* Optimizing instruction-cache, more packets at each stage @ 2016-01-15 13:22 Jesper Dangaard Brouer 2016-01-15 13:32 ` Hannes Frederic Sowa ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-15 13:22 UTC (permalink / raw) To: netdev Cc: brouer, David Miller, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend Given net-next is closed, we have time to discuss controversial core changes right? ;-) I want to do some instruction-cache level optimizations. What do I mean by that... The kernel network stack code path (a packet travels) is obviously larger than the instruction-cache (icache). Today, every packet travel individually through the network stack, experiencing the exact same icache misses (as the previous packet). I imagine that we could process several packets at each stage in the packet processing code path. That way making better use of the icache. Today, we already allow NAPI net_rx_action() to process many (e.g. up-to 64) packets in the driver RX-poll routine. But the driver then calls the "full" stack for every single packet (e.g. via napi_gro_receive()) in its processing loop. Thus, trashing the icache for every packet. I have a prove-of-concept patch for ixgbe, which gives me 10% speedup on full IP forwarding. (This patch also optimize delaying when I touch the packet data, thus it also optimizes data-cache misses). The basic idea is that I delay calling ixgbe_rx_skb/napi_gro_receive, and allow the RX loop (in ixgbe_clean_rx_irq()) to run more iterations before "flushing" the icache (by calling the stack). This was only at the driver level. I also would like some API towards the stack. Maybe we could simple pass a skb-list? Changing / adjusting the stack to support processing in "stages" might be more difficult/controversial? Maybe we should view the packets stuck/avail in the RX ring, as packets that all arrived at the same "time", and thus process them at the same time. By letting the "bulking" depend on the avail packets in the RX ring, we automatically amortize processing cost in a scalable manor. One challenge with icache optimizations is that it is hard to profile. But hopefully the new Skylake CPU can profile this. Because as I always say, if you cannot measure it, you cannot improve it. p.s. I doing a Network Performance BoF[1] at NetDev 1.1, where this and many more subjects will be brought up face-to-face. [1] http://netdevconf.org/1.1/bof-network-performance-bof-jesper-dangaard-brouer.html -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer @ 2016-01-15 13:32 ` Hannes Frederic Sowa 2016-01-15 14:17 ` Jesper Dangaard Brouer 2016-01-15 13:36 ` David Laight 2016-01-15 20:47 ` David Miller 2 siblings, 1 reply; 59+ messages in thread From: Hannes Frederic Sowa @ 2016-01-15 13:32 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev Cc: David Miller, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Florian Westphal, Paolo Abeni, John Fastabend On 15.01.2016 14:22, Jesper Dangaard Brouer wrote: > > Given net-next is closed, we have time to discuss controversial core > changes right? ;-) > > I want to do some instruction-cache level optimizations. > > What do I mean by that... > > The kernel network stack code path (a packet travels) is obviously > larger than the instruction-cache (icache). Today, every packet > travel individually through the network stack, experiencing the exact > same icache misses (as the previous packet). > > I imagine that we could process several packets at each stage in the > packet processing code path. That way making better use of the > icache. > > Today, we already allow NAPI net_rx_action() to process many > (e.g. up-to 64) packets in the driver RX-poll routine. But the driver > then calls the "full" stack for every single packet (e.g. via > napi_gro_receive()) in its processing loop. Thus, trashing the icache > for every packet. > > I have a prove-of-concept patch for ixgbe, which gives me 10% speedup > on full IP forwarding. (This patch also optimize delaying when I > touch the packet data, thus it also optimizes data-cache misses). The > basic idea is that I delay calling ixgbe_rx_skb/napi_gro_receive, and > allow the RX loop (in ixgbe_clean_rx_irq()) to run more iterations > before "flushing" the icache (by calling the stack). > > > This was only at the driver level. I also would like some API towards > the stack. Maybe we could simple pass a skb-list? > > Changing / adjusting the stack to support processing in "stages" might > be more difficult/controversial? I once tried this up till the vlan layer and error handling got so complex and complicated that I stopped there. Maybe it is possible in some separate stages. This needs redesign of a lot of stuff and while doing so I would switch from a more stack based approach to build the stack to try out a more iterative one (see e.g. stack space consumption problems). Just my 2 cents, Hannes ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-15 13:32 ` Hannes Frederic Sowa @ 2016-01-15 14:17 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-15 14:17 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: netdev, David Miller, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Florian Westphal, Paolo Abeni, John Fastabend, brouer On Fri, 15 Jan 2016 14:32:12 +0100 Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > On 15.01.2016 14:22, Jesper Dangaard Brouer wrote: > > > > Given net-next is closed, we have time to discuss controversial core > > changes right? ;-) > > > > I want to do some instruction-cache level optimizations. > > > > What do I mean by that... > > > > The kernel network stack code path (a packet travels) is obviously > > larger than the instruction-cache (icache). Today, every packet > > travel individually through the network stack, experiencing the exact > > same icache misses (as the previous packet). > > > > I imagine that we could process several packets at each stage in the > > packet processing code path. That way making better use of the > > icache. > > > > Today, we already allow NAPI net_rx_action() to process many > > (e.g. up-to 64) packets in the driver RX-poll routine. But the driver > > then calls the "full" stack for every single packet (e.g. via > > napi_gro_receive()) in its processing loop. Thus, trashing the icache > > for every packet. > > > > I have a prove-of-concept patch for ixgbe, which gives me 10% speedup > > on full IP forwarding. (This patch also optimize delaying when I > > touch the packet data, thus it also optimizes data-cache misses). The > > basic idea is that I delay calling ixgbe_rx_skb/napi_gro_receive, and > > allow the RX loop (in ixgbe_clean_rx_irq()) to run more iterations > > before "flushing" the icache (by calling the stack). > > > > > > This was only at the driver level. I also would like some API towards > > the stack. Maybe we could simple pass a skb-list? > > > > Changing / adjusting the stack to support processing in "stages" might > > be more difficult/controversial? > > I once tried this up till the vlan layer and error handling got so > complex and complicated that I stopped there. Maybe it is possible in > some separate stages. I've already split the driver layer into a stage. Next I will split GRO layer into a stage. The GRO layer is actually quite expensive icache-wise as it have deep calls, as the compiler cannot inline functions due to the flexible function pointer approach. Simply enable/disable GRO show 10% CPU usage drop (and perf increase). > This needs redesign of a lot of stuff and while doing so I would > switch from a more stack based approach to build the stack to try out > a more iterative one (see e.g. stack space consumption problems). The recursive nature of the rx handler (__netif_receive_skb_core/another_round) is not necessarily bad approach for icache usage (unless rx_handler() call indirectly flush the icache). But as you have shown it _is_ bad for stack space consumption. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* RE: Optimizing instruction-cache, more packets at each stage 2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer 2016-01-15 13:32 ` Hannes Frederic Sowa @ 2016-01-15 13:36 ` David Laight 2016-01-15 14:00 ` Jesper Dangaard Brouer 2016-01-15 20:47 ` David Miller 2 siblings, 1 reply; 59+ messages in thread From: David Laight @ 2016-01-15 13:36 UTC (permalink / raw) To: 'Jesper Dangaard Brouer', netdev Cc: David Miller, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend From: Jesper Dangaard Brouer > Sent: 15 January 2016 13:22 ... > I want to do some instruction-cache level optimizations. > > What do I mean by that... > > The kernel network stack code path (a packet travels) is obviously > larger than the instruction-cache (icache). Today, every packet > travel individually through the network stack, experiencing the exact > same icache misses (as the previous packet). ... Is that actually true for modern server processors that have large i-cache. While the total size of the networking code may well be larger, that part used for transmitting data packets will be much be smaller and could easily fit in the icache. David ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-15 13:36 ` David Laight @ 2016-01-15 14:00 ` Jesper Dangaard Brouer 2016-01-15 14:38 ` Felix Fietkau 0 siblings, 1 reply; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-15 14:00 UTC (permalink / raw) To: David Laight Cc: netdev, David Miller, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, brouer On Fri, 15 Jan 2016 13:36:04 +0000 David Laight <David.Laight@ACULAB.COM> wrote: > From: Jesper Dangaard Brouer > > Sent: 15 January 2016 13:22 > ... > > I want to do some instruction-cache level optimizations. > > > > What do I mean by that... > > > > The kernel network stack code path (a packet travels) is obviously > > larger than the instruction-cache (icache). Today, every packet > > travel individually through the network stack, experiencing the exact > > same icache misses (as the previous packet). > ... > > Is that actually true for modern server processors that have large i-cache. > While the total size of the networking code may well be larger, that > part used for transmitting data packets will be much be smaller and > could easily fit in the icache. Yes, exactly. That is what I'm betting on. If I can split it into stages (e.g. part used for transmitting) that fits into icache then I should see a win. The icache is still quite small 32Kb on modern server processors. I don't know if smaller embedded processors also have icache and how large they are. I speculate this approach would also be a benefit for them (if they have icache). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-15 14:00 ` Jesper Dangaard Brouer @ 2016-01-15 14:38 ` Felix Fietkau 2016-01-18 11:54 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 59+ messages in thread From: Felix Fietkau @ 2016-01-15 14:38 UTC (permalink / raw) To: Jesper Dangaard Brouer, David Laight Cc: netdev, David Miller, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend On 2016-01-15 15:00, Jesper Dangaard Brouer wrote: > On Fri, 15 Jan 2016 13:36:04 +0000 > David Laight <David.Laight@ACULAB.COM> wrote: > >> From: Jesper Dangaard Brouer >> > Sent: 15 January 2016 13:22 >> ... >> > I want to do some instruction-cache level optimizations. >> > >> > What do I mean by that... >> > >> > The kernel network stack code path (a packet travels) is obviously >> > larger than the instruction-cache (icache). Today, every packet >> > travel individually through the network stack, experiencing the exact >> > same icache misses (as the previous packet). >> ... >> >> Is that actually true for modern server processors that have large i-cache. >> While the total size of the networking code may well be larger, that >> part used for transmitting data packets will be much be smaller and >> could easily fit in the icache. > > Yes, exactly. That is what I'm betting on. If I can split it into > stages (e.g. part used for transmitting) that fits into icache then I > should see a win. > > The icache is still quite small 32Kb on modern server processors. I > don't know if smaller embedded processors also have icache and how > large they are. I speculate this approach would also be a benefit for > them (if they have icache). All of the router devices that I work with have icache. Typical sizes are 32 or 64 KiB. FWIW, I'm really looking forward to having such optimizations in the network stack ;) - Felix ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-15 14:38 ` Felix Fietkau @ 2016-01-18 11:54 ` Jesper Dangaard Brouer 2016-01-18 17:01 ` Eric Dumazet 2016-01-25 0:08 ` Florian Fainelli 0 siblings, 2 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-18 11:54 UTC (permalink / raw) To: Felix Fietkau Cc: David Laight, netdev, David Miller, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, brouer On Fri, 15 Jan 2016 15:38:43 +0100 Felix Fietkau <nbd@openwrt.org> wrote: > On 2016-01-15 15:00, Jesper Dangaard Brouer wrote: [...] > > > > The icache is still quite small 32Kb on modern server processors. I > > don't know if smaller embedded processors also have icache and how > > large they are. I speculate this approach would also be a benefit for > > them (if they have icache). > > All of the router devices that I work with have icache. Typical sizes > are 32 or 64 KiB. FWIW, I'm really looking forward to having such > optimizations in the network stack ;) That is very interesting. These kind of icache optimization will then likely benefit lower-end devices more than high end Intel CPUs :-) AFAIK the Intel CPUs are masking this icache problem, by having a icache prefetcher and optimizing how fast the CPU can load/refill from higher level caches. Intel CPUs have a lot of HW-logic around this, which the I assume the smaller CPUs don't. E.g. quote from Intel Optimization Reference Manual: "The instruction fetch unit (IFU) can fetch up to 16 bytes of aligned instruction bytes each cycle from the instruction cache to the instruction length decoder (ILD). The instruction queue (IQ) buffers the ILD-processed instructions and can deliver up to four instructions in one cycle to the instruction decoder." -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-18 11:54 ` Jesper Dangaard Brouer @ 2016-01-18 17:01 ` Eric Dumazet 2016-01-25 0:08 ` Florian Fainelli 1 sibling, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2016-01-18 17:01 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Felix Fietkau, David Laight, netdev, David Miller, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend On Mon, 2016-01-18 at 12:54 +0100, Jesper Dangaard Brouer wrote: > That is very interesting. These kind of icache optimization will then > likely benefit lower-end devices more than high end Intel CPUs :-) > > AFAIK the Intel CPUs are masking this icache problem, by having a icache > prefetcher and optimizing how fast the CPU can load/refill from higher > level caches. Intel CPUs have a lot of HW-logic around this, which the > I assume the smaller CPUs don't. E.g. quote from Intel Optimization > Reference Manual: > > "The instruction fetch unit (IFU) can fetch up to 16 bytes of aligned > instruction bytes each cycle from the instruction cache to the > instruction length decoder (ILD). The instruction queue (IQ) buffers > the ILD-processed instructions and can deliver up to four instructions > in one cycle to the instruction decoder." > This does not tell how many core/threads can fetch 16 bytes per cycle. With more than 36 execution units per socket, single peak performance of one unit does not reflect what happens when all units are busy and contend on shared resource. If we want to properly exploit L1 caches of each execution unit, we need to split the load in a pipeline. But the number of units depend on hardware capabilities (like L1 cache size). Something hard to code in a generic way (linux kernel) For example, having the same core handling RX and TX interrupts are not the best choice, especially when TX interrupts have to call expensive callbacks to upper layers (TCP Small Queues). ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-18 11:54 ` Jesper Dangaard Brouer 2016-01-18 17:01 ` Eric Dumazet @ 2016-01-25 0:08 ` Florian Fainelli 1 sibling, 0 replies; 59+ messages in thread From: Florian Fainelli @ 2016-01-25 0:08 UTC (permalink / raw) To: Jesper Dangaard Brouer, Felix Fietkau Cc: David Laight, netdev, David Miller, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend Hi Jesper On 18/01/2016 03:54, Jesper Dangaard Brouer wrote: > > On Fri, 15 Jan 2016 15:38:43 +0100 Felix Fietkau <nbd@openwrt.org> wrote: >> On 2016-01-15 15:00, Jesper Dangaard Brouer wrote: > [...] >>> >>> The icache is still quite small 32Kb on modern server processors. I >>> don't know if smaller embedded processors also have icache and how >>> large they are. I speculate this approach would also be a benefit for >>> them (if they have icache). >> >> All of the router devices that I work with have icache. Typical sizes >> are 32 or 64 KiB. FWIW, I'm really looking forward to having such >> optimizations in the network stack ;) > > That is very interesting. These kind of icache optimization will then > likely benefit lower-end devices more than high end Intel CPUs :-) Typical embedded routers have small I and D cache, but they also have fairly small cache line sizes (16, 32 or 64 bytes), and not necessarily a L2 cache to help them, the memory bandwidth is also very limited (DDR/DDR2 speeds are not uncommon) so the less I/D cache lines you trash, the better obviously. One thing that some HW vendors have done, before they started introducing a HW capable of offloading routing/NAT workloads to specialized hardware is to hack the heck of the Linux network stack to allow a lightweight SKB structure to be used for forwarding and allocate these "meta" bookeekping SKBs from a dedicated kmem cache pool to get relatively predictable latencies. There is also a notion of a dirty pointer within the skbuff itself, such that instead of e.g: having your Ethernet NIC driver do a DMA-API call which can potentially invalidate the D-cache for an entire 1500-ish bytes Ethernet frame, the packet contents are "valid" up until the dirty pointer, which is a nice trick if you are just forwarding, but requires both SKB accessors/manipulation functions to check that, and your Ethernet driver to be cooperative as well, so may not scale well. Broadcom's implementation of such a thing can be found here among these files, code is not kernel style compliant, but there might be some re-usable ideas for you: NBUFF/FKBUFF/SKBUFF are the actual packet book keeping data structures that replace and/or extend the use of SKBs: https://code.google.com/p/gfiber-gflt100/source/browse/kernel/linux/include/linux/nbuff.h https://code.google.com/p/gfiber-gflt100/source/browse/kernel/linux/net/core/nbuff.c # Check for CONFIG_MIPS_BRCM changes here: https://code.google.com/p/gfiber-gflt100/source/browse/kernel/linux/net/core/skbuff.c https://code.google.com/p/gfiber-gflt100/source/browse/kernel/linux/include/linux/skbuff.h ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer 2016-01-15 13:32 ` Hannes Frederic Sowa 2016-01-15 13:36 ` David Laight @ 2016-01-15 20:47 ` David Miller 2016-01-18 10:27 ` Jesper Dangaard Brouer 2 siblings, 1 reply; 59+ messages in thread From: David Miller @ 2016-01-15 20:47 UTC (permalink / raw) To: brouer Cc: netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Fri, 15 Jan 2016 14:22:23 +0100 > This was only at the driver level. I also would like some API towards > the stack. Maybe we could simple pass a skb-list? Datastructures are everything so maybe we can create some kind of SKB bundle abstractions. Whether it's a lockless array or a linked list behind it doesn't really matter. We could have two categories: Related and Unrelated. If you think about GRO and routing keys you might see what I am getting at. :-) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-15 20:47 ` David Miller @ 2016-01-18 10:27 ` Jesper Dangaard Brouer 2016-01-18 16:24 ` David Miller ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-18 10:27 UTC (permalink / raw) To: David Miller Cc: netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend, brouer On Fri, 15 Jan 2016 15:47:21 -0500 (EST) David Miller <davem@davemloft.net> wrote: > From: Jesper Dangaard Brouer <brouer@redhat.com> > Date: Fri, 15 Jan 2016 14:22:23 +0100 > > > This was only at the driver level. I also would like some API towards > > the stack. Maybe we could simple pass a skb-list? > > Datastructures are everything so maybe we can create some kind of SKB > bundle abstractions. Whether it's a lockless array or a linked list > behind it doesn't really matter. > > We could have two categories: Related and Unrelated. > > If you think about GRO and routing keys you might see what I am getting > at. :-) Yes, I think I get it. I like the idea of Related and Unrelated. We already have GRO packets which is in the "Related" category/type. I'm wondering about the API between driver and "GRO-layer" (calling napi_gro_receive): Down in the driver layer (RX), I think it is too early to categorize Related/Unrelated SKB's, because we want to delay touching packet-data as long as possible (waiting for the prefetcher to get data into cache). We could keep the napi_gro_receive() call. But in-order to save icache, then the driver could just create it's own simple loop around napi_gro_receive(). This loop's icache and extra function call per packet would cost something. The down side is: The GRO layer will have no-idea how many "more" packets are coming. Thus, it depends on a "flush" API, which for "xmit_more" didn't work out that well. The NAPI drivers actually already have a flush API (calling napi_complete_done()), BUT it does not always get invoked, e.g. if the driver have more work to do, and want to keep polling. I'm not sure we want to delay "flushing" packets queued in the GRO layer for this long(?). The simplest solution to get around this (flush and driver loop complexity), would be to create a SKB-list down in the driver, and call napi_gro_receive() with this list. Simply extending napi_gro_receive() with a SKB list loop. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-18 10:27 ` Jesper Dangaard Brouer @ 2016-01-18 16:24 ` David Miller 2016-01-20 22:20 ` Or Gerlitz 2016-01-18 16:53 ` Eric Dumazet 2016-01-18 17:36 ` Tom Herbert 2 siblings, 1 reply; 59+ messages in thread From: David Miller @ 2016-01-18 16:24 UTC (permalink / raw) To: brouer Cc: netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Mon, 18 Jan 2016 11:27:03 +0100 > Down in the driver layer (RX), I think it is too early to categorize > Related/Unrelated SKB's, because we want to delay touching packet-data > as long as possible (waiting for the prefetcher to get data into > cache). You don't need to touch the headers in order to have a good idea as to whether there is a strong possibility packets are related or not. We have the hash available. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-18 16:24 ` David Miller @ 2016-01-20 22:20 ` Or Gerlitz 2016-01-20 23:02 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: Or Gerlitz @ 2016-01-20 22:20 UTC (permalink / raw) To: David Miller, Eric Dumazet Cc: Jesper Dangaard Brouer, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, borkmann, marek, hannes, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Mon, Jan 18, 2016 at 6:24 PM, David Miller <davem@davemloft.net> wrote: > From: Jesper Dangaard Brouer <brouer@redhat.com> > Date: Mon, 18 Jan 2016 11:27:03 +0100 > >> Down in the driver layer (RX), I think it is too early to categorize >> Related/Unrelated SKB's, because we want to delay touching packet-data >> as long as possible (waiting for the prefetcher to get data into >> cache). > > You don't need to touch the headers in order to have a good idea > as to whether there is a strong possibility packets are related > or not. > > We have the hash available. Dave, I assume you refer to the RSS hash result which is written by NIC HWs to the completion descriptor and then fed to the stack by the driver calling skb_set_hash(.)? Well, this can be taken even further. Suppose a the NIC can be programmed by the kernel to provide a unique flow tag on the completion descriptor per a given 5/12 tuple which represents a TCP (or other logical) stream a higher level in the stack is identifying to be in progress, and the driver plants that in skb->mark before calling into the stack. I guess this could yield nice speed up for the GRO stack -- matching based on single 32 bit value instead of per protocol (eth, vlan, ip, tcp) checks [1] - or hint which packets from the current window of "ready" completion descriptor could be grouped together for upper processing? Or. [1] some details to complete (...) here, on the last protocol hop we do need to verify that it would be correct to stick the incoming packet to the existing pending packet of this stream ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-20 22:20 ` Or Gerlitz @ 2016-01-20 23:02 ` Eric Dumazet 2016-01-20 23:27 ` Tom Herbert 0 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2016-01-20 23:02 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Eric Dumazet, Jesper Dangaard Brouer, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, borkmann, marek, hannes, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Thu, 2016-01-21 at 00:20 +0200, Or Gerlitz wrote: > Dave, I assume you refer to the RSS hash result which is written by > NIC HWs to the completion descriptor and then fed to the stack by the > driver calling skb_set_hash(.)? Well, this can be taken even further. > > Suppose a the NIC can be programmed by the kernel to provide a unique > flow tag on the completion descriptor per a given 5/12 tuple which > represents a TCP (or other logical) stream a higher level in the stack > is identifying to be in progress, and the driver plants that in > skb->mark before calling into the stack. > > I guess this could yield nice speed up for the GRO stack -- matching > based on single 32 bit value instead of per protocol (eth, vlan, ip, > tcp) checks [1] - or hint which packets from the current window of > "ready" completion descriptor could be grouped together for upper > processing? We already use the RSS hash (skb->hash) in GRO engine to speedup the parsing : If skb->hash differs, then there is no point trying to aggregate two packets. Note that if we had a l4 hash for all provided packets, GRO could use a hash table instead of one single list of skbs. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-20 23:02 ` Eric Dumazet @ 2016-01-20 23:27 ` Tom Herbert 2016-01-21 11:27 ` Jesper Dangaard Brouer ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Tom Herbert @ 2016-01-20 23:27 UTC (permalink / raw) To: Eric Dumazet Cc: Or Gerlitz, David Miller, Eric Dumazet, Jesper Dangaard Brouer, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Wed, Jan 20, 2016 at 3:02 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2016-01-21 at 00:20 +0200, Or Gerlitz wrote: > >> Dave, I assume you refer to the RSS hash result which is written by >> NIC HWs to the completion descriptor and then fed to the stack by the >> driver calling skb_set_hash(.)? Well, this can be taken even further. >> >> Suppose a the NIC can be programmed by the kernel to provide a unique >> flow tag on the completion descriptor per a given 5/12 tuple which >> represents a TCP (or other logical) stream a higher level in the stack >> is identifying to be in progress, and the driver plants that in >> skb->mark before calling into the stack. >> >> I guess this could yield nice speed up for the GRO stack -- matching >> based on single 32 bit value instead of per protocol (eth, vlan, ip, >> tcp) checks [1] - or hint which packets from the current window of >> "ready" completion descriptor could be grouped together for upper >> processing? > > We already use the RSS hash (skb->hash) in GRO engine to speedup the > parsing : If skb->hash differs, then there is no point trying to > aggregate two packets. > > Note that if we had a l4 hash for all provided packets, GRO could use a > hash table instead of one single list of skbs. > Besides that, GRO requires parsing the packet anyway so I don't see much value in trying to optimize GRO by using the hash. Unfortunately, the hardware hash from devices hasn't really lived up to its potential. The original intent of getting the hash from device was to be able to do packet steering (RPS and RFS) without touching the header. But this never was implemented. eth_type_trans touches headers and GRO is best when done before steering. Given the weaknesses of Toeplitz we talked about recently and that fact that Jenkins is really fast to compute, I am starting to think maybe we should always do a software hash and not rely on HW for it... > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-20 23:27 ` Tom Herbert @ 2016-01-21 11:27 ` Jesper Dangaard Brouer 2016-01-21 12:49 ` Or Gerlitz ` (2 more replies) 2016-01-21 12:23 ` Jesper Dangaard Brouer 2016-02-02 16:13 ` Or Gerlitz 2 siblings, 3 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-21 11:27 UTC (permalink / raw) To: Tom Herbert Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, brouer On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote: > eth_type_trans touches headers True, the eth_type_trans() call in the driver is a major bottleneck, because it touch the packet header and happens very early in the driver. In my experiments, where I extract several packet before calling napi_gro_receive(), and I also delay calling eth_type_trans(). Most of my speedup comes from this trick, as the prefetch() now that enough time. while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) { skb->protocol = eth_type_trans(skb, rq->netdev); napi_gro_receive(cq->napi, skb); } What is the HW could provide the info we need in the descriptor?!? eth_type_trans() does two things: 1) determine skb->protocol 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} Could the HW descriptor deliver the "proto", or perhaps just some bits on the most common proto's? The skb->pkt_type don't need many bits. And I bet the HW already have the information. The BROADCAST and MULTICAST indication are easy. The PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST indication, if the eth->h_dest match the devices dev->dev_addr (else a SW compare is required). Is that doable in hardware? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 11:27 ` Jesper Dangaard Brouer @ 2016-01-21 12:49 ` Or Gerlitz 2016-01-21 13:57 ` Jesper Dangaard Brouer 2016-01-21 18:56 ` David Miller 2016-01-21 16:38 ` Eric Dumazet 2016-01-21 18:54 ` David Miller 2 siblings, 2 replies; 59+ messages in thread From: Or Gerlitz @ 2016-01-21 12:49 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Tom Herbert, Eric Dumazet, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Matan Barak On Thu, Jan 21, 2016 at 1:27 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote: > >> eth_type_trans touches headers > > True, the eth_type_trans() call in the driver is a major bottleneck, > because it touch the packet header and happens very early in the driver. > > In my experiments, where I extract several packet before calling > napi_gro_receive(), and I also delay calling eth_type_trans(). Most of > my speedup comes from this trick, as the prefetch() now that enough > time. > > while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) { > skb->protocol = eth_type_trans(skb, rq->netdev); > napi_gro_receive(cq->napi, skb); > } > > What is the HW could provide the info we need in the descriptor?!? > > > eth_type_trans() does two things: > > 1) determine skb->protocol > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} > > Could the HW descriptor deliver the "proto", or perhaps just some bits > on the most common proto's? > > The skb->pkt_type don't need many bits. And I bet the HW already have > the information. The BROADCAST and MULTICAST indication are easy. The > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST > indication, if the eth->h_dest match the devices dev->dev_addr (else a > SW compare is required). > > Is that doable in hardware? As I wrote earlier, for determination of the eth-type HWs can do what you ask here and more. Protocol being IP or not (and only then you look in the data) you could get I guess from many NICs, e.g if the NIC sets PKT_HASH_TYPE_L4 or PKT_HASH_TYPE_L3 then we know it's an IP packets and only if we don't see this indication we look into the data. As for pkt_type we can use NIC steering HW to provide us a tag saying if it was our broadcast, other multicast or "our" unicast. Or. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 12:49 ` Or Gerlitz @ 2016-01-21 13:57 ` Jesper Dangaard Brouer 2016-01-21 18:56 ` David Miller 1 sibling, 0 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-21 13:57 UTC (permalink / raw) To: Or Gerlitz Cc: Tom Herbert, Eric Dumazet, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Matan Barak, brouer On Thu, 21 Jan 2016 14:49:25 +0200 Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Thu, Jan 21, 2016 at 1:27 PM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote: > > > >> eth_type_trans touches headers > > > > True, the eth_type_trans() call in the driver is a major bottleneck, > > because it touch the packet header and happens very early in the driver. > > > > In my experiments, where I extract several packet before calling > > napi_gro_receive(), and I also delay calling eth_type_trans(). Most of > > my speedup comes from this trick, as the prefetch() now that enough > > time. > > > > while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) { > > skb->protocol = eth_type_trans(skb, rq->netdev); > > napi_gro_receive(cq->napi, skb); > > } > > > > What is the HW could provide the info we need in the descriptor?!? > > > > > > eth_type_trans() does two things: > > > > 1) determine skb->protocol > > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} > > > > Could the HW descriptor deliver the "proto", or perhaps just some bits > > on the most common proto's? > > > > The skb->pkt_type don't need many bits. And I bet the HW already have > > the information. The BROADCAST and MULTICAST indication are easy. The > > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST > > indication, if the eth->h_dest match the devices dev->dev_addr (else a > > SW compare is required). > > > > Is that doable in hardware? > > As I wrote earlier, for determination of the eth-type HWs can do what > you ask here and more. That is great! Is this already being delivered in the descriptor? > Protocol being IP or not (and only then you look in the data) you > could get I guess from many NICs, e.g if the NIC sets PKT_HASH_TYPE_L4 > or PKT_HASH_TYPE_L3 then we know it's an IP packets and only if > we don't see this indication we look into the data. It is a good trick. But at this very early stage we only need the eth-proto/type. Once we get to processing the IP layer, then packet-data will have been pulled / prefetched into L1 cache, thus cost of determining that should be almost free. > As for pkt_type we can use NIC steering HW to provide us a tag saying > if it was our broadcast, other multicast or "our" unicast. That would be good. Does that conflict with other programming of the NIC HW, or can we always have it turned on? If we can pull this off, then we can do some very interesting cache latency hiding! :-) (In my perf top eth_type_trans() is one of the top contenders, especially for your mlx5 driver). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 12:49 ` Or Gerlitz 2016-01-21 13:57 ` Jesper Dangaard Brouer @ 2016-01-21 18:56 ` David Miller 2016-01-21 22:45 ` Or Gerlitz 1 sibling, 1 reply; 59+ messages in thread From: David Miller @ 2016-01-21 18:56 UTC (permalink / raw) To: gerlitz.or Cc: brouer, tom, eric.dumazet, edumazet, netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend, amirva, matanb From: Or Gerlitz <gerlitz.or@gmail.com> Date: Thu, 21 Jan 2016 14:49:25 +0200 > On Thu, Jan 21, 2016 at 1:27 PM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: >> On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote: >> >>> eth_type_trans touches headers >> >> True, the eth_type_trans() call in the driver is a major bottleneck, >> because it touch the packet header and happens very early in the driver. >> >> In my experiments, where I extract several packet before calling >> napi_gro_receive(), and I also delay calling eth_type_trans(). Most of >> my speedup comes from this trick, as the prefetch() now that enough >> time. >> >> while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) { >> skb->protocol = eth_type_trans(skb, rq->netdev); >> napi_gro_receive(cq->napi, skb); >> } >> >> What is the HW could provide the info we need in the descriptor?!? >> >> >> eth_type_trans() does two things: >> >> 1) determine skb->protocol >> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} >> >> Could the HW descriptor deliver the "proto", or perhaps just some bits >> on the most common proto's? >> >> The skb->pkt_type don't need many bits. And I bet the HW already have >> the information. The BROADCAST and MULTICAST indication are easy. The >> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST >> indication, if the eth->h_dest match the devices dev->dev_addr (else a >> SW compare is required). >> >> Is that doable in hardware? > > As I wrote earlier, for determination of the eth-type HWs can do what you ask > here and more. > > Protocol being IP or not (and only then you look in the data) you could > get I guess from many NICs, e.g if the NIC sets PKT_HASH_TYPE_L4 > or PKT_HASH_TYPE_L3 then we know it's an IP packets and only if > we don't see this indication we look into the data. This doesn't differentiate ipv4 vs. ipv6 which is critical here, so this mechanism is not sufficient. We must know the exact ETH_P_* value. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 18:56 ` David Miller @ 2016-01-21 22:45 ` Or Gerlitz 2016-01-21 22:59 ` David Miller 0 siblings, 1 reply; 59+ messages in thread From: Or Gerlitz @ 2016-01-21 22:45 UTC (permalink / raw) To: David Miller Cc: Jesper Dangaard Brouer, Tom Herbert, Eric Dumazet, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Matan Barak On Thu, Jan 21, 2016 at 8:56 PM, David Miller <davem@davemloft.net> wrote: > From: Or Gerlitz <gerlitz.or@gmail.com> > Date: Thu, 21 Jan 2016 14:49:25 +0200 > >> On Thu, Jan 21, 2016 at 1:27 PM, Jesper Dangaard Brouer >> <brouer@redhat.com> wrote: >>> On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote: >>> >>>> eth_type_trans touches headers >>> >>> True, the eth_type_trans() call in the driver is a major bottleneck, >>> because it touch the packet header and happens very early in the driver. >>> >>> In my experiments, where I extract several packet before calling >>> napi_gro_receive(), and I also delay calling eth_type_trans(). Most of >>> my speedup comes from this trick, as the prefetch() now that enough >>> time. >>> >>> while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) { >>> skb->protocol = eth_type_trans(skb, rq->netdev); >>> napi_gro_receive(cq->napi, skb); >>> } >>> >>> What is the HW could provide the info we need in the descriptor?!? >>> >>> >>> eth_type_trans() does two things: >>> >>> 1) determine skb->protocol >>> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} >>> >>> Could the HW descriptor deliver the "proto", or perhaps just some bits >>> on the most common proto's? >>> >>> The skb->pkt_type don't need many bits. And I bet the HW already have >>> the information. The BROADCAST and MULTICAST indication are easy. The >>> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST >>> indication, if the eth->h_dest match the devices dev->dev_addr (else a >>> SW compare is required). >>> >>> Is that doable in hardware? >> >> As I wrote earlier, for determination of the eth-type HWs can do what you ask >> here and more. >> >> Protocol being IP or not (and only then you look in the data) you could >> get I guess from many NICs, e.g if the NIC sets PKT_HASH_TYPE_L4 >> or PKT_HASH_TYPE_L3 then we know it's an IP packets and only if >> we don't see this indication we look into the data. > > This doesn't differentiate ipv4 vs. ipv6 which is critical here, so this > mechanism is not sufficient. Dave, at least in the ConnectX4 (mlx5e driver), as I commented earlier on this thread, we can use programmed tags reported by the HW on the completion of packets whether the ethtype is ipv4 or ipv6 or something else, and let the kernel branch look into the packet memory on in the last case. > We must know the exact ETH_P_* value. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 22:45 ` Or Gerlitz @ 2016-01-21 22:59 ` David Miller 0 siblings, 0 replies; 59+ messages in thread From: David Miller @ 2016-01-21 22:59 UTC (permalink / raw) To: gerlitz.or Cc: brouer, tom, eric.dumazet, edumazet, netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend, amirva, matanb From: Or Gerlitz <gerlitz.or@gmail.com> Date: Fri, 22 Jan 2016 00:45:13 +0200 > Dave, at least in the ConnectX4 (mlx5e driver), as I commented earlier > on this thread, we can use programmed tags reported by the HW on the > completion of packets whether the ethtype is ipv4 or ipv6 or > something else, and let the kernel > branch look into the packet memory on in the last case. Fair enough. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 11:27 ` Jesper Dangaard Brouer 2016-01-21 12:49 ` Or Gerlitz @ 2016-01-21 16:38 ` Eric Dumazet 2016-01-21 18:54 ` David Miller 2 siblings, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2016-01-21 16:38 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Tom Herbert, Or Gerlitz, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Thu, 2016-01-21 at 12:27 +0100, Jesper Dangaard Brouer wrote: > In my experiments, where I extract several packet before calling > napi_gro_receive(), and I also delay calling eth_type_trans(). Most of > my speedup comes from this trick, as the prefetch() now that enough > time. It really depends on the cpu. Many cpus have very poor prefetch performance. prefetch instructions are lazily defined by Intel/AMD Ivy Bridge prefetcher for example is known to be not that good. http://www.agner.org/optimize/blog/read.php?i=415 http://www.agner.org/optimize/blog/read.php?i=285 https://groups.google.com/forum/#!topic/comp.arch/71wnqr_F9sw Really, refrain from adding stuff that might look good one one cpu. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 11:27 ` Jesper Dangaard Brouer 2016-01-21 12:49 ` Or Gerlitz 2016-01-21 16:38 ` Eric Dumazet @ 2016-01-21 18:54 ` David Miller 2016-01-24 14:28 ` Jesper Dangaard Brouer 2 siblings, 1 reply; 59+ messages in thread From: David Miller @ 2016-01-21 18:54 UTC (permalink / raw) To: brouer Cc: tom, eric.dumazet, gerlitz.or, edumazet, netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend, amirva From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Thu, 21 Jan 2016 12:27:30 +0100 > eth_type_trans() does two things: > > 1) determine skb->protocol > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} > > Could the HW descriptor deliver the "proto", or perhaps just some bits > on the most common proto's? > > The skb->pkt_type don't need many bits. And I bet the HW already have > the information. The BROADCAST and MULTICAST indication are easy. The > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST > indication, if the eth->h_dest match the devices dev->dev_addr (else a > SW compare is required). > > Is that doable in hardware? I feel like we've had this discussion before several years ago. I think having just the protocol value would be enough. skb->pkt_type we could deal with by using always an accessor and evaluating it lazily. Nothing needs it until we hit ip_rcv() or similar. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 18:54 ` David Miller @ 2016-01-24 14:28 ` Jesper Dangaard Brouer 2016-01-24 14:44 ` Michael S. Tsirkin 2016-01-24 20:09 ` Optimizing instruction-cache, more packets at each stage Tom Herbert 0 siblings, 2 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-24 14:28 UTC (permalink / raw) To: David Miller Cc: tom, eric.dumazet, gerlitz.or, edumazet, netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend, amirva, brouer, Michael S. Tsirkin On Thu, 21 Jan 2016 10:54:01 -0800 (PST) David Miller <davem@davemloft.net> wrote: > From: Jesper Dangaard Brouer <brouer@redhat.com> > Date: Thu, 21 Jan 2016 12:27:30 +0100 > > > eth_type_trans() does two things: > > > > 1) determine skb->protocol > > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} > > > > Could the HW descriptor deliver the "proto", or perhaps just some bits > > on the most common proto's? > > > > The skb->pkt_type don't need many bits. And I bet the HW already have > > the information. The BROADCAST and MULTICAST indication are easy. The > > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST > > indication, if the eth->h_dest match the devices dev->dev_addr (else a > > SW compare is required). > > > > Is that doable in hardware? > > I feel like we've had this discussion before several years ago. > > I think having just the protocol value would be enough. > > skb->pkt_type we could deal with by using always an accessor and > evaluating it lazily. Nothing needs it until we hit ip_rcv() or > similar. First I thought, I liked the idea delaying the eval of skb->pkt_type. BUT then I realized, what if we take this even further. What if we actually use this information, for something useful, at this very early RX stage. The information I'm interested in, from the HW descriptor, is if this packet is NOT for local delivery. If so, we can send the packet on a "fast-forward" code path. Think about bridging packets to a guest OS. Because we know very early at RX (from packet HW descriptor) we might even avoid allocating a SKB. We could just "forward" the packet-page to the guest OS. Taking Eric's idea, of remote CPUs, we could even send these packet-pages to a remote CPU (e.g. where the guest OS is running), without having touched a single cache-line in the packet-data. I would still bundle them up first, to amortize the (100-133ns) cost of transferring something to another CPU. The data-cache trick, would be to instruct prefetcher only to start prefetching to L3 or L2, when these packet are destined for a remote CPU. At-least Intel CPUs have prefetch operations that specify only L2/L3 cache. Maybe, we need a combined solution. Lazy eval skb->pkt_type, for local delivery, but set the information if avail from HW desc. And fast page-forward don't even need a SKB. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-24 14:28 ` Jesper Dangaard Brouer @ 2016-01-24 14:44 ` Michael S. Tsirkin 2016-01-24 17:28 ` John Fastabend 2016-01-24 20:09 ` Optimizing instruction-cache, more packets at each stage Tom Herbert 1 sibling, 1 reply; 59+ messages in thread From: Michael S. Tsirkin @ 2016-01-24 14:44 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: David Miller, tom, eric.dumazet, gerlitz.or, edumazet, netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend, amirva On Sun, Jan 24, 2016 at 03:28:14PM +0100, Jesper Dangaard Brouer wrote: > On Thu, 21 Jan 2016 10:54:01 -0800 (PST) > David Miller <davem@davemloft.net> wrote: > > > From: Jesper Dangaard Brouer <brouer@redhat.com> > > Date: Thu, 21 Jan 2016 12:27:30 +0100 > > > > > eth_type_trans() does two things: > > > > > > 1) determine skb->protocol > > > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} > > > > > > Could the HW descriptor deliver the "proto", or perhaps just some bits > > > on the most common proto's? > > > > > > The skb->pkt_type don't need many bits. And I bet the HW already have > > > the information. The BROADCAST and MULTICAST indication are easy. The > > > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST > > > indication, if the eth->h_dest match the devices dev->dev_addr (else a > > > SW compare is required). > > > > > > Is that doable in hardware? > > > > I feel like we've had this discussion before several years ago. > > > > I think having just the protocol value would be enough. > > > > skb->pkt_type we could deal with by using always an accessor and > > evaluating it lazily. Nothing needs it until we hit ip_rcv() or > > similar. > > First I thought, I liked the idea delaying the eval of skb->pkt_type. > > BUT then I realized, what if we take this even further. What if we > actually use this information, for something useful, at this very > early RX stage. > > The information I'm interested in, from the HW descriptor, is if this > packet is NOT for local delivery. If so, we can send the packet on a > "fast-forward" code path. > > Think about bridging packets to a guest OS. Because we know very > early at RX (from packet HW descriptor) we might even avoid allocating > a SKB. We could just "forward" the packet-page to the guest OS. OK, so you would build a new kind of rx handler, and then e.g. macvtap could maybe get packets this way? Sure - e.g. vhost expects an skb at the moment but it won't be too hard to teach it that there's some other option. Or maybe some kind of stub skb that just has the correct length but no data is easier, I'm not sure. > Taking Eric's idea, of remote CPUs, we could even send these > packet-pages to a remote CPU (e.g. where the guest OS is running), > without having touched a single cache-line in the packet-data. I > would still bundle them up first, to amortize the (100-133ns) cost of > transferring something to another CPU. This bundling would have to happen in a guest specific way then, so in vhost. I'd be curious to see what you come up with. > The data-cache trick, would be to instruct prefetcher only to start > prefetching to L3 or L2, when these packet are destined for a remote > CPU. At-least Intel CPUs have prefetch operations that specify only > L2/L3 cache. > > > Maybe, we need a combined solution. Lazy eval skb->pkt_type, for > local delivery, but set the information if avail from HW desc. And > fast page-forward don't even need a SKB. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-24 14:44 ` Michael S. Tsirkin @ 2016-01-24 17:28 ` John Fastabend 2016-01-25 13:15 ` Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) Jesper Dangaard Brouer 0 siblings, 1 reply; 59+ messages in thread From: John Fastabend @ 2016-01-24 17:28 UTC (permalink / raw) To: Michael S. Tsirkin, Jesper Dangaard Brouer Cc: David Miller, tom, eric.dumazet, gerlitz.or, edumazet, netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend, amirva, Daniel Borkmann, vyasevich On 16-01-24 06:44 AM, Michael S. Tsirkin wrote: > On Sun, Jan 24, 2016 at 03:28:14PM +0100, Jesper Dangaard Brouer wrote: >> On Thu, 21 Jan 2016 10:54:01 -0800 (PST) >> David Miller <davem@davemloft.net> wrote: >> >>> From: Jesper Dangaard Brouer <brouer@redhat.com> >>> Date: Thu, 21 Jan 2016 12:27:30 +0100 >>> >>>> eth_type_trans() does two things: >>>> >>>> 1) determine skb->protocol >>>> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} >>>> >>>> Could the HW descriptor deliver the "proto", or perhaps just some bits >>>> on the most common proto's? >>>> >>>> The skb->pkt_type don't need many bits. And I bet the HW already have >>>> the information. The BROADCAST and MULTICAST indication are easy. The >>>> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST >>>> indication, if the eth->h_dest match the devices dev->dev_addr (else a >>>> SW compare is required). >>>> >>>> Is that doable in hardware? >>> >>> I feel like we've had this discussion before several years ago. >>> >>> I think having just the protocol value would be enough. >>> >>> skb->pkt_type we could deal with by using always an accessor and >>> evaluating it lazily. Nothing needs it until we hit ip_rcv() or >>> similar. >> >> First I thought, I liked the idea delaying the eval of skb->pkt_type. >> >> BUT then I realized, what if we take this even further. What if we >> actually use this information, for something useful, at this very >> early RX stage. >> >> The information I'm interested in, from the HW descriptor, is if this >> packet is NOT for local delivery. If so, we can send the packet on a >> "fast-forward" code path. >> >> Think about bridging packets to a guest OS. Because we know very >> early at RX (from packet HW descriptor) we might even avoid allocating >> a SKB. We could just "forward" the packet-page to the guest OS. > > OK, so you would build a new kind of rx handler, and then > e.g. macvtap could maybe get packets this way? > Sure - e.g. vhost expects an skb at the moment > but it won't be too hard to teach it that there's > some other option. + Daniel, Vlad If you use the macvtap device with the offload features you can "know" via mac address that all packets on a specific hardware queue set belong to a specific guest. (the queues are bound to a new netdev) This works well with the passthru mode of macvlan. So you can do hardware bridging this way. Supporting similar L3 modes probably not via macvlan has been on my todo list for awhile but I haven't got there yet. ixgbe and fm10k intel drivers support this now maybe others but those are the two I've worked with recently. The idea here is you remove any overhead from running bridge code, etc. but still allowing users to stick netfilter, qos, etc hooks in the datapath. Also Daniel and I started working on a zero-copy RX mode which would further help this by letting vhost-net pass down a set of dma buffers we should probably get this working and submit it. iirc Vlad also had the same sort of idea. The initial data for this looked good but not as good as the solution below. However it had a similar issue as below in that you just jumped over netfilter, qos, etc. Our initial implementation used af_packet. > > Or maybe some kind of stub skb that just has > the correct length but no data is easier, > I'm not sure. > Another option is to use perfect filters to push traffic to a VF and then map the VF into user space and use the vhost dpdk bits. This works fairly well and gets pkts into the guest with little hypervisor overhead and no(?) kernel network stack overhead. But the trade-off is you cut out netfilter, qos, etc. This is really slick if you "trust" your guest or have enough ACLs/etc in your hardware to "trust' the guest. A compromise is to use a VF and do not unbind it from the OS then you can use macvtap again and map the netdev 1:1 to a guest. With this mode you can still use your netfilter, qos, etc. but do l2,l3,l4 hardware forwarding with perfect filters. As an aside if you don't like ethtool perfect filters I have a set of patches to control this via 'tc' that I'll submit when net-next opens up again which would let you support filtering on more field options using offset:mask:value notation. >> Taking Eric's idea, of remote CPUs, we could even send these >> packet-pages to a remote CPU (e.g. where the guest OS is running), >> without having touched a single cache-line in the packet-data. I >> would still bundle them up first, to amortize the (100-133ns) cost of >> transferring something to another CPU. > > This bundling would have to happen in a guest > specific way then, so in vhost. > I'd be curious to see what you come up with. > >> The data-cache trick, would be to instruct prefetcher only to start >> prefetching to L3 or L2, when these packet are destined for a remote >> CPU. At-least Intel CPUs have prefetch operations that specify only >> L2/L3 cache. >> >> >> Maybe, we need a combined solution. Lazy eval skb->pkt_type, for >> local delivery, but set the information if avail from HW desc. And >> fast page-forward don't even need a SKB. >> >> -- >> Best regards, >> Jesper Dangaard Brouer >> MSc.CS, Principal Kernel Engineer at Red Hat >> Author of http://www.iptv-analyzer.org >> LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-24 17:28 ` John Fastabend @ 2016-01-25 13:15 ` Jesper Dangaard Brouer 2016-01-25 17:09 ` Tom Herbert 0 siblings, 1 reply; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-25 13:15 UTC (permalink / raw) To: John Fastabend Cc: Michael S. Tsirkin, David Miller, tom, eric.dumazet, gerlitz.or, edumazet, netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend, amirva, Daniel Borkmann, vyasevich, brouer After reading John's reply about perfect filters, I want to re-state my idea, for this very early RX stage. And describe a packet-page level bypass use-case, that John indirectly mentions. There are two ideas, getting mixed up here. (1) bundling from the RX-ring, (2) allowing to pick up the "packet-page" directly. Bundling (1) is something that seems natural, and which help us amortize the cost between layers (and utilizes icache better). Lets keep that in another thread. This (2) direct forward of "packet-pages" is a fairly extreme idea, BUT it have the potential of being an new integration point for "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to speed with bypass-solutions. Today, the bypass-solutions grab and control the entire NIC HW. In many cases this is not very practical, if you also want to use the NIC for something else. Solutions for bypassing only part of the traffic is starting to show up. Both a netmap[1] and a DPDK[2] based approach. [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/ [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/ Both approaches install a HW filter in the NIC, and redirect packets to a separate RX HW queue (via ethtool ntuple + flow-type). DPDK needs pci SRIOV setup and then run it own poll-mode driver on top. Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's changes[3] support a single RX queue mode. [3] https://github.com/luigirizzo/netmap/pull/87 I'm thinking, why run all this extra driver software on top. Why don't we just pickup the (packet)-page from the RX ring, and hand-it-over to a registered bypass handler? (as mentioned before, the HW descriptor need to somehow "mark" these packets for us). I imagine some kind of page ring structure, and I also imagine RAW/af_packet being a "bypass" consumer. I guess the af_packet part was also something John and Daniel have been looking at. (top post, but left John's replay below, because it got me thinking) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer On Sun, 24 Jan 2016 09:28:36 -0800 John Fastabend <john.fastabend@gmail.com> wrote: > On 16-01-24 06:44 AM, Michael S. Tsirkin wrote: > > On Sun, Jan 24, 2016 at 03:28:14PM +0100, Jesper Dangaard Brouer wrote: > >> On Thu, 21 Jan 2016 10:54:01 -0800 (PST) > >> David Miller <davem@davemloft.net> wrote: > >> > >>> From: Jesper Dangaard Brouer <brouer@redhat.com> > >>> Date: Thu, 21 Jan 2016 12:27:30 +0100 > >>> [...] > >> > >> BUT then I realized, what if we take this even further. What if we > >> actually use this information, for something useful, at this very > >> early RX stage. > >> > >> The information I'm interested in, from the HW descriptor, is if this > >> packet is NOT for local delivery. If so, we can send the packet on a > >> "fast-forward" code path. > >> > >> Think about bridging packets to a guest OS. Because we know very > >> early at RX (from packet HW descriptor) we might even avoid allocating > >> a SKB. We could just "forward" the packet-page to the guest OS. > > > > OK, so you would build a new kind of rx handler, and then > > e.g. macvtap could maybe get packets this way? > > Sure - e.g. vhost expects an skb at the moment > > but it won't be too hard to teach it that there's > > some other option. > > + Daniel, Vlad > > If you use the macvtap device with the offload features you can "know" > via mac address that all packets on a specific hardware queue set belong > to a specific guest. (the queues are bound to a new netdev) This works > well with the passthru mode of macvlan. So you can do hardware bridging > this way. Supporting similar L3 modes probably not via macvlan has been > on my todo list for awhile but I haven't got there yet. ixgbe and fm10k > intel drivers support this now maybe others but those are the two I've > worked with recently. > > The idea here is you remove any overhead from running bridge code, etc. > but still allowing users to stick netfilter, qos, etc hooks in the > datapath. > > Also Daniel and I started working on a zero-copy RX mode which would > further help this by letting vhost-net pass down a set of dma buffers > we should probably get this working and submit it. iirc Vlad also > had the same sort of idea. The initial data for this looked good but > not as good as the solution below. However it had a similar issue as > below in that you just jumped over netfilter, qos, etc. Our initial > implementation used af_packet. > > > > > Or maybe some kind of stub skb that just has > > the correct length but no data is easier, > > I'm not sure. > > > > Another option is to use perfect filters to push traffic to a VF and > then map the VF into user space and use the vhost dpdk bits. This > works fairly well and gets pkts into the guest with little hypervisor > overhead and no(?) kernel network stack overhead. But the trade-off is > you cut out netfilter, qos, etc. This is really slick if you "trust" > your guest or have enough ACLs/etc in your hardware to "trust' the > guest. > > A compromise is to use a VF and do not unbind it from the OS then > you can use macvtap again and map the netdev 1:1 to a guest. With > this mode you can still use your netfilter, qos, etc. but do l2,l3,l4 > hardware forwarding with perfect filters. > > As an aside if you don't like ethtool perfect filters I have a set of > patches to control this via 'tc' that I'll submit when net-next opens > up again which would let you support filtering on more field options > using offset:mask:value notation. > > >> Taking Eric's idea, of remote CPUs, we could even send these > >> packet-pages to a remote CPU (e.g. where the guest OS is running), > >> without having touched a single cache-line in the packet-data. I > >> would still bundle them up first, to amortize the (100-133ns) cost of > >> transferring something to another CPU. > > > > This bundling would have to happen in a guest > > specific way then, so in vhost. > > I'd be curious to see what you come up with. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-25 13:15 ` Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) Jesper Dangaard Brouer @ 2016-01-25 17:09 ` Tom Herbert 2016-01-25 17:50 ` John Fastabend 0 siblings, 1 reply; 59+ messages in thread From: Tom Herbert @ 2016-01-25 17:09 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: John Fastabend, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > After reading John's reply about perfect filters, I want to re-state > my idea, for this very early RX stage. And describe a packet-page > level bypass use-case, that John indirectly mentions. > > > There are two ideas, getting mixed up here. (1) bundling from the > RX-ring, (2) allowing to pick up the "packet-page" directly. > > Bundling (1) is something that seems natural, and which help us > amortize the cost between layers (and utilizes icache better). Lets > keep that in another thread. > > This (2) direct forward of "packet-pages" is a fairly extreme idea, > BUT it have the potential of being an new integration point for > "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to > speed with bypass-solutions. > > > Today, the bypass-solutions grab and control the entire NIC HW. In > many cases this is not very practical, if you also want to use the NIC > for something else. > > Solutions for bypassing only part of the traffic is starting to show > up. Both a netmap[1] and a DPDK[2] based approach. > > [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/ > [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/ > > Both approaches install a HW filter in the NIC, and redirect packets > to a separate RX HW queue (via ethtool ntuple + flow-type). DPDK > needs pci SRIOV setup and then run it own poll-mode driver on top. > Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's > changes[3] support a single RX queue mode. > Jepser, thanks for providing more specifics. One comment: If you intend to change core code paths or APIs for this, then I think that we should require up front that the associated HW support is protocol agnostic (i.e. HW filters must be programmable and generic ). We don't want a promising feature like this to be undermined by protocol ossification. Thanks, Tom > [3] https://github.com/luigirizzo/netmap/pull/87 > > > I'm thinking, why run all this extra driver software on top. Why > don't we just pickup the (packet)-page from the RX ring, and > hand-it-over to a registered bypass handler? (as mentioned before, > the HW descriptor need to somehow "mark" these packets for us). > > I imagine some kind of page ring structure, and I also imagine > RAW/af_packet being a "bypass" consumer. I guess the af_packet part > was also something John and Daniel have been looking at. > > > (top post, but left John's replay below, because it got me thinking) > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer > > > > > On Sun, 24 Jan 2016 09:28:36 -0800 > John Fastabend <john.fastabend@gmail.com> wrote: > >> On 16-01-24 06:44 AM, Michael S. Tsirkin wrote: >> > On Sun, Jan 24, 2016 at 03:28:14PM +0100, Jesper Dangaard Brouer wrote: >> >> On Thu, 21 Jan 2016 10:54:01 -0800 (PST) >> >> David Miller <davem@davemloft.net> wrote: >> >> >> >>> From: Jesper Dangaard Brouer <brouer@redhat.com> >> >>> Date: Thu, 21 Jan 2016 12:27:30 +0100 >> >>> > [...] > >> >> >> >> BUT then I realized, what if we take this even further. What if we >> >> actually use this information, for something useful, at this very >> >> early RX stage. >> >> >> >> The information I'm interested in, from the HW descriptor, is if this >> >> packet is NOT for local delivery. If so, we can send the packet on a >> >> "fast-forward" code path. >> >> >> >> Think about bridging packets to a guest OS. Because we know very >> >> early at RX (from packet HW descriptor) we might even avoid allocating >> >> a SKB. We could just "forward" the packet-page to the guest OS. >> > >> > OK, so you would build a new kind of rx handler, and then >> > e.g. macvtap could maybe get packets this way? >> > Sure - e.g. vhost expects an skb at the moment >> > but it won't be too hard to teach it that there's >> > some other option. >> >> + Daniel, Vlad >> >> If you use the macvtap device with the offload features you can "know" >> via mac address that all packets on a specific hardware queue set belong >> to a specific guest. (the queues are bound to a new netdev) This works >> well with the passthru mode of macvlan. So you can do hardware bridging >> this way. Supporting similar L3 modes probably not via macvlan has been >> on my todo list for awhile but I haven't got there yet. ixgbe and fm10k >> intel drivers support this now maybe others but those are the two I've >> worked with recently. >> >> The idea here is you remove any overhead from running bridge code, etc. >> but still allowing users to stick netfilter, qos, etc hooks in the >> datapath. >> >> Also Daniel and I started working on a zero-copy RX mode which would >> further help this by letting vhost-net pass down a set of dma buffers >> we should probably get this working and submit it. iirc Vlad also >> had the same sort of idea. The initial data for this looked good but >> not as good as the solution below. However it had a similar issue as >> below in that you just jumped over netfilter, qos, etc. Our initial >> implementation used af_packet. >> >> > >> > Or maybe some kind of stub skb that just has >> > the correct length but no data is easier, >> > I'm not sure. >> > >> >> Another option is to use perfect filters to push traffic to a VF and >> then map the VF into user space and use the vhost dpdk bits. This >> works fairly well and gets pkts into the guest with little hypervisor >> overhead and no(?) kernel network stack overhead. But the trade-off is >> you cut out netfilter, qos, etc. This is really slick if you "trust" >> your guest or have enough ACLs/etc in your hardware to "trust' the >> guest. >> >> A compromise is to use a VF and do not unbind it from the OS then >> you can use macvtap again and map the netdev 1:1 to a guest. With >> this mode you can still use your netfilter, qos, etc. but do l2,l3,l4 >> hardware forwarding with perfect filters. >> >> As an aside if you don't like ethtool perfect filters I have a set of >> patches to control this via 'tc' that I'll submit when net-next opens >> up again which would let you support filtering on more field options >> using offset:mask:value notation. >> >> >> Taking Eric's idea, of remote CPUs, we could even send these >> >> packet-pages to a remote CPU (e.g. where the guest OS is running), >> >> without having touched a single cache-line in the packet-data. I >> >> would still bundle them up first, to amortize the (100-133ns) cost of >> >> transferring something to another CPU. >> > >> > This bundling would have to happen in a guest >> > specific way then, so in vhost. >> > I'd be curious to see what you come up with. > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-25 17:09 ` Tom Herbert @ 2016-01-25 17:50 ` John Fastabend 2016-01-25 21:32 ` Tom Herbert 2016-01-25 22:10 ` Jesper Dangaard Brouer 0 siblings, 2 replies; 59+ messages in thread From: John Fastabend @ 2016-01-25 17:50 UTC (permalink / raw) To: Tom Herbert, Jesper Dangaard Brouer Cc: Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On 16-01-25 09:09 AM, Tom Herbert wrote: > On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: >> >> After reading John's reply about perfect filters, I want to re-state >> my idea, for this very early RX stage. And describe a packet-page >> level bypass use-case, that John indirectly mentions. >> >> >> There are two ideas, getting mixed up here. (1) bundling from the >> RX-ring, (2) allowing to pick up the "packet-page" directly. >> >> Bundling (1) is something that seems natural, and which help us >> amortize the cost between layers (and utilizes icache better). Lets >> keep that in another thread. >> >> This (2) direct forward of "packet-pages" is a fairly extreme idea, >> BUT it have the potential of being an new integration point for >> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to >> speed with bypass-solutions. >> >> >> Today, the bypass-solutions grab and control the entire NIC HW. In >> many cases this is not very practical, if you also want to use the NIC >> for something else. >> >> Solutions for bypassing only part of the traffic is starting to show >> up. Both a netmap[1] and a DPDK[2] based approach. >> >> [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/ >> [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/ >> >> Both approaches install a HW filter in the NIC, and redirect packets >> to a separate RX HW queue (via ethtool ntuple + flow-type). DPDK >> needs pci SRIOV setup and then run it own poll-mode driver on top. >> Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's >> changes[3] support a single RX queue mode. >> FWIW I wrote a version of the patch talked about in the queue splitting article that didn't require SR-IOV and we also talked about it at last netconf in ottowa. The problem is without SR-IOV if you map a queue directly into userspace so you can run the poll mode drivers there is nothing protecting the DMA engine. So userspace can put arbitrary addresses in there. There is something called Process Address Space ID (PASID) also part of the PCI-SIG spec that could help you here but I don't know of any hardware that supports it. The other option is to use system calls and validate the descriptors in the kernel but this incurs some overhead we had it at 15% or so when I did the numbers last year. However I'm told there is some interesting work going on around syscall overhead that may help. One thing to note is SRIOV does somewhat limit the number of these types of interfaces you can support to the max VFs where as the queue mechanism although slower with a function call would be limited to max number of queues. Also busy polling will help here if you are worried about pps. Jesper, at least for you (2) case what are we missing with the bifurcated/queue splitting work? Are you really after systems without SR-IOV support or are you trying to get this on the order of queues instead of VFs. > Jepser, thanks for providing more specifics. > > One comment: If you intend to change core code paths or APIs for this, > then I think that we should require up front that the associated HW > support is protocol agnostic (i.e. HW filters must be programmable and > generic ). We don't want a promising feature like this to be > undermined by protocol ossification. At the moment we use ethtool ntuple filters which is basically adding a new set of enums and structures every time we need a new protocol so its painful and you need your vendor to support you and you need a new kernel. The flow api was shot down (which would get you to the point where the user could specify the protocols for the driver to implement e.g. put_parse_graph) and the only new proposals I've seen are bpf translations in drivers and 'tc'. I plan to take another shot at this in net-next. > > Thanks, > Tom > >> [3] https://github.com/luigirizzo/netmap/pull/87 >> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-25 17:50 ` John Fastabend @ 2016-01-25 21:32 ` Tom Herbert 2016-01-25 21:58 ` John Fastabend 2016-01-25 22:10 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 59+ messages in thread From: Tom Herbert @ 2016-01-25 21:32 UTC (permalink / raw) To: John Fastabend Cc: Jesper Dangaard Brouer, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Mon, Jan 25, 2016 at 9:50 AM, John Fastabend <john.fastabend@gmail.com> wrote: > On 16-01-25 09:09 AM, Tom Herbert wrote: >> On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer >> <brouer@redhat.com> wrote: >>> >>> After reading John's reply about perfect filters, I want to re-state >>> my idea, for this very early RX stage. And describe a packet-page >>> level bypass use-case, that John indirectly mentions. >>> >>> >>> There are two ideas, getting mixed up here. (1) bundling from the >>> RX-ring, (2) allowing to pick up the "packet-page" directly. >>> >>> Bundling (1) is something that seems natural, and which help us >>> amortize the cost between layers (and utilizes icache better). Lets >>> keep that in another thread. >>> >>> This (2) direct forward of "packet-pages" is a fairly extreme idea, >>> BUT it have the potential of being an new integration point for >>> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to >>> speed with bypass-solutions. >>> >>> >>> Today, the bypass-solutions grab and control the entire NIC HW. In >>> many cases this is not very practical, if you also want to use the NIC >>> for something else. >>> >>> Solutions for bypassing only part of the traffic is starting to show >>> up. Both a netmap[1] and a DPDK[2] based approach. >>> >>> [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/ >>> [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/ >>> >>> Both approaches install a HW filter in the NIC, and redirect packets >>> to a separate RX HW queue (via ethtool ntuple + flow-type). DPDK >>> needs pci SRIOV setup and then run it own poll-mode driver on top. >>> Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's >>> changes[3] support a single RX queue mode. >>> > > FWIW I wrote a version of the patch talked about in the queue splitting > article that didn't require SR-IOV and we also talked about it at last > netconf in ottowa. The problem is without SR-IOV if you map a queue > directly into userspace so you can run the poll mode drivers there is > nothing protecting the DMA engine. So userspace can put arbitrary > addresses in there. There is something called Process Address Space ID > (PASID) also part of the PCI-SIG spec that could help you here but I > don't know of any hardware that supports it. The other option is to > use system calls and validate the descriptors in the kernel but this > incurs some overhead we had it at 15% or so when I did the numbers > last year. However I'm told there is some interesting work going on > around syscall overhead that may help. > > One thing to note is SRIOV does somewhat limit the number of these > types of interfaces you can support to the max VFs where as the > queue mechanism although slower with a function call would be limited > to max number of queues. Also busy polling will help here if you > are worried about pps. > I think you're understating that a bit :-) We know that busy polling helps with both pps and latency. IIRC, busy polling in the kernel reduced latency by 2/3. Any latency or pps comparison between an interrupt driven kernel stack and a userspace stack doing polling would be invalid. If this work is all about latency (like burning cores is not an issue), maybe busy polling should be be assumed for all test cases? > Jesper, at least for you (2) case what are we missing with the > bifurcated/queue splitting work? Are you really after systems > without SR-IOV support or are you trying to get this on the order > of queues instead of VFs. > >> Jepser, thanks for providing more specifics. >> >> One comment: If you intend to change core code paths or APIs for this, >> then I think that we should require up front that the associated HW >> support is protocol agnostic (i.e. HW filters must be programmable and >> generic ). We don't want a promising feature like this to be >> undermined by protocol ossification. > > At the moment we use ethtool ntuple filters which is basically adding > a new set of enums and structures every time we need a new protocol > so its painful and you need your vendor to support you and you need a > new kernel. > > The flow api was shot down (which would get you to the point where > the user could specify the protocols for the driver to implement e.g. > put_parse_graph) and the only new proposals I've seen are bpf > translations in drivers and 'tc'. I plan to take another shot at this in > net-next. > >> >> Thanks, >> Tom >> >>> [3] https://github.com/luigirizzo/netmap/pull/87 >>> > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-25 21:32 ` Tom Herbert @ 2016-01-25 21:58 ` John Fastabend 0 siblings, 0 replies; 59+ messages in thread From: John Fastabend @ 2016-01-25 21:58 UTC (permalink / raw) To: Tom Herbert Cc: Jesper Dangaard Brouer, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On 16-01-25 01:32 PM, Tom Herbert wrote: > On Mon, Jan 25, 2016 at 9:50 AM, John Fastabend > <john.fastabend@gmail.com> wrote: >> On 16-01-25 09:09 AM, Tom Herbert wrote: >>> On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer >>> <brouer@redhat.com> wrote: >>>> >>>> After reading John's reply about perfect filters, I want to re-state >>>> my idea, for this very early RX stage. And describe a packet-page >>>> level bypass use-case, that John indirectly mentions. >>>> >>>> >>>> There are two ideas, getting mixed up here. (1) bundling from the >>>> RX-ring, (2) allowing to pick up the "packet-page" directly. >>>> >>>> Bundling (1) is something that seems natural, and which help us >>>> amortize the cost between layers (and utilizes icache better). Lets >>>> keep that in another thread. >>>> >>>> This (2) direct forward of "packet-pages" is a fairly extreme idea, >>>> BUT it have the potential of being an new integration point for >>>> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to >>>> speed with bypass-solutions. >>>> >>>> >>>> Today, the bypass-solutions grab and control the entire NIC HW. In >>>> many cases this is not very practical, if you also want to use the NIC >>>> for something else. >>>> >>>> Solutions for bypassing only part of the traffic is starting to show >>>> up. Both a netmap[1] and a DPDK[2] based approach. >>>> >>>> [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/ >>>> [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/ >>>> >>>> Both approaches install a HW filter in the NIC, and redirect packets >>>> to a separate RX HW queue (via ethtool ntuple + flow-type). DPDK >>>> needs pci SRIOV setup and then run it own poll-mode driver on top. >>>> Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's >>>> changes[3] support a single RX queue mode. >>>> >> >> FWIW I wrote a version of the patch talked about in the queue splitting >> article that didn't require SR-IOV and we also talked about it at last >> netconf in ottowa. The problem is without SR-IOV if you map a queue >> directly into userspace so you can run the poll mode drivers there is >> nothing protecting the DMA engine. So userspace can put arbitrary >> addresses in there. There is something called Process Address Space ID >> (PASID) also part of the PCI-SIG spec that could help you here but I >> don't know of any hardware that supports it. The other option is to >> use system calls and validate the descriptors in the kernel but this >> incurs some overhead we had it at 15% or so when I did the numbers >> last year. However I'm told there is some interesting work going on >> around syscall overhead that may help. >> >> One thing to note is SRIOV does somewhat limit the number of these >> types of interfaces you can support to the max VFs where as the >> queue mechanism although slower with a function call would be limited >> to max number of queues. Also busy polling will help here if you >> are worried about pps. >> > I think you're understating that a bit :-) We know that busy polling > helps with both pps and latency. IIRC, busy polling in the kernel > reduced latency by 2/3. Any latency or pps comparison between an > interrupt driven kernel stack and a userspace stack doing polling > would be invalid. If this work is all about latency (like burning > cores is not an issue), maybe busy polling should be be assumed for > all test cases? Probably if your going to try and report pps numbers and chart them we mind as well play the game and use the best configuration we can. Although I did want to make busy polling per queue or maybe create L3/L4 netdev's like macvlan and put those in busy polling. Its a bit overkill to put the entire device in busy polling mode when we have only a couple sockets doing it. net-next is opening soon right ;) > >> Jesper, at least for you (2) case what are we missing with the >> bifurcated/queue splitting work? Are you really after systems >> without SR-IOV support or are you trying to get this on the order >> of queues instead of VFs. >> >>> Jepser, thanks for providing more specifics. >>> >>> One comment: If you intend to change core code paths or APIs for this, >>> then I think that we should require up front that the associated HW >>> support is protocol agnostic (i.e. HW filters must be programmable and >>> generic ). We don't want a promising feature like this to be >>> undermined by protocol ossification. >> >> At the moment we use ethtool ntuple filters which is basically adding >> a new set of enums and structures every time we need a new protocol >> so its painful and you need your vendor to support you and you need a >> new kernel. >> >> The flow api was shot down (which would get you to the point where >> the user could specify the protocols for the driver to implement e.g. >> put_parse_graph) and the only new proposals I've seen are bpf >> translations in drivers and 'tc'. I plan to take another shot at this in >> net-next. >> >>> >>> Thanks, >>> Tom >>> >>>> [3] https://github.com/luigirizzo/netmap/pull/87 >>>> >> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-25 17:50 ` John Fastabend 2016-01-25 21:32 ` Tom Herbert @ 2016-01-25 22:10 ` Jesper Dangaard Brouer 2016-01-27 20:47 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-25 22:10 UTC (permalink / raw) To: John Fastabend Cc: Tom Herbert, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich, brouer On Mon, 25 Jan 2016 09:50:16 -0800 John Fastabend <john.fastabend@gmail.com> wrote: > On 16-01-25 09:09 AM, Tom Herbert wrote: > > On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer > > <brouer@redhat.com> wrote: > >> [...] > >> > >> There are two ideas, getting mixed up here. (1) bundling from the > >> RX-ring, (2) allowing to pick up the "packet-page" directly. > >> > >> Bundling (1) is something that seems natural, and which help us > >> amortize the cost between layers (and utilizes icache better). Lets > >> keep that in another thread. > >> > >> This (2) direct forward of "packet-pages" is a fairly extreme idea, > >> BUT it have the potential of being an new integration point for > >> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to > >> speed with bypass-solutions. > [...] > > Jesper, at least for you (2) case what are we missing with the > bifurcated/queue splitting work? Are you really after systems > without SR-IOV support or are you trying to get this on the order > of queues instead of VFs. I'm not saying something is missing for bifurcated/queue splitting work. I'm not trying to work-around SR-IOV. This an extreme idea, which I got while looking at the lowest RX layer. Before working any further on this idea/path, I need/want to evaluate if it makes sense from a performance point of view. I need to evaluate if "pulling" out these "packet-pages" is fast enough to compete with DPDK/netmap. Else it makes no sense to work on this path. As a first step to evaluate this lowest RX layer, I'm simply hacking the drivers (ixgbe and mlx5) to drop/discard packets within-the-driver. For now, simply replacing napi_gro_receive() with dev_kfree_skb(), and measuring the "RX-drop" performance. Next step was to avoid the skb alloc+free calls, but doing so is more complicated that I first anticipated, as the SKB is tied in fairly heavily. Thus, right now I'm instead hooking in my bulk alloc+free API, as that will remove/mitigate most of the overhead of the kmem_cache/slab-allocators. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-25 22:10 ` Jesper Dangaard Brouer @ 2016-01-27 20:47 ` Jesper Dangaard Brouer 2016-01-27 21:56 ` Alexei Starovoitov 2016-01-28 2:50 ` Tom Herbert 0 siblings, 2 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-27 20:47 UTC (permalink / raw) To: John Fastabend Cc: Tom Herbert, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich, brouer On Mon, 25 Jan 2016 23:10:16 +0100 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Mon, 25 Jan 2016 09:50:16 -0800 John Fastabend <john.fastabend@gmail.com> wrote: > > > On 16-01-25 09:09 AM, Tom Herbert wrote: > > > On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer > > > <brouer@redhat.com> wrote: > > >> > [...] > > >> > > >> There are two ideas, getting mixed up here. (1) bundling from the > > >> RX-ring, (2) allowing to pick up the "packet-page" directly. > > >> > > >> Bundling (1) is something that seems natural, and which help us > > >> amortize the cost between layers (and utilizes icache better). Lets > > >> keep that in another thread. > > >> > > >> This (2) direct forward of "packet-pages" is a fairly extreme idea, > > >> BUT it have the potential of being an new integration point for > > >> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to > > >> speed with bypass-solutions. > > > [...] > > > > Jesper, at least for you (2) case what are we missing with the > > bifurcated/queue splitting work? Are you really after systems > > without SR-IOV support or are you trying to get this on the order > > of queues instead of VFs. > > I'm not saying something is missing for bifurcated/queue splitting work. > I'm not trying to work-around SR-IOV. > > This an extreme idea, which I got while looking at the lowest RX layer. > > > Before working any further on this idea/path, I need/want to evaluate > if it makes sense from a performance point of view. I need to evaluate > if "pulling" out these "packet-pages" is fast enough to compete with > DPDK/netmap. Else it makes no sense to work on this path. > > As a first step to evaluate this lowest RX layer, I'm simply hacking > the drivers (ixgbe and mlx5) to drop/discard packets within-the-driver. > For now, simply replacing napi_gro_receive() with dev_kfree_skb(), and > measuring the "RX-drop" performance. > > Next step was to avoid the skb alloc+free calls, but doing so is more > complicated that I first anticipated, as the SKB is tied in fairly > heavily. Thus, right now I'm instead hooking in my bulk alloc+free > API, as that will remove/mitigate most of the overhead of the > kmem_cache/slab-allocators. I've tried to deduct that kind of speeds we can achieve, at this lowest RX layer. By in the mlx5/100G driver drop packets directly in the driver. Just replacing replacing napi_gro_receive() with dev_kfree_skb(), was fairly depressing, showing only 6.2Mpps (6253970 pps => 159.9 ns) (single core). Looking at the perf report showed major cache-miss in eth_type_trans(29%/47ns). And driver is hitting the SLUB slowpath quite badly (because it prealloc SKBs and binds to RX ring, usually this test case would hits SLUB "recycle" fastpath): Group-report: kmem_cache/SLUB allocator functions :: 5.00 % ~= 8.0 ns <= __slab_free 4.91 % ~= 7.9 ns <= cmpxchg_double_slab.isra.65 4.22 % ~= 6.7 ns <= kmem_cache_alloc 1.68 % ~= 2.7 ns <= kmem_cache_free 1.10 % ~= 1.8 ns <= ___slab_alloc 0.93 % ~= 1.5 ns <= __cmpxchg_double_slab.isra.54 0.65 % ~= 1.0 ns <= __slab_alloc.isra.74 0.26 % ~= 0.4 ns <= put_cpu_partial Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns To get around the cache-miss in eth_type_trans(), I created a "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out", before calling eth_type_trans(), reducing cost to 2.45%. To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API . And also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger slab-pages, thus bigger bulk opportunities. This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns). Group-report: kmem_cache/SLUB allocator functions :: 4.99 % ~= 4.1 ns <= kmem_cache_alloc_bulk 2.87 % ~= 2.4 ns <= kmem_cache_free_bulk 0.24 % ~= 0.2 ns <= ___slab_alloc 0.23 % ~= 0.2 ns <= __slab_free 0.21 % ~= 0.2 ns <= __cmpxchg_double_slab.isra.54 0.17 % ~= 0.1 ns <= cmpxchg_double_slab.isra.65 0.07 % ~= 0.1 ns <= put_cpu_partial 0.04 % ~= 0.0 ns <= unfreeze_partials.isra.71 0.03 % ~= 0.0 ns <= get_partial_node.isra.72 Sum: 8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns Full perf report output below signature, is from optimized case. SKB related cost is 22.9 ns. However 51.7% (11.84ns) cost originates from memset of the SKB. Group-report: related to pattern "skb" :: 17.92 % ~= 14.8 ns <= __napi_alloc_skb <== 80% memset(0) / rep stos 3.29 % ~= 2.7 ns <= skb_release_data 2.20 % ~= 1.8 ns <= napi_consume_skb 1.86 % ~= 1.5 ns <= skb_release_head_state 1.20 % ~= 1.0 ns <= skb_put 1.14 % ~= 0.9 ns <= skb_release_all 0.02 % ~= 0.0 ns <= __kfree_skb_flush Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns Doing a crude extrapolation, 82.7 ns subtract, SLUB (7.3 ns) and SKB (22.9 ns) related => 52.5 ns -> extrapolate 19 Mpps would be the maximum speed we can pull off packet-pages from the RX ring. I don't know if 19Mpps (52.5 ns "overhead") is fast enough, to compete with just mapping a RX HW queue/ring to netmap or via SR-IOV to DPDK(?) But it was interesting to see how the lowest RX layer performs... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer Perf-report script: * https://github.com/netoptimizer/network-testing/blob/master/bin/perf_report_pps_stats.pl Report: ALL functions :: 19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq 17.92 % ~= 14.8 ns <= __napi_alloc_skb 9.54 % ~= 7.9 ns <= __free_page_frag 7.16 % ~= 5.9 ns <= mlx5e_get_cqe 6.37 % ~= 5.3 ns <= mlx5e_post_rx_wqes 4.99 % ~= 4.1 ns <= kmem_cache_alloc_bulk 3.70 % ~= 3.1 ns <= __alloc_page_frag 3.29 % ~= 2.7 ns <= skb_release_data 2.87 % ~= 2.4 ns <= kmem_cache_free_bulk 2.45 % ~= 2.0 ns <= eth_type_trans 2.43 % ~= 2.0 ns <= get_page_from_freelist 2.36 % ~= 2.0 ns <= swiotlb_map_page 2.20 % ~= 1.8 ns <= napi_consume_skb 1.86 % ~= 1.5 ns <= skb_release_head_state 1.25 % ~= 1.0 ns <= free_pages_prepare 1.20 % ~= 1.0 ns <= skb_put 1.14 % ~= 0.9 ns <= skb_release_all 0.77 % ~= 0.6 ns <= __free_pages_ok 0.59 % ~= 0.5 ns <= get_pfnblock_flags_mask 0.59 % ~= 0.5 ns <= swiotlb_dma_mapping_error 0.59 % ~= 0.5 ns <= unmap_single 0.58 % ~= 0.5 ns <= _raw_spin_lock_irqsave 0.57 % ~= 0.5 ns <= free_one_page 0.56 % ~= 0.5 ns <= swiotlb_unmap_page 0.52 % ~= 0.4 ns <= _raw_spin_lock 0.46 % ~= 0.4 ns <= __mod_zone_page_state 0.36 % ~= 0.3 ns <= __rmqueue 0.36 % ~= 0.3 ns <= net_rx_action 0.34 % ~= 0.3 ns <= __alloc_pages_nodemask 0.31 % ~= 0.3 ns <= __zone_watermark_ok 0.27 % ~= 0.2 ns <= mlx5e_napi_poll 0.24 % ~= 0.2 ns <= ___slab_alloc 0.23 % ~= 0.2 ns <= __slab_free 0.22 % ~= 0.2 ns <= __list_del_entry 0.21 % ~= 0.2 ns <= __cmpxchg_double_slab.isra.54 0.21 % ~= 0.2 ns <= next_zones_zonelist 0.20 % ~= 0.2 ns <= __list_add 0.17 % ~= 0.1 ns <= __do_softirq 0.17 % ~= 0.1 ns <= cmpxchg_double_slab.isra.65 0.16 % ~= 0.1 ns <= __inc_zone_state 0.12 % ~= 0.1 ns <= _raw_spin_unlock 0.12 % ~= 0.1 ns <= zone_statistics (Percent limit(0.1%) stop at "mlx5e_poll_tx_cq") Sum: 99.45 % => calc: 82.3 ns (sum: 82.3 ns) => Total: 82.7 ns Group-report: related to pattern "eth_type_trans|mlx5|ixgbe|__iowrite64_copy" :: (Driver related) 19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq 7.16 % ~= 5.9 ns <= mlx5e_get_cqe 6.37 % ~= 5.3 ns <= mlx5e_post_rx_wqes 2.45 % ~= 2.0 ns <= eth_type_trans 0.27 % ~= 0.2 ns <= mlx5e_napi_poll 0.09 % ~= 0.1 ns <= mlx5e_poll_tx_cq Sum: 36.05 % => calc: 29.8 ns (sum: 29.8 ns) => Total: 82.7 ns Group-report: DMA functions :: 2.36 % ~= 2.0 ns <= swiotlb_map_page 0.59 % ~= 0.5 ns <= unmap_single 0.59 % ~= 0.5 ns <= swiotlb_dma_mapping_error 0.56 % ~= 0.5 ns <= swiotlb_unmap_page Sum: 4.10 % => calc: 3.4 ns (sum: 3.4 ns) => Total: 82.7 ns Group-report: page_frag_cache functions :: 9.54 % ~= 7.9 ns <= __free_page_frag 3.70 % ~= 3.1 ns <= __alloc_page_frag 2.43 % ~= 2.0 ns <= get_page_from_freelist 1.25 % ~= 1.0 ns <= free_pages_prepare 0.77 % ~= 0.6 ns <= __free_pages_ok 0.59 % ~= 0.5 ns <= get_pfnblock_flags_mask 0.57 % ~= 0.5 ns <= free_one_page 0.46 % ~= 0.4 ns <= __mod_zone_page_state 0.36 % ~= 0.3 ns <= __rmqueue 0.34 % ~= 0.3 ns <= __alloc_pages_nodemask 0.31 % ~= 0.3 ns <= __zone_watermark_ok 0.21 % ~= 0.2 ns <= next_zones_zonelist 0.16 % ~= 0.1 ns <= __inc_zone_state 0.12 % ~= 0.1 ns <= zone_statistics 0.02 % ~= 0.0 ns <= mod_zone_page_state Sum: 20.83 % => calc: 17.2 ns (sum: 17.2 ns) => Total: 82.7 ns Group-report: kmem_cache/SLUB allocator functions :: 4.99 % ~= 4.1 ns <= kmem_cache_alloc_bulk 2.87 % ~= 2.4 ns <= kmem_cache_free_bulk 0.24 % ~= 0.2 ns <= ___slab_alloc 0.23 % ~= 0.2 ns <= __slab_free 0.21 % ~= 0.2 ns <= __cmpxchg_double_slab.isra.54 0.17 % ~= 0.1 ns <= cmpxchg_double_slab.isra.65 0.07 % ~= 0.1 ns <= put_cpu_partial 0.04 % ~= 0.0 ns <= unfreeze_partials.isra.71 0.03 % ~= 0.0 ns <= get_partial_node.isra.72 Sum: 8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns Group-report: related to pattern "skb" :: 17.92 % ~= 14.8 ns <= __napi_alloc_skb <== 80% memset(0) / rep stos 3.29 % ~= 2.7 ns <= skb_release_data 2.20 % ~= 1.8 ns <= napi_consume_skb 1.86 % ~= 1.5 ns <= skb_release_head_state 1.20 % ~= 1.0 ns <= skb_put 1.14 % ~= 0.9 ns <= skb_release_all 0.02 % ~= 0.0 ns <= __kfree_skb_flush Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns Group-report: Core network-stack functions :: 0.36 % ~= 0.3 ns <= net_rx_action 0.17 % ~= 0.1 ns <= __do_softirq 0.02 % ~= 0.0 ns <= __raise_softirq_irqoff 0.01 % ~= 0.0 ns <= run_ksoftirqd 0.00 % ~= 0.0 ns <= run_timer_softirq 0.00 % ~= 0.0 ns <= ksoftirqd_should_run 0.00 % ~= 0.0 ns <= raise_softirq Sum: 0.56 % => calc: 0.5 ns (sum: 0.5 ns) => Total: 82.7 ns Group-report: GRO network-stack functions :: Sum: 0.00 % => calc: 0.0 ns (sum: 0.0 ns) => Total: 82.7 ns Group-report: related to pattern "spin_.*lock|mutex" :: 0.58 % ~= 0.5 ns <= _raw_spin_lock_irqsave 0.52 % ~= 0.4 ns <= _raw_spin_lock 0.12 % ~= 0.1 ns <= _raw_spin_unlock 0.01 % ~= 0.0 ns <= _raw_spin_unlock_irqrestore 0.00 % ~= 0.0 ns <= __mutex_lock_slowpath 0.00 % ~= 0.0 ns <= _raw_spin_lock_irq Sum: 1.23 % => calc: 1.0 ns (sum: 1.0 ns) => Total: 82.7 ns Negative Report: functions NOT included in group reports:: 0.22 % ~= 0.2 ns <= __list_del_entry 0.20 % ~= 0.2 ns <= __list_add 0.07 % ~= 0.1 ns <= list_del 0.05 % ~= 0.0 ns <= native_sched_clock 0.04 % ~= 0.0 ns <= irqtime_account_irq 0.02 % ~= 0.0 ns <= rcu_bh_qs 0.01 % ~= 0.0 ns <= task_tick_fair 0.01 % ~= 0.0 ns <= net_rps_action_and_irq_enable.isra.112 0.01 % ~= 0.0 ns <= perf_event_task_tick 0.01 % ~= 0.0 ns <= apic_timer_interrupt 0.01 % ~= 0.0 ns <= lapic_next_deadline 0.01 % ~= 0.0 ns <= rcu_check_callbacks 0.01 % ~= 0.0 ns <= smpboot_thread_fn 0.01 % ~= 0.0 ns <= irqtime_account_process_tick.isra.3 0.00 % ~= 0.0 ns <= intel_bts_enable_local 0.00 % ~= 0.0 ns <= kthread_should_park 0.00 % ~= 0.0 ns <= native_apic_mem_write 0.00 % ~= 0.0 ns <= hrtimer_forward 0.00 % ~= 0.0 ns <= get_work_pool 0.00 % ~= 0.0 ns <= cpu_startup_entry 0.00 % ~= 0.0 ns <= acct_account_cputime 0.00 % ~= 0.0 ns <= set_next_entity 0.00 % ~= 0.0 ns <= worker_thread 0.00 % ~= 0.0 ns <= dbs_timer_handler 0.00 % ~= 0.0 ns <= delay_tsc 0.00 % ~= 0.0 ns <= idle_cpu 0.00 % ~= 0.0 ns <= timerqueue_add 0.00 % ~= 0.0 ns <= hrtimer_interrupt 0.00 % ~= 0.0 ns <= dbs_work_handler 0.00 % ~= 0.0 ns <= dequeue_entity 0.00 % ~= 0.0 ns <= update_cfs_shares 0.00 % ~= 0.0 ns <= update_fast_timekeeper 0.00 % ~= 0.0 ns <= smp_trace_apic_timer_interrupt 0.00 % ~= 0.0 ns <= __update_cpu_load 0.00 % ~= 0.0 ns <= cpu_needs_another_gp 0.00 % ~= 0.0 ns <= ret_from_intr 0.00 % ~= 0.0 ns <= __intel_pmu_enable_all 0.00 % ~= 0.0 ns <= trigger_load_balance 0.00 % ~= 0.0 ns <= __schedule 0.00 % ~= 0.0 ns <= nsecs_to_jiffies64 0.00 % ~= 0.0 ns <= account_entity_dequeue 0.00 % ~= 0.0 ns <= worker_enter_idle 0.00 % ~= 0.0 ns <= __hrtimer_get_next_event 0.00 % ~= 0.0 ns <= rcu_irq_exit 0.00 % ~= 0.0 ns <= rb_erase 0.00 % ~= 0.0 ns <= __intel_pmu_disable_all 0.00 % ~= 0.0 ns <= tick_sched_do_timer 0.00 % ~= 0.0 ns <= cpuacct_account_field 0.00 % ~= 0.0 ns <= update_wall_time 0.00 % ~= 0.0 ns <= notifier_call_chain 0.00 % ~= 0.0 ns <= timekeeping_update 0.00 % ~= 0.0 ns <= ktime_get_update_offsets_now 0.00 % ~= 0.0 ns <= rb_next 0.00 % ~= 0.0 ns <= rcu_all_qs 0.00 % ~= 0.0 ns <= x86_pmu_disable 0.00 % ~= 0.0 ns <= _cond_resched 0.00 % ~= 0.0 ns <= __rcu_read_lock 0.00 % ~= 0.0 ns <= __local_bh_enable 0.00 % ~= 0.0 ns <= update_cpu_load_active 0.00 % ~= 0.0 ns <= x86_pmu_enable 0.00 % ~= 0.0 ns <= insert_work 0.00 % ~= 0.0 ns <= ktime_get 0.00 % ~= 0.0 ns <= __usecs_to_jiffies 0.00 % ~= 0.0 ns <= __acct_update_integrals 0.00 % ~= 0.0 ns <= scheduler_tick 0.00 % ~= 0.0 ns <= update_vsyscall 0.00 % ~= 0.0 ns <= memcpy_erms 0.00 % ~= 0.0 ns <= get_cpu_idle_time_us 0.00 % ~= 0.0 ns <= sched_clock_cpu 0.00 % ~= 0.0 ns <= tick_do_update_jiffies64 0.00 % ~= 0.0 ns <= hrtimer_active 0.00 % ~= 0.0 ns <= profile_tick 0.00 % ~= 0.0 ns <= __hrtimer_run_queues 0.00 % ~= 0.0 ns <= kthread_should_stop 0.00 % ~= 0.0 ns <= run_posix_cpu_timers 0.00 % ~= 0.0 ns <= read_tsc 0.00 % ~= 0.0 ns <= __remove_hrtimer 0.00 % ~= 0.0 ns <= calc_global_load_tick 0.00 % ~= 0.0 ns <= hrtimer_run_queues 0.00 % ~= 0.0 ns <= irq_work_tick 0.00 % ~= 0.0 ns <= cpuacct_charge 0.00 % ~= 0.0 ns <= clockevents_program_event 0.00 % ~= 0.0 ns <= update_blocked_averages Sum: 0.68 % => calc: 0.6 ns (sum: 0.6 ns) => Total: 82.7 ns ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-27 20:47 ` Jesper Dangaard Brouer @ 2016-01-27 21:56 ` Alexei Starovoitov 2016-01-28 9:52 ` Jesper Dangaard Brouer 2016-01-28 2:50 ` Tom Herbert 1 sibling, 1 reply; 59+ messages in thread From: Alexei Starovoitov @ 2016-01-27 21:56 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: John Fastabend, Tom Herbert, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Wed, Jan 27, 2016 at 09:47:50PM +0100, Jesper Dangaard Brouer wrote: > Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns > > To get around the cache-miss in eth_type_trans(), I created a > "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out", > before calling eth_type_trans(), reducing cost to 2.45%. > > To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API . And > also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger > slab-pages, thus bigger bulk opportunities. > > This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns). great stuff. I think such batching loop will reduce the cost of eth_type_trans() for all use cases. Only unfortunate that it would need to be implemented in every driver, but there is only a handful that people care about in high performance setups, so I think it's worth getting this patch in for mlx5 and the other drivers will catch up. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-27 21:56 ` Alexei Starovoitov @ 2016-01-28 9:52 ` Jesper Dangaard Brouer 2016-01-28 12:54 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-28 9:52 UTC (permalink / raw) To: Alexei Starovoitov Cc: John Fastabend, Tom Herbert, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich, brouer On Wed, 27 Jan 2016 13:56:03 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Wed, Jan 27, 2016 at 09:47:50PM +0100, Jesper Dangaard Brouer wrote: > > Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns > > > > To get around the cache-miss in eth_type_trans(), I created a > > "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out", > > before calling eth_type_trans(), reducing cost to 2.45%. > > > > To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API . And > > also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger > > slab-pages, thus bigger bulk opportunities. > > > > This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns). > > great stuff. I think such batching loop will reduce the cost of > eth_type_trans() for all use cases. > Only unfortunate that it would need to be implemented in every driver, > but there is only a handful that people care about in high performance > setups, so I think it's worth getting this patch in for mlx5 and > the other drivers will catch up. I'm still in flux/undecided how long we should delay the first touching of pkt-data, which happens when calling eth_type_trans(). Should it stay in the driver or not(?). In the extreme case, for optimize for RPS sending to remote CPUs, delay calling eth_type_trans() as long as possible. 1. In driver only start prefetch data to L2/L3 cache 2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash 3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue 4. On remote CPU in process_backlog call eth_type_trans() on sd->input_pkt_queue On the other hand, if the HW desc can provide skb->proto, and we can lazy eval skb->pkt_type, then it is okay to keep that responsibility in the driver (as the call to eth_type_trans() basically disappears). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-28 9:52 ` Jesper Dangaard Brouer @ 2016-01-28 12:54 ` Eric Dumazet 2016-01-28 13:25 ` Eric Dumazet 2016-01-28 16:43 ` Tom Herbert 2 siblings, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2016-01-28 12:54 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Alexei Starovoitov, John Fastabend, Tom Herbert, Michael S. Tsirkin, David Miller, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Thu, 2016-01-28 at 10:52 +0100, Jesper Dangaard Brouer wrote: > I'm still in flux/undecided how long we should delay the first touching > of pkt-data, which happens when calling eth_type_trans(). Should it > stay in the driver or not(?). > > In the extreme case, for optimize for RPS sending to remote CPUs, delay > calling eth_type_trans() as long as possible. > > 1. In driver only start prefetch data to L2/L3 cache > 2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash > 3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue > 4. On remote CPU in process_backlog call eth_type_trans() on sd->input_pkt_queue > > > On the other hand, if the HW desc can provide skb->proto, and we can > lazy eval skb->pkt_type, then it is okay to keep that responsibility in > the driver (as the call to eth_type_trans() basically disappears). Delaying means GRO wont be able to recycle its super hot skb (see napi_get_frags()) You might optimize the reception of packets in the router case (poor GRO aggregation rate), but you'll slow down GRO efficiency when receiving nice GRO trains. When we receive a train of 10 MSS, driver keeps using the same sk_buff, very hot in its L1 (This was the original idea of build_skb() to get nice cache locality for the metadata, since it is 4 cache lines per sk_buff) Now most drivers have no clue why it is important to allocate the skb _after_ receiving the ethernet frame and not in advance. (The lazy drivers allocate ~1024 skbs to prefill their ~1024 slot RX ring) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-28 9:52 ` Jesper Dangaard Brouer 2016-01-28 12:54 ` Eric Dumazet @ 2016-01-28 13:25 ` Eric Dumazet 2016-01-28 16:43 ` Tom Herbert 2 siblings, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2016-01-28 13:25 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Alexei Starovoitov, John Fastabend, Tom Herbert, Michael S. Tsirkin, David Miller, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Thu, 2016-01-28 at 10:52 +0100, Jesper Dangaard Brouer wrote: > I'm still in flux/undecided how long we should delay the first touching > of pkt-data, which happens when calling eth_type_trans(). Should it > stay in the driver or not(?). Some cpus have limited prefetch capabilities. Sometimes, prefetches need to be spaced, otherwise they are ignored. A driver author might be tempted to 'optimize' its rx handler for few cpus. Also, removing eth_type_trans() from the drivers would require quite some work, but would be generic and certainly helpful. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-28 9:52 ` Jesper Dangaard Brouer 2016-01-28 12:54 ` Eric Dumazet 2016-01-28 13:25 ` Eric Dumazet @ 2016-01-28 16:43 ` Tom Herbert 2 siblings, 0 replies; 59+ messages in thread From: Tom Herbert @ 2016-01-28 16:43 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Alexei Starovoitov, John Fastabend, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Thu, Jan 28, 2016 at 1:52 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Wed, 27 Jan 2016 13:56:03 -0800 > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > >> On Wed, Jan 27, 2016 at 09:47:50PM +0100, Jesper Dangaard Brouer wrote: >> > Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns >> > >> > To get around the cache-miss in eth_type_trans(), I created a >> > "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out", >> > before calling eth_type_trans(), reducing cost to 2.45%. >> > >> > To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API . And >> > also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger >> > slab-pages, thus bigger bulk opportunities. >> > >> > This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns). >> >> great stuff. I think such batching loop will reduce the cost of >> eth_type_trans() for all use cases. >> Only unfortunate that it would need to be implemented in every driver, >> but there is only a handful that people care about in high performance >> setups, so I think it's worth getting this patch in for mlx5 and >> the other drivers will catch up. > > I'm still in flux/undecided how long we should delay the first touching > of pkt-data, which happens when calling eth_type_trans(). Should it > stay in the driver or not(?). > > In the extreme case, for optimize for RPS sending to remote CPUs, delay > calling eth_type_trans() as long as possible. > > 1. In driver only start prefetch data to L2/L3 cache > 2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash > 3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue > 4. On remote CPU in process_backlog call eth_type_trans() on sd->input_pkt_queue > There is also GRO to consider which still might be better to do before packet steering? One thing that could be exploited is that we probably don't need to look at packet data for GRO until we get a second packet that matches the hash. > > On the other hand, if the HW desc can provide skb->proto, and we can > lazy eval skb->pkt_type, then it is okay to keep that responsibility in > the driver (as the call to eth_type_trans() basically disappears). > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-27 20:47 ` Jesper Dangaard Brouer 2016-01-27 21:56 ` Alexei Starovoitov @ 2016-01-28 2:50 ` Tom Herbert 2016-01-28 9:25 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 59+ messages in thread From: Tom Herbert @ 2016-01-28 2:50 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: John Fastabend, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Wed, Jan 27, 2016 at 12:47 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Mon, 25 Jan 2016 23:10:16 +0100 > Jesper Dangaard Brouer <brouer@redhat.com> wrote: > >> On Mon, 25 Jan 2016 09:50:16 -0800 John Fastabend <john.fastabend@gmail.com> wrote: >> >> > On 16-01-25 09:09 AM, Tom Herbert wrote: >> > > On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer >> > > <brouer@redhat.com> wrote: >> > >> >> [...] >> > >> >> > >> There are two ideas, getting mixed up here. (1) bundling from the >> > >> RX-ring, (2) allowing to pick up the "packet-page" directly. >> > >> >> > >> Bundling (1) is something that seems natural, and which help us >> > >> amortize the cost between layers (and utilizes icache better). Lets >> > >> keep that in another thread. >> > >> >> > >> This (2) direct forward of "packet-pages" is a fairly extreme idea, >> > >> BUT it have the potential of being an new integration point for >> > >> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to >> > >> speed with bypass-solutions. >> > >> [...] >> > >> > Jesper, at least for you (2) case what are we missing with the >> > bifurcated/queue splitting work? Are you really after systems >> > without SR-IOV support or are you trying to get this on the order >> > of queues instead of VFs. >> >> I'm not saying something is missing for bifurcated/queue splitting work. >> I'm not trying to work-around SR-IOV. >> >> This an extreme idea, which I got while looking at the lowest RX layer. >> >> >> Before working any further on this idea/path, I need/want to evaluate >> if it makes sense from a performance point of view. I need to evaluate >> if "pulling" out these "packet-pages" is fast enough to compete with >> DPDK/netmap. Else it makes no sense to work on this path. >> >> As a first step to evaluate this lowest RX layer, I'm simply hacking >> the drivers (ixgbe and mlx5) to drop/discard packets within-the-driver. >> For now, simply replacing napi_gro_receive() with dev_kfree_skb(), and >> measuring the "RX-drop" performance. >> >> Next step was to avoid the skb alloc+free calls, but doing so is more >> complicated that I first anticipated, as the SKB is tied in fairly >> heavily. Thus, right now I'm instead hooking in my bulk alloc+free >> API, as that will remove/mitigate most of the overhead of the >> kmem_cache/slab-allocators. > > I've tried to deduct that kind of speeds we can achieve, at this lowest > RX layer. By in the mlx5/100G driver drop packets directly in the driver. > Just replacing replacing napi_gro_receive() with dev_kfree_skb(), was > fairly depressing, showing only 6.2Mpps (6253970 pps => 159.9 ns) (single core). > > Looking at the perf report showed major cache-miss in eth_type_trans(29%/47ns). > > And driver is hitting the SLUB slowpath quite badly (because it > prealloc SKBs and binds to RX ring, usually this test case would hits > SLUB "recycle" fastpath): > > Group-report: kmem_cache/SLUB allocator functions :: > 5.00 % ~= 8.0 ns <= __slab_free > 4.91 % ~= 7.9 ns <= cmpxchg_double_slab.isra.65 > 4.22 % ~= 6.7 ns <= kmem_cache_alloc > 1.68 % ~= 2.7 ns <= kmem_cache_free > 1.10 % ~= 1.8 ns <= ___slab_alloc > 0.93 % ~= 1.5 ns <= __cmpxchg_double_slab.isra.54 > 0.65 % ~= 1.0 ns <= __slab_alloc.isra.74 > 0.26 % ~= 0.4 ns <= put_cpu_partial > Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns > > To get around the cache-miss in eth_type_trans(), I created a > "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out", > before calling eth_type_trans(), reducing cost to 2.45%. > > To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API . And > also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger > slab-pages, thus bigger bulk opportunities. > > This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns). > > Group-report: kmem_cache/SLUB allocator functions :: > 4.99 % ~= 4.1 ns <= kmem_cache_alloc_bulk > 2.87 % ~= 2.4 ns <= kmem_cache_free_bulk > 0.24 % ~= 0.2 ns <= ___slab_alloc > 0.23 % ~= 0.2 ns <= __slab_free > 0.21 % ~= 0.2 ns <= __cmpxchg_double_slab.isra.54 > 0.17 % ~= 0.1 ns <= cmpxchg_double_slab.isra.65 > 0.07 % ~= 0.1 ns <= put_cpu_partial > 0.04 % ~= 0.0 ns <= unfreeze_partials.isra.71 > 0.03 % ~= 0.0 ns <= get_partial_node.isra.72 > Sum: 8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns > > Full perf report output below signature, is from optimized case. > > SKB related cost is 22.9 ns. However 51.7% (11.84ns) cost originates > from memset of the SKB. > > Group-report: related to pattern "skb" :: > 17.92 % ~= 14.8 ns <= __napi_alloc_skb <== 80% memset(0) / rep stos > 3.29 % ~= 2.7 ns <= skb_release_data > 2.20 % ~= 1.8 ns <= napi_consume_skb > 1.86 % ~= 1.5 ns <= skb_release_head_state > 1.20 % ~= 1.0 ns <= skb_put > 1.14 % ~= 0.9 ns <= skb_release_all > 0.02 % ~= 0.0 ns <= __kfree_skb_flush > Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns > > Doing a crude extrapolation, 82.7 ns subtract, SLUB (7.3 ns) and SKB > (22.9 ns) related => 52.5 ns -> extrapolate 19 Mpps would be the > maximum speed we can pull off packet-pages from the RX ring. > > I don't know if 19Mpps (52.5 ns "overhead") is fast enough, to compete > with just mapping a RX HW queue/ring to netmap or via SR-IOV to DPDK(?) > > But it was interesting to see how the lowest RX layer performs... Cool stuff! Looking at the typical driver receive path, I'm wonder if we should beak netif_receive_skb (napi_gro_receive) into two parts. One utility function to create a list of received skb's and prefetch the data called as ring is processed, the other one to give the list to the stack (e.g. netif_receive_skbs) and defer eth_type_trans as long as possible. Is something like this what you are contemplating? Tom > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer > > > Perf-report script: > * https://github.com/netoptimizer/network-testing/blob/master/bin/perf_report_pps_stats.pl > > Report: ALL functions :: > 19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq > 17.92 % ~= 14.8 ns <= __napi_alloc_skb > 9.54 % ~= 7.9 ns <= __free_page_frag > 7.16 % ~= 5.9 ns <= mlx5e_get_cqe > 6.37 % ~= 5.3 ns <= mlx5e_post_rx_wqes > 4.99 % ~= 4.1 ns <= kmem_cache_alloc_bulk > 3.70 % ~= 3.1 ns <= __alloc_page_frag > 3.29 % ~= 2.7 ns <= skb_release_data > 2.87 % ~= 2.4 ns <= kmem_cache_free_bulk > 2.45 % ~= 2.0 ns <= eth_type_trans > 2.43 % ~= 2.0 ns <= get_page_from_freelist > 2.36 % ~= 2.0 ns <= swiotlb_map_page > 2.20 % ~= 1.8 ns <= napi_consume_skb > 1.86 % ~= 1.5 ns <= skb_release_head_state > 1.25 % ~= 1.0 ns <= free_pages_prepare > 1.20 % ~= 1.0 ns <= skb_put > 1.14 % ~= 0.9 ns <= skb_release_all > 0.77 % ~= 0.6 ns <= __free_pages_ok > 0.59 % ~= 0.5 ns <= get_pfnblock_flags_mask > 0.59 % ~= 0.5 ns <= swiotlb_dma_mapping_error > 0.59 % ~= 0.5 ns <= unmap_single > 0.58 % ~= 0.5 ns <= _raw_spin_lock_irqsave > 0.57 % ~= 0.5 ns <= free_one_page > 0.56 % ~= 0.5 ns <= swiotlb_unmap_page > 0.52 % ~= 0.4 ns <= _raw_spin_lock > 0.46 % ~= 0.4 ns <= __mod_zone_page_state > 0.36 % ~= 0.3 ns <= __rmqueue > 0.36 % ~= 0.3 ns <= net_rx_action > 0.34 % ~= 0.3 ns <= __alloc_pages_nodemask > 0.31 % ~= 0.3 ns <= __zone_watermark_ok > 0.27 % ~= 0.2 ns <= mlx5e_napi_poll > 0.24 % ~= 0.2 ns <= ___slab_alloc > 0.23 % ~= 0.2 ns <= __slab_free > 0.22 % ~= 0.2 ns <= __list_del_entry > 0.21 % ~= 0.2 ns <= __cmpxchg_double_slab.isra.54 > 0.21 % ~= 0.2 ns <= next_zones_zonelist > 0.20 % ~= 0.2 ns <= __list_add > 0.17 % ~= 0.1 ns <= __do_softirq > 0.17 % ~= 0.1 ns <= cmpxchg_double_slab.isra.65 > 0.16 % ~= 0.1 ns <= __inc_zone_state > 0.12 % ~= 0.1 ns <= _raw_spin_unlock > 0.12 % ~= 0.1 ns <= zone_statistics > (Percent limit(0.1%) stop at "mlx5e_poll_tx_cq") > Sum: 99.45 % => calc: 82.3 ns (sum: 82.3 ns) => Total: 82.7 ns > > Group-report: related to pattern "eth_type_trans|mlx5|ixgbe|__iowrite64_copy" :: > (Driver related) > 19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq > 7.16 % ~= 5.9 ns <= mlx5e_get_cqe > 6.37 % ~= 5.3 ns <= mlx5e_post_rx_wqes > 2.45 % ~= 2.0 ns <= eth_type_trans > 0.27 % ~= 0.2 ns <= mlx5e_napi_poll > 0.09 % ~= 0.1 ns <= mlx5e_poll_tx_cq > Sum: 36.05 % => calc: 29.8 ns (sum: 29.8 ns) => Total: 82.7 ns > > Group-report: DMA functions :: > 2.36 % ~= 2.0 ns <= swiotlb_map_page > 0.59 % ~= 0.5 ns <= unmap_single > 0.59 % ~= 0.5 ns <= swiotlb_dma_mapping_error > 0.56 % ~= 0.5 ns <= swiotlb_unmap_page > Sum: 4.10 % => calc: 3.4 ns (sum: 3.4 ns) => Total: 82.7 ns > > Group-report: page_frag_cache functions :: > 9.54 % ~= 7.9 ns <= __free_page_frag > 3.70 % ~= 3.1 ns <= __alloc_page_frag > 2.43 % ~= 2.0 ns <= get_page_from_freelist > 1.25 % ~= 1.0 ns <= free_pages_prepare > 0.77 % ~= 0.6 ns <= __free_pages_ok > 0.59 % ~= 0.5 ns <= get_pfnblock_flags_mask > 0.57 % ~= 0.5 ns <= free_one_page > 0.46 % ~= 0.4 ns <= __mod_zone_page_state > 0.36 % ~= 0.3 ns <= __rmqueue > 0.34 % ~= 0.3 ns <= __alloc_pages_nodemask > 0.31 % ~= 0.3 ns <= __zone_watermark_ok > 0.21 % ~= 0.2 ns <= next_zones_zonelist > 0.16 % ~= 0.1 ns <= __inc_zone_state > 0.12 % ~= 0.1 ns <= zone_statistics > 0.02 % ~= 0.0 ns <= mod_zone_page_state > Sum: 20.83 % => calc: 17.2 ns (sum: 17.2 ns) => Total: 82.7 ns > > Group-report: kmem_cache/SLUB allocator functions :: > 4.99 % ~= 4.1 ns <= kmem_cache_alloc_bulk > 2.87 % ~= 2.4 ns <= kmem_cache_free_bulk > 0.24 % ~= 0.2 ns <= ___slab_alloc > 0.23 % ~= 0.2 ns <= __slab_free > 0.21 % ~= 0.2 ns <= __cmpxchg_double_slab.isra.54 > 0.17 % ~= 0.1 ns <= cmpxchg_double_slab.isra.65 > 0.07 % ~= 0.1 ns <= put_cpu_partial > 0.04 % ~= 0.0 ns <= unfreeze_partials.isra.71 > 0.03 % ~= 0.0 ns <= get_partial_node.isra.72 > Sum: 8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns > > Group-report: related to pattern "skb" :: > 17.92 % ~= 14.8 ns <= __napi_alloc_skb <== 80% memset(0) / rep stos > 3.29 % ~= 2.7 ns <= skb_release_data > 2.20 % ~= 1.8 ns <= napi_consume_skb > 1.86 % ~= 1.5 ns <= skb_release_head_state > 1.20 % ~= 1.0 ns <= skb_put > 1.14 % ~= 0.9 ns <= skb_release_all > 0.02 % ~= 0.0 ns <= __kfree_skb_flush > Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns > > Group-report: Core network-stack functions :: > 0.36 % ~= 0.3 ns <= net_rx_action > 0.17 % ~= 0.1 ns <= __do_softirq > 0.02 % ~= 0.0 ns <= __raise_softirq_irqoff > 0.01 % ~= 0.0 ns <= run_ksoftirqd > 0.00 % ~= 0.0 ns <= run_timer_softirq > 0.00 % ~= 0.0 ns <= ksoftirqd_should_run > 0.00 % ~= 0.0 ns <= raise_softirq > Sum: 0.56 % => calc: 0.5 ns (sum: 0.5 ns) => Total: 82.7 ns > > Group-report: GRO network-stack functions :: > Sum: 0.00 % => calc: 0.0 ns (sum: 0.0 ns) => Total: 82.7 ns > > Group-report: related to pattern "spin_.*lock|mutex" :: > 0.58 % ~= 0.5 ns <= _raw_spin_lock_irqsave > 0.52 % ~= 0.4 ns <= _raw_spin_lock > 0.12 % ~= 0.1 ns <= _raw_spin_unlock > 0.01 % ~= 0.0 ns <= _raw_spin_unlock_irqrestore > 0.00 % ~= 0.0 ns <= __mutex_lock_slowpath > 0.00 % ~= 0.0 ns <= _raw_spin_lock_irq > Sum: 1.23 % => calc: 1.0 ns (sum: 1.0 ns) => Total: 82.7 ns > > Negative Report: functions NOT included in group reports:: > 0.22 % ~= 0.2 ns <= __list_del_entry > 0.20 % ~= 0.2 ns <= __list_add > 0.07 % ~= 0.1 ns <= list_del > 0.05 % ~= 0.0 ns <= native_sched_clock > 0.04 % ~= 0.0 ns <= irqtime_account_irq > 0.02 % ~= 0.0 ns <= rcu_bh_qs > 0.01 % ~= 0.0 ns <= task_tick_fair > 0.01 % ~= 0.0 ns <= net_rps_action_and_irq_enable.isra.112 > 0.01 % ~= 0.0 ns <= perf_event_task_tick > 0.01 % ~= 0.0 ns <= apic_timer_interrupt > 0.01 % ~= 0.0 ns <= lapic_next_deadline > 0.01 % ~= 0.0 ns <= rcu_check_callbacks > 0.01 % ~= 0.0 ns <= smpboot_thread_fn > 0.01 % ~= 0.0 ns <= irqtime_account_process_tick.isra.3 > 0.00 % ~= 0.0 ns <= intel_bts_enable_local > 0.00 % ~= 0.0 ns <= kthread_should_park > 0.00 % ~= 0.0 ns <= native_apic_mem_write > 0.00 % ~= 0.0 ns <= hrtimer_forward > 0.00 % ~= 0.0 ns <= get_work_pool > 0.00 % ~= 0.0 ns <= cpu_startup_entry > 0.00 % ~= 0.0 ns <= acct_account_cputime > 0.00 % ~= 0.0 ns <= set_next_entity > 0.00 % ~= 0.0 ns <= worker_thread > 0.00 % ~= 0.0 ns <= dbs_timer_handler > 0.00 % ~= 0.0 ns <= delay_tsc > 0.00 % ~= 0.0 ns <= idle_cpu > 0.00 % ~= 0.0 ns <= timerqueue_add > 0.00 % ~= 0.0 ns <= hrtimer_interrupt > 0.00 % ~= 0.0 ns <= dbs_work_handler > 0.00 % ~= 0.0 ns <= dequeue_entity > 0.00 % ~= 0.0 ns <= update_cfs_shares > 0.00 % ~= 0.0 ns <= update_fast_timekeeper > 0.00 % ~= 0.0 ns <= smp_trace_apic_timer_interrupt > 0.00 % ~= 0.0 ns <= __update_cpu_load > 0.00 % ~= 0.0 ns <= cpu_needs_another_gp > 0.00 % ~= 0.0 ns <= ret_from_intr > 0.00 % ~= 0.0 ns <= __intel_pmu_enable_all > 0.00 % ~= 0.0 ns <= trigger_load_balance > 0.00 % ~= 0.0 ns <= __schedule > 0.00 % ~= 0.0 ns <= nsecs_to_jiffies64 > 0.00 % ~= 0.0 ns <= account_entity_dequeue > 0.00 % ~= 0.0 ns <= worker_enter_idle > 0.00 % ~= 0.0 ns <= __hrtimer_get_next_event > 0.00 % ~= 0.0 ns <= rcu_irq_exit > 0.00 % ~= 0.0 ns <= rb_erase > 0.00 % ~= 0.0 ns <= __intel_pmu_disable_all > 0.00 % ~= 0.0 ns <= tick_sched_do_timer > 0.00 % ~= 0.0 ns <= cpuacct_account_field > 0.00 % ~= 0.0 ns <= update_wall_time > 0.00 % ~= 0.0 ns <= notifier_call_chain > 0.00 % ~= 0.0 ns <= timekeeping_update > 0.00 % ~= 0.0 ns <= ktime_get_update_offsets_now > 0.00 % ~= 0.0 ns <= rb_next > 0.00 % ~= 0.0 ns <= rcu_all_qs > 0.00 % ~= 0.0 ns <= x86_pmu_disable > 0.00 % ~= 0.0 ns <= _cond_resched > 0.00 % ~= 0.0 ns <= __rcu_read_lock > 0.00 % ~= 0.0 ns <= __local_bh_enable > 0.00 % ~= 0.0 ns <= update_cpu_load_active > 0.00 % ~= 0.0 ns <= x86_pmu_enable > 0.00 % ~= 0.0 ns <= insert_work > 0.00 % ~= 0.0 ns <= ktime_get > 0.00 % ~= 0.0 ns <= __usecs_to_jiffies > 0.00 % ~= 0.0 ns <= __acct_update_integrals > 0.00 % ~= 0.0 ns <= scheduler_tick > 0.00 % ~= 0.0 ns <= update_vsyscall > 0.00 % ~= 0.0 ns <= memcpy_erms > 0.00 % ~= 0.0 ns <= get_cpu_idle_time_us > 0.00 % ~= 0.0 ns <= sched_clock_cpu > 0.00 % ~= 0.0 ns <= tick_do_update_jiffies64 > 0.00 % ~= 0.0 ns <= hrtimer_active > 0.00 % ~= 0.0 ns <= profile_tick > 0.00 % ~= 0.0 ns <= __hrtimer_run_queues > 0.00 % ~= 0.0 ns <= kthread_should_stop > 0.00 % ~= 0.0 ns <= run_posix_cpu_timers > 0.00 % ~= 0.0 ns <= read_tsc > 0.00 % ~= 0.0 ns <= __remove_hrtimer > 0.00 % ~= 0.0 ns <= calc_global_load_tick > 0.00 % ~= 0.0 ns <= hrtimer_run_queues > 0.00 % ~= 0.0 ns <= irq_work_tick > 0.00 % ~= 0.0 ns <= cpuacct_charge > 0.00 % ~= 0.0 ns <= clockevents_program_event > 0.00 % ~= 0.0 ns <= update_blocked_averages > Sum: 0.68 % => calc: 0.6 ns (sum: 0.6 ns) => Total: 82.7 ns > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-28 2:50 ` Tom Herbert @ 2016-01-28 9:25 ` Jesper Dangaard Brouer 2016-01-28 12:45 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-28 9:25 UTC (permalink / raw) To: Tom Herbert Cc: John Fastabend, Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich, brouer On Wed, 27 Jan 2016 18:50:27 -0800 Tom Herbert <tom@herbertland.com> wrote: > On Wed, Jan 27, 2016 at 12:47 PM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > On Mon, 25 Jan 2016 23:10:16 +0100 > > Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > >> On Mon, 25 Jan 2016 09:50:16 -0800 John Fastabend <john.fastabend@gmail.com> wrote: > >> > >> > On 16-01-25 09:09 AM, Tom Herbert wrote: > >> > > On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer > >> > > <brouer@redhat.com> wrote: > >> > >> > >> [...] > >> > >> > >> > >> There are two ideas, getting mixed up here. (1) bundling from the > >> > >> RX-ring, (2) allowing to pick up the "packet-page" directly. > >> > >> > >> > >> Bundling (1) is something that seems natural, and which help us > >> > >> amortize the cost between layers (and utilizes icache better). Lets > >> > >> keep that in another thread. > >> > >> > >> > >> This (2) direct forward of "packet-pages" is a fairly extreme idea, > >> > >> BUT it have the potential of being an new integration point for > >> > >> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to > >> > >> speed with bypass-solutions. > >> > > >> [...] > >> > > >> > Jesper, at least for you (2) case what are we missing with the > >> > bifurcated/queue splitting work? Are you really after systems > >> > without SR-IOV support or are you trying to get this on the order > >> > of queues instead of VFs. > >> > >> I'm not saying something is missing for bifurcated/queue splitting work. > >> I'm not trying to work-around SR-IOV. > >> > >> This an extreme idea, which I got while looking at the lowest RX layer. > >> > >> > >> Before working any further on this idea/path, I need/want to evaluate > >> if it makes sense from a performance point of view. I need to evaluate > >> if "pulling" out these "packet-pages" is fast enough to compete with > >> DPDK/netmap. Else it makes no sense to work on this path. > >> > >> As a first step to evaluate this lowest RX layer, I'm simply hacking > >> the drivers (ixgbe and mlx5) to drop/discard packets within-the-driver. > >> For now, simply replacing napi_gro_receive() with dev_kfree_skb(), and > >> measuring the "RX-drop" performance. > >> > >> Next step was to avoid the skb alloc+free calls, but doing so is more > >> complicated that I first anticipated, as the SKB is tied in fairly > >> heavily. Thus, right now I'm instead hooking in my bulk alloc+free > >> API, as that will remove/mitigate most of the overhead of the > >> kmem_cache/slab-allocators. > > > > I've tried to deduct that kind of speeds we can achieve, at this lowest > > RX layer. By in the mlx5/100G driver drop packets directly in the driver. > > Just replacing replacing napi_gro_receive() with dev_kfree_skb(), was > > fairly depressing, showing only 6.2Mpps (6253970 pps => 159.9 ns) (single core). > > > > Looking at the perf report showed major cache-miss in eth_type_trans(29%/47ns). > > > > And driver is hitting the SLUB slowpath quite badly (because it > > prealloc SKBs and binds to RX ring, usually this test case would hits > > SLUB "recycle" fastpath): > > > > Group-report: kmem_cache/SLUB allocator functions :: > > 5.00 % ~= 8.0 ns <= __slab_free > > 4.91 % ~= 7.9 ns <= cmpxchg_double_slab.isra.65 > > 4.22 % ~= 6.7 ns <= kmem_cache_alloc > > 1.68 % ~= 2.7 ns <= kmem_cache_free > > 1.10 % ~= 1.8 ns <= ___slab_alloc > > 0.93 % ~= 1.5 ns <= __cmpxchg_double_slab.isra.54 > > 0.65 % ~= 1.0 ns <= __slab_alloc.isra.74 > > 0.26 % ~= 0.4 ns <= put_cpu_partial > > Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns > > > > To get around the cache-miss in eth_type_trans(), I created a > > "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out", > > before calling eth_type_trans(), reducing cost to 2.45%. > > > > To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API . And > > also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger > > slab-pages, thus bigger bulk opportunities. > > > > This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns). > > > > Group-report: kmem_cache/SLUB allocator functions :: > > 4.99 % ~= 4.1 ns <= kmem_cache_alloc_bulk > > 2.87 % ~= 2.4 ns <= kmem_cache_free_bulk > > 0.24 % ~= 0.2 ns <= ___slab_alloc > > 0.23 % ~= 0.2 ns <= __slab_free > > 0.21 % ~= 0.2 ns <= __cmpxchg_double_slab.isra.54 > > 0.17 % ~= 0.1 ns <= cmpxchg_double_slab.isra.65 > > 0.07 % ~= 0.1 ns <= put_cpu_partial > > 0.04 % ~= 0.0 ns <= unfreeze_partials.isra.71 > > 0.03 % ~= 0.0 ns <= get_partial_node.isra.72 > > Sum: 8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns > > > > Full perf report output below signature, is from optimized case. > > > > SKB related cost is 22.9 ns. However 51.7% (11.84ns) cost originates > > from memset of the SKB. > > > > Group-report: related to pattern "skb" :: > > 17.92 % ~= 14.8 ns <= __napi_alloc_skb <== 80% memset(0) / rep stos > > 3.29 % ~= 2.7 ns <= skb_release_data > > 2.20 % ~= 1.8 ns <= napi_consume_skb > > 1.86 % ~= 1.5 ns <= skb_release_head_state > > 1.20 % ~= 1.0 ns <= skb_put > > 1.14 % ~= 0.9 ns <= skb_release_all > > 0.02 % ~= 0.0 ns <= __kfree_skb_flush > > Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns > > > > Doing a crude extrapolation, 82.7 ns subtract, SLUB (7.3 ns) and SKB > > (22.9 ns) related => 52.5 ns -> extrapolate 19 Mpps would be the > > maximum speed we can pull off packet-pages from the RX ring. > > > > I don't know if 19Mpps (52.5 ns "overhead") is fast enough, to compete > > with just mapping a RX HW queue/ring to netmap or via SR-IOV to DPDK(?) > > > > But it was interesting to see how the lowest RX layer performs... > > Cool stuff! Thanks :-) > Looking at the typical driver receive path, I'm wonder if we should > beak netif_receive_skb (napi_gro_receive) into two parts. One utility > function to create a list of received skb's and prefetch the data > called as ring is processed, the other one to give the list to the > stack (e.g. netif_receive_skbs) and defer eth_type_trans as long as > possible. Is something like this what you are contemplating? Yes, that is exactly what I'm contemplating :-) That is idea "(1)". A natural extension to this work, which I expect Tom will love, is to also use the idea for RPS. Once we have a SKB list in stack/GRO-layer, then we could build a local sk_buff_head list for each remote CPU, by calling get_rps_cpu(). And then enqueue_list_to_backlog, by a skb_queue_splice_tail(&cpu_list, &cpu->sd->input_pkt_queue) call. This would amortize the cost of transferring packets to a remote CPU, which Eric AFAIK points out is costing approx ~133ns. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer > > Perf-report script: > > * https://github.com/netoptimizer/network-testing/blob/master/bin/perf_report_pps_stats.pl > > > > Report: ALL functions :: > > 19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq > > 17.92 % ~= 14.8 ns <= __napi_alloc_skb > > 9.54 % ~= 7.9 ns <= __free_page_frag > > 7.16 % ~= 5.9 ns <= mlx5e_get_cqe > > 6.37 % ~= 5.3 ns <= mlx5e_post_rx_wqes > > 4.99 % ~= 4.1 ns <= kmem_cache_alloc_bulk > > 3.70 % ~= 3.1 ns <= __alloc_page_frag > > 3.29 % ~= 2.7 ns <= skb_release_data > > 2.87 % ~= 2.4 ns <= kmem_cache_free_bulk > > 2.45 % ~= 2.0 ns <= eth_type_trans > > 2.43 % ~= 2.0 ns <= get_page_from_freelist > > 2.36 % ~= 2.0 ns <= swiotlb_map_page > > 2.20 % ~= 1.8 ns <= napi_consume_skb > > 1.86 % ~= 1.5 ns <= skb_release_head_state > > 1.25 % ~= 1.0 ns <= free_pages_prepare > > 1.20 % ~= 1.0 ns <= skb_put > > 1.14 % ~= 0.9 ns <= skb_release_all > > 0.77 % ~= 0.6 ns <= __free_pages_ok > > 0.59 % ~= 0.5 ns <= get_pfnblock_flags_mask > > 0.59 % ~= 0.5 ns <= swiotlb_dma_mapping_error > > 0.59 % ~= 0.5 ns <= unmap_single > > 0.58 % ~= 0.5 ns <= _raw_spin_lock_irqsave > > 0.57 % ~= 0.5 ns <= free_one_page > > 0.56 % ~= 0.5 ns <= swiotlb_unmap_page > > 0.52 % ~= 0.4 ns <= _raw_spin_lock > > 0.46 % ~= 0.4 ns <= __mod_zone_page_state > > 0.36 % ~= 0.3 ns <= __rmqueue > > 0.36 % ~= 0.3 ns <= net_rx_action > > 0.34 % ~= 0.3 ns <= __alloc_pages_nodemask > > 0.31 % ~= 0.3 ns <= __zone_watermark_ok > > 0.27 % ~= 0.2 ns <= mlx5e_napi_poll > > 0.24 % ~= 0.2 ns <= ___slab_alloc > > 0.23 % ~= 0.2 ns <= __slab_free > > 0.22 % ~= 0.2 ns <= __list_del_entry > > 0.21 % ~= 0.2 ns <= __cmpxchg_double_slab.isra.54 > > 0.21 % ~= 0.2 ns <= next_zones_zonelist > > 0.20 % ~= 0.2 ns <= __list_add > > 0.17 % ~= 0.1 ns <= __do_softirq > > 0.17 % ~= 0.1 ns <= cmpxchg_double_slab.isra.65 > > 0.16 % ~= 0.1 ns <= __inc_zone_state > > 0.12 % ~= 0.1 ns <= _raw_spin_unlock > > 0.12 % ~= 0.1 ns <= zone_statistics > > (Percent limit(0.1%) stop at "mlx5e_poll_tx_cq") > > Sum: 99.45 % => calc: 82.3 ns (sum: 82.3 ns) => Total: 82.7 ns > > > > Group-report: related to pattern "eth_type_trans|mlx5|ixgbe|__iowrite64_copy" :: > > (Driver related) > > 19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq > > 7.16 % ~= 5.9 ns <= mlx5e_get_cqe > > 6.37 % ~= 5.3 ns <= mlx5e_post_rx_wqes > > 2.45 % ~= 2.0 ns <= eth_type_trans > > 0.27 % ~= 0.2 ns <= mlx5e_napi_poll > > 0.09 % ~= 0.1 ns <= mlx5e_poll_tx_cq > > Sum: 36.05 % => calc: 29.8 ns (sum: 29.8 ns) => Total: 82.7 ns > > > > Group-report: DMA functions :: > > 2.36 % ~= 2.0 ns <= swiotlb_map_page > > 0.59 % ~= 0.5 ns <= unmap_single > > 0.59 % ~= 0.5 ns <= swiotlb_dma_mapping_error > > 0.56 % ~= 0.5 ns <= swiotlb_unmap_page > > Sum: 4.10 % => calc: 3.4 ns (sum: 3.4 ns) => Total: 82.7 ns > > > > Group-report: page_frag_cache functions :: > > 9.54 % ~= 7.9 ns <= __free_page_frag > > 3.70 % ~= 3.1 ns <= __alloc_page_frag > > 2.43 % ~= 2.0 ns <= get_page_from_freelist > > 1.25 % ~= 1.0 ns <= free_pages_prepare > > 0.77 % ~= 0.6 ns <= __free_pages_ok > > 0.59 % ~= 0.5 ns <= get_pfnblock_flags_mask > > 0.57 % ~= 0.5 ns <= free_one_page > > 0.46 % ~= 0.4 ns <= __mod_zone_page_state > > 0.36 % ~= 0.3 ns <= __rmqueue > > 0.34 % ~= 0.3 ns <= __alloc_pages_nodemask > > 0.31 % ~= 0.3 ns <= __zone_watermark_ok > > 0.21 % ~= 0.2 ns <= next_zones_zonelist > > 0.16 % ~= 0.1 ns <= __inc_zone_state > > 0.12 % ~= 0.1 ns <= zone_statistics > > 0.02 % ~= 0.0 ns <= mod_zone_page_state > > Sum: 20.83 % => calc: 17.2 ns (sum: 17.2 ns) => Total: 82.7 ns > > > > Group-report: kmem_cache/SLUB allocator functions :: > > 4.99 % ~= 4.1 ns <= kmem_cache_alloc_bulk > > 2.87 % ~= 2.4 ns <= kmem_cache_free_bulk > > 0.24 % ~= 0.2 ns <= ___slab_alloc > > 0.23 % ~= 0.2 ns <= __slab_free > > 0.21 % ~= 0.2 ns <= __cmpxchg_double_slab.isra.54 > > 0.17 % ~= 0.1 ns <= cmpxchg_double_slab.isra.65 > > 0.07 % ~= 0.1 ns <= put_cpu_partial > > 0.04 % ~= 0.0 ns <= unfreeze_partials.isra.71 > > 0.03 % ~= 0.0 ns <= get_partial_node.isra.72 > > Sum: 8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns > > > > Group-report: related to pattern "skb" :: > > 17.92 % ~= 14.8 ns <= __napi_alloc_skb <== 80% memset(0) / rep stos > > 3.29 % ~= 2.7 ns <= skb_release_data > > 2.20 % ~= 1.8 ns <= napi_consume_skb > > 1.86 % ~= 1.5 ns <= skb_release_head_state > > 1.20 % ~= 1.0 ns <= skb_put > > 1.14 % ~= 0.9 ns <= skb_release_all > > 0.02 % ~= 0.0 ns <= __kfree_skb_flush > > Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns > > > > Group-report: Core network-stack functions :: > > 0.36 % ~= 0.3 ns <= net_rx_action > > 0.17 % ~= 0.1 ns <= __do_softirq > > 0.02 % ~= 0.0 ns <= __raise_softirq_irqoff > > 0.01 % ~= 0.0 ns <= run_ksoftirqd > > 0.00 % ~= 0.0 ns <= run_timer_softirq > > 0.00 % ~= 0.0 ns <= ksoftirqd_should_run > > 0.00 % ~= 0.0 ns <= raise_softirq > > Sum: 0.56 % => calc: 0.5 ns (sum: 0.5 ns) => Total: 82.7 ns > > > > Group-report: GRO network-stack functions :: > > Sum: 0.00 % => calc: 0.0 ns (sum: 0.0 ns) => Total: 82.7 ns > > > > Group-report: related to pattern "spin_.*lock|mutex" :: > > 0.58 % ~= 0.5 ns <= _raw_spin_lock_irqsave > > 0.52 % ~= 0.4 ns <= _raw_spin_lock > > 0.12 % ~= 0.1 ns <= _raw_spin_unlock > > 0.01 % ~= 0.0 ns <= _raw_spin_unlock_irqrestore > > 0.00 % ~= 0.0 ns <= __mutex_lock_slowpath > > 0.00 % ~= 0.0 ns <= _raw_spin_lock_irq > > Sum: 1.23 % => calc: 1.0 ns (sum: 1.0 ns) => Total: 82.7 ns > > > > Negative Report: functions NOT included in group reports:: > > 0.22 % ~= 0.2 ns <= __list_del_entry > > 0.20 % ~= 0.2 ns <= __list_add > > 0.07 % ~= 0.1 ns <= list_del > > 0.05 % ~= 0.0 ns <= native_sched_clock > > 0.04 % ~= 0.0 ns <= irqtime_account_irq > > 0.02 % ~= 0.0 ns <= rcu_bh_qs > > 0.01 % ~= 0.0 ns <= task_tick_fair > > 0.01 % ~= 0.0 ns <= net_rps_action_and_irq_enable.isra.112 > > 0.01 % ~= 0.0 ns <= perf_event_task_tick > > 0.01 % ~= 0.0 ns <= apic_timer_interrupt > > 0.01 % ~= 0.0 ns <= lapic_next_deadline > > 0.01 % ~= 0.0 ns <= rcu_check_callbacks > > 0.01 % ~= 0.0 ns <= smpboot_thread_fn > > 0.01 % ~= 0.0 ns <= irqtime_account_process_tick.isra.3 > > 0.00 % ~= 0.0 ns <= intel_bts_enable_local > > 0.00 % ~= 0.0 ns <= kthread_should_park > > 0.00 % ~= 0.0 ns <= native_apic_mem_write > > 0.00 % ~= 0.0 ns <= hrtimer_forward > > 0.00 % ~= 0.0 ns <= get_work_pool > > 0.00 % ~= 0.0 ns <= cpu_startup_entry > > 0.00 % ~= 0.0 ns <= acct_account_cputime > > 0.00 % ~= 0.0 ns <= set_next_entity > > 0.00 % ~= 0.0 ns <= worker_thread > > 0.00 % ~= 0.0 ns <= dbs_timer_handler > > 0.00 % ~= 0.0 ns <= delay_tsc > > 0.00 % ~= 0.0 ns <= idle_cpu > > 0.00 % ~= 0.0 ns <= timerqueue_add > > 0.00 % ~= 0.0 ns <= hrtimer_interrupt > > 0.00 % ~= 0.0 ns <= dbs_work_handler > > 0.00 % ~= 0.0 ns <= dequeue_entity > > 0.00 % ~= 0.0 ns <= update_cfs_shares > > 0.00 % ~= 0.0 ns <= update_fast_timekeeper > > 0.00 % ~= 0.0 ns <= smp_trace_apic_timer_interrupt > > 0.00 % ~= 0.0 ns <= __update_cpu_load > > 0.00 % ~= 0.0 ns <= cpu_needs_another_gp > > 0.00 % ~= 0.0 ns <= ret_from_intr > > 0.00 % ~= 0.0 ns <= __intel_pmu_enable_all > > 0.00 % ~= 0.0 ns <= trigger_load_balance > > 0.00 % ~= 0.0 ns <= __schedule > > 0.00 % ~= 0.0 ns <= nsecs_to_jiffies64 > > 0.00 % ~= 0.0 ns <= account_entity_dequeue > > 0.00 % ~= 0.0 ns <= worker_enter_idle > > 0.00 % ~= 0.0 ns <= __hrtimer_get_next_event > > 0.00 % ~= 0.0 ns <= rcu_irq_exit > > 0.00 % ~= 0.0 ns <= rb_erase > > 0.00 % ~= 0.0 ns <= __intel_pmu_disable_all > > 0.00 % ~= 0.0 ns <= tick_sched_do_timer > > 0.00 % ~= 0.0 ns <= cpuacct_account_field > > 0.00 % ~= 0.0 ns <= update_wall_time > > 0.00 % ~= 0.0 ns <= notifier_call_chain > > 0.00 % ~= 0.0 ns <= timekeeping_update > > 0.00 % ~= 0.0 ns <= ktime_get_update_offsets_now > > 0.00 % ~= 0.0 ns <= rb_next > > 0.00 % ~= 0.0 ns <= rcu_all_qs > > 0.00 % ~= 0.0 ns <= x86_pmu_disable > > 0.00 % ~= 0.0 ns <= _cond_resched > > 0.00 % ~= 0.0 ns <= __rcu_read_lock > > 0.00 % ~= 0.0 ns <= __local_bh_enable > > 0.00 % ~= 0.0 ns <= update_cpu_load_active > > 0.00 % ~= 0.0 ns <= x86_pmu_enable > > 0.00 % ~= 0.0 ns <= insert_work > > 0.00 % ~= 0.0 ns <= ktime_get > > 0.00 % ~= 0.0 ns <= __usecs_to_jiffies > > 0.00 % ~= 0.0 ns <= __acct_update_integrals > > 0.00 % ~= 0.0 ns <= scheduler_tick > > 0.00 % ~= 0.0 ns <= update_vsyscall > > 0.00 % ~= 0.0 ns <= memcpy_erms > > 0.00 % ~= 0.0 ns <= get_cpu_idle_time_us > > 0.00 % ~= 0.0 ns <= sched_clock_cpu > > 0.00 % ~= 0.0 ns <= tick_do_update_jiffies64 > > 0.00 % ~= 0.0 ns <= hrtimer_active > > 0.00 % ~= 0.0 ns <= profile_tick > > 0.00 % ~= 0.0 ns <= __hrtimer_run_queues > > 0.00 % ~= 0.0 ns <= kthread_should_stop > > 0.00 % ~= 0.0 ns <= run_posix_cpu_timers > > 0.00 % ~= 0.0 ns <= read_tsc > > 0.00 % ~= 0.0 ns <= __remove_hrtimer > > 0.00 % ~= 0.0 ns <= calc_global_load_tick > > 0.00 % ~= 0.0 ns <= hrtimer_run_queues > > 0.00 % ~= 0.0 ns <= irq_work_tick > > 0.00 % ~= 0.0 ns <= cpuacct_charge > > 0.00 % ~= 0.0 ns <= clockevents_program_event > > 0.00 % ~= 0.0 ns <= update_blocked_averages > > Sum: 0.68 % => calc: 0.6 ns (sum: 0.6 ns) => Total: 82.7 ns > > > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-28 9:25 ` Jesper Dangaard Brouer @ 2016-01-28 12:45 ` Eric Dumazet 2016-01-28 16:37 ` Tom Herbert 0 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2016-01-28 12:45 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Tom Herbert, John Fastabend, Michael S. Tsirkin, David Miller, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Thu, 2016-01-28 at 10:25 +0100, Jesper Dangaard Brouer wrote: > Yes, that is exactly what I'm contemplating :-) That is idea "(1)". > > A natural extension to this work, which I expect Tom will love, is to > also use the idea for RPS. Once we have a SKB list in stack/GRO-layer, > then we could build a local sk_buff_head list for each remote CPU, by > calling get_rps_cpu(). And then enqueue_list_to_backlog, by a > skb_queue_splice_tail(&cpu_list, &cpu->sd->input_pkt_queue) call. > > This would amortize the cost of transferring packets to a remote CPU, > which Eric AFAIK points out is costing approx ~133ns. > Jesper, RPS and RFS already defer sending the IPI and submit batches to remote cpus. See commits e326bed2f47d0365da5a8faaf8ee93ed2d86325b ("rps: immediate send IPI in process_backlog()") 88751275b8e867d756e4f86ae92afe0232de129f ("rps: shortcut net_rps_action()") And of course all the discussions we had to come up with 0a9627f2649a02bea165cfd529d7bcb625c2fcad ("rps: Receive Packet Steering") The current state : net_rps_action_and_irq_enable() sends the IPI at the end of net_rx_action() once all NAPI handlers have been called, and therefore have accumulated packets and cook rps_ipi_list (via calls to rps_ipi_queued() from enqueue_to_backlog()) Adding another stage in the pipeline would not help. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-28 12:45 ` Eric Dumazet @ 2016-01-28 16:37 ` Tom Herbert 2016-01-28 16:43 ` Eric Dumazet 2016-01-28 17:04 ` Jesper Dangaard Brouer 0 siblings, 2 replies; 59+ messages in thread From: Tom Herbert @ 2016-01-28 16:37 UTC (permalink / raw) To: Eric Dumazet Cc: Jesper Dangaard Brouer, John Fastabend, Michael S. Tsirkin, David Miller, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Thu, Jan 28, 2016 at 4:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2016-01-28 at 10:25 +0100, Jesper Dangaard Brouer wrote: > >> Yes, that is exactly what I'm contemplating :-) That is idea "(1)". >> >> A natural extension to this work, which I expect Tom will love, is to >> also use the idea for RPS. Once we have a SKB list in stack/GRO-layer, >> then we could build a local sk_buff_head list for each remote CPU, by >> calling get_rps_cpu(). And then enqueue_list_to_backlog, by a >> skb_queue_splice_tail(&cpu_list, &cpu->sd->input_pkt_queue) call. >> >> This would amortize the cost of transferring packets to a remote CPU, >> which Eric AFAIK points out is costing approx ~133ns. >> > > Jesper, RPS and RFS already defer sending the IPI and submit batches to > remote cpus. > > See commits > > e326bed2f47d0365da5a8faaf8ee93ed2d86325b ("rps: immediate send IPI in > process_backlog()") > > 88751275b8e867d756e4f86ae92afe0232de129f ("rps: shortcut > net_rps_action()") > > And of course all the discussions we had to come up with > 0a9627f2649a02bea165cfd529d7bcb625c2fcad ("rps: Receive Packet > Steering") > > The current state : > > net_rps_action_and_irq_enable() sends the IPI at the end of > net_rx_action() once all NAPI handlers have been called, and therefore > have accumulated packets and cook rps_ipi_list (via calls to > rps_ipi_queued() from enqueue_to_backlog()) > > > Adding another stage in the pipeline would not help. > skbs are enqueued on a CPU queue one at at time through enqueue_to_backlog. It would be nice to do that as a batch of skbs. > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-28 16:37 ` Tom Herbert @ 2016-01-28 16:43 ` Eric Dumazet 2016-01-28 17:04 ` Jesper Dangaard Brouer 1 sibling, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2016-01-28 16:43 UTC (permalink / raw) To: Tom Herbert Cc: Jesper Dangaard Brouer, John Fastabend, Michael S. Tsirkin, David Miller, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich On Thu, 2016-01-28 at 08:37 -0800, Tom Herbert wrote: > skbs are enqueued on a CPU queue one at at time through > enqueue_to_backlog. It would be nice to do that as a batch of skbs. Adding yet another layer and cache misses. This might be a win for stress situations, not for nominal traffic, when very few packets are delivered per NAPI poll. For stress situations, we do not rely on RPS/RFS at all, but prefer RSS and appropriate number of RX queues, to have true silos. For the router case, where Jesper wants 15 Mpps on a single core, RPS/RFS is not used. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) 2016-01-28 16:37 ` Tom Herbert 2016-01-28 16:43 ` Eric Dumazet @ 2016-01-28 17:04 ` Jesper Dangaard Brouer 1 sibling, 0 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-28 17:04 UTC (permalink / raw) To: Tom Herbert Cc: Eric Dumazet, John Fastabend, Michael S. Tsirkin, David Miller, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich, brouer On Thu, 28 Jan 2016 08:37:07 -0800 Tom Herbert <tom@herbertland.com> wrote: > On Thu, Jan 28, 2016 at 4:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Thu, 2016-01-28 at 10:25 +0100, Jesper Dangaard Brouer wrote: > > > >> Yes, that is exactly what I'm contemplating :-) That is idea "(1)". > >> > >> A natural extension to this work, which I expect Tom will love, is to > >> also use the idea for RPS. Once we have a SKB list in stack/GRO-layer, > >> then we could build a local sk_buff_head list for each remote CPU, by > >> calling get_rps_cpu(). And then enqueue_list_to_backlog, by a > >> skb_queue_splice_tail(&cpu_list, &cpu->sd->input_pkt_queue) call. > >> > >> This would amortize the cost of transferring packets to a remote CPU, > >> which Eric AFAIK points out is costing approx ~133ns. > >> > > > > Jesper, RPS and RFS already defer sending the IPI and submit batches to > > remote cpus. > > > > See commits > > > > e326bed2f47d0365da5a8faaf8ee93ed2d86325b ("rps: immediate send IPI in > > process_backlog()") > > > > 88751275b8e867d756e4f86ae92afe0232de129f ("rps: shortcut > > net_rps_action()") > > > > And of course all the discussions we had to come up with > > 0a9627f2649a02bea165cfd529d7bcb625c2fcad ("rps: Receive Packet > > Steering") > > > > The current state : > > > > net_rps_action_and_irq_enable() sends the IPI at the end of > > net_rx_action() once all NAPI handlers have been called, and therefore > > have accumulated packets and cook rps_ipi_list (via calls to > > rps_ipi_queued() from enqueue_to_backlog()) Yes, thanks for pointing this out. Then we already have amortized the IPI call. Great. > > Adding another stage in the pipeline would not help. > > > skbs are enqueued on a CPU queue one at at time through > enqueue_to_backlog. It would be nice to do that as a batch of skbs. Yes, this was what I was looking at doing, a bulk enqueue to backlog. Thus, amortizing the lock. And if some remote CPU is reading/using input_pkt_queue, then we don't bounce that cache line. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-24 14:28 ` Jesper Dangaard Brouer 2016-01-24 14:44 ` Michael S. Tsirkin @ 2016-01-24 20:09 ` Tom Herbert 2016-01-24 21:41 ` John Fastabend 1 sibling, 1 reply; 59+ messages in thread From: Tom Herbert @ 2016-01-24 20:09 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Michael S. Tsirkin On Sun, Jan 24, 2016 at 6:28 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Thu, 21 Jan 2016 10:54:01 -0800 (PST) > David Miller <davem@davemloft.net> wrote: > >> From: Jesper Dangaard Brouer <brouer@redhat.com> >> Date: Thu, 21 Jan 2016 12:27:30 +0100 >> >> > eth_type_trans() does two things: >> > >> > 1) determine skb->protocol >> > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} >> > >> > Could the HW descriptor deliver the "proto", or perhaps just some bits >> > on the most common proto's? >> > >> > The skb->pkt_type don't need many bits. And I bet the HW already have >> > the information. The BROADCAST and MULTICAST indication are easy. The >> > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST >> > indication, if the eth->h_dest match the devices dev->dev_addr (else a >> > SW compare is required). >> > >> > Is that doable in hardware? >> >> I feel like we've had this discussion before several years ago. >> >> I think having just the protocol value would be enough. >> >> skb->pkt_type we could deal with by using always an accessor and >> evaluating it lazily. Nothing needs it until we hit ip_rcv() or >> similar. > > First I thought, I liked the idea delaying the eval of skb->pkt_type. > > BUT then I realized, what if we take this even further. What if we > actually use this information, for something useful, at this very > early RX stage. > > The information I'm interested in, from the HW descriptor, is if this > packet is NOT for local delivery. If so, we can send the packet on a > "fast-forward" code path. > > Think about bridging packets to a guest OS. Because we know very > early at RX (from packet HW descriptor) we might even avoid allocating > a SKB. We could just "forward" the packet-page to the guest OS. > > Taking Eric's idea, of remote CPUs, we could even send these > packet-pages to a remote CPU (e.g. where the guest OS is running), > without having touched a single cache-line in the packet-data. I > would still bundle them up first, to amortize the (100-133ns) cost of > transferring something to another CPU. > You mean like RPS/RFS/aRFS/flow_director already does (except for the zero-touch part)? > The data-cache trick, would be to instruct prefetcher only to start > prefetching to L3 or L2, when these packet are destined for a remote > CPU. At-least Intel CPUs have prefetch operations that specify only > L2/L3 cache. > > > Maybe, we need a combined solution. Lazy eval skb->pkt_type, for > local delivery, but set the information if avail from HW desc. And > fast page-forward don't even need a SKB. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-24 20:09 ` Optimizing instruction-cache, more packets at each stage Tom Herbert @ 2016-01-24 21:41 ` John Fastabend 2016-01-24 23:50 ` Tom Herbert 0 siblings, 1 reply; 59+ messages in thread From: John Fastabend @ 2016-01-24 21:41 UTC (permalink / raw) To: Tom Herbert, Jesper Dangaard Brouer Cc: David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Michael S. Tsirkin On 16-01-24 12:09 PM, Tom Herbert wrote: > On Sun, Jan 24, 2016 at 6:28 AM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: >> On Thu, 21 Jan 2016 10:54:01 -0800 (PST) >> David Miller <davem@davemloft.net> wrote: >> >>> From: Jesper Dangaard Brouer <brouer@redhat.com> >>> Date: Thu, 21 Jan 2016 12:27:30 +0100 >>> >>>> eth_type_trans() does two things: >>>> >>>> 1) determine skb->protocol >>>> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} >>>> >>>> Could the HW descriptor deliver the "proto", or perhaps just some bits >>>> on the most common proto's? >>>> >>>> The skb->pkt_type don't need many bits. And I bet the HW already have >>>> the information. The BROADCAST and MULTICAST indication are easy. The >>>> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST >>>> indication, if the eth->h_dest match the devices dev->dev_addr (else a >>>> SW compare is required). >>>> >>>> Is that doable in hardware? >>> >>> I feel like we've had this discussion before several years ago. >>> >>> I think having just the protocol value would be enough. >>> >>> skb->pkt_type we could deal with by using always an accessor and >>> evaluating it lazily. Nothing needs it until we hit ip_rcv() or >>> similar. >> >> First I thought, I liked the idea delaying the eval of skb->pkt_type. >> >> BUT then I realized, what if we take this even further. What if we >> actually use this information, for something useful, at this very >> early RX stage. >> >> The information I'm interested in, from the HW descriptor, is if this >> packet is NOT for local delivery. If so, we can send the packet on a >> "fast-forward" code path. >> >> Think about bridging packets to a guest OS. Because we know very >> early at RX (from packet HW descriptor) we might even avoid allocating >> a SKB. We could just "forward" the packet-page to the guest OS. >> >> Taking Eric's idea, of remote CPUs, we could even send these >> packet-pages to a remote CPU (e.g. where the guest OS is running), >> without having touched a single cache-line in the packet-data. I >> would still bundle them up first, to amortize the (100-133ns) cost of >> transferring something to another CPU. >> > You mean like RPS/RFS/aRFS/flow_director already does (except for the > zero-touch part)? > You could also look at ATR in the ixgbe/i40e drivers which on xmit uses a tuple to try and force the hardware to recv on the same queue pair as the sending side. The idea being you can bind tx/rx queue pairs to a core and send/recv on the same core which tends to be an OK strategy although not always. It is sometimes better to tx and rx on separate cores. >> The data-cache trick, would be to instruct prefetcher only to start >> prefetching to L3 or L2, when these packet are destined for a remote >> CPU. At-least Intel CPUs have prefetch operations that specify only >> L2/L3 cache. >> >> >> Maybe, we need a combined solution. Lazy eval skb->pkt_type, for >> local delivery, but set the information if avail from HW desc. And >> fast page-forward don't even need a SKB. >> >> -- >> Best regards, >> Jesper Dangaard Brouer >> MSc.CS, Principal Kernel Engineer at Red Hat >> Author of http://www.iptv-analyzer.org >> LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-24 21:41 ` John Fastabend @ 2016-01-24 23:50 ` Tom Herbert 0 siblings, 0 replies; 59+ messages in thread From: Tom Herbert @ 2016-01-24 23:50 UTC (permalink / raw) To: John Fastabend Cc: Jesper Dangaard Brouer, David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, Michael S. Tsirkin On Sun, Jan 24, 2016 at 1:41 PM, John Fastabend <john.fastabend@gmail.com> wrote: > On 16-01-24 12:09 PM, Tom Herbert wrote: >> On Sun, Jan 24, 2016 at 6:28 AM, Jesper Dangaard Brouer >> <brouer@redhat.com> wrote: >>> On Thu, 21 Jan 2016 10:54:01 -0800 (PST) >>> David Miller <davem@davemloft.net> wrote: >>> >>>> From: Jesper Dangaard Brouer <brouer@redhat.com> >>>> Date: Thu, 21 Jan 2016 12:27:30 +0100 >>>> >>>>> eth_type_trans() does two things: >>>>> >>>>> 1) determine skb->protocol >>>>> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST} >>>>> >>>>> Could the HW descriptor deliver the "proto", or perhaps just some bits >>>>> on the most common proto's? >>>>> >>>>> The skb->pkt_type don't need many bits. And I bet the HW already have >>>>> the information. The BROADCAST and MULTICAST indication are easy. The >>>>> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST >>>>> indication, if the eth->h_dest match the devices dev->dev_addr (else a >>>>> SW compare is required). >>>>> >>>>> Is that doable in hardware? >>>> >>>> I feel like we've had this discussion before several years ago. >>>> >>>> I think having just the protocol value would be enough. >>>> >>>> skb->pkt_type we could deal with by using always an accessor and >>>> evaluating it lazily. Nothing needs it until we hit ip_rcv() or >>>> similar. >>> >>> First I thought, I liked the idea delaying the eval of skb->pkt_type. >>> >>> BUT then I realized, what if we take this even further. What if we >>> actually use this information, for something useful, at this very >>> early RX stage. >>> >>> The information I'm interested in, from the HW descriptor, is if this >>> packet is NOT for local delivery. If so, we can send the packet on a >>> "fast-forward" code path. >>> >>> Think about bridging packets to a guest OS. Because we know very >>> early at RX (from packet HW descriptor) we might even avoid allocating >>> a SKB. We could just "forward" the packet-page to the guest OS. >>> >>> Taking Eric's idea, of remote CPUs, we could even send these >>> packet-pages to a remote CPU (e.g. where the guest OS is running), >>> without having touched a single cache-line in the packet-data. I >>> would still bundle them up first, to amortize the (100-133ns) cost of >>> transferring something to another CPU. >>> >> You mean like RPS/RFS/aRFS/flow_director already does (except for the >> zero-touch part)? >> > > You could also look at ATR in the ixgbe/i40e drivers which on xmit > uses a tuple to try and force the hardware to recv on the same queue > pair as the sending side. The idea being you can bind tx/rx queue > pairs to a core and send/recv on the same core which tends to be an > OK strategy although not always. It is sometimes better to tx and rx > on separate cores. > Right, we have seen cases where HW attempting to autonomously bind tx/rx to the same CPU does nothing more than create a whole bunch of OOO packets and a big mess otherwise. The better approach is to allow the stack to indicate to HW where *it* wants received packets for each flow to go. If it wants to bind tx/rx it can do that, if it wants to split that's fine to. This is possible with aRFS, and in fact I don't see any reason why virtual drivers shouldn't support also aRFS to allow guests control over steering within their CPUs. >>> The data-cache trick, would be to instruct prefetcher only to start >>> prefetching to L3 or L2, when these packet are destined for a remote >>> CPU. At-least Intel CPUs have prefetch operations that specify only >>> L2/L3 cache. >>> >>> >>> Maybe, we need a combined solution. Lazy eval skb->pkt_type, for >>> local delivery, but set the information if avail from HW desc. And >>> fast page-forward don't even need a SKB. >>> >>> -- >>> Best regards, >>> Jesper Dangaard Brouer >>> MSc.CS, Principal Kernel Engineer at Red Hat >>> Author of http://www.iptv-analyzer.org >>> LinkedIn: http://www.linkedin.com/in/brouer > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-20 23:27 ` Tom Herbert 2016-01-21 11:27 ` Jesper Dangaard Brouer @ 2016-01-21 12:23 ` Jesper Dangaard Brouer 2016-01-21 16:38 ` Tom Herbert 2016-02-02 16:13 ` Or Gerlitz 2 siblings, 1 reply; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-21 12:23 UTC (permalink / raw) To: Tom Herbert Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, brouer On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote: > weaknesses of Toeplitz we talked about recently and that fact that > Jenkins is really fast to compute, I am starting to think maybe we > should always do a software hash and not rely on HW for it... Please don't enforce a software hash. You are proposing a hash computation per packet which cost in the area 50-100 nanosec (?). And on data which is cache cold (even with DDIO, you take the L3 cache cost/hit). Consider the increase in network hardware speeds. Worst-case (pkt size 64 bytes) time between packets: * 10 Gbit/s -> 67.2 nanosec * 40 Gbit/s -> 16.8 nanosec * 100 Gbit/s -> 6.7 nanosec Adding such a per packet cost is not going to fly. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 12:23 ` Jesper Dangaard Brouer @ 2016-01-21 16:38 ` Tom Herbert 2016-01-21 17:48 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: Tom Herbert @ 2016-01-21 16:38 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Thu, Jan 21, 2016 at 4:23 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Wed, 20 Jan 2016 15:27:38 -0800 > Tom Herbert <tom@herbertland.com> wrote: > >> weaknesses of Toeplitz we talked about recently and that fact that >> Jenkins is really fast to compute, I am starting to think maybe we >> should always do a software hash and not rely on HW for it... > > Please don't enforce a software hash. You are proposing a hash > computation per packet which cost in the area 50-100 nanosec (?). And > on data which is cache cold (even with DDIO, you take the L3 cache > cost/hit). > I clock Jenkins hash computation itself at ~6nsecs (not taking cache miss), but your point is taken. > Consider the increase in network hardware speeds. > > Worst-case (pkt size 64 bytes) time between packets: > * 10 Gbit/s -> 67.2 nanosec > * 40 Gbit/s -> 16.8 nanosec > * 100 Gbit/s -> 6.7 nanosec > > Adding such a per packet cost is not going to fly. > Sure, but the receive path is parallelized. Improving parallelism has continuously shown to have much more impact than attempting to optimize for cache misses. The primary goal is not to drive 100Gbps with 64 packets from a single CPU. It is one benchmark of many we should look at to measure efficiency of the data path, but I've yet to see any real workload that requires that... Regardless of anything, we need to load packet headers into CPU cache to do protocol processing. I'm not sure I see how trying to defer that as long as possible helps except in cases where the packet is crossing CPU cache boundaries and can eliminate cache misses completely (not just move them around from one function to another). Tom > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 16:38 ` Tom Herbert @ 2016-01-21 17:48 ` Eric Dumazet 2016-01-22 12:33 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2016-01-21 17:48 UTC (permalink / raw) To: Tom Herbert Cc: Jesper Dangaard Brouer, Or Gerlitz, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote: > Sure, but the receive path is parallelized. This is true for multiqueue processing, assuming you can dedicate many cores to process RX. > Improving parallelism has > continuously shown to have much more impact than attempting to > optimize for cache misses. The primary goal is not to drive 100Gbps > with 64 packets from a single CPU. It is one benchmark of many we > should look at to measure efficiency of the data path, but I've yet to > see any real workload that requires that... > > Regardless of anything, we need to load packet headers into CPU cache > to do protocol processing. I'm not sure I see how trying to defer that > as long as possible helps except in cases where the packet is crossing > CPU cache boundaries and can eliminate cache misses completely (not > just move them around from one function to another). Note that some user space use multiple core (or hyper threads) to implement a pipeline, using a single RX queue. One thread can handle one stage (device RX drain) and prefetch data into shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads) The second thread process packets with headers already in L1/L2 This way, the ~100 ns (or even more if you also consider skb allocations) penalty to bring packet headers do not hurt PPS. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-21 17:48 ` Eric Dumazet @ 2016-01-22 12:33 ` Jesper Dangaard Brouer 2016-01-22 14:33 ` Eric Dumazet 2016-01-22 17:07 ` Tom Herbert 0 siblings, 2 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-22 12:33 UTC (permalink / raw) To: Eric Dumazet Cc: Tom Herbert, Or Gerlitz, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, brouer On Thu, 21 Jan 2016 09:48:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote: > > > Sure, but the receive path is parallelized. > > This is true for multiqueue processing, assuming you can dedicate many > cores to process RX. > > > Improving parallelism has > > continuously shown to have much more impact than attempting to > > optimize for cache misses. The primary goal is not to drive 100Gbps > > with 64 packets from a single CPU. It is one benchmark of many we > > should look at to measure efficiency of the data path, but I've yet to > > see any real workload that requires that... > > > > Regardless of anything, we need to load packet headers into CPU cache > > to do protocol processing. I'm not sure I see how trying to defer that > > as long as possible helps except in cases where the packet is crossing > > CPU cache boundaries and can eliminate cache misses completely (not > > just move them around from one function to another). > > Note that some user space use multiple core (or hyper threads) to > implement a pipeline, using a single RX queue. > > One thread can handle one stage (device RX drain) and prefetch data into > shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads) > > The second thread process packets with headers already in L1/L2 I agree. I've heard experiences where DPDK users use 2 core for RX, and 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding with full Internet routing table look up. One of the ideas behind my alf_queue, is that it can be used for efficiently distributing object (pointers) between threads. 1. because it only transfers the pointers (not touching object), and 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg. Thus, lower in the message passing cost between threads. > This way, the ~100 ns (or even more if you also consider skb > allocations) penalty to bring packet headers do not hurt PPS. I've studied the allocation cost in great detail, thus let me share my numbers, 100 ns is too high: Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz). The cycles count should be comparable with other CPUs, but that nanosec measurement is affected by the very high clock freq of this CPU. Kmem_cache fastpath "recycle" case: SLUB => 44 cycles(tsc) 11.205 ns SLAB => 96 cycles(tsc) 24.119 ns. The problem is that real use-cases in the network stack, almost always hit the slowpath in kmem_cache allocators. Kmem_cache "slowpath" case: SLUB => 117 cycles(tsc) 29.276 ns SLAB => 101 cycles(tsc) 25.342 ns I've addressed this "slowpath" problem in the SLUB and SLAB allocators, by introducing a bulk API, which amortize the needed sync-mechanisms. Kmem_cache using bulk API: SLUB => 37 cycles(tsc) 9.280 ns SLAB => 20 cycles(tsc) 5.035 ns -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-22 12:33 ` Jesper Dangaard Brouer @ 2016-01-22 14:33 ` Eric Dumazet 2016-01-22 17:07 ` Tom Herbert 1 sibling, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2016-01-22 14:33 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Tom Herbert, Or Gerlitz, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Fri, 2016-01-22 at 13:33 +0100, Jesper Dangaard Brouer wrote: > On Thu, 21 Jan 2016 09:48:36 -0800 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote: > > > > > Sure, but the receive path is parallelized. > > > > This is true for multiqueue processing, assuming you can dedicate many > > cores to process RX. > > > > > Improving parallelism has > > > continuously shown to have much more impact than attempting to > > > optimize for cache misses. The primary goal is not to drive 100Gbps > > > with 64 packets from a single CPU. It is one benchmark of many we > > > should look at to measure efficiency of the data path, but I've yet to > > > see any real workload that requires that... > > > > > > Regardless of anything, we need to load packet headers into CPU cache > > > to do protocol processing. I'm not sure I see how trying to defer that > > > as long as possible helps except in cases where the packet is crossing > > > CPU cache boundaries and can eliminate cache misses completely (not > > > just move them around from one function to another). > > > > Note that some user space use multiple core (or hyper threads) to > > implement a pipeline, using a single RX queue. > > > > One thread can handle one stage (device RX drain) and prefetch data into > > shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads) > > > > The second thread process packets with headers already in L1/L2 > > I agree. I've heard experiences where DPDK users use 2 core for RX, and > 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding > with full Internet routing table look up. > > One of the ideas behind my alf_queue, is that it can be used for > efficiently distributing object (pointers) between threads. > 1. because it only transfers the pointers (not touching object), and > 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg. > Thus, lower in the message passing cost between threads. > > > > This way, the ~100 ns (or even more if you also consider skb > > allocations) penalty to bring packet headers do not hurt PPS. > > I've studied the allocation cost in great detail, thus let me share my > numbers, 100 ns is too high: > > Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz). > The cycles count should be comparable with other CPUs, but that nanosec > measurement is affected by the very high clock freq of this CPU. > > Kmem_cache fastpath "recycle" case: > SLUB => 44 cycles(tsc) 11.205 ns > SLAB => 96 cycles(tsc) 24.119 ns. > > The problem is that real use-cases in the network stack, almost always > hit the slowpath in kmem_cache allocators. > > Kmem_cache "slowpath" case: > SLUB => 117 cycles(tsc) 29.276 ns > SLAB => 101 cycles(tsc) 25.342 ns > > I've addressed this "slowpath" problem in the SLUB and SLAB allocators, > by introducing a bulk API, which amortize the needed sync-mechanisms. > > Kmem_cache using bulk API: > SLUB => 37 cycles(tsc) 9.280 ns > SLAB => 20 cycles(tsc) 5.035 ns Your numbers are nice, but the reality of most applications is they run on hosts with ~72 hyperthreads, soon to be ~128 ht. (Two physical sockets, with their corresponding memory) The perf numbers show about 100 ns penalty per cache line miss, when all these threads perform real work and applications are properly tuned, because it is very rare the working set is all in caches. In the following real case, we can see these numbers. $ perf guncore -M miss_lat_rem,miss_lat_loc #------------------------------------------------------------------------------------ # Socket0 | Socket1 | #------------------------------------------------------------------------------------ # Load Miss Latency | Load Miss Latency | Load Miss Latency | Load Miss Latency | # Remote RAM | Local RAM | Remote RAM | Local RAM | # ns| ns| ns| ns| #------------------------------------------------------------------------------------ 162.25 130.61 173.74 116.80 162.40 130.41 173.33 116.59 163.11 132.28 175.90 117.09 163.36 132.86 176.69 117.45 161.92 130.32 173.20 117.35 163.46 130.99 174.80 117.42 163.54 130.55 174.09 117.26 163.29 129.75 173.84 117.36 162.38 130.31 173.44 117.18 163.00 130.81 174.47 117.24 ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-22 12:33 ` Jesper Dangaard Brouer 2016-01-22 14:33 ` Eric Dumazet @ 2016-01-22 17:07 ` Tom Herbert 2016-01-22 17:17 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 59+ messages in thread From: Tom Herbert @ 2016-01-22 17:07 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Fri, Jan 22, 2016 at 4:33 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Thu, 21 Jan 2016 09:48:36 -0800 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote: >> >> > Sure, but the receive path is parallelized. >> >> This is true for multiqueue processing, assuming you can dedicate many >> cores to process RX. >> >> > Improving parallelism has >> > continuously shown to have much more impact than attempting to >> > optimize for cache misses. The primary goal is not to drive 100Gbps >> > with 64 packets from a single CPU. It is one benchmark of many we >> > should look at to measure efficiency of the data path, but I've yet to >> > see any real workload that requires that... >> > >> > Regardless of anything, we need to load packet headers into CPU cache >> > to do protocol processing. I'm not sure I see how trying to defer that >> > as long as possible helps except in cases where the packet is crossing >> > CPU cache boundaries and can eliminate cache misses completely (not >> > just move them around from one function to another). >> >> Note that some user space use multiple core (or hyper threads) to >> implement a pipeline, using a single RX queue. >> >> One thread can handle one stage (device RX drain) and prefetch data into >> shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads) >> >> The second thread process packets with headers already in L1/L2 > > I agree. I've heard experiences where DPDK users use 2 core for RX, and > 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding > with full Internet routing table look up. > > One of the ideas behind my alf_queue, is that it can be used for > efficiently distributing object (pointers) between threads. > 1. because it only transfers the pointers (not touching object), and > 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg. > Thus, lower in the message passing cost between threads. > > >> This way, the ~100 ns (or even more if you also consider skb >> allocations) penalty to bring packet headers do not hurt PPS. > > I've studied the allocation cost in great detail, thus let me share my > numbers, 100 ns is too high: > > Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz). > The cycles count should be comparable with other CPUs, but that nanosec > measurement is affected by the very high clock freq of this CPU. > > Kmem_cache fastpath "recycle" case: > SLUB => 44 cycles(tsc) 11.205 ns > SLAB => 96 cycles(tsc) 24.119 ns. > > The problem is that real use-cases in the network stack, almost always > hit the slowpath in kmem_cache allocators. > > Kmem_cache "slowpath" case: > SLUB => 117 cycles(tsc) 29.276 ns > SLAB => 101 cycles(tsc) 25.342 ns > > I've addressed this "slowpath" problem in the SLUB and SLAB allocators, > by introducing a bulk API, which amortize the needed sync-mechanisms. > > Kmem_cache using bulk API: > SLUB => 37 cycles(tsc) 9.280 ns > SLAB => 20 cycles(tsc) 5.035 ns > Hi Jesper, I am a little confused. I believe the 100ns hit refers specifically cache miss on packet headers. Memory object allocation seems like different problem; the latency might depend on cache misses, but it's not on packet data (which we seem to assume is always a cache miss). For the cache miss problem on the packet headers I think we really need to evaluate whether DDIO adequately solves the it (need more numbers :) ). As I read it, DDIO is enabled by default since Sandy Bridge-EP and is transparent to both HW and SW. It seems like we should have seen some sort of measurable benefit by now... Tom > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-22 17:07 ` Tom Herbert @ 2016-01-22 17:17 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-22 17:17 UTC (permalink / raw) To: Tom Herbert Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai, brouer On Fri, 22 Jan 2016 09:07:43 -0800 Tom Herbert <tom@herbertland.com> wrote: > On Fri, Jan 22, 2016 at 4:33 AM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > On Thu, 21 Jan 2016 09:48:36 -0800 > > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > >> On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote: > >> > >> > Sure, but the receive path is parallelized. > >> > >> This is true for multiqueue processing, assuming you can dedicate many > >> cores to process RX. > >> > >> > Improving parallelism has > >> > continuously shown to have much more impact than attempting to > >> > optimize for cache misses. The primary goal is not to drive 100Gbps > >> > with 64 packets from a single CPU. It is one benchmark of many we > >> > should look at to measure efficiency of the data path, but I've yet to > >> > see any real workload that requires that... > >> > > >> > Regardless of anything, we need to load packet headers into CPU cache > >> > to do protocol processing. I'm not sure I see how trying to defer that > >> > as long as possible helps except in cases where the packet is crossing > >> > CPU cache boundaries and can eliminate cache misses completely (not > >> > just move them around from one function to another). > >> > >> Note that some user space use multiple core (or hyper threads) to > >> implement a pipeline, using a single RX queue. > >> > >> One thread can handle one stage (device RX drain) and prefetch data into > >> shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads) > >> > >> The second thread process packets with headers already in L1/L2 > > > > I agree. I've heard experiences where DPDK users use 2 core for RX, and > > 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding > > with full Internet routing table look up. > > > > One of the ideas behind my alf_queue, is that it can be used for > > efficiently distributing object (pointers) between threads. > > 1. because it only transfers the pointers (not touching object), and > > 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg. > > Thus, lower in the message passing cost between threads. > > > > > >> This way, the ~100 ns (or even more if you also consider skb > >> allocations) penalty to bring packet headers do not hurt PPS. > > > > I've studied the allocation cost in great detail, thus let me share my > > numbers, 100 ns is too high: > > > > Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz). > > The cycles count should be comparable with other CPUs, but that nanosec > > measurement is affected by the very high clock freq of this CPU. > > > > Kmem_cache fastpath "recycle" case: > > SLUB => 44 cycles(tsc) 11.205 ns > > SLAB => 96 cycles(tsc) 24.119 ns. > > > > The problem is that real use-cases in the network stack, almost always > > hit the slowpath in kmem_cache allocators. > > > > Kmem_cache "slowpath" case: > > SLUB => 117 cycles(tsc) 29.276 ns > > SLAB => 101 cycles(tsc) 25.342 ns > > > > I've addressed this "slowpath" problem in the SLUB and SLAB allocators, > > by introducing a bulk API, which amortize the needed sync-mechanisms. > > > > Kmem_cache using bulk API: > > SLUB => 37 cycles(tsc) 9.280 ns > > SLAB => 20 cycles(tsc) 5.035 ns > > > Hi Jesper, > > I am a little confused. I believe the 100ns hit refers specifically > cache miss on packet headers. Sorry, I misread Eric's statement. You are right. > Memory object allocation seems like different problem; Yes, it is, I just misread it, and though we were talking about memory object alloc overhead. Sorry, for the confusion. > the latency might depend on cache misses, but it's > not on packet data (which we seem to assume is always a cache miss). > For the cache miss problem on the packet headers I think we really > need to evaluate whether DDIO adequately solves the it (need more > numbers :) ). As I read it, DDIO is enabled by default since Sandy > Bridge-EP and is transparent to both HW and SW. It seems like we > should have seen some sort of measurable benefit by now... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-20 23:27 ` Tom Herbert 2016-01-21 11:27 ` Jesper Dangaard Brouer 2016-01-21 12:23 ` Jesper Dangaard Brouer @ 2016-02-02 16:13 ` Or Gerlitz 2016-02-02 16:37 ` Eric Dumazet 2 siblings, 1 reply; 59+ messages in thread From: Or Gerlitz @ 2016-02-02 16:13 UTC (permalink / raw) To: Tom Herbert Cc: Eric Dumazet, David Miller, Eric Dumazet, Jesper Dangaard Brouer, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Thu, Jan 21, 2016 at 1:27 AM, Tom Herbert <tom@herbertland.com> wrote: > Unfortunately, the hardware hash from devices hasn't really lived up > to its potential. The original intent of getting the hash from device > was to be able to do packet steering (RPS and RFS) without touching > the header. But this never was implemented. eth_type_trans touches > headers and GRO is best when done before steering. Given the > weaknesses of Toeplitz we talked about recently and that fact that > Jenkins is really fast to compute, I am starting to think maybe we > should always do a software hash and not rely on HW for it... Could you provide some details on the weaknesses of Toeplitz? FYI, the admin is able to configure non-default keys for Toeplitz through ethtool. Or. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-02-02 16:13 ` Or Gerlitz @ 2016-02-02 16:37 ` Eric Dumazet 0 siblings, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2016-02-02 16:37 UTC (permalink / raw) To: Or Gerlitz Cc: Tom Herbert, David Miller, Eric Dumazet, Jesper Dangaard Brouer, Linux Netdev List, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai On Tue, 2016-02-02 at 18:13 +0200, Or Gerlitz wrote: > Could you provide some details on the weaknesses of Toeplitz? > FYI, the admin is able to configure non-default keys for Toeplitz > through ethtool. Well, Toeplitz keys are no longer default anyway, I hope. d682d2bdc306 bnx2x: byte swap rss_key to comply to Toeplitz specs 4671fc6d47e0 net/mlx4_en: really allow to change RSS key c33d23c21501 enic: use netdev_rss_key_fill() helper 6bf79cdddd50 vmxnet3: use netdev_rss_key_fill() helper 7a20db379ce7 sfc: use netdev_rss_key_fill() helper b9d1ab7eb42e mlx4: use netdev_rss_key_fill() helper 9913c61c4486 ixgbe: use netdev_rss_key_fill() helper eb31f8493eee igb: use netdev_rss_key_fill() helper 22f258a1cc2f i40e: use netdev_rss_key_fill() helper c41a4fba4a22 fm10k: use netdev_rss_key_fill() helper 5c8d19da9508 e100e: use netdev_rss_key_fill() helper 1dcf7b1c5f57 be2net:use netdev_rss_key_fill() helper 0fa6aa4ac4e0 bna: use netdev_rss_key_fill() helper 396483564409 tg3: use netdev_rss_key_fill() helper e3ec69ca80a2 bnx2x: use netdev_rss_key_fill() helper b23063034f11 amd-xgbe: use netdev_rss_key_fill() helper 960fb622f851 net: provide a per host RSS key generic infrastructure ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-18 10:27 ` Jesper Dangaard Brouer 2016-01-18 16:24 ` David Miller @ 2016-01-18 16:53 ` Eric Dumazet 2016-01-18 17:36 ` Tom Herbert 2 siblings, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2016-01-18 16:53 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: David Miller, netdev, alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw, pabeni, john.r.fastabend On Mon, 2016-01-18 at 11:27 +0100, Jesper Dangaard Brouer wrote: > The NAPI drivers actually already have a flush API (calling > napi_complete_done()), BUT it does not always get invoked, e.g. if the > driver have more work to do, and want to keep polling. > I'm not sure we want to delay "flushing" packets queued in the GRO > layer for this long(?). Since linux-3.7 we have this logic : http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2e71a6f8084e7ac87166dd77d99c44190fb844fc (Some arches have quite expensive high resolution timestamps) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-18 10:27 ` Jesper Dangaard Brouer 2016-01-18 16:24 ` David Miller 2016-01-18 16:53 ` Eric Dumazet @ 2016-01-18 17:36 ` Tom Herbert 2016-01-18 17:49 ` Jesper Dangaard Brouer 2 siblings, 1 reply; 59+ messages in thread From: Tom Herbert @ 2016-01-18 17:36 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: David Miller, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, marek, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend On Mon, Jan 18, 2016 at 2:27 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Fri, 15 Jan 2016 15:47:21 -0500 (EST) > David Miller <davem@davemloft.net> wrote: > >> From: Jesper Dangaard Brouer <brouer@redhat.com> >> Date: Fri, 15 Jan 2016 14:22:23 +0100 >> >> > This was only at the driver level. I also would like some API towards >> > the stack. Maybe we could simple pass a skb-list? >> >> Datastructures are everything so maybe we can create some kind of SKB >> bundle abstractions. Whether it's a lockless array or a linked list >> behind it doesn't really matter. >> >> We could have two categories: Related and Unrelated. >> >> If you think about GRO and routing keys you might see what I am getting >> at. :-) > > Yes, I think I get it. I like the idea of Related and Unrelated. > We already have GRO packets which is in the "Related" category/type. > > I'm wondering about the API between driver and "GRO-layer" (calling > napi_gro_receive): > > Down in the driver layer (RX), I think it is too early to categorize > Related/Unrelated SKB's, because we want to delay touching packet-data > as long as possible (waiting for the prefetcher to get data into > cache). > Does DDIO address this? > We could keep the napi_gro_receive() call. But in-order to save > icache, then the driver could just create it's own simple loop around > napi_gro_receive(). This loop's icache and extra function call per > packet would cost something. > > The down side is: The GRO layer will have no-idea how many "more" > packets are coming. Thus, it depends on a "flush" API, which for > "xmit_more" didn't work out that well. > > The NAPI drivers actually already have a flush API (calling > napi_complete_done()), BUT it does not always get invoked, e.g. if the > driver have more work to do, and want to keep polling. > I'm not sure we want to delay "flushing" packets queued in the GRO > layer for this long(?). > > > The simplest solution to get around this (flush and driver loop > complexity), would be to create a SKB-list down in the driver, and > call napi_gro_receive() with this list. Simply extending napi_gro_receive() > with a SKB list loop. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Optimizing instruction-cache, more packets at each stage 2016-01-18 17:36 ` Tom Herbert @ 2016-01-18 17:49 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 59+ messages in thread From: Jesper Dangaard Brouer @ 2016-01-18 17:49 UTC (permalink / raw) To: Tom Herbert Cc: David Miller, Linux Kernel Network Developers, Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, marek, Hannes Frederic Sowa, Florian Westphal, Paolo Abeni, John Fastabend, brouer On Mon, 18 Jan 2016 09:36:32 -0800 Tom Herbert <tom@herbertland.com> wrote: > On Mon, Jan 18, 2016 at 2:27 AM, Jesper Dangaard Brouer [...] > > Down in the driver layer (RX), I think it is too early to categorize > > Related/Unrelated SKB's, because we want to delay touching packet-data > > as long as possible (waiting for the prefetcher to get data into > > cache). > > > Does DDIO address this? Data Direct IO (DDIO) delivers packet-data into L3 cache, which is great to avoid this first cache miss on data. But not all CPUs have this feature. And it is difficult to deduct which CPUs support this feature. For test purposes, I do have systems both with and without DDIO. I'm currently setting up as Skylake CPU based system, which I believe don't have DDIO. The reason for this system is that, the Skylake CPU should have better PMU support for profiling icache and front-end. I'll soon verify this... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2016-02-02 16:37 UTC | newest] Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer 2016-01-15 13:32 ` Hannes Frederic Sowa 2016-01-15 14:17 ` Jesper Dangaard Brouer 2016-01-15 13:36 ` David Laight 2016-01-15 14:00 ` Jesper Dangaard Brouer 2016-01-15 14:38 ` Felix Fietkau 2016-01-18 11:54 ` Jesper Dangaard Brouer 2016-01-18 17:01 ` Eric Dumazet 2016-01-25 0:08 ` Florian Fainelli 2016-01-15 20:47 ` David Miller 2016-01-18 10:27 ` Jesper Dangaard Brouer 2016-01-18 16:24 ` David Miller 2016-01-20 22:20 ` Or Gerlitz 2016-01-20 23:02 ` Eric Dumazet 2016-01-20 23:27 ` Tom Herbert 2016-01-21 11:27 ` Jesper Dangaard Brouer 2016-01-21 12:49 ` Or Gerlitz 2016-01-21 13:57 ` Jesper Dangaard Brouer 2016-01-21 18:56 ` David Miller 2016-01-21 22:45 ` Or Gerlitz 2016-01-21 22:59 ` David Miller 2016-01-21 16:38 ` Eric Dumazet 2016-01-21 18:54 ` David Miller 2016-01-24 14:28 ` Jesper Dangaard Brouer 2016-01-24 14:44 ` Michael S. Tsirkin 2016-01-24 17:28 ` John Fastabend 2016-01-25 13:15 ` Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) Jesper Dangaard Brouer 2016-01-25 17:09 ` Tom Herbert 2016-01-25 17:50 ` John Fastabend 2016-01-25 21:32 ` Tom Herbert 2016-01-25 21:58 ` John Fastabend 2016-01-25 22:10 ` Jesper Dangaard Brouer 2016-01-27 20:47 ` Jesper Dangaard Brouer 2016-01-27 21:56 ` Alexei Starovoitov 2016-01-28 9:52 ` Jesper Dangaard Brouer 2016-01-28 12:54 ` Eric Dumazet 2016-01-28 13:25 ` Eric Dumazet 2016-01-28 16:43 ` Tom Herbert 2016-01-28 2:50 ` Tom Herbert 2016-01-28 9:25 ` Jesper Dangaard Brouer 2016-01-28 12:45 ` Eric Dumazet 2016-01-28 16:37 ` Tom Herbert 2016-01-28 16:43 ` Eric Dumazet 2016-01-28 17:04 ` Jesper Dangaard Brouer 2016-01-24 20:09 ` Optimizing instruction-cache, more packets at each stage Tom Herbert 2016-01-24 21:41 ` John Fastabend 2016-01-24 23:50 ` Tom Herbert 2016-01-21 12:23 ` Jesper Dangaard Brouer 2016-01-21 16:38 ` Tom Herbert 2016-01-21 17:48 ` Eric Dumazet 2016-01-22 12:33 ` Jesper Dangaard Brouer 2016-01-22 14:33 ` Eric Dumazet 2016-01-22 17:07 ` Tom Herbert 2016-01-22 17:17 ` Jesper Dangaard Brouer 2016-02-02 16:13 ` Or Gerlitz 2016-02-02 16:37 ` Eric Dumazet 2016-01-18 16:53 ` Eric Dumazet 2016-01-18 17:36 ` Tom Herbert 2016-01-18 17:49 ` Jesper Dangaard Brouer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.