All of lore.kernel.org
 help / color / mirror / Atom feed
* Created benchmarks modules for page_pool
@ 2020-01-21 16:09 Jesper Dangaard Brouer
  2020-01-22 10:42 ` Ilias Apalodimas
  0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-21 16:09 UTC (permalink / raw)
  To: Ilias Apalodimas, Lorenzo Bianconi
  Cc: brouer, Saeed Mahameed, Matteo Croce, Tariq Toukan,
	Toke Høiland-Jørgensen, Jonathan Lemon, netdev

Hi Ilias and Lorenzo, (Cc others + netdev)

I've created two benchmarks modules for page_pool.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c

I think we/you could actually use this as part of your presentation[3]?

The first benchmark[1] illustrate/measure what happen when page_pool
alloc and free/return happens on the same CPU.  Here there are 3 modes
of operations with different performance characteristic.

Fast_path NAPI recycle (XDP_DROP use-case)
 - cost per elem: 15 cycles(tsc) 4.437 ns

Recycle via ptr_ring
 - cost per elem: 48 cycles(tsc) 13.439 ns

Failed recycle, return to page-allocator
 - cost per elem: 256 cycles(tsc) 71.169 ns


The second benchmark[2] measures what happens cross-CPU.  It is
primarily the concurrent return-path that I want to capture. As this
is page_pool's weak spot, that we/I need to improve performance of.
Hint when SKBs use page_pool return this will happen more often.
It is a little more tricky to get proper measurement as we want to
observe the case, where return-path isn't stalling/waiting on pages to
return.

- 1 CPU returning  , cost per elem: 110 cycles(tsc)   30.709 ns
- 2 concurrent CPUs, cost per elem: 989 cycles(tsc)  274.861 ns
- 3 concurrent CPUs, cost per elem: 2089 cycles(tsc) 580.530 ns
- 4 concurrent CPUs, cost per elem: 2339 cycles(tsc) 649.984 ns

[3] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Created benchmarks modules for page_pool
  2020-01-21 16:09 Created benchmarks modules for page_pool Jesper Dangaard Brouer
@ 2020-01-22 10:42 ` Ilias Apalodimas
  2020-01-22 12:09   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 6+ messages in thread
From: Ilias Apalodimas @ 2020-01-22 10:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Lorenzo Bianconi, Saeed Mahameed, Matteo Croce, Tariq Toukan,
	Toke Høiland-Jørgensen, Jonathan Lemon, netdev

Hi Jesper, 

On Tue, Jan 21, 2020 at 05:09:45PM +0100, Jesper Dangaard Brouer wrote:
> Hi Ilias and Lorenzo, (Cc others + netdev)
> 
> I've created two benchmarks modules for page_pool.
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c
> 
> I think we/you could actually use this as part of your presentation[3]?

I think we can mention this as part of the improvements we can offer, alongside
with native SKB recycling.

> 
> The first benchmark[1] illustrate/measure what happen when page_pool
> alloc and free/return happens on the same CPU.  Here there are 3 modes
> of operations with different performance characteristic.
> 
> Fast_path NAPI recycle (XDP_DROP use-case)
>  - cost per elem: 15 cycles(tsc) 4.437 ns
> 
> Recycle via ptr_ring
>  - cost per elem: 48 cycles(tsc) 13.439 ns
> 
> Failed recycle, return to page-allocator
>  - cost per elem: 256 cycles(tsc) 71.169 ns
> 
> 
> The second benchmark[2] measures what happens cross-CPU.  It is
> primarily the concurrent return-path that I want to capture. As this
> is page_pool's weak spot, that we/I need to improve performance of.
> Hint when SKBs use page_pool return this will happen more often.
> It is a little more tricky to get proper measurement as we want to
> observe the case, where return-path isn't stalling/waiting on pages to
> return.
> 
> - 1 CPU returning  , cost per elem: 110 cycles(tsc)   30.709 ns
> - 2 concurrent CPUs, cost per elem: 989 cycles(tsc)  274.861 ns
> - 3 concurrent CPUs, cost per elem: 2089 cycles(tsc) 580.530 ns
> - 4 concurrent CPUs, cost per elem: 2339 cycles(tsc) 649.984 ns

Interesting, i'll try having a look at the code and maybe run then on my armv8
board.

Thanks!
/Ilias
> 
> [3] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Created benchmarks modules for page_pool
  2020-01-22 10:42 ` Ilias Apalodimas
@ 2020-01-22 12:09   ` Jesper Dangaard Brouer
  2020-01-28 16:22     ` Matteo Croce
  0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-22 12:09 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Lorenzo Bianconi, Saeed Mahameed, Matteo Croce, Tariq Toukan,
	Toke Høiland-Jørgensen, Jonathan Lemon, netdev, brouer

On Wed, 22 Jan 2020 12:42:05 +0200
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> Hi Jesper, 
> 
> On Tue, Jan 21, 2020 at 05:09:45PM +0100, Jesper Dangaard Brouer wrote:
> > Hi Ilias and Lorenzo, (Cc others + netdev)
> > 
> > I've created two benchmarks modules for page_pool.
> > 
> > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c
> > 
> > I think we/you could actually use this as part of your presentation[3]?  
> 
> I think we can mention this as part of the improvements we can offer,
> alongside with native SKB recycling.

Yes, but you should notice that the cross CPU return benchmark test
show that we/page_pool is too slow...


> > 
> > The first benchmark[1] illustrate/measure what happen when page_pool
> > alloc and free/return happens on the same CPU.  Here there are 3
> > modes of operations with different performance characteristic.
> > 
> > Fast_path NAPI recycle (XDP_DROP use-case)
> >  - cost per elem: 15 cycles(tsc) 4.437 ns
> > 
> > Recycle via ptr_ring
> >  - cost per elem: 48 cycles(tsc) 13.439 ns
> > 
> > Failed recycle, return to page-allocator
> >  - cost per elem: 256 cycles(tsc) 71.169 ns
> > 
> > 
> > The second benchmark[2] measures what happens cross-CPU.  It is
> > primarily the concurrent return-path that I want to capture. As this
> > is page_pool's weak spot, that we/I need to improve performance of.
> > Hint when SKBs use page_pool return this will happen more often.
> > It is a little more tricky to get proper measurement as we want to
> > observe the case, where return-path isn't stalling/waiting on pages
> > to return.
> > 
> > - 1 CPU returning  , cost per elem: 110 cycles(tsc)   30.709 ns
> > - 2 concurrent CPUs, cost per elem: 989 cycles(tsc)  274.861 ns
> > - 3 concurrent CPUs, cost per elem: 2089 cycles(tsc) 580.530 ns
> > - 4 concurrent CPUs, cost per elem: 2339 cycles(tsc) 649.984 ns  

Add a small bug, thus re-run of cross_cpu bench numbers:

- 2 concurrent CPUs, cost per elem:  462 cycles(tsc) 128.502 ns
- 3 concurrent CPUs, cost per elem: 1992 cycles(tsc) 553.507 ns
- 4 concurrent CPUs, cost per elem: 2323 cycles(tsc) 645.389 ns


> Interesting, i'll try having a look at the code and maybe run then on
> my armv8 board.

That will be great, but we/you have to fixup the Intel specific ASM
instructions in time_bench.c (which we already discussed on IRC).

> > 
> > [3] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Created benchmarks modules for page_pool
  2020-01-22 12:09   ` Jesper Dangaard Brouer
@ 2020-01-28 16:22     ` Matteo Croce
  2020-01-28 18:41       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 6+ messages in thread
From: Matteo Croce @ 2020-01-28 16:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Ilias Apalodimas, Lorenzo Bianconi, Saeed Mahameed, Tariq Toukan,
	Toke Høiland-Jørgensen, Jonathan Lemon, netdev

On Wed, Jan 22, 2020 at 1:09 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 22 Jan 2020 12:42:05 +0200
> > Interesting, i'll try having a look at the code and maybe run then on
> > my armv8 board.
>
> That will be great, but we/you have to fixup the Intel specific ASM
> instructions in time_bench.c (which we already discussed on IRC).
>

What does it need to work on arm64? Replace RDPMC with something generic?


--
Matteo Croce
per aspera ad upstream


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Created benchmarks modules for page_pool
  2020-01-28 16:22     ` Matteo Croce
@ 2020-01-28 18:41       ` Jesper Dangaard Brouer
  2020-01-29  9:07         ` Ilias Apalodimas
  0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-28 18:41 UTC (permalink / raw)
  To: Matteo Croce
  Cc: Ilias Apalodimas, Lorenzo Bianconi, Saeed Mahameed, Tariq Toukan,
	Toke Høiland-Jørgensen, Jonathan Lemon, netdev, brouer

On Tue, 28 Jan 2020 17:22:47 +0100
Matteo Croce <mcroce@redhat.com> wrote:

> On Wed, Jan 22, 2020 at 1:09 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Wed, 22 Jan 2020 12:42:05 +0200  
> > > Interesting, i'll try having a look at the code and maybe run then on
> > > my armv8 board.  
> >
> > That will be great, but we/you have to fixup the Intel specific ASM
> > instructions in time_bench.c (which we already discussed on IRC).
> >  
> 
> What does it need to work on arm64? Replace RDPMC with something generic?

Replacing the RDTSC. Hoping Ilias will fix it for ARM ;-) 

You can also fix yourself via using get_cycles() include <linux/timex.h>.
If the ARCH doesn't have support it will just return 0.

Have you tried it out on your normal x86/Intel box?
Hint:
 https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Created benchmarks modules for page_pool
  2020-01-28 18:41       ` Jesper Dangaard Brouer
@ 2020-01-29  9:07         ` Ilias Apalodimas
  0 siblings, 0 replies; 6+ messages in thread
From: Ilias Apalodimas @ 2020-01-29  9:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matteo Croce, Lorenzo Bianconi, Saeed Mahameed, Tariq Toukan,
	Toke Høiland-Jørgensen, Jonathan Lemon, netdev

On Tue, Jan 28, 2020 at 07:41:36PM +0100, Jesper Dangaard Brouer wrote:
> On Tue, 28 Jan 2020 17:22:47 +0100
> Matteo Croce <mcroce@redhat.com> wrote:
> 
> > On Wed, Jan 22, 2020 at 1:09 PM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > > On Wed, 22 Jan 2020 12:42:05 +0200  
> > > > Interesting, i'll try having a look at the code and maybe run then on
> > > > my armv8 board.  
> > >
> > > That will be great, but we/you have to fixup the Intel specific ASM
> > > instructions in time_bench.c (which we already discussed on IRC).
> > >  
> > 
> > What does it need to work on arm64? Replace RDPMC with something generic?
> 
> Replacing the RDTSC. Hoping Ilias will fix it for ARM ;-) 

I'll have a look today and run it on my armv8 box

Cheers
/Ilias

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-01-29  9:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-21 16:09 Created benchmarks modules for page_pool Jesper Dangaard Brouer
2020-01-22 10:42 ` Ilias Apalodimas
2020-01-22 12:09   ` Jesper Dangaard Brouer
2020-01-28 16:22     ` Matteo Croce
2020-01-28 18:41       ` Jesper Dangaard Brouer
2020-01-29  9:07         ` Ilias Apalodimas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.