* Created benchmarks modules for page_pool
@ 2020-01-21 16:09 Jesper Dangaard Brouer
2020-01-22 10:42 ` Ilias Apalodimas
0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-21 16:09 UTC (permalink / raw)
To: Ilias Apalodimas, Lorenzo Bianconi
Cc: brouer, Saeed Mahameed, Matteo Croce, Tariq Toukan,
Toke Høiland-Jørgensen, Jonathan Lemon, netdev
Hi Ilias and Lorenzo, (Cc others + netdev)
I've created two benchmarks modules for page_pool.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c
I think we/you could actually use this as part of your presentation[3]?
The first benchmark[1] illustrate/measure what happen when page_pool
alloc and free/return happens on the same CPU. Here there are 3 modes
of operations with different performance characteristic.
Fast_path NAPI recycle (XDP_DROP use-case)
- cost per elem: 15 cycles(tsc) 4.437 ns
Recycle via ptr_ring
- cost per elem: 48 cycles(tsc) 13.439 ns
Failed recycle, return to page-allocator
- cost per elem: 256 cycles(tsc) 71.169 ns
The second benchmark[2] measures what happens cross-CPU. It is
primarily the concurrent return-path that I want to capture. As this
is page_pool's weak spot, that we/I need to improve performance of.
Hint when SKBs use page_pool return this will happen more often.
It is a little more tricky to get proper measurement as we want to
observe the case, where return-path isn't stalling/waiting on pages to
return.
- 1 CPU returning , cost per elem: 110 cycles(tsc) 30.709 ns
- 2 concurrent CPUs, cost per elem: 989 cycles(tsc) 274.861 ns
- 3 concurrent CPUs, cost per elem: 2089 cycles(tsc) 580.530 ns
- 4 concurrent CPUs, cost per elem: 2339 cycles(tsc) 649.984 ns
[3] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool
2020-01-21 16:09 Created benchmarks modules for page_pool Jesper Dangaard Brouer
@ 2020-01-22 10:42 ` Ilias Apalodimas
2020-01-22 12:09 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 6+ messages in thread
From: Ilias Apalodimas @ 2020-01-22 10:42 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Lorenzo Bianconi, Saeed Mahameed, Matteo Croce, Tariq Toukan,
Toke Høiland-Jørgensen, Jonathan Lemon, netdev
Hi Jesper,
On Tue, Jan 21, 2020 at 05:09:45PM +0100, Jesper Dangaard Brouer wrote:
> Hi Ilias and Lorenzo, (Cc others + netdev)
>
> I've created two benchmarks modules for page_pool.
>
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c
>
> I think we/you could actually use this as part of your presentation[3]?
I think we can mention this as part of the improvements we can offer, alongside
with native SKB recycling.
>
> The first benchmark[1] illustrate/measure what happen when page_pool
> alloc and free/return happens on the same CPU. Here there are 3 modes
> of operations with different performance characteristic.
>
> Fast_path NAPI recycle (XDP_DROP use-case)
> - cost per elem: 15 cycles(tsc) 4.437 ns
>
> Recycle via ptr_ring
> - cost per elem: 48 cycles(tsc) 13.439 ns
>
> Failed recycle, return to page-allocator
> - cost per elem: 256 cycles(tsc) 71.169 ns
>
>
> The second benchmark[2] measures what happens cross-CPU. It is
> primarily the concurrent return-path that I want to capture. As this
> is page_pool's weak spot, that we/I need to improve performance of.
> Hint when SKBs use page_pool return this will happen more often.
> It is a little more tricky to get proper measurement as we want to
> observe the case, where return-path isn't stalling/waiting on pages to
> return.
>
> - 1 CPU returning , cost per elem: 110 cycles(tsc) 30.709 ns
> - 2 concurrent CPUs, cost per elem: 989 cycles(tsc) 274.861 ns
> - 3 concurrent CPUs, cost per elem: 2089 cycles(tsc) 580.530 ns
> - 4 concurrent CPUs, cost per elem: 2339 cycles(tsc) 649.984 ns
Interesting, i'll try having a look at the code and maybe run then on my armv8
board.
Thanks!
/Ilias
>
> [3] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool
2020-01-22 10:42 ` Ilias Apalodimas
@ 2020-01-22 12:09 ` Jesper Dangaard Brouer
2020-01-28 16:22 ` Matteo Croce
0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-22 12:09 UTC (permalink / raw)
To: Ilias Apalodimas
Cc: Lorenzo Bianconi, Saeed Mahameed, Matteo Croce, Tariq Toukan,
Toke Høiland-Jørgensen, Jonathan Lemon, netdev, brouer
On Wed, 22 Jan 2020 12:42:05 +0200
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> Hi Jesper,
>
> On Tue, Jan 21, 2020 at 05:09:45PM +0100, Jesper Dangaard Brouer wrote:
> > Hi Ilias and Lorenzo, (Cc others + netdev)
> >
> > I've created two benchmarks modules for page_pool.
> >
> > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c
> >
> > I think we/you could actually use this as part of your presentation[3]?
>
> I think we can mention this as part of the improvements we can offer,
> alongside with native SKB recycling.
Yes, but you should notice that the cross CPU return benchmark test
show that we/page_pool is too slow...
> >
> > The first benchmark[1] illustrate/measure what happen when page_pool
> > alloc and free/return happens on the same CPU. Here there are 3
> > modes of operations with different performance characteristic.
> >
> > Fast_path NAPI recycle (XDP_DROP use-case)
> > - cost per elem: 15 cycles(tsc) 4.437 ns
> >
> > Recycle via ptr_ring
> > - cost per elem: 48 cycles(tsc) 13.439 ns
> >
> > Failed recycle, return to page-allocator
> > - cost per elem: 256 cycles(tsc) 71.169 ns
> >
> >
> > The second benchmark[2] measures what happens cross-CPU. It is
> > primarily the concurrent return-path that I want to capture. As this
> > is page_pool's weak spot, that we/I need to improve performance of.
> > Hint when SKBs use page_pool return this will happen more often.
> > It is a little more tricky to get proper measurement as we want to
> > observe the case, where return-path isn't stalling/waiting on pages
> > to return.
> >
> > - 1 CPU returning , cost per elem: 110 cycles(tsc) 30.709 ns
> > - 2 concurrent CPUs, cost per elem: 989 cycles(tsc) 274.861 ns
> > - 3 concurrent CPUs, cost per elem: 2089 cycles(tsc) 580.530 ns
> > - 4 concurrent CPUs, cost per elem: 2339 cycles(tsc) 649.984 ns
Add a small bug, thus re-run of cross_cpu bench numbers:
- 2 concurrent CPUs, cost per elem: 462 cycles(tsc) 128.502 ns
- 3 concurrent CPUs, cost per elem: 1992 cycles(tsc) 553.507 ns
- 4 concurrent CPUs, cost per elem: 2323 cycles(tsc) 645.389 ns
> Interesting, i'll try having a look at the code and maybe run then on
> my armv8 board.
That will be great, but we/you have to fixup the Intel specific ASM
instructions in time_bench.c (which we already discussed on IRC).
> >
> > [3] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool
2020-01-22 12:09 ` Jesper Dangaard Brouer
@ 2020-01-28 16:22 ` Matteo Croce
2020-01-28 18:41 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 6+ messages in thread
From: Matteo Croce @ 2020-01-28 16:22 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Ilias Apalodimas, Lorenzo Bianconi, Saeed Mahameed, Tariq Toukan,
Toke Høiland-Jørgensen, Jonathan Lemon, netdev
On Wed, Jan 22, 2020 at 1:09 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 22 Jan 2020 12:42:05 +0200
> > Interesting, i'll try having a look at the code and maybe run then on
> > my armv8 board.
>
> That will be great, but we/you have to fixup the Intel specific ASM
> instructions in time_bench.c (which we already discussed on IRC).
>
What does it need to work on arm64? Replace RDPMC with something generic?
--
Matteo Croce
per aspera ad upstream
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool
2020-01-28 16:22 ` Matteo Croce
@ 2020-01-28 18:41 ` Jesper Dangaard Brouer
2020-01-29 9:07 ` Ilias Apalodimas
0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-28 18:41 UTC (permalink / raw)
To: Matteo Croce
Cc: Ilias Apalodimas, Lorenzo Bianconi, Saeed Mahameed, Tariq Toukan,
Toke Høiland-Jørgensen, Jonathan Lemon, netdev, brouer
On Tue, 28 Jan 2020 17:22:47 +0100
Matteo Croce <mcroce@redhat.com> wrote:
> On Wed, Jan 22, 2020 at 1:09 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Wed, 22 Jan 2020 12:42:05 +0200
> > > Interesting, i'll try having a look at the code and maybe run then on
> > > my armv8 board.
> >
> > That will be great, but we/you have to fixup the Intel specific ASM
> > instructions in time_bench.c (which we already discussed on IRC).
> >
>
> What does it need to work on arm64? Replace RDPMC with something generic?
Replacing the RDTSC. Hoping Ilias will fix it for ARM ;-)
You can also fix yourself via using get_cycles() include <linux/timex.h>.
If the ARCH doesn't have support it will just return 0.
Have you tried it out on your normal x86/Intel box?
Hint:
https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Created benchmarks modules for page_pool
2020-01-28 18:41 ` Jesper Dangaard Brouer
@ 2020-01-29 9:07 ` Ilias Apalodimas
0 siblings, 0 replies; 6+ messages in thread
From: Ilias Apalodimas @ 2020-01-29 9:07 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Matteo Croce, Lorenzo Bianconi, Saeed Mahameed, Tariq Toukan,
Toke Høiland-Jørgensen, Jonathan Lemon, netdev
On Tue, Jan 28, 2020 at 07:41:36PM +0100, Jesper Dangaard Brouer wrote:
> On Tue, 28 Jan 2020 17:22:47 +0100
> Matteo Croce <mcroce@redhat.com> wrote:
>
> > On Wed, Jan 22, 2020 at 1:09 PM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > > On Wed, 22 Jan 2020 12:42:05 +0200
> > > > Interesting, i'll try having a look at the code and maybe run then on
> > > > my armv8 board.
> > >
> > > That will be great, but we/you have to fixup the Intel specific ASM
> > > instructions in time_bench.c (which we already discussed on IRC).
> > >
> >
> > What does it need to work on arm64? Replace RDPMC with something generic?
>
> Replacing the RDTSC. Hoping Ilias will fix it for ARM ;-)
I'll have a look today and run it on my armv8 box
Cheers
/Ilias
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-01-29 9:07 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-21 16:09 Created benchmarks modules for page_pool Jesper Dangaard Brouer
2020-01-22 10:42 ` Ilias Apalodimas
2020-01-22 12:09 ` Jesper Dangaard Brouer
2020-01-28 16:22 ` Matteo Croce
2020-01-28 18:41 ` Jesper Dangaard Brouer
2020-01-29 9:07 ` Ilias Apalodimas
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.