All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yonghong Song <yhs@fb.com>
To: Andrii Nakryiko <andriin@fb.com>, <bpf@vger.kernel.org>,
	<netdev@vger.kernel.org>, <ast@fb.com>, <daniel@iogearbox.net>
Cc: <andrii.nakryiko@gmail.com>, <kernel-team@fb.com>
Subject: Re: [PATCH v3 bpf-next 2/4] selftests/bpf: add benchmark runner infrastructure
Date: Tue, 12 May 2020 18:09:11 -0700	[thread overview]
Message-ID: <65f234b5-0c97-2925-f440-d43a9a9afef2@fb.com> (raw)
In-Reply-To: <20200512192445.2351848-3-andriin@fb.com>



On 5/12/20 12:24 PM, Andrii Nakryiko wrote:
> While working on BPF ringbuf implementation, testing, and benchmarking, I've
> developed a pretty generic and modular benchmark runner, which seems to be
> generically useful, as I've already used it for one more purpose (testing
> fastest way to trigger BPF program, to minimize overhead of in-kernel code).
> 
> This patch adds generic part of benchmark runner and sets up Makefile for
> extending it with more sets of benchmarks.
> 
> Benchmarker itself operates by spinning up specified number of producer and
> consumer threads, setting up interval timer sending SIGALARM signal to
> application once a second. Every second, current snapshot with hits/drops
> counters are collected and stored in an array. Drops are useful for
> producer/consumer benchmarks in which producer might overwhelm consumers.
> 
> Once test finishes after given amount of warm-up and testing seconds, mean and
> stddev are calculated (ignoring warm-up results) and is printed out to stdout.
> This setup seems to give consistent and accurate results.
> 
> To validate behavior, I added two atomic counting tests: global and local.
> For global one, all the producer threads are atomically incrementing same
> counter as fast as possible. This, of course, leads to huge drop of
> performance once there is more than one producer thread due to CPUs fighting
> for the same memory location.
> 
> Local counting, on the other hand, maintains one counter per each producer
> thread, incremented independently. Once per second, all counters are read and
> added together to form final "counting throughput" measurement. As expected,
> such setup demonstrates linear scalability with number of producers (as long
> as there are enough physical CPU cores, of course). See example output below.
> Also, this setup can nicely demonstrate disastrous effects of false sharing,
> if care is not taken to take those per-producer counters apart into
> independent cache lines.
> 
> Demo output shows global counter first with 1 producer, then with 4. Both
> total and per-producer performance significantly drop. The last run is local
> counter with 4 producers, demonstrating near-perfect scalability.
> 
> $ ./bench -a -w1 -d2 -p1 count-global
> Setting up benchmark 'count-global'...
> Benchmark 'count-global' started.
> Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
> Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
> Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
> Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
> Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s
> 
> $ ./bench -a -w1 -d2 -p4 count-global
> Setting up benchmark 'count-global'...
> Benchmark 'count-global' started.
> Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
> Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
> Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
> Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
> Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s
> 
> $ ./bench -a -w1 -d2 -p4 count-local
> Setting up benchmark 'count-local'...
> Benchmark 'count-local' started.
> Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
> Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
> Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
> Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
> Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s
> 
> Benchmark runner supports setting thread affinity for producer and consumer
> threads. You can use -a flag for default CPU selection scheme, where first
> consumer gets CPU #0, next one gets CPU #1, and so on. Then producer threads
> pick up next CPU and increment one-by-one as well. But user can also specify
> a set of CPUs independently for producers and consumers with --prod-affinity
> 1,2-10,15 and --cons-affinity <set-of-cpus>. The latter allows to force
> producers and consumers to share same set of CPUs, if necessary.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>

  reply	other threads:[~2020-05-13  1:09 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-12 19:24 [PATCH v3 bpf-next 0/4] Add benchmark runner and few benchmarks Andrii Nakryiko
2020-05-12 19:24 ` [PATCH v3 bpf-next 1/4] selftests/bpf: extract parse_num_list into generic testing_helpers.c Andrii Nakryiko
2020-05-13  1:08   ` Yonghong Song
2020-05-12 19:24 ` [PATCH v3 bpf-next 2/4] selftests/bpf: add benchmark runner infrastructure Andrii Nakryiko
2020-05-13  1:09   ` Yonghong Song [this message]
2020-05-12 19:24 ` [PATCH v3 bpf-next 3/4] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench Andrii Nakryiko
2020-05-13  1:10   ` Yonghong Song
2020-05-12 19:24 ` [PATCH v3 bpf-next 4/4] selftest/bpf: add BPF triggering benchmark Andrii Nakryiko
2020-05-13 19:23 ` [PATCH v3 bpf-next 0/4] Add benchmark runner and few benchmarks Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=65f234b5-0c97-2925-f440-d43a9a9afef2@fb.com \
    --to=yhs@fb.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andriin@fb.com \
    --cc=ast@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.