From: Riccardo Mancini <rickyman7@gmail.com>
To: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>,
Namhyung Kim <namhyung@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Mark Rutland <mark.rutland@arm.com>, Jiri Olsa <jolsa@redhat.com>,
linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org,
Riccardo Mancini <rickyman7@gmail.com>
Subject: [RFC PATCH v1 00/37] perf: use workqueue for evlist operations
Date: Sat, 21 Aug 2021 11:19:06 +0200 [thread overview]
Message-ID: <cover.1629490974.git.rickyman7@gmail.com> (raw)
Hi,
this patchset adds multithreading through the workqueue to the most
important evlist operations: enable, disable, close, and open.
Their multithreaded implementation is then used in perf-record through
a new option '--threads'.
It is dependent on the workqueue patchset (v3):
https://lore.kernel.org/lkml/cover.1629454773.git.rickyman7@gmail.com/
In each operation, each worker is assigned a cpu and pins itself to that
cpu to perform the operation on all evsels. In case enough threads are
provided, the underlying threads are pinned to the cpus, otherwise, they
just change their affinity temporarily.
Parallelization of enable, disable, and close is pretty straightforward,
while open requires more work to separate the actual open from all
fallback mechanisms.
In the multithreaded implementation of open, each thread will run until it
finishes or gets an error. When all threads finish, the main process
checks for errors. If it finds one, it applies a fallback and resumes the
open on all cpus from where they encountered the error.
I have tested the main fallback mechanisms (precise_ip, ignore missing
thread,fd limit increase), but not all of the missing feature ones.
I also ran perf test, with no errors. Below you can find the skipped
tests (I ommitted successfull results for brevity):
$ sudo ./perf test
23: Watchpoint :
23.1: Read Only Watchpoint : Skip (missing hardware support)
58: builtin clang support : Skip (not compiled in)
63: Test libpfm4 support : Skip (not compiled in)
89: perf stat --bpf-counters test : Skip
90: Check Arm CoreSight trace data recording and synthesized samples: Skip
I know the patchset is huge, but I didn't have time to split it (and
my time is running out). In any case, I tried to keep related patches
close together. It is organized as follows:
- 1 - 3: remove the cpu iterator inside evsel to simplify
parallelization. In the doing, cpumap idx and max methods are
improved based on the assumption that the cpumap is ordered.
- 4 - 5: preparation patches for adding affinities to threadpool.
- 6 - 8: add affinity support to threadpool and workqueue (preparation
for adding workqueue to evsel).
- 9: preparation for adding workqueue to evsel.
- 10 - 13: add multithreading to evlist enable, disable, and close.
- 14 - 27: preparation for adding multithreading to evlist__open.
- 28: add multithreading to evlist__open.
- 29 - 34: use multithreaded evlist operations in perf-record.
- 35 - 38: improve evlist-open-close benchmark, adding multithreading
and detailed output.
I'll be happy to split it if necessary in the future, with the goal of
merging my GSoC work, but, for now, I need to send it as is to include it
in my final report.
Below are some experimental results of evlist-open-close benchmark run on:
- laptop (2 cores + hyperthreading)
- vm (16 vCPUs)
The command line was:
$ ./perf bench internals evlist-open-close [-j] <options>
where [-j] was included only in the specified rows and <options> refers
to the column label:
- "" (dummy): open one dummy event on all cpus.
- "-n 100": open 100 dummy events on all cpus.
- "-e '{cs,cycles}'": open the "{cs,cycles}" group on all cpus.
- "-u 0": open one dummy event on all cpus for all processes of root user
(~300 on my laptop; ~360 on my VM).
Results:
machine configuration (dummy) -n 100 -e '{cs,cycles}' -u 0
laptop perf/core 980 +- 130 10514 +- 313 31950 +- 526 14529 +- 241
laptop this w/o -j 698 +- 102 11302 +- 283 31807 +- 448 13885 +- 143
laptop this w/ -j 2233 +- 261 5434 +- 386 13586 +- 443 9465 +- 568
vm perf/core 9818 +- 94 89993 +- 941 N/A 266414 +-1431
vm this w/o -j 5245 +- 88 82806 +- 922 N/A 260416 +-1563
vm this w -j 37787 +- 748 54844 +-1089 N/A 101088 +-1900
Comments:
- opening one dummy event in single threaded mode is faster than
perf/core, probably due to the changes in how evsel cpus are iterated.
- opening one event in multithreaded mode is not worth it.
- in all other cases, multithreaded mode is between 1.5x and 2.5x faster.
Time breakdown on my laptop:
One dummy event per cpu:
$ ./perf bench internals evlist-open-close -d
# Running 'internals/evlist-open-close' benchmark:
Number of workers: 1
Number of cpus: 4
Number of threads: 1
Number of events: 1 (4 fds)
Number of iterations: 100
Average open-close took: 879.250 usec (+- 103.238 usec)
init took: 0.040 usec (+- 0.020 usec)
open took: 28.900 usec (+- 6.675 usec)
mmap took: 415.870 usec (+- 17.391 usec)
enable took: 240.950 usec (+- 92.641 usec)
disable took: 64.670 usec (+- 9.722 usec)
munmap took: 16.450 usec (+- 3.571 usec)
close took: 112.100 usec (+- 25.465 usec)
fini took: 0.220 usec (+- 0.042 usec)
$ ./perf bench internals evlist-open-close -j -d
# Running 'internals/evlist-open-close' benchmark:
Number of workers: 4
Number of cpus: 4
Number of threads: 1
Number of events: 1 (4 fds)
Number of iterations: 100
Average open-close took: 1979.670 usec (+- 271.772 usec)
init took: 30.860 usec (+- 1.190 usec)
open took: 552.040 usec (+- 96.166 usec)
mmap took: 428.970 usec (+- 9.492 usec)
enable took: 222.740 usec (+- 56.112 usec)
disable took: 191.990 usec (+- 55.029 usec)
munmap took: 13.670 usec (+- 0.754 usec)
close took: 155.660 usec (+- 44.079 usec)
fini took: 383.520 usec (+- 87.476 usec)
Comments:
- the overhead comes from open (spinnig up threads) and fini
(terminating threads). There could be some improvements there (e.g.
not waiting for the thread to spawn).
- enable, disable, and close also take longer due to the overhead of
assigning the tasks to the workers.
One dummy event per process per cpu:
$ ./perf bench internals evlist-open-close -d -u0
# Running 'internals/evlist-open-close' benchmark:
Number of workers: 1
Number of cpus: 4
Number of threads: 295
Number of events: 1 (1180 fds)
Number of iterations: 100
Average open-close took: 15101.380 usec (+- 247.959 usec)
init took: 0.010 usec (+- 0.010 usec)
open took: 4224.460 usec (+- 119.028 usec)
mmap took: 3235.210 usec (+- 55.867 usec)
enable took: 2359.570 usec (+- 44.923 usec)
disable took: 2321.100 usec (+- 80.779 usec)
munmap took: 304.440 usec (+- 11.558 usec)
close took: 2655.920 usec (+- 59.089 usec)
fini took: 0.380 usec (+- 0.051 usec)
$ ./perf bench internals evlist-open-close -j -d -u0
# Running 'internals/evlist-open-close' benchmark:
Number of workers: 4
Number of cpus: 4
Number of threads: 298
Number of events: 1 (1192 fds)
Number of iterations: 100
Average open-close took: 10321.060 usec (+- 771.875 usec)
init took: 31.530 usec (+- 0.721 usec)
open took: 2849.870 usec (+- 533.019 usec)
mmap took: 3267.810 usec (+- 87.465 usec)
enable took: 1041.160 usec (+- 66.324 usec)
disable took: 1176.970 usec (+- 134.291 usec)
munmap took: 253.680 usec (+- 4.525 usec)
close took: 1204.550 usec (+- 101.284 usec)
fini took: 495.260 usec (+- 136.661 usec)
Comments:
- mmap/munmap are not parallelized and account for 20% of the time.
- open time is reduced by 33% (due to overhead of spinning up threads,
which is around half ms).
- enable, disable, and close times are halved.
It is not always worth using multithreading in evlist operations, but
in the good cases the improvements can be significant.
For this reason, we could include an heuristic to decide whether to
use it or not (e.g. use it only if there are less than X event fds to open).
It'd be great to see more experiments like these on a bigger machine, and
on a more realistic scenario (instead of dummy events).
Thanks,
Riccardo
Riccardo Mancini (37):
libperf cpumap: improve idx function
libperf cpumap: improve max function
perf evlist: replace evsel__cpu_iter* functions with evsel__find_cpu
perf util: add mmap_cpu_mask__duplicate function
perf util/mmap: add missing bitops.h header
perf workqueue: add affinities to threadpool
perf workqueue: add support for setting affinities to workers
perf workqueue: add method to execute work on specific CPU
perf python: add workqueue dependency
perf evlist: add multithreading helper
perf evlist: add multithreading to evlist__disable
perf evlist: add multithreading to evlist__enable
perf evlist: add multithreading to evlist__close
perf evsel: remove retry_sample_id goto label
perf evsel: separate open preparation from open itself
perf evsel: save open flags in evsel
perf evsel: separate missing feature disabling from evsel__open_cpu
perf evsel: add evsel__prepare_open function
perf evsel: separate missing feature detection from evsel__open_cpu
perf evsel: separate rlimit increase from evsel__open_cpu
perf evsel: move ignore_missing_thread to fallback code
perf evsel: move test_attr__open to success path in evsel__open_cpu
perf evsel: move bpf_counter__install_pe to success path in
evsel__open_cpu
perf evsel: handle precise_ip fallback in evsel__open_cpu
perf evsel: move event open in evsel__open_cpu to separate function
perf evsel: add evsel__open_per_cpu_no_fallback function
perf evlist: add evlist__for_each_entry_from macro
perf evlist: add multithreading to evlist__open
perf evlist: add custom fallback to evlist__open
perf record: use evlist__open_custom
tools lib/subcmd: add OPT_UINTEGER_OPTARG option type
perf record: add --threads option
perf record: pin threads to monitored cpus if enough threads available
perf record: apply multithreading in init and fini phases
perf test/evlist-open-close: add multithreading
perf test/evlist-open-close: use inline func to convert timeval to
usec
perf test/evlist-open-close: add detailed output mode
tools/lib/perf/cpumap.c | 27 +-
tools/lib/subcmd/parse-options.h | 1 +
tools/perf/Documentation/perf-record.txt | 9 +
tools/perf/bench/evlist-open-close.c | 116 +++++-
tools/perf/builtin-record.c | 128 ++++--
tools/perf/builtin-stat.c | 24 +-
tools/perf/tests/workqueue.c | 87 ++++-
tools/perf/util/evlist.c | 464 +++++++++++++++++-----
tools/perf/util/evlist.h | 44 ++-
tools/perf/util/evsel.c | 473 ++++++++++++++---------
tools/perf/util/evsel.h | 36 +-
tools/perf/util/mmap.c | 12 +
tools/perf/util/mmap.h | 4 +
tools/perf/util/python-ext-sources | 2 +
tools/perf/util/record.h | 3 +
tools/perf/util/workqueue/threadpool.c | 70 ++++
tools/perf/util/workqueue/threadpool.h | 7 +
tools/perf/util/workqueue/workqueue.c | 154 +++++++-
tools/perf/util/workqueue/workqueue.h | 18 +
19 files changed, 1334 insertions(+), 345 deletions(-)
--
2.31.1
next reply other threads:[~2021-08-21 9:19 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-21 9:19 Riccardo Mancini [this message]
2021-08-21 9:19 ` [RFC PATCH v1 01/37] libperf cpumap: improve idx function Riccardo Mancini
2021-08-31 18:46 ` Arnaldo Carvalho de Melo
2021-10-08 14:29 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 02/37] libperf cpumap: improve max function Riccardo Mancini
2021-08-31 18:47 ` Arnaldo Carvalho de Melo
2021-08-31 19:16 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 03/37] perf evlist: replace evsel__cpu_iter* functions with evsel__find_cpu Riccardo Mancini
2021-10-08 14:38 ` [RFC PATCH v1 03/37] perf evlist: replace evsel__cpu_iter* functions with evsel__find_cpu() Arnaldo Carvalho de Melo
2021-12-11 0:20 ` [RFC PATCH v1 03/37] perf evlist: replace evsel__cpu_iter* functions with evsel__find_cpu Ian Rogers
2021-08-21 9:19 ` [RFC PATCH v1 04/37] perf util: add mmap_cpu_mask__duplicate function Riccardo Mancini
2021-08-31 19:21 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 05/37] perf util/mmap: add missing bitops.h header Riccardo Mancini
2021-08-31 19:22 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 06/37] perf workqueue: add affinities to threadpool Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 07/37] perf workqueue: add support for setting affinities to workers Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 08/37] perf workqueue: add method to execute work on specific CPU Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 09/37] perf python: add workqueue dependency Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 10/37] perf evlist: add multithreading helper Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 11/37] perf evlist: add multithreading to evlist__disable Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 12/37] perf evlist: add multithreading to evlist__enable Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 13/37] perf evlist: add multithreading to evlist__close Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 14/37] perf evsel: remove retry_sample_id goto label Riccardo Mancini
2021-08-31 19:25 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 15/37] perf evsel: separate open preparation from open itself Riccardo Mancini
2021-08-31 19:27 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 16/37] perf evsel: save open flags in evsel Riccardo Mancini
2021-08-31 19:31 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 17/37] perf evsel: separate missing feature disabling from evsel__open_cpu Riccardo Mancini
2021-08-31 19:35 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 18/37] perf evsel: add evsel__prepare_open function Riccardo Mancini
2021-08-31 19:36 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 19/37] perf evsel: separate missing feature detection from evsel__open_cpu Riccardo Mancini
2021-08-31 19:39 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 20/37] perf evsel: separate rlimit increase " Riccardo Mancini
2021-08-31 19:41 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 21/37] perf evsel: move ignore_missing_thread to fallback code Riccardo Mancini
2021-08-31 19:44 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 22/37] perf evsel: move test_attr__open to success path in evsel__open_cpu Riccardo Mancini
2021-08-31 19:47 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 23/37] perf evsel: move bpf_counter__install_pe " Riccardo Mancini
2021-08-31 19:50 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 24/37] perf evsel: handle precise_ip fallback " Riccardo Mancini
2021-08-31 19:52 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 25/37] perf evsel: move event open in evsel__open_cpu to separate function Riccardo Mancini
2021-08-31 19:54 ` Arnaldo Carvalho de Melo
2021-09-03 21:52 ` Riccardo Mancini
2021-09-11 19:10 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 26/37] perf evsel: add evsel__open_per_cpu_no_fallback function Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 27/37] perf evlist: add evlist__for_each_entry_from macro Riccardo Mancini
2021-08-31 20:06 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 28/37] perf evlist: add multithreading to evlist__open Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 29/37] perf evlist: add custom fallback " Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 30/37] perf record: use evlist__open_custom Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 31/37] tools lib/subcmd: add OPT_UINTEGER_OPTARG option type Riccardo Mancini
2021-08-31 18:44 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 32/37] perf record: add --threads option Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 33/37] perf record: pin threads to monitored cpus if enough threads available Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 34/37] perf record: apply multithreading in init and fini phases Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 35/37] perf test/evlist-open-close: add multithreading Riccardo Mancini
2021-08-21 9:19 ` [RFC PATCH v1 36/37] perf test/evlist-open-close: use inline func to convert timeval to usec Riccardo Mancini
2021-10-08 14:46 ` Arnaldo Carvalho de Melo
2021-08-21 9:19 ` [RFC PATCH v1 37/37] perf test/evlist-open-close: add detailed output mode Riccardo Mancini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1629490974.git.rickyman7@gmail.com \
--to=rickyman7@gmail.com \
--cc=acme@kernel.org \
--cc=irogers@google.com \
--cc=jolsa@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=mingo@redhat.com \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.