perf/x86/intel: Collecting CPU-local performance counters from all cores in parallel

* perf/x86/intel: Collecting CPU-local performance counters from all cores in parallel
@ 2017-05-23  5:42 Michael Edwards
  2017-05-23 20:53 ` Andi Kleen
  0 siblings, 1 reply; 2+ messages in thread
From: Michael Edwards @ 2017-05-23  5:42 UTC (permalink / raw)
  To: linux-kernel, peterz, linux-perf-users

I'm working on a system-wide profiling tool that uses perf_event to
gather CPU-local performance counters (L2/L3 cache misses, etc.)
across all CPUs (hyperthreads) of a multi-socket system.  We'd like
for the monitoring process to run on a single core, and to be able to
sample at frequent, regular intervals (sub-millisecond), with minimal
impact on the tasks running on other CPUs.  I've prototyped this using
perf_events (with one event group per CPU), and on a two-socket,
32-(logical)-CPU system the prototype reaches about 2,700 samples per
second per CPU, at which point it's spending about 30% of its time
inside the read() syscall.  Optimizing the other 70% (the prototype
userland) looks fairly routine, so I'm looking at what it would take
to get beyond 10K samples per second.

I'm aware of the mmap()/RDPMC path to sampling counters from userland,
but I'd prefer not to go down that road; it involves mmap()ing all the
individual perf_event fds and reading them from userland tasks on the
relevant core, which is needlessly intrusive on the actual workload.
The measured overheads of the IPI-dispatched __perf_event_read() are
acceptable, if we could just dispatch it in parallel to all CPUs from
a single read() syscall.

I've dug through the perf_event code and think I have a fair idea of
what it would take to implement a sort of "event meta-group" file.
Its read() handler would be equivalent to concatenating the read()
output of its member fds (per-CPU event group leaders), except that it
would only take the syscall / VFS indirection / locking / copy_to_user
overhead once, and would dispatch one IPI (with a per-cpu array of
cache-line-aligned struct perf_read_data arguments) via
on_each_cpu_mask() (thus effectively waiting in parallel on all the
responses).  Implementing that is a bit tedious but it's just plumbing
-- except for the small matter of taking all the perf_event_ctx::mutex
locks in the right order.  There is a logical sequence (by mutex
address; see mutex_lock_double()), but acquiring several dozen mutexes
in every read() call may be problematic.

One could add a per-meta-group mutex, and add code to
perf_event_ctx_lock() (and other callers / variants of
perf_event_ctx_lock_nested()) that checks for meta-group membership
and takes the per-meta-group mutex before taking the ctx mutex.  Then
the meta-group read() path only has to take this one mutex.  That
means an event group can only be attached to one meta-group, but
that's probably okay.  Still, it's fiddly code, what with the lock
nesting - though I think it helps that we're dealing exclusively with
the group leaders for hardware events, so the move_group code path in
perf_event_open() isn't relevant.

Am I going about this wrong?  Is there some better way to pursue the
high-level goal of gathering PMC-based statistics frequently and
efficiently from all cores, without breaking everything else that uses
perf_events?

^ permalink raw reply	[flat|nested] 2+ messages in thread