linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* perf/x86/intel: Collecting CPU-local performance counters from all cores in parallel
@ 2017-05-23  5:42 Michael Edwards
  2017-05-23 20:53 ` Andi Kleen
  0 siblings, 1 reply; 2+ messages in thread
From: Michael Edwards @ 2017-05-23  5:42 UTC (permalink / raw)
  To: linux-kernel, peterz, linux-perf-users

I'm working on a system-wide profiling tool that uses perf_event to
gather CPU-local performance counters (L2/L3 cache misses, etc.)
across all CPUs (hyperthreads) of a multi-socket system.  We'd like
for the monitoring process to run on a single core, and to be able to
sample at frequent, regular intervals (sub-millisecond), with minimal
impact on the tasks running on other CPUs.  I've prototyped this using
perf_events (with one event group per CPU), and on a two-socket,
32-(logical)-CPU system the prototype reaches about 2,700 samples per
second per CPU, at which point it's spending about 30% of its time
inside the read() syscall.  Optimizing the other 70% (the prototype
userland) looks fairly routine, so I'm looking at what it would take
to get beyond 10K samples per second.

I'm aware of the mmap()/RDPMC path to sampling counters from userland,
but I'd prefer not to go down that road; it involves mmap()ing all the
individual perf_event fds and reading them from userland tasks on the
relevant core, which is needlessly intrusive on the actual workload.
The measured overheads of the IPI-dispatched __perf_event_read() are
acceptable, if we could just dispatch it in parallel to all CPUs from
a single read() syscall.

I've dug through the perf_event code and think I have a fair idea of
what it would take to implement a sort of "event meta-group" file.
Its read() handler would be equivalent to concatenating the read()
output of its member fds (per-CPU event group leaders), except that it
would only take the syscall / VFS indirection / locking / copy_to_user
overhead once, and would dispatch one IPI (with a per-cpu array of
cache-line-aligned struct perf_read_data arguments) via
on_each_cpu_mask() (thus effectively waiting in parallel on all the
responses).  Implementing that is a bit tedious but it's just plumbing
-- except for the small matter of taking all the perf_event_ctx::mutex
locks in the right order.  There is a logical sequence (by mutex
address; see mutex_lock_double()), but acquiring several dozen mutexes
in every read() call may be problematic.

One could add a per-meta-group mutex, and add code to
perf_event_ctx_lock() (and other callers / variants of
perf_event_ctx_lock_nested()) that checks for meta-group membership
and takes the per-meta-group mutex before taking the ctx mutex.  Then
the meta-group read() path only has to take this one mutex.  That
means an event group can only be attached to one meta-group, but
that's probably okay.  Still, it's fiddly code, what with the lock
nesting - though I think it helps that we're dealing exclusively with
the group leaders for hardware events, so the move_group code path in
perf_event_open() isn't relevant.

Am I going about this wrong?  Is there some better way to pursue the
high-level goal of gathering PMC-based statistics frequently and
efficiently from all cores, without breaking everything else that uses
perf_events?

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: perf/x86/intel: Collecting CPU-local performance counters from all cores in parallel
  2017-05-23  5:42 perf/x86/intel: Collecting CPU-local performance counters from all cores in parallel Michael Edwards
@ 2017-05-23 20:53 ` Andi Kleen
  0 siblings, 0 replies; 2+ messages in thread
From: Andi Kleen @ 2017-05-23 20:53 UTC (permalink / raw)
  To: Michael Edwards; +Cc: linux-kernel, peterz, linux-perf-users

Michael Edwards <michael@tensyr.com> writes:
>
> Am I going about this wrong?

It seems like a reasonable optimization, but it's likely a lot of work.

> Is there some better way to pursue the
> high-level goal of gathering PMC-based statistics frequently and
> efficiently from all cores, without breaking everything else that uses
> perf_events?

If you can drive the collection from a performance counter
(e.g. reference cycles) you could use leader sampling, and let the
PMIs log the values to the mmap'ed ring buffer. This should
be vastly more efficient than pulling everything. This works today,
however there are some scaling problems with many groups still.

perf record -F frequency -e '{cpu/ref-cycles/,<three other
events to collect>}:S,... more groups like this ... -a ...

-Andi

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-05-23 20:53 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-23  5:42 perf/x86/intel: Collecting CPU-local performance counters from all cores in parallel Michael Edwards
2017-05-23 20:53 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).