Re: [RFC 1/6] perf/core: create active and inactive event groups

From: Mark Rutland <mark.rutland@arm.com>
To: David Carrillo-Cisneros <davidcc@google.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>, Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andi Kleen <ak@linux.intel.com>, Kan Liang <kan.liang@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@suse.de>,
	Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Vikas Shivappa <vikas.shivappa@linux.intel.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Vince Weaver <vince@deater.net>, Paul Turner <pjt@google.com>,
	Stephane Eranian <eranian@google.com>
Subject: Re: [RFC 1/6] perf/core: create active and inactive event groups
Date: Fri, 13 Jan 2017 11:01:19 +0000	[thread overview]
Message-ID: <20170113110118.GB26120@leverpostej> (raw)
In-Reply-To: <CALcN6mhPmpSqKhE3Ua+j-xROLzeAyrgdCk4AGGtfF9kExXRTJg@mail.gmail.com>

On Fri, Jan 13, 2017 at 07:20:48AM +0000, David Carrillo-Cisneros wrote:
> On Thu, Jan 12, 2017 at 3:05 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Tue, Jan 10, 2017 at 12:45:31PM -0800, David Carrillo-Cisneros wrote:
> >> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> >> index 4741ecdb9817..3fa18f05c9b0 100644
> >> >> --- a/include/linux/perf_event.h
> >> >> +++ b/include/linux/perf_event.h
> >> >> @@ -573,6 +573,7 @@ struct perf_event {
> >> >>
> >> >>       struct hlist_node               hlist_entry;
> >> >>       struct list_head                active_entry;
> >> >> +     struct list_head                ctx_active_entry;
> >> >
> >> > I think we should be able to kill off active_entry as part of this
> >> > series; it's there to do the same thing (optimize iteration over active
> >> > events).
> >> >
> >> > If we expose a for_each_ctx_active_event() helper which iterates of the
> >> > pinned and flexible lists, I think we may be able to migrate existing
> >> > users over and kill off perf_event::active_entry, and the redundant
> list
> >> > manipulation in drivers.
> >>
> >> The problem with that would be iterating over all CPU contexts, when most
> >> users of active_entry only install evens in one CPU per package/socket.
> >>
> >> Maybe we can create yet another list of cpu contexts to have contexts
> with
> >> at least one active event.
> >
> > Ah. I thought these were already per-context rather than per-{pmu,box}.
> > My bad. I guess having bth isn't really a problem.
> >
> > In all cases they're used by a hrtimer that iterates over events to
> > update them (and they're only used by uncore PMUs).
> >
> > We could instead register a hrtimer affine to a cpu when the first event
> > was created, which would iterate over that CPU's events alone. When you
> > only have events on one CPU, that behaves the same. When you have events
> > on multiple CPUs, you avoid repeated cross-calls to update the events.
> 
> That sounds good, although I wonder if the overhead is currently a problem.

Sure, that's not clear to me either.

> FWIK, only system-wide uncore events use this. These are system-wide so
> there is no really a reason to have more than one per socket/package per
> event type.

Even with that, there's the risk that one CPU is blocked for a period
wiating for those few others to respond to IPIs. But as you say, it's
not obviously a problem in practice today.

> The one advantage of converting them to per cpu hrtimers is that we could
> remove event->active_entry and just use the ctx->active_groups list in this
> series.

Given it's of minor/unknown benefit, and would complicate this series,
feel free to forget about it. I'll add it to the list of things to clean
up at some point in future. ;)

[...]

> > In some cases, several PMUs share the same task context, so events in a
> > context might be for a HW PMU other than ctx->pmu. Thus, the
> > perf_pmu_disable(ctx->pmu) in perf_event_context_sched_in() isn't
> > necessarily sufficient to disable the PMU that group_sched_in() is
> > operating on. AFAICT, the PMU is only necessarily disabled in the blcok
> > from start_txn() to {stop,cancel}_txn().
> 
> I see the problem now. Thank you for explaining the details. It seems like
> this only matters in contexts with more that one pmu and with pmus that can
> fail commit_txn.

For the *sched_in path, yes. 

> This makes me like more the idea of removing context sharing for
> non-sw pmus that you describe below (and renaming sw contexts to "not
> fail pmu->add contexts").

If we could handle dynamic contexts, I'd try to get rid of the context
type entirely, and have a flag on the PMU to describe that it was a pure
SW PMU.

> > If any of those PMUs sharing a context can raise an NMI, then I believe
> > the problem I raised above stands. Currently the ARM PMU don't, but
> > there's ongoing work to implement NMIs for GICv3. I'm not sure about the
> > other PMUs.
> 
> Agree with this being problematic. Also intel cmt pmu uses
> sw context even though it can fail to ad events (although it does not
> raises NMI).

I hadn't realised that was the case. That'll violate one of the main
invariants regarding SW events, so we've probably got additional subtle
breakage there (e.g. SW elligible events not being scheduled).

> > This context sharing is problematic for a number of reasons outside of
> > scheduling. I had hoped to address this by reworking the way we handle
> > task contexts to avoid sharing. It turns out that there are many subtle
> > locking and ordering issues to consider for that, so I hadn't made much
> > headway with that approach.
> 
> Do heterogeneous CPUs have a large number of distinct cpu pmus?

In general, I would expect a small number. 

One or two CPU PMUs is the likely case today, but we could see more in
future, and tracing PMUs will add to that (those can differ across
microarchitectures, too). I can imagine seeing at least four in the near
term.

> Could this be addressed short-term by adding extra types of task contexts
> (e.g. hw_arch_context)? So architectures could chose this as an extra
> context to accommodate their special behaviors?

Maybe. Bumping perf_nr_task_contexts, and dynamically allocating the
pmu::task_ctx_nr for HW PMUs would give some relief.

Thanks,
Mark.