Re: Kernel perf counter support (for apple M1 and others)

From: Yichao Yu <yyc1992@gmail.com>
To: Marc Zyngier <maz@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org,
	khuong@os.amperecomputing.com,  will@kernel.org,
	mark.rutland@arm.com, Frank.li@nxp.com,
	 zhangshaokun@hisilicon.com, liuqi115@huawei.com,
	john.garry@huawei.com,
	 Mathieu Poirier <mathieu.poirier@linaro.org>,
	Leo Yan <leo.yan@linaro.org>
Subject: Re: Kernel perf counter support (for apple M1 and others)
Date: Tue, 19 Apr 2022 08:06:37 -0400	[thread overview]
Message-ID: <CAMvDr+R5TE4C++wNYAre-GJQMBXcjmmp=WcXx_KONAky8amE+Q@mail.gmail.com> (raw)
In-Reply-To: <87o80yior2.wl-maz@kernel.org>

> - I don't think there is any value in stashing any of these HW events
>   in the kernel. In most cases, the kernel definition only matches the
>   x86 definition, and doesn't accurately describe the vast majority of
>   the events implemented on an ARM CPU. The ARM architecture mentions
>   a handful of architectural events that actually match the kernel
>   definition, and for these CPUs the kernel carries the in-kernel
>   description.
>
> - For the M1, none of the above applies, because there is *NO*
>   architectural description for the events reported by the (non
>   architectural) PMU, and there is no guarantee that they actually
>   match the common understanding we have of these events.

You mentioned documents from Apple on IRC and below. Why is that the
only acceptable source?
The entire support for M1 is based on reverse engineering/testing of
the hardware so why would those not be acceptable sources here as
well?
My understanding is that the current cycles and instructions counters
were figured out this way so I don't see why you want them to be
removed.
There are also other counters that I believe are matching the
canonical definitions and I don't see why those should be left out
either.

> - The correct place for these non-architectural events is in a JSON
>   description that would be built into perf, which would give you
>   symbolic events. Bloating the kernel for something we're not sure
>   about seems counterproductive.

As I've mentioned before, perf isn't the only user of performance
counters. If there is a shared place, or even good document, for this,
it might have been better.
Currently, just by reading the document of the hardware event type, it
seems that it should work if the hardware supports such counters.

> > 2. Should the kernel provide names for hardware events? Here I'm
> > talking about things under
> > `/sys/bus/event_source/devices/<pmu>/events` which I assume is
> > provided by the kernel (that or my understanding of sysfs has been
> > fundamentally wrong/out-of-date...). Based on the fact that the
> > current pmu kernel driver for the M1 does provide this and this
> > comment https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/drivers/perf/apple_m1_cpu_pmu.c#L31
> > I assume it's desired. This would also agree with what I've observed
> > on other (including non-x86) systems. If this is the case, I assume
> > the kernel driver for the M1 PMU isn't fully "done" yet.
>
> See my reply above: there are no architectural descriptions for these
> events, and we don't know how closely they match the definition Linux
> has. If one day Apple shows up and tells us how close these events are
> from their Linux (and thus x86) definition, we can expand this. Until
> then, the interpretation belongs, IMHO, to userspace.
>
> I'd rather *remove* CYCLES and INSTRUCTIONS definitions from the
> kernel than add any other.

Replied above.

> > 3. For counting events on a system with asymmetric cores.
> >     I understand that if the system contains multiple processors of
> > different characteristics, it may not make sense to provide a counter
> > that counts events on both (or all) types of cores. However, there are
> > events (PERF_COUNT_HW_INSTRUCTIONS and
> > PERF_COUNT_HW_BRANCH_INSTRUCTIONS at the least) that shouldn't really
> > be affected by this (and in fact, any counters that counts events
> > visible directly to the software/userspace). I want to even say that
> > branch misses/cache reference/misses might be in this category as well
> > although certainly not as clear cut.
>
> That boat has sailed a long time ago, when the BL PMU support was
> introduced, and all counters are treated equally: they are *NOT*
> counted globally. Changing this would be an ABI break, and I seriously
> doubt we want to go there.

Sorry I'm not familiar with the names here. What's the "BL PMU"
support? And what are the counters that are not counted globally?

> It would also mean that the kernel would need to know which counters
> it can accumulate over the various CPU types (which is often more than
> 2, these days). All of that to save userspace adding things? I doubt
> this is worth it.
>
> > 4. There are other events that may not make as much sense to combine
> > (cycles for example). However, I feel like a combined cycle count
> > isn't going to be much tricker to use given that the cycle count on a
> > single core is still affected by frequency scaling and it can still be
> > used correctly by pinning the thread.
>
> I don't understand what frequency scaling has anything to do with this
> (a cycle is still a cycle at any frequency).

Exactly, a cycle is still a cycle, so I don't see why it's that big a
problem to count it globally.
What I meant exactly was that if a code runs for 100 cycles at 1 GHz,
it doesn't mean it'll also run (close to) 100 cycles at 3 GHz.
Similarly, if it runs for 100 cycles on the E core, it doesn't mean
it'll run for 100 cycles on the P core.
We already allow the former case to count using the same counter
everywhere, I don't see why the latter can't be allowed. (ABI change
issue aside)
I don't have hardware to test this but it also seems that on the new
intel chips, the E core and the P core are counted together. (this is
purely based on the lack of multiple counter support in rr to support
the new chip...)

> > The main reasons I'm asking about 3 and 4 is that
> > 1. Right now, even to just count instructions without pinning the
> > thread, I need to create two counters.
>
> How bad is that? I mean, the counters are per-CPU anyway, so there
> *are* N counters (N being the number of CPUs). You only have to create
> a counter per PMU.
>
> > 2. Even if the number isn't exactly accurate, it can still be useful
> > as a general guideline. Right now, even if I just want to do a quick
> > check, I still need to manually specify a dozen of events in `perf
> > stat -e` rather than simply using `perf stat` (to make it worse, perf
> > doesn't even provide any useful warning about it). It is also much
> > harder to do things generically (which is at least partially because
> > of the lack of documentation....).
>
> I see this as a potential perf-tool improvement. Being able to say
> 'Count this event on all CPU PMUs'  would certainly be valuable to all
> asymmetric systems.

Short answer is not that bad if and only if there's a standard and
documented way to do this, userspace or kernel.
(A userspace solution that automatically sums counters value together
would also need to handle the grouping as well).
However, as I mentioned above, based on the document I can find, there
isn't a standard interface for a userspace program to figure out how
to use these counters correctly and the document for perf_event_open
also doesn't mention these kinds of limitations. These are what my
expectation of the kernel interface come from.

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel