Re: Kernel perf counter support (for apple M1 and others)

From: Marc Zyngier <maz@kernel.org>
To: Yichao Yu <yyc1992@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org,
	khuong@os.amperecomputing.com, will@kernel.org,
	mark.rutland@arm.com, Frank.li@nxp.com,
	zhangshaokun@hisilicon.com, liuqi115@huawei.com,
	john.garry@huawei.com, mathieu.poirier@linaro.org,
	leo.yan@linaro.org
Subject: Re: Kernel perf counter support (for apple M1 and others)
Date: Mon, 18 Apr 2022 13:01:53 +0100	[thread overview]
Message-ID: <87o80yior2.wl-maz@kernel.org> (raw)
In-Reply-To: <CAMvDr+Q-4gzYozLV8f8R8DgTEDYrkvYAOC+ND6bm11L+4mQDWw@mail.gmail.com>

Hi,

Please make sure you use current email addresses (the MAINTAINERS file
should be accurate for any recent kernel version).

On Fri, 01 Apr 2022 02:39:39 +0100,
Yichao Yu <yyc1992@gmail.com> wrote:
> 
> Hi,
> 
> I am playing with the performance counters on the apple M1 chip from
> linux with the hope that it could help making userspace tools like
> perf and rr works on the M1. However, I was told that none of these
> info should go into the kernel (not even raw event names) and the
> userspace should only use the raw event numbers instead of
> PERF_TYPE_HARDWARE even for events that have a canonical counterpart.

Since I was the one who had a brief chat with you on IRC, let me
clarify what I said exactly:

- I don't think there is any value in stashing any of these HW events
  in the kernel. In most cases, the kernel definition only matches the
  x86 definition, and doesn't accurately describe the vast majority of
  the events implemented on an ARM CPU. The ARM architecture mentions
  a handful of architectural events that actually match the kernel
  definition, and for these CPUs the kernel carries the in-kernel
  description.

- For the M1, none of the above applies, because there is *NO*
  architectural description for the events reported by the (non
  architectural) PMU, and there is no guarantee that they actually
  match the common understanding we have of these events.

- The correct place for these non-architectural events is in a JSON
  description that would be built into perf, which would give you
  symbolic events. Bloating the kernel for something we're not sure
  about seems counterproductive.

> Although I'm not planning to submit any kernel patches anytime soon
> and I'm mostly interested in running the test right now, I do want to
> know what I should expect in the long term on the userspace side. I
> was told to ask about this on "the list" (and I'm hoping this is the
> right one after browsing through MAINTAINERS) instead. There are a few
> issues/questions, not all of which are related to M1/asymmetric
> systems. For context, see
> https://oftc.irclog.whitequark.org/asahi-dev/2022-03-30 (there also
> happens to be no other discussion on the channel that day)
>
> 1. Is it acceptable (to either kernel or perf source) to submit
> patches that are based on a14.plist from macOS. I have personally
> never looked at it but if it is acceptable then there's little point
> doing the experiment I was doing (apart from the fun doing so and as a
> practice to understand the system).

My take on this is "I am not a lawyer". The MacOS file is Apple's
intellectual property, and I'm not prepared to use it, transform it,
or interpret it in any way. At best, giving people a way to use this
file on their own system without distributing it would be a step in
the right direction (it should be rather simple to turn this file into
the JSON format that perf uses).

Now, if someone with the right level of IP law wants to take
responsibility for this, I'm not going to get in the way. I'm just not
going to be the one looking at it or taking the patch.

> 2. Should the kernel provide names for hardware events? Here I'm
> talking about things under
> `/sys/bus/event_source/devices/<pmu>/events` which I assume is
> provided by the kernel (that or my understanding of sysfs has been
> fundamentally wrong/out-of-date...). Based on the fact that the
> current pmu kernel driver for the M1 does provide this and this
> comment https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/drivers/perf/apple_m1_cpu_pmu.c#L31
> I assume it's desired. This would also agree with what I've observed
> on other (including non-x86) systems. If this is the case, I assume
> the kernel driver for the M1 PMU isn't fully "done" yet.

See my reply above: there are no architectural descriptions for these
events, and we don't know how closely they match the definition Linux
has. If one day Apple shows up and tells us how close these events are
from their Linux (and thus x86) definition, we can expand this. Until
then, the interpretation belongs, IMHO, to userspace.

I'd rather *remove* CYCLES and INSTRUCTIONS definitions from the
kernel than add any other.

> 3. For counting events on a system with asymmetric cores.
>     I understand that if the system contains multiple processors of
> different characteristics, it may not make sense to provide a counter
> that counts events on both (or all) types of cores. However, there are
> events (PERF_COUNT_HW_INSTRUCTIONS and
> PERF_COUNT_HW_BRANCH_INSTRUCTIONS at the least) that shouldn't really
> be affected by this (and in fact, any counters that counts events
> visible directly to the software/userspace). I want to even say that
> branch misses/cache reference/misses might be in this category as well
> although certainly not as clear cut.

That boat has sailed a long time ago, when the BL PMU support was
introduced, and all counters are treated equally: they are *NOT*
counted globally. Changing this would be an ABI break, and I seriously
doubt we want to go there.

It would also mean that the kernel would need to know which counters
it can accumulate over the various CPU types (which is often more than
2, these days). All of that to save userspace adding things? I doubt
this is worth it.

> 4. There are other events that may not make as much sense to combine
> (cycles for example). However, I feel like a combined cycle count
> isn't going to be much tricker to use given that the cycle count on a
> single core is still affected by frequency scaling and it can still be
> used correctly by pinning the thread.

I don't understand what frequency scaling has anything to do with this
(a cycle is still a cycle at any frequency).

> 
> The main reasons I'm asking about 3 and 4 is that
> 1. Right now, even to just count instructions without pinning the
> thread, I need to create two counters.

How bad is that? I mean, the counters are per-CPU anyway, so there
*are* N counters (N being the number of CPUs). You only have to create
a counter per PMU.

> 2. Even if the number isn't exactly accurate, it can still be useful
> as a general guideline. Right now, even if I just want to do a quick
> check, I still need to manually specify a dozen of events in `perf
> stat -e` rather than simply using `perf stat` (to make it worse, perf
> doesn't even provide any useful warning about it). It is also much
> harder to do things generically (which is at least partially because
> of the lack of documentation....).

I see this as a potential perf-tool improvement. Being able to say
'Count this event on all CPU PMUs'  would certainly be valuable to all
asymmetric systems.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel