Re: Kernel perf counter support (for apple M1 and others)

From: Yichao Yu <yyc1992@gmail.com>
To: Marc Zyngier <maz@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org,
	khuong@os.amperecomputing.com,  will@kernel.org,
	mark.rutland@arm.com, Frank.li@nxp.com,
	 zhangshaokun@hisilicon.com, liuqi115@huawei.com,
	john.garry@huawei.com,
	 Mathieu Poirier <mathieu.poirier@linaro.org>,
	Leo Yan <leo.yan@linaro.org>
Subject: Re: Kernel perf counter support (for apple M1 and others)
Date: Tue, 19 Apr 2022 09:34:21 -0400	[thread overview]
Message-ID: <CAMvDr+R46CKTMv+9qEz4e-xOwz6K_3Q+15YovgHDz+eC0n_DrA@mail.gmail.com> (raw)
In-Reply-To: <87pmld9q4q.wl-maz@kernel.org>

On Tue, Apr 19, 2022 at 9:09 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Tue, 19 Apr 2022 13:06:37 +0100,
> Yichao Yu <yyc1992@gmail.com> wrote:
> >
> > > - I don't think there is any value in stashing any of these HW events
> > >   in the kernel. In most cases, the kernel definition only matches the
> > >   x86 definition, and doesn't accurately describe the vast majority of
> > >   the events implemented on an ARM CPU. The ARM architecture mentions
> > >   a handful of architectural events that actually match the kernel
> > >   definition, and for these CPUs the kernel carries the in-kernel
> > >   description.
> > >
> > > - For the M1, none of the above applies, because there is *NO*
> > >   architectural description for the events reported by the (non
> > >   architectural) PMU, and there is no guarantee that they actually
> > >   match the common understanding we have of these events.
> >
> > You mentioned documents from Apple on IRC and below. Why is that the
> > only acceptable source?
>
> Because that would be the only one giving an exact definition to what
> you are counting. Anything else is guess-work. Very good guess-work,
> I'm sure, but still very much a wet finger in the air.
>
> > The entire support for M1 is based on reverse engineering/testing of
> > the hardware so why would those not be acceptable sources here as
> > well?
>
> Because there is a difference between getting something to work (the
> PMU driver itself) and interpreting the results it gives. All we know
> is that it is counting something. You can sort of guess what, but you
> don't know for sure.
>
> > My understanding is that the current cycles and instructions counters
> > were figured out this way so I don't see why you want them to be
> > removed.
>
> Because you use them as an argument to pile more crap in the
> kernel. Gee, at this stage, it is the driver itself I am going to
> remove.

I'm sorry but I'm not sure why you are so mad about this. That's
certainly not my intention.
I specifically said that I wasn't intended to submit anything to the
kernel at this point (which I assume is the "crap" you are talking
about) because I don't know what's acceptable and I want to understand
why.
For comparison to other M1 supporting code, I'm not talking about the
perf counter driver specifically, but all of the other related
drivers. I'm sure there is a lot of code that depends on what specific
thing a register is doing. For this particular question I'd like to
know why there's a difference between the two. The answer might be
bluntly obvious to you but that is certainly not the case for me. (And
FWIW, this very reason, that I think there might be some background
knowledge that I'm lacking is why I asked on IRC first)

> Feel free to write a document or something else. The only thing I care
> about is in the kernel tree.

That is fair, but my point is that this is literally the first time I
heard about hardware event type being essentially deprecated. I'm
certainly not qualified to write such a document myself (at least not
right now) and I won't be unless someone could explain to me what is
actually the expectation and why, or if there's existing document
explaining all these so that I can contribute to the document of other
projects.

> > Currently, just by reading the document of the hardware event type, it
> > seems that it should work if the hardware supports such counters.
>
> Such document would be the JSON file I mentioned. But since you have
> stated that you don't intend to write anything that ends up in the
> kernel, I guess that's a moot point.

By document I meant that `perf_event_open(2)` doesn't say anything
about, say the instruction hardware counter doesn't count all
instructions even when you get a non-zero value.

> > > That boat has sailed a long time ago, when the BL PMU support was
> > > introduced, and all counters are treated equally: they are *NOT*
> > > counted globally. Changing this would be an ABI break, and I seriously
> > > doubt we want to go there.
> >
> > Sorry I'm not familiar with the names here. What's the "BL PMU"
> > support? And what are the counters that are not counted globally?
>
> BL stands for Big-Little. Asymmetric support, if you want. None of the
> counters are counted globally, only per PMU type. And this is an ABI
> we cannot break.

Are you talking about the dynamic PMU type or the hardware or raw type?

> > > It would also mean that the kernel would need to know which counters
> > > it can accumulate over the various CPU types (which is often more than
> > > 2, these days). All of that to save userspace adding things? I doubt
> > > this is worth it.
> > >
> > > > 4. There are other events that may not make as much sense to combine
> > > > (cycles for example). However, I feel like a combined cycle count
> > > > isn't going to be much tricker to use given that the cycle count on a
> > > > single core is still affected by frequency scaling and it can still be
> > > > used correctly by pinning the thread.
> > >
> > > I don't understand what frequency scaling has anything to do with this
> > > (a cycle is still a cycle at any frequency).
> >
> > Exactly, a cycle is still a cycle, so I don't see why it's that big a
> > problem to count it globally.
>
> Because you are going to walk the list of events generated during a
> time slice, work out which ones are to be merged and which ones
> aren't, and accumulate them into global, userspace visible counters? I
> dread to imagine the effect on scheduling latency. All that to avoid
> adding two values into userspace. Great.

OK, if doing that will always incur a big overhead then I can take
that. What I imagined was that this only needs to be done if the
process is moved to a different CPU, and also I thought there should
already be some logic in scheduling related to perf counters (I was
imagining that's when the kernel decide to add/remove counters for
other cases) which is why I thought adding such logic shouldn't make a
big difference if no counters is used by the process. I can certainly
be wrong about that.

Also, see below.

> > What I meant exactly was that if a code runs for 100 cycles at 1 GHz,
> > it doesn't mean it'll also run (close to) 100 cycles at 3 GHz.
> > Similarly, if it runs for 100 cycles on the E core, it doesn't mean
> > it'll run for 100 cycles on the P core.
>
> And? What do you derive from this set of statements?

And this is replying to the original argument you gave, saying that
counting cycles across different core types doesn't make sense. What
I'm saying here is that I don't believe counting across core types
makes any more or less sense than counting cycles across different
processor frequencies.

> > We already allow the former case to count using the same counter
> > everywhere, I don't see why the latter can't be allowed. (ABI change
> > issue aside)
>
> *blink*. If you don't see a problem with changing the ABI, I'm at a
> loss.

Yes I do see the issue with changing ABIs. However, there are multiple
arguments you brought up and I'd like to understand each of them
individually. It's certainly possible that some of what I was asking
about is impossible for some specific reason, but I'd like to
understand all of the arguments you brought up to fully understand the
issue. (also I intended to mean here that I get that there could be
ABI issue, although I don't fully get it yet which is why I was asking
above, however, I'd like to discuss this part without concerning the
ABI issue, I didn't intend to mean that we can just ignore all the ABI
issues and just change things. If that's not what I said actually
implies, I'm sorry about that)

> > I don't have hardware to test this but it also seems that on the new
> > intel chips, the E core and the P core are counted together. (this is
> > purely based on the lack of multiple counter support in rr to support
> > the new chip...)
>
> Colour me uninterested on both count. x86 can do whatever they want.

Again, this is just to show that counting globally on both E and P
cores isn't something that makes as little sense as you originally
said.

> >
> > > > The main reasons I'm asking about 3 and 4 is that
> > > > 1. Right now, even to just count instructions without pinning the
> > > > thread, I need to create two counters.
> > >
> > > How bad is that? I mean, the counters are per-CPU anyway, so there
> > > *are* N counters (N being the number of CPUs). You only have to create
> > > a counter per PMU.
> > >
> > > > 2. Even if the number isn't exactly accurate, it can still be useful
> > > > as a general guideline. Right now, even if I just want to do a quick
> > > > check, I still need to manually specify a dozen of events in `perf
> > > > stat -e` rather than simply using `perf stat` (to make it worse, perf
> > > > doesn't even provide any useful warning about it). It is also much
> > > > harder to do things generically (which is at least partially because
> > > > of the lack of documentation....).
> > >
> > > I see this as a potential perf-tool improvement. Being able to say
> > > 'Count this event on all CPU PMUs'  would certainly be valuable to all
> > > asymmetric systems.
> >
> > Short answer is not that bad if and only if there's a standard and
> > documented way to do this, userspace or kernel.
>
> Feel free to improve the kernel documentation[1], which is admittedly
> pretty sparse on the subject.
>
> The kernel gives you the tools to match PMUs and CPUs (just rummage in
> sysfs). If userspace knows which counter is what, you're in business.
> Do document your findings, by any mean.

And as I said above, without understanding all the details I can't.
And it also seems that I don't know the right way to get such
information without putting up crap so I'll appreciate it if you could
let me know how I can find out more detail about it without annoy more
people.

>
>         M.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/arm64/perf.rst#n136
>
> --
> Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel