All of lore.kernel.org
 help / color / mirror / Atom feed
* Kernel perf counter support (for apple M1 and others)
@ 2022-04-01  1:39 Yichao Yu
  2022-04-13 12:58 ` Yichao Yu
  2022-04-18 12:01 ` Marc Zyngier
  0 siblings, 2 replies; 7+ messages in thread
From: Yichao Yu @ 2022-04-01  1:39 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: khuong, will, mark.rutland, Frank.li, zhangshaokun, liuqi115,
	john.garry, mathieu.poirier, leo.yan, marc.zyngier

Hi,

I am playing with the performance counters on the apple M1 chip from
linux with the hope that it could help making userspace tools like
perf and rr works on the M1. However, I was told that none of these
info should go into the kernel (not even raw event names) and the
userspace should only use the raw event numbers instead of
PERF_TYPE_HARDWARE even for events that have a canonical counterpart.

Although I'm not planning to submit any kernel patches anytime soon
and I'm mostly interested in running the test right now, I do want to
know what I should expect in the long term on the userspace side. I
was told to ask about this on "the list" (and I'm hoping this is the
right one after browsing through MAINTAINERS) instead. There are a few
issues/questions, not all of which are related to M1/asymmetric
systems. For context, see
https://oftc.irclog.whitequark.org/asahi-dev/2022-03-30 (there also
happens to be no other discussion on the channel that day)

1. Is it acceptable (to either kernel or perf source) to submit
patches that are based on a14.plist from macOS. I have personally
never looked at it but if it is acceptable then there's little point
doing the experiment I was doing (apart from the fun doing so and as a
practice to understand the system).

2. Should the kernel provide names for hardware events? Here I'm
talking about things under
`/sys/bus/event_source/devices/<pmu>/events` which I assume is
provided by the kernel (that or my understanding of sysfs has been
fundamentally wrong/out-of-date...). Based on the fact that the
current pmu kernel driver for the M1 does provide this and this
comment https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/drivers/perf/apple_m1_cpu_pmu.c#L31
I assume it's desired. This would also agree with what I've observed
on other (including non-x86) systems. If this is the case, I assume
the kernel driver for the M1 PMU isn't fully "done" yet.

3. For counting events on a system with asymmetric cores.
    I understand that if the system contains multiple processors of
different characteristics, it may not make sense to provide a counter
that counts events on both (or all) types of cores. However, there are
events (PERF_COUNT_HW_INSTRUCTIONS and
PERF_COUNT_HW_BRANCH_INSTRUCTIONS at the least) that shouldn't really
be affected by this (and in fact, any counters that counts events
visible directly to the software/userspace). I want to even say that
branch misses/cache reference/misses might be in this category as well
although certainly not as clear cut.

4. There are other events that may not make as much sense to combine
(cycles for example). However, I feel like a combined cycle count
isn't going to be much tricker to use given that the cycle count on a
single core is still affected by frequency scaling and it can still be
used correctly by pinning the thread.

The main reasons I'm asking about 3 and 4 is that
1. Right now, even to just count instructions without pinning the
thread, I need to create two counters.
2. Even if the number isn't exactly accurate, it can still be useful
as a general guideline. Right now, even if I just want to do a quick
check, I still need to manually specify a dozen of events in `perf
stat -e` rather than simply using `perf stat` (to make it worse, perf
doesn't even provide any useful warning about it). It is also much
harder to do things generically (which is at least partially because
of the lack of documentation....).


Yichao Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel perf counter support (for apple M1 and others)
  2022-04-01  1:39 Kernel perf counter support (for apple M1 and others) Yichao Yu
@ 2022-04-13 12:58 ` Yichao Yu
  2022-04-18 12:01 ` Marc Zyngier
  1 sibling, 0 replies; 7+ messages in thread
From: Yichao Yu @ 2022-04-13 12:58 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: khuong, will, mark.rutland, Frank.li, zhangshaokun, liuqi115,
	john.garry, Mathieu Poirier, Leo Yan, marc.zyngier

> I am playing with the performance counters on the apple M1 chip from
> linux with the hope that it could help making userspace tools like
> perf and rr works on the M1. However, I was told that none of these
> info should go into the kernel (not even raw event names) and the
> userspace should only use the raw event numbers instead of
> PERF_TYPE_HARDWARE even for events that have a canonical counterpart.
>
> Although I'm not planning to submit any kernel patches anytime soon
> and I'm mostly interested in running the test right now, I do want to
> know what I should expect in the long term on the userspace side. I
> was told to ask about this on "the list" (and I'm hoping this is the
> right one after browsing through MAINTAINERS) instead. There are a few
> issues/questions, not all of which are related to M1/asymmetric
> systems. For context, see
> https://oftc.irclog.whitequark.org/asahi-dev/2022-03-30 (there also
> happens to be no other discussion on the channel that day)
>
> 1. Is it acceptable (to either kernel or perf source) to submit
> patches that are based on a14.plist from macOS. I have personally
> never looked at it but if it is acceptable then there's little point
> doing the experiment I was doing (apart from the fun doing so and as a
> practice to understand the system).
>
> 2. Should the kernel provide names for hardware events? Here I'm
> talking about things under
> `/sys/bus/event_source/devices/<pmu>/events` which I assume is
> provided by the kernel (that or my understanding of sysfs has been
> fundamentally wrong/out-of-date...). Based on the fact that the
> current pmu kernel driver for the M1 does provide this and this
> comment https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/drivers/perf/apple_m1_cpu_pmu.c#L31
> I assume it's desired. This would also agree with what I've observed
> on other (including non-x86) systems. If this is the case, I assume
> the kernel driver for the M1 PMU isn't fully "done" yet.
>
> 3. For counting events on a system with asymmetric cores.
>     I understand that if the system contains multiple processors of
> different characteristics, it may not make sense to provide a counter
> that counts events on both (or all) types of cores. However, there are
> events (PERF_COUNT_HW_INSTRUCTIONS and
> PERF_COUNT_HW_BRANCH_INSTRUCTIONS at the least) that shouldn't really
> be affected by this (and in fact, any counters that counts events
> visible directly to the software/userspace). I want to even say that
> branch misses/cache reference/misses might be in this category as well
> although certainly not as clear cut.
>
> 4. There are other events that may not make as much sense to combine
> (cycles for example). However, I feel like a combined cycle count
> isn't going to be much tricker to use given that the cycle count on a
> single core is still affected by frequency scaling and it can still be
> used correctly by pinning the thread.
>
> The main reasons I'm asking about 3 and 4 is that
> 1. Right now, even to just count instructions without pinning the
> thread, I need to create two counters.
> 2. Even if the number isn't exactly accurate, it can still be useful
> as a general guideline. Right now, even if I just want to do a quick
> check, I still need to manually specify a dozen of events in `perf
> stat -e` rather than simply using `perf stat` (to make it worse, perf
> doesn't even provide any useful warning about it). It is also much
> harder to do things generically (which is at least partially because
> of the lack of documentation....).


Anyone got any input on this? Over at https://rr-project.org/, it
would be really nice if some counters can be handled transparently
when the process migrates between cores.

>
>
> Yichao Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel perf counter support (for apple M1 and others)
  2022-04-01  1:39 Kernel perf counter support (for apple M1 and others) Yichao Yu
  2022-04-13 12:58 ` Yichao Yu
@ 2022-04-18 12:01 ` Marc Zyngier
  2022-04-19 12:06   ` Yichao Yu
  1 sibling, 1 reply; 7+ messages in thread
From: Marc Zyngier @ 2022-04-18 12:01 UTC (permalink / raw)
  To: Yichao Yu
  Cc: linux-arm-kernel, khuong, will, mark.rutland, Frank.li,
	zhangshaokun, liuqi115, john.garry, mathieu.poirier, leo.yan

Hi,

Please make sure you use current email addresses (the MAINTAINERS file
should be accurate for any recent kernel version).

On Fri, 01 Apr 2022 02:39:39 +0100,
Yichao Yu <yyc1992@gmail.com> wrote:
> 
> Hi,
> 
> I am playing with the performance counters on the apple M1 chip from
> linux with the hope that it could help making userspace tools like
> perf and rr works on the M1. However, I was told that none of these
> info should go into the kernel (not even raw event names) and the
> userspace should only use the raw event numbers instead of
> PERF_TYPE_HARDWARE even for events that have a canonical counterpart.

Since I was the one who had a brief chat with you on IRC, let me
clarify what I said exactly:

- I don't think there is any value in stashing any of these HW events
  in the kernel. In most cases, the kernel definition only matches the
  x86 definition, and doesn't accurately describe the vast majority of
  the events implemented on an ARM CPU. The ARM architecture mentions
  a handful of architectural events that actually match the kernel
  definition, and for these CPUs the kernel carries the in-kernel
  description.

- For the M1, none of the above applies, because there is *NO*
  architectural description for the events reported by the (non
  architectural) PMU, and there is no guarantee that they actually
  match the common understanding we have of these events.

- The correct place for these non-architectural events is in a JSON
  description that would be built into perf, which would give you
  symbolic events. Bloating the kernel for something we're not sure
  about seems counterproductive.

> Although I'm not planning to submit any kernel patches anytime soon
> and I'm mostly interested in running the test right now, I do want to
> know what I should expect in the long term on the userspace side. I
> was told to ask about this on "the list" (and I'm hoping this is the
> right one after browsing through MAINTAINERS) instead. There are a few
> issues/questions, not all of which are related to M1/asymmetric
> systems. For context, see
> https://oftc.irclog.whitequark.org/asahi-dev/2022-03-30 (there also
> happens to be no other discussion on the channel that day)
>
> 1. Is it acceptable (to either kernel or perf source) to submit
> patches that are based on a14.plist from macOS. I have personally
> never looked at it but if it is acceptable then there's little point
> doing the experiment I was doing (apart from the fun doing so and as a
> practice to understand the system).

My take on this is "I am not a lawyer". The MacOS file is Apple's
intellectual property, and I'm not prepared to use it, transform it,
or interpret it in any way. At best, giving people a way to use this
file on their own system without distributing it would be a step in
the right direction (it should be rather simple to turn this file into
the JSON format that perf uses).

Now, if someone with the right level of IP law wants to take
responsibility for this, I'm not going to get in the way. I'm just not
going to be the one looking at it or taking the patch.

> 2. Should the kernel provide names for hardware events? Here I'm
> talking about things under
> `/sys/bus/event_source/devices/<pmu>/events` which I assume is
> provided by the kernel (that or my understanding of sysfs has been
> fundamentally wrong/out-of-date...). Based on the fact that the
> current pmu kernel driver for the M1 does provide this and this
> comment https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/drivers/perf/apple_m1_cpu_pmu.c#L31
> I assume it's desired. This would also agree with what I've observed
> on other (including non-x86) systems. If this is the case, I assume
> the kernel driver for the M1 PMU isn't fully "done" yet.

See my reply above: there are no architectural descriptions for these
events, and we don't know how closely they match the definition Linux
has. If one day Apple shows up and tells us how close these events are
from their Linux (and thus x86) definition, we can expand this. Until
then, the interpretation belongs, IMHO, to userspace.

I'd rather *remove* CYCLES and INSTRUCTIONS definitions from the
kernel than add any other.

> 3. For counting events on a system with asymmetric cores.
>     I understand that if the system contains multiple processors of
> different characteristics, it may not make sense to provide a counter
> that counts events on both (or all) types of cores. However, there are
> events (PERF_COUNT_HW_INSTRUCTIONS and
> PERF_COUNT_HW_BRANCH_INSTRUCTIONS at the least) that shouldn't really
> be affected by this (and in fact, any counters that counts events
> visible directly to the software/userspace). I want to even say that
> branch misses/cache reference/misses might be in this category as well
> although certainly not as clear cut.

That boat has sailed a long time ago, when the BL PMU support was
introduced, and all counters are treated equally: they are *NOT*
counted globally. Changing this would be an ABI break, and I seriously
doubt we want to go there.

It would also mean that the kernel would need to know which counters
it can accumulate over the various CPU types (which is often more than
2, these days). All of that to save userspace adding things? I doubt
this is worth it.

> 4. There are other events that may not make as much sense to combine
> (cycles for example). However, I feel like a combined cycle count
> isn't going to be much tricker to use given that the cycle count on a
> single core is still affected by frequency scaling and it can still be
> used correctly by pinning the thread.

I don't understand what frequency scaling has anything to do with this
(a cycle is still a cycle at any frequency).

> 
> The main reasons I'm asking about 3 and 4 is that
> 1. Right now, even to just count instructions without pinning the
> thread, I need to create two counters.

How bad is that? I mean, the counters are per-CPU anyway, so there
*are* N counters (N being the number of CPUs). You only have to create
a counter per PMU.

> 2. Even if the number isn't exactly accurate, it can still be useful
> as a general guideline. Right now, even if I just want to do a quick
> check, I still need to manually specify a dozen of events in `perf
> stat -e` rather than simply using `perf stat` (to make it worse, perf
> doesn't even provide any useful warning about it). It is also much
> harder to do things generically (which is at least partially because
> of the lack of documentation....).

I see this as a potential perf-tool improvement. Being able to say
'Count this event on all CPU PMUs'  would certainly be valuable to all
asymmetric systems.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel perf counter support (for apple M1 and others)
  2022-04-18 12:01 ` Marc Zyngier
@ 2022-04-19 12:06   ` Yichao Yu
  2022-04-19 13:09     ` Marc Zyngier
  0 siblings, 1 reply; 7+ messages in thread
From: Yichao Yu @ 2022-04-19 12:06 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-arm-kernel, khuong, will, mark.rutland, Frank.li,
	zhangshaokun, liuqi115, john.garry, Mathieu Poirier, Leo Yan

> - I don't think there is any value in stashing any of these HW events
>   in the kernel. In most cases, the kernel definition only matches the
>   x86 definition, and doesn't accurately describe the vast majority of
>   the events implemented on an ARM CPU. The ARM architecture mentions
>   a handful of architectural events that actually match the kernel
>   definition, and for these CPUs the kernel carries the in-kernel
>   description.
>
> - For the M1, none of the above applies, because there is *NO*
>   architectural description for the events reported by the (non
>   architectural) PMU, and there is no guarantee that they actually
>   match the common understanding we have of these events.

You mentioned documents from Apple on IRC and below. Why is that the
only acceptable source?
The entire support for M1 is based on reverse engineering/testing of
the hardware so why would those not be acceptable sources here as
well?
My understanding is that the current cycles and instructions counters
were figured out this way so I don't see why you want them to be
removed.
There are also other counters that I believe are matching the
canonical definitions and I don't see why those should be left out
either.

> - The correct place for these non-architectural events is in a JSON
>   description that would be built into perf, which would give you
>   symbolic events. Bloating the kernel for something we're not sure
>   about seems counterproductive.

As I've mentioned before, perf isn't the only user of performance
counters. If there is a shared place, or even good document, for this,
it might have been better.
Currently, just by reading the document of the hardware event type, it
seems that it should work if the hardware supports such counters.

> > 2. Should the kernel provide names for hardware events? Here I'm
> > talking about things under
> > `/sys/bus/event_source/devices/<pmu>/events` which I assume is
> > provided by the kernel (that or my understanding of sysfs has been
> > fundamentally wrong/out-of-date...). Based on the fact that the
> > current pmu kernel driver for the M1 does provide this and this
> > comment https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/drivers/perf/apple_m1_cpu_pmu.c#L31
> > I assume it's desired. This would also agree with what I've observed
> > on other (including non-x86) systems. If this is the case, I assume
> > the kernel driver for the M1 PMU isn't fully "done" yet.
>
> See my reply above: there are no architectural descriptions for these
> events, and we don't know how closely they match the definition Linux
> has. If one day Apple shows up and tells us how close these events are
> from their Linux (and thus x86) definition, we can expand this. Until
> then, the interpretation belongs, IMHO, to userspace.
>
> I'd rather *remove* CYCLES and INSTRUCTIONS definitions from the
> kernel than add any other.

Replied above.

> > 3. For counting events on a system with asymmetric cores.
> >     I understand that if the system contains multiple processors of
> > different characteristics, it may not make sense to provide a counter
> > that counts events on both (or all) types of cores. However, there are
> > events (PERF_COUNT_HW_INSTRUCTIONS and
> > PERF_COUNT_HW_BRANCH_INSTRUCTIONS at the least) that shouldn't really
> > be affected by this (and in fact, any counters that counts events
> > visible directly to the software/userspace). I want to even say that
> > branch misses/cache reference/misses might be in this category as well
> > although certainly not as clear cut.
>
> That boat has sailed a long time ago, when the BL PMU support was
> introduced, and all counters are treated equally: they are *NOT*
> counted globally. Changing this would be an ABI break, and I seriously
> doubt we want to go there.

Sorry I'm not familiar with the names here. What's the "BL PMU"
support? And what are the counters that are not counted globally?

> It would also mean that the kernel would need to know which counters
> it can accumulate over the various CPU types (which is often more than
> 2, these days). All of that to save userspace adding things? I doubt
> this is worth it.
>
> > 4. There are other events that may not make as much sense to combine
> > (cycles for example). However, I feel like a combined cycle count
> > isn't going to be much tricker to use given that the cycle count on a
> > single core is still affected by frequency scaling and it can still be
> > used correctly by pinning the thread.
>
> I don't understand what frequency scaling has anything to do with this
> (a cycle is still a cycle at any frequency).

Exactly, a cycle is still a cycle, so I don't see why it's that big a
problem to count it globally.
What I meant exactly was that if a code runs for 100 cycles at 1 GHz,
it doesn't mean it'll also run (close to) 100 cycles at 3 GHz.
Similarly, if it runs for 100 cycles on the E core, it doesn't mean
it'll run for 100 cycles on the P core.
We already allow the former case to count using the same counter
everywhere, I don't see why the latter can't be allowed. (ABI change
issue aside)
I don't have hardware to test this but it also seems that on the new
intel chips, the E core and the P core are counted together. (this is
purely based on the lack of multiple counter support in rr to support
the new chip...)

> > The main reasons I'm asking about 3 and 4 is that
> > 1. Right now, even to just count instructions without pinning the
> > thread, I need to create two counters.
>
> How bad is that? I mean, the counters are per-CPU anyway, so there
> *are* N counters (N being the number of CPUs). You only have to create
> a counter per PMU.
>
> > 2. Even if the number isn't exactly accurate, it can still be useful
> > as a general guideline. Right now, even if I just want to do a quick
> > check, I still need to manually specify a dozen of events in `perf
> > stat -e` rather than simply using `perf stat` (to make it worse, perf
> > doesn't even provide any useful warning about it). It is also much
> > harder to do things generically (which is at least partially because
> > of the lack of documentation....).
>
> I see this as a potential perf-tool improvement. Being able to say
> 'Count this event on all CPU PMUs'  would certainly be valuable to all
> asymmetric systems.

Short answer is not that bad if and only if there's a standard and
documented way to do this, userspace or kernel.
(A userspace solution that automatically sums counters value together
would also need to handle the grouping as well).
However, as I mentioned above, based on the document I can find, there
isn't a standard interface for a userspace program to figure out how
to use these counters correctly and the document for perf_event_open
also doesn't mention these kinds of limitations. These are what my
expectation of the kernel interface come from.

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel perf counter support (for apple M1 and others)
  2022-04-19 12:06   ` Yichao Yu
@ 2022-04-19 13:09     ` Marc Zyngier
  2022-04-19 13:34       ` Yichao Yu
  0 siblings, 1 reply; 7+ messages in thread
From: Marc Zyngier @ 2022-04-19 13:09 UTC (permalink / raw)
  To: Yichao Yu
  Cc: linux-arm-kernel, khuong, will, mark.rutland, Frank.li,
	zhangshaokun, liuqi115, john.garry, Mathieu Poirier, Leo Yan

On Tue, 19 Apr 2022 13:06:37 +0100,
Yichao Yu <yyc1992@gmail.com> wrote:
> 
> > - I don't think there is any value in stashing any of these HW events
> >   in the kernel. In most cases, the kernel definition only matches the
> >   x86 definition, and doesn't accurately describe the vast majority of
> >   the events implemented on an ARM CPU. The ARM architecture mentions
> >   a handful of architectural events that actually match the kernel
> >   definition, and for these CPUs the kernel carries the in-kernel
> >   description.
> >
> > - For the M1, none of the above applies, because there is *NO*
> >   architectural description for the events reported by the (non
> >   architectural) PMU, and there is no guarantee that they actually
> >   match the common understanding we have of these events.
> 
> You mentioned documents from Apple on IRC and below. Why is that the
> only acceptable source?

Because that would be the only one giving an exact definition to what
you are counting. Anything else is guess-work. Very good guess-work,
I'm sure, but still very much a wet finger in the air.

> The entire support for M1 is based on reverse engineering/testing of
> the hardware so why would those not be acceptable sources here as
> well?

Because there is a difference between getting something to work (the
PMU driver itself) and interpreting the results it gives. All we know
is that it is counting something. You can sort of guess what, but you
don't know for sure.

> My understanding is that the current cycles and instructions counters
> were figured out this way so I don't see why you want them to be
> removed.

Because you use them as an argument to pile more crap in the
kernel. Gee, at this stage, it is the driver itself I am going to
remove.

> There are also other counters that I believe are matching the
> canonical definitions and I don't see why those should be left out
> either.
>
> > - The correct place for these non-architectural events is in a JSON
> >   description that would be built into perf, which would give you
> >   symbolic events. Bloating the kernel for something we're not sure
> >   about seems counterproductive.
> 
> As I've mentioned before, perf isn't the only user of performance
> counters. If there is a shared place, or even good document, for this,
> it might have been better.

Feel free to write a document or something else. The only thing I care
about is in the kernel tree.

> Currently, just by reading the document of the hardware event type, it
> seems that it should work if the hardware supports such counters.

Such document would be the JSON file I mentioned. But since you have
stated that you don't intend to write anything that ends up in the
kernel, I guess that's a moot point.

> > > 2. Should the kernel provide names for hardware events? Here I'm
> > > talking about things under
> > > `/sys/bus/event_source/devices/<pmu>/events` which I assume is
> > > provided by the kernel (that or my understanding of sysfs has been
> > > fundamentally wrong/out-of-date...). Based on the fact that the
> > > current pmu kernel driver for the M1 does provide this and this
> > > comment https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/drivers/perf/apple_m1_cpu_pmu.c#L31
> > > I assume it's desired. This would also agree with what I've observed
> > > on other (including non-x86) systems. If this is the case, I assume
> > > the kernel driver for the M1 PMU isn't fully "done" yet.
> >
> > See my reply above: there are no architectural descriptions for these
> > events, and we don't know how closely they match the definition Linux
> > has. If one day Apple shows up and tells us how close these events are
> > from their Linux (and thus x86) definition, we can expand this. Until
> > then, the interpretation belongs, IMHO, to userspace.
> >
> > I'd rather *remove* CYCLES and INSTRUCTIONS definitions from the
> > kernel than add any other.
> 
> Replied above.
> 
> > > 3. For counting events on a system with asymmetric cores.
> > >     I understand that if the system contains multiple processors of
> > > different characteristics, it may not make sense to provide a counter
> > > that counts events on both (or all) types of cores. However, there are
> > > events (PERF_COUNT_HW_INSTRUCTIONS and
> > > PERF_COUNT_HW_BRANCH_INSTRUCTIONS at the least) that shouldn't really
> > > be affected by this (and in fact, any counters that counts events
> > > visible directly to the software/userspace). I want to even say that
> > > branch misses/cache reference/misses might be in this category as well
> > > although certainly not as clear cut.
> >
> > That boat has sailed a long time ago, when the BL PMU support was
> > introduced, and all counters are treated equally: they are *NOT*
> > counted globally. Changing this would be an ABI break, and I seriously
> > doubt we want to go there.
> 
> Sorry I'm not familiar with the names here. What's the "BL PMU"
> support? And what are the counters that are not counted globally?

BL stands for Big-Little. Asymmetric support, if you want. None of the
counters are counted globally, only per PMU type. And this is an ABI
we cannot break.

> 
> > It would also mean that the kernel would need to know which counters
> > it can accumulate over the various CPU types (which is often more than
> > 2, these days). All of that to save userspace adding things? I doubt
> > this is worth it.
> >
> > > 4. There are other events that may not make as much sense to combine
> > > (cycles for example). However, I feel like a combined cycle count
> > > isn't going to be much tricker to use given that the cycle count on a
> > > single core is still affected by frequency scaling and it can still be
> > > used correctly by pinning the thread.
> >
> > I don't understand what frequency scaling has anything to do with this
> > (a cycle is still a cycle at any frequency).
> 
> Exactly, a cycle is still a cycle, so I don't see why it's that big a
> problem to count it globally.

Because you are going to walk the list of events generated during a
time slice, work out which ones are to be merged and which ones
aren't, and accumulate them into global, userspace visible counters? I
dread to imagine the effect on scheduling latency. All that to avoid
adding two values into userspace. Great.

> What I meant exactly was that if a code runs for 100 cycles at 1 GHz,
> it doesn't mean it'll also run (close to) 100 cycles at 3 GHz.
> Similarly, if it runs for 100 cycles on the E core, it doesn't mean
> it'll run for 100 cycles on the P core.

And? What do you derive from this set of statements?

> We already allow the former case to count using the same counter
> everywhere, I don't see why the latter can't be allowed. (ABI change
> issue aside)

*blink*. If you don't see a problem with changing the ABI, I'm at a
loss.

> I don't have hardware to test this but it also seems that on the new
> intel chips, the E core and the P core are counted together. (this is
> purely based on the lack of multiple counter support in rr to support
> the new chip...)

Colour me uninterested on both count. x86 can do whatever they want.

> 
> > > The main reasons I'm asking about 3 and 4 is that
> > > 1. Right now, even to just count instructions without pinning the
> > > thread, I need to create two counters.
> >
> > How bad is that? I mean, the counters are per-CPU anyway, so there
> > *are* N counters (N being the number of CPUs). You only have to create
> > a counter per PMU.
> >
> > > 2. Even if the number isn't exactly accurate, it can still be useful
> > > as a general guideline. Right now, even if I just want to do a quick
> > > check, I still need to manually specify a dozen of events in `perf
> > > stat -e` rather than simply using `perf stat` (to make it worse, perf
> > > doesn't even provide any useful warning about it). It is also much
> > > harder to do things generically (which is at least partially because
> > > of the lack of documentation....).
> >
> > I see this as a potential perf-tool improvement. Being able to say
> > 'Count this event on all CPU PMUs'  would certainly be valuable to all
> > asymmetric systems.
> 
> Short answer is not that bad if and only if there's a standard and
> documented way to do this, userspace or kernel.

Feel free to improve the kernel documentation[1], which is admittedly
pretty sparse on the subject.

> (A userspace solution that automatically sums counters value together
> would also need to handle the grouping as well).
> However, as I mentioned above, based on the document I can find, there
> isn't a standard interface for a userspace program to figure out how
> to use these counters correctly and the document for perf_event_open
> also doesn't mention these kinds of limitations. These are what my
> expectation of the kernel interface come from.

The kernel gives you the tools to match PMUs and CPUs (just rummage in
sysfs). If userspace knows which counter is what, you're in business.
Do document your findings, by any mean.

	M.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/arm64/perf.rst#n136

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel perf counter support (for apple M1 and others)
  2022-04-19 13:09     ` Marc Zyngier
@ 2022-04-19 13:34       ` Yichao Yu
  2022-04-19 13:36         ` Yichao Yu
  0 siblings, 1 reply; 7+ messages in thread
From: Yichao Yu @ 2022-04-19 13:34 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-arm-kernel, khuong, will, mark.rutland, Frank.li,
	zhangshaokun, liuqi115, john.garry, Mathieu Poirier, Leo Yan

On Tue, Apr 19, 2022 at 9:09 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Tue, 19 Apr 2022 13:06:37 +0100,
> Yichao Yu <yyc1992@gmail.com> wrote:
> >
> > > - I don't think there is any value in stashing any of these HW events
> > >   in the kernel. In most cases, the kernel definition only matches the
> > >   x86 definition, and doesn't accurately describe the vast majority of
> > >   the events implemented on an ARM CPU. The ARM architecture mentions
> > >   a handful of architectural events that actually match the kernel
> > >   definition, and for these CPUs the kernel carries the in-kernel
> > >   description.
> > >
> > > - For the M1, none of the above applies, because there is *NO*
> > >   architectural description for the events reported by the (non
> > >   architectural) PMU, and there is no guarantee that they actually
> > >   match the common understanding we have of these events.
> >
> > You mentioned documents from Apple on IRC and below. Why is that the
> > only acceptable source?
>
> Because that would be the only one giving an exact definition to what
> you are counting. Anything else is guess-work. Very good guess-work,
> I'm sure, but still very much a wet finger in the air.
>
> > The entire support for M1 is based on reverse engineering/testing of
> > the hardware so why would those not be acceptable sources here as
> > well?
>
> Because there is a difference between getting something to work (the
> PMU driver itself) and interpreting the results it gives. All we know
> is that it is counting something. You can sort of guess what, but you
> don't know for sure.
>
> > My understanding is that the current cycles and instructions counters
> > were figured out this way so I don't see why you want them to be
> > removed.
>
> Because you use them as an argument to pile more crap in the
> kernel. Gee, at this stage, it is the driver itself I am going to
> remove.

I'm sorry but I'm not sure why you are so mad about this. That's
certainly not my intention.
I specifically said that I wasn't intended to submit anything to the
kernel at this point (which I assume is the "crap" you are talking
about) because I don't know what's acceptable and I want to understand
why.
For comparison to other M1 supporting code, I'm not talking about the
perf counter driver specifically, but all of the other related
drivers. I'm sure there is a lot of code that depends on what specific
thing a register is doing. For this particular question I'd like to
know why there's a difference between the two. The answer might be
bluntly obvious to you but that is certainly not the case for me. (And
FWIW, this very reason, that I think there might be some background
knowledge that I'm lacking is why I asked on IRC first)

> Feel free to write a document or something else. The only thing I care
> about is in the kernel tree.

That is fair, but my point is that this is literally the first time I
heard about hardware event type being essentially deprecated. I'm
certainly not qualified to write such a document myself (at least not
right now) and I won't be unless someone could explain to me what is
actually the expectation and why, or if there's existing document
explaining all these so that I can contribute to the document of other
projects.

> > Currently, just by reading the document of the hardware event type, it
> > seems that it should work if the hardware supports such counters.
>
> Such document would be the JSON file I mentioned. But since you have
> stated that you don't intend to write anything that ends up in the
> kernel, I guess that's a moot point.

By document I meant that `perf_event_open(2)` doesn't say anything
about, say the instruction hardware counter doesn't count all
instructions even when you get a non-zero value.

> > > That boat has sailed a long time ago, when the BL PMU support was
> > > introduced, and all counters are treated equally: they are *NOT*
> > > counted globally. Changing this would be an ABI break, and I seriously
> > > doubt we want to go there.
> >
> > Sorry I'm not familiar with the names here. What's the "BL PMU"
> > support? And what are the counters that are not counted globally?
>
> BL stands for Big-Little. Asymmetric support, if you want. None of the
> counters are counted globally, only per PMU type. And this is an ABI
> we cannot break.

Are you talking about the dynamic PMU type or the hardware or raw type?

> > > It would also mean that the kernel would need to know which counters
> > > it can accumulate over the various CPU types (which is often more than
> > > 2, these days). All of that to save userspace adding things? I doubt
> > > this is worth it.
> > >
> > > > 4. There are other events that may not make as much sense to combine
> > > > (cycles for example). However, I feel like a combined cycle count
> > > > isn't going to be much tricker to use given that the cycle count on a
> > > > single core is still affected by frequency scaling and it can still be
> > > > used correctly by pinning the thread.
> > >
> > > I don't understand what frequency scaling has anything to do with this
> > > (a cycle is still a cycle at any frequency).
> >
> > Exactly, a cycle is still a cycle, so I don't see why it's that big a
> > problem to count it globally.
>
> Because you are going to walk the list of events generated during a
> time slice, work out which ones are to be merged and which ones
> aren't, and accumulate them into global, userspace visible counters? I
> dread to imagine the effect on scheduling latency. All that to avoid
> adding two values into userspace. Great.

OK, if doing that will always incur a big overhead then I can take
that. What I imagined was that this only needs to be done if the
process is moved to a different CPU, and also I thought there should
already be some logic in scheduling related to perf counters (I was
imagining that's when the kernel decide to add/remove counters for
other cases) which is why I thought adding such logic shouldn't make a
big difference if no counters is used by the process. I can certainly
be wrong about that.

Also, see below.

> > What I meant exactly was that if a code runs for 100 cycles at 1 GHz,
> > it doesn't mean it'll also run (close to) 100 cycles at 3 GHz.
> > Similarly, if it runs for 100 cycles on the E core, it doesn't mean
> > it'll run for 100 cycles on the P core.
>
> And? What do you derive from this set of statements?

And this is replying to the original argument you gave, saying that
counting cycles across different core types doesn't make sense. What
I'm saying here is that I don't believe counting across core types
makes any more or less sense than counting cycles across different
processor frequencies.

> > We already allow the former case to count using the same counter
> > everywhere, I don't see why the latter can't be allowed. (ABI change
> > issue aside)
>
> *blink*. If you don't see a problem with changing the ABI, I'm at a
> loss.

Yes I do see the issue with changing ABIs. However, there are multiple
arguments you brought up and I'd like to understand each of them
individually. It's certainly possible that some of what I was asking
about is impossible for some specific reason, but I'd like to
understand all of the arguments you brought up to fully understand the
issue. (also I intended to mean here that I get that there could be
ABI issue, although I don't fully get it yet which is why I was asking
above, however, I'd like to discuss this part without concerning the
ABI issue, I didn't intend to mean that we can just ignore all the ABI
issues and just change things. If that's not what I said actually
implies, I'm sorry about that)

> > I don't have hardware to test this but it also seems that on the new
> > intel chips, the E core and the P core are counted together. (this is
> > purely based on the lack of multiple counter support in rr to support
> > the new chip...)
>
> Colour me uninterested on both count. x86 can do whatever they want.

Again, this is just to show that counting globally on both E and P
cores isn't something that makes as little sense as you originally
said.

> >
> > > > The main reasons I'm asking about 3 and 4 is that
> > > > 1. Right now, even to just count instructions without pinning the
> > > > thread, I need to create two counters.
> > >
> > > How bad is that? I mean, the counters are per-CPU anyway, so there
> > > *are* N counters (N being the number of CPUs). You only have to create
> > > a counter per PMU.
> > >
> > > > 2. Even if the number isn't exactly accurate, it can still be useful
> > > > as a general guideline. Right now, even if I just want to do a quick
> > > > check, I still need to manually specify a dozen of events in `perf
> > > > stat -e` rather than simply using `perf stat` (to make it worse, perf
> > > > doesn't even provide any useful warning about it). It is also much
> > > > harder to do things generically (which is at least partially because
> > > > of the lack of documentation....).
> > >
> > > I see this as a potential perf-tool improvement. Being able to say
> > > 'Count this event on all CPU PMUs'  would certainly be valuable to all
> > > asymmetric systems.
> >
> > Short answer is not that bad if and only if there's a standard and
> > documented way to do this, userspace or kernel.
>
> Feel free to improve the kernel documentation[1], which is admittedly
> pretty sparse on the subject.
>
> The kernel gives you the tools to match PMUs and CPUs (just rummage in
> sysfs). If userspace knows which counter is what, you're in business.
> Do document your findings, by any mean.

And as I said above, without understanding all the details I can't.
And it also seems that I don't know the right way to get such
information without putting up crap so I'll appreciate it if you could
let me know how I can find out more detail about it without annoy more
people.

>
>         M.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/arm64/perf.rst#n136
>
> --
> Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel perf counter support (for apple M1 and others)
  2022-04-19 13:34       ` Yichao Yu
@ 2022-04-19 13:36         ` Yichao Yu
  0 siblings, 0 replies; 7+ messages in thread
From: Yichao Yu @ 2022-04-19 13:36 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-arm-kernel, khuong, will, mark.rutland, Frank.li,
	zhangshaokun, liuqi115, john.garry, Mathieu Poirier, Leo Yan

On Tue, Apr 19, 2022 at 9:34 AM Yichao Yu <yyc1992@gmail.com> wrote:
>
> On Tue, Apr 19, 2022 at 9:09 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Tue, 19 Apr 2022 13:06:37 +0100,
> > Yichao Yu <yyc1992@gmail.com> wrote:
> > >
> > > > - I don't think there is any value in stashing any of these HW events
> > > >   in the kernel. In most cases, the kernel definition only matches the
> > > >   x86 definition, and doesn't accurately describe the vast majority of
> > > >   the events implemented on an ARM CPU. The ARM architecture mentions
> > > >   a handful of architectural events that actually match the kernel
> > > >   definition, and for these CPUs the kernel carries the in-kernel
> > > >   description.
> > > >
> > > > - For the M1, none of the above applies, because there is *NO*
> > > >   architectural description for the events reported by the (non
> > > >   architectural) PMU, and there is no guarantee that they actually
> > > >   match the common understanding we have of these events.
> > >
> > > You mentioned documents from Apple on IRC and below. Why is that the
> > > only acceptable source?
> >
> > Because that would be the only one giving an exact definition to what
> > you are counting. Anything else is guess-work. Very good guess-work,
> > I'm sure, but still very much a wet finger in the air.
> >
> > > The entire support for M1 is based on reverse engineering/testing of
> > > the hardware so why would those not be acceptable sources here as
> > > well?
> >
> > Because there is a difference between getting something to work (the
> > PMU driver itself) and interpreting the results it gives. All we know
> > is that it is counting something. You can sort of guess what, but you
> > don't know for sure.
> >
> > > My understanding is that the current cycles and instructions counters
> > > were figured out this way so I don't see why you want them to be
> > > removed.
> >
> > Because you use them as an argument to pile more crap in the
> > kernel. Gee, at this stage, it is the driver itself I am going to
> > remove.
>
> I'm sorry but I'm not sure why you are so mad about this. That's
> certainly not my intention.
> I specifically said that I wasn't intended to submit anything to the
> kernel at this point (which I assume is the "crap" you are talking
> about) because I don't know what's acceptable and I want to understand
> why.
> For comparison to other M1 supporting code, I'm not talking about the
> perf counter driver specifically, but all of the other related
> drivers. I'm sure there is a lot of code that depends on what specific
> thing a register is doing. For this particular question I'd like to
> know why there's a difference between the two. The answer might be
> bluntly obvious to you but that is certainly not the case for me. (And
> FWIW, this very reason, that I think there might be some background
> knowledge that I'm lacking is why I asked on IRC first)
>
> > Feel free to write a document or something else. The only thing I care
> > about is in the kernel tree.
>
> That is fair, but my point is that this is literally the first time I
> heard about hardware event type being essentially deprecated. I'm
> certainly not qualified to write such a document myself (at least not
> right now) and I won't be unless someone could explain to me what is
> actually the expectation and why, or if there's existing document
> explaining all these so that I can contribute to the document of other
> projects.
>
> > > Currently, just by reading the document of the hardware event type, it
> > > seems that it should work if the hardware supports such counters.
> >
> > Such document would be the JSON file I mentioned. But since you have
> > stated that you don't intend to write anything that ends up in the
> > kernel, I guess that's a moot point.
>
> By document I meant that `perf_event_open(2)` doesn't say anything
> about, say the instruction hardware counter doesn't count all
> instructions even when you get a non-zero value.
>
> > > > That boat has sailed a long time ago, when the BL PMU support was
> > > > introduced, and all counters are treated equally: they are *NOT*
> > > > counted globally. Changing this would be an ABI break, and I seriously
> > > > doubt we want to go there.
> > >
> > > Sorry I'm not familiar with the names here. What's the "BL PMU"
> > > support? And what are the counters that are not counted globally?
> >
> > BL stands for Big-Little. Asymmetric support, if you want. None of the
> > counters are counted globally, only per PMU type. And this is an ABI
> > we cannot break.
>
> Are you talking about the dynamic PMU type or the hardware or raw type?
>
> > > > It would also mean that the kernel would need to know which counters
> > > > it can accumulate over the various CPU types (which is often more than
> > > > 2, these days). All of that to save userspace adding things? I doubt
> > > > this is worth it.
> > > >
> > > > > 4. There are other events that may not make as much sense to combine
> > > > > (cycles for example). However, I feel like a combined cycle count
> > > > > isn't going to be much tricker to use given that the cycle count on a
> > > > > single core is still affected by frequency scaling and it can still be
> > > > > used correctly by pinning the thread.
> > > >
> > > > I don't understand what frequency scaling has anything to do with this
> > > > (a cycle is still a cycle at any frequency).
> > >
> > > Exactly, a cycle is still a cycle, so I don't see why it's that big a
> > > problem to count it globally.
> >
> > Because you are going to walk the list of events generated during a
> > time slice, work out which ones are to be merged and which ones
> > aren't, and accumulate them into global, userspace visible counters? I
> > dread to imagine the effect on scheduling latency. All that to avoid
> > adding two values into userspace. Great.
>
> OK, if doing that will always incur a big overhead then I can take
> that. What I imagined was that this only needs to be done if the
> process is moved to a different CPU, and also I thought there should
> already be some logic in scheduling related to perf counters (I was
> imagining that's when the kernel decide to add/remove counters for
> other cases) which is why I thought adding such logic shouldn't make a
> big difference if no counters is used by the process. I can certainly
> be wrong about that.
>
> Also, see below.
>
> > > What I meant exactly was that if a code runs for 100 cycles at 1 GHz,
> > > it doesn't mean it'll also run (close to) 100 cycles at 3 GHz.
> > > Similarly, if it runs for 100 cycles on the E core, it doesn't mean
> > > it'll run for 100 cycles on the P core.
> >
> > And? What do you derive from this set of statements?
>
> And this is replying to the original argument you gave, saying that
> counting cycles across different core types doesn't make sense. What
> I'm saying here is that I don't believe counting across core types
> makes any more or less sense than counting cycles across different
> processor frequencies.
>
> > > We already allow the former case to count using the same counter
> > > everywhere, I don't see why the latter can't be allowed. (ABI change
> > > issue aside)
> >
> > *blink*. If you don't see a problem with changing the ABI, I'm at a
> > loss.
>
> Yes I do see the issue with changing ABIs. However, there are multiple
> arguments you brought up and I'd like to understand each of them
> individually. It's certainly possible that some of what I was asking
> about is impossible for some specific reason, but I'd like to
> understand all of the arguments you brought up to fully understand the
> issue. (also I intended to mean here that I get that there could be
> ABI issue, although I don't fully get it yet which is why I was asking
> above, however, I'd like to discuss this part without concerning the
> ABI issue, I didn't intend to mean that we can just ignore all the ABI
> issues and just change things. If that's not what I said actually
> implies, I'm sorry about that)
>
> > > I don't have hardware to test this but it also seems that on the new
> > > intel chips, the E core and the P core are counted together. (this is
> > > purely based on the lack of multiple counter support in rr to support
> > > the new chip...)
> >
> > Colour me uninterested on both count. x86 can do whatever they want.
>
> Again, this is just to show that counting globally on both E and P
> cores isn't something that makes as little sense as you originally
> said.
>
> > >
> > > > > The main reasons I'm asking about 3 and 4 is that
> > > > > 1. Right now, even to just count instructions without pinning the
> > > > > thread, I need to create two counters.
> > > >
> > > > How bad is that? I mean, the counters are per-CPU anyway, so there
> > > > *are* N counters (N being the number of CPUs). You only have to create
> > > > a counter per PMU.
> > > >
> > > > > 2. Even if the number isn't exactly accurate, it can still be useful
> > > > > as a general guideline. Right now, even if I just want to do a quick
> > > > > check, I still need to manually specify a dozen of events in `perf
> > > > > stat -e` rather than simply using `perf stat` (to make it worse, perf
> > > > > doesn't even provide any useful warning about it). It is also much
> > > > > harder to do things generically (which is at least partially because
> > > > > of the lack of documentation....).
> > > >
> > > > I see this as a potential perf-tool improvement. Being able to say
> > > > 'Count this event on all CPU PMUs'  would certainly be valuable to all
> > > > asymmetric systems.
> > >
> > > Short answer is not that bad if and only if there's a standard and
> > > documented way to do this, userspace or kernel.
> >
> > Feel free to improve the kernel documentation[1], which is admittedly
> > pretty sparse on the subject.
> >
> > The kernel gives you the tools to match PMUs and CPUs (just rummage in
> > sysfs). If userspace knows which counter is what, you're in business.
> > Do document your findings, by any mean.
>
> And as I said above, without understanding all the details I can't.
> And it also seems that I don't know the right way to get such
> information without putting up crap so I'll appreciate it if you could
> let me know how I can find out more detail about it without annoy more
> people.

And just to clarify, I didn't originally ask about these to know how
to fix/improve the document, I asked because I don't even know
what/where to fix and apparently I knew even less than I thought I did
in this regard.

>
> >
> >         M.
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/arm64/perf.rst#n136
> >
> > --
> > Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-04-19 14:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-01  1:39 Kernel perf counter support (for apple M1 and others) Yichao Yu
2022-04-13 12:58 ` Yichao Yu
2022-04-18 12:01 ` Marc Zyngier
2022-04-19 12:06   ` Yichao Yu
2022-04-19 13:09     ` Marc Zyngier
2022-04-19 13:34       ` Yichao Yu
2022-04-19 13:36         ` Yichao Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.