All of lore.kernel.org
 help / color / mirror / Atom feed
* perf-tools-next: Request for testing hybrid changes
@ 2023-06-22 18:57 Namhyung Kim
  2023-06-22 20:15 ` Ian Rogers
  0 siblings, 1 reply; 18+ messages in thread
From: Namhyung Kim @ 2023-06-22 18:57 UTC (permalink / raw)
  To: John Garry, Will Deacon, James Clark, Leo Yan, Mike Leach
  Cc: Ian Rogers, Kan Liang, Arnaldo Carvalho de Melo, Jiri Olsa,
	Adrian Hunter, Athira Rajeev, linux-perf-users, Thomas Richter,
	Ravi Bangoria

Hello,

In this development cycle, we've got a big change in how to handle
heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
but caused some troubles on others.  So I'd like to confirm it works
on major platforms without problems.  Especially on ARM64 as it has
the big.LITTLE CPUs.

Also we are transitioning from Arnaldo's tree to new repositories in
git.kernel.org:

 * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
 * kernel/git/perf/perf-tools-next (for new features for the next cycle)

There is a branch with the same name.  Please check out
perf-tools-next and play with it.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-22 18:57 perf-tools-next: Request for testing hybrid changes Namhyung Kim
@ 2023-06-22 20:15 ` Ian Rogers
  2023-06-23  8:42   ` James Clark
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Rogers @ 2023-06-22 20:15 UTC (permalink / raw)
  To: John Garry, Will Deacon, James Clark
  Cc: Leo Yan, Mike Leach, Kan Liang, Arnaldo Carvalho de Melo,
	Jiri Olsa, Adrian Hunter, Athira Rajeev, linux-perf-users,
	Thomas Richter, Ravi Bangoria, Namhyung Kim

On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
>
> Hello,
>
> In this development cycle, we've got a big change in how to handle
> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
> but caused some troubles on others.  So I'd like to confirm it works
> on major platforms without problems.  Especially on ARM64 as it has
> the big.LITTLE CPUs.


Thanks Namhyung. I believe there is a change on ARM64 that makes a
problem of one type into a slightly different problem. The perf parse
events logic handles events certain known hardware events using known
numeric encodings, for example, the cycles event has a type of 0
(PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286

The perf_event_open syscall can specify a CPU number of -1 for any. My
understanding is that -1 on ARM translates into choose the PMU of the
current running thread, this means that if a thread migrates from
between core/PMU types it won't be recorded on the alternate PMU type.

The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
this, but appears not to be set on ARM's PMU:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
Could an ARM64 related person fix this?

The perf tool is expecting it to be supported and so it tries to open
the event for each core PMU, so the cycles event will open for each of
a BIG.little PMU, however, as the extended type isn't supported the
same event will be opened twice and the migration problem still
exists. This problem exists for all PERF_TYPE_HARDWARE events (10) and
all PERF_TYPE_HW_CACHE events (42).

Assuming the capability issue is fixed, there is still an open issue
about how to properly support these 52 events on older kernels. It
would be possible to use the architecture standard events, so cycles
can map to cpu_cycles, instructions to sw_incr, cache-references to
l1d_cache (probably), etc. I'm not sure this will work for all 52
events and the plumbing in the perf tool isn't straightforward.
There's also the question of whether such a change of events should
happen when we know the perf_event_open will be given a CPU, should
that case not wildcard PMUs?

I'm happy to provide more context/details.

Thanks,
Ian

> Also we are transitioning from Arnaldo's tree to new repositories in
> git.kernel.org:
>
>  * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
>  * kernel/git/perf/perf-tools-next (for new features for the next cycle)
>
> There is a branch with the same name.  Please check out
> perf-tools-next and play with it.
>
> Thanks,
> Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-22 20:15 ` Ian Rogers
@ 2023-06-23  8:42   ` James Clark
  2023-06-23 14:30     ` Ian Rogers
  0 siblings, 1 reply; 18+ messages in thread
From: James Clark @ 2023-06-23  8:42 UTC (permalink / raw)
  To: Ian Rogers, John Garry, Will Deacon
  Cc: Leo Yan, Mike Leach, Kan Liang, Arnaldo Carvalho de Melo,
	Jiri Olsa, Adrian Hunter, Athira Rajeev, linux-perf-users,
	Thomas Richter, Ravi Bangoria, Namhyung Kim



On 22/06/2023 21:15, Ian Rogers wrote:
> On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
>>
>> Hello,
>>
>> In this development cycle, we've got a big change in how to handle
>> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
>> but caused some troubles on others.  So I'd like to confirm it works
>> on major platforms without problems.  Especially on ARM64 as it has
>> the big.LITTLE CPUs.
> 
> 
> Thanks Namhyung. I believe there is a change on ARM64 that makes a
> problem of one type into a slightly different problem. The perf parse
> events logic handles events certain known hardware events using known
> numeric encodings, for example, the cycles event has a type of 0
> (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
> 
> The perf_event_open syscall can specify a CPU number of -1 for any. My
> understanding is that -1 on ARM translates into choose the PMU of the
> current running thread, this means that if a thread migrates from
> between core/PMU types it won't be recorded on the alternate PMU type.
> 
> The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
> this, but appears not to be set on ARM's PMU:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
> Could an ARM64 related person fix this?
> 

Yes I was planning to look into this since this thread [1].
It seems like there are still some test failures on Juno which I'm
working through now, even after the fix in that thread. And I was
thinking that it should be possible to support
PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.

[1]: https://lore.kernel.org/all/ZIxza13x+AwApbQb@kernel.org/

> The perf tool is expecting it to be supported and so it tries to open
> the event for each core PMU, so the cycles event will open for each of
> a BIG.little PMU, however, as the extended type isn't supported the
> same event will be opened twice and the migration problem still
> exists. This problem exists for all PERF_TYPE_HARDWARE events (10) and
> all PERF_TYPE_HW_CACHE events (42).
> 
> Assuming the capability issue is fixed, there is still an open issue
> about how to properly support these 52 events on older kernels. It
> would be possible to use the architecture standard events, so cycles
> can map to cpu_cycles, instructions to sw_incr, cache-references to
> l1d_cache (probably), etc. I'm not sure this will work for all 52
> events and the plumbing in the perf tool isn't straightforward.
> There's also the question of whether such a change of events should
> happen when we know the perf_event_open will be given a CPU, should
> that case not wildcard PMUs?>
> I'm happy to provide more context/details.
> 
> Thanks,
> Ian
> 
>> Also we are transitioning from Arnaldo's tree to new repositories in
>> git.kernel.org:
>>
>>  * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
>>  * kernel/git/perf/perf-tools-next (for new features for the next cycle)
>>
>> There is a branch with the same name.  Please check out
>> perf-tools-next and play with it.
>>
>> Thanks,
>> Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-23  8:42   ` James Clark
@ 2023-06-23 14:30     ` Ian Rogers
  2023-06-23 17:59       ` Ian Rogers
                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Ian Rogers @ 2023-06-23 14:30 UTC (permalink / raw)
  To: James Clark
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim

On Fri, Jun 23, 2023 at 1:42 AM James Clark <james.clark@arm.com> wrote:
>
>
>
> On 22/06/2023 21:15, Ian Rogers wrote:
> > On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >> In this development cycle, we've got a big change in how to handle
> >> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
> >> but caused some troubles on others.  So I'd like to confirm it works
> >> on major platforms without problems.  Especially on ARM64 as it has
> >> the big.LITTLE CPUs.
> >
> >
> > Thanks Namhyung. I believe there is a change on ARM64 that makes a
> > problem of one type into a slightly different problem. The perf parse
> > events logic handles events certain known hardware events using known
> > numeric encodings, for example, the cycles event has a type of 0
> > (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
> > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
> >
> > The perf_event_open syscall can specify a CPU number of -1 for any. My
> > understanding is that -1 on ARM translates into choose the PMU of the
> > current running thread, this means that if a thread migrates from
> > between core/PMU types it won't be recorded on the alternate PMU type.
> >
> > The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
> > this, but appears not to be set on ARM's PMU:
> > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
> > Could an ARM64 related person fix this?
> >
>
> Yes I was planning to look into this since this thread [1].
> It seems like there are still some test failures on Juno which I'm
> working through now, even after the fix in that thread. And I was
> thinking that it should be possible to support
> PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.

Thanks James, my memory of Juno boards was early bring up work and I
don't recall them having heterogeneous PMUs. Presumably this is
another issue, I'm happy to help if you can share more details.

It is great that you'll look to add PERF_PMU_CAP_EXTENDED_HW_TYPE,
there is also the legacy kernel issue. Do you have a plan there or
should supporting PERF_PMU_CAP_EXTENDED_HW_TYPE be the plan? As the
Linux 6.5 merge window will open soon I'm concerned that these fixes
may only make it into 6.6. Should we do anything short-term for 6.5?
For example, if extended types aren't supported then disable the
wildcard matching of PMUs in the "numeric", really PERF_TYPE_HARDWARE
and PERF_TYPE_HW_CACHE, cases?

Thanks,
Ian

> [1]: https://lore.kernel.org/all/ZIxza13x+AwApbQb@kernel.org/
>
> > The perf tool is expecting it to be supported and so it tries to open
> > the event for each core PMU, so the cycles event will open for each of
> > a BIG.little PMU, however, as the extended type isn't supported the
> > same event will be opened twice and the migration problem still
> > exists. This problem exists for all PERF_TYPE_HARDWARE events (10) and
> > all PERF_TYPE_HW_CACHE events (42).
> >
> > Assuming the capability issue is fixed, there is still an open issue
> > about how to properly support these 52 events on older kernels. It
> > would be possible to use the architecture standard events, so cycles
> > can map to cpu_cycles, instructions to sw_incr, cache-references to
> > l1d_cache (probably), etc. I'm not sure this will work for all 52
> > events and the plumbing in the perf tool isn't straightforward.
> > There's also the question of whether such a change of events should
> > happen when we know the perf_event_open will be given a CPU, should
> > that case not wildcard PMUs?>
> > I'm happy to provide more context/details.
> >
> > Thanks,
> > Ian
> >
> >> Also we are transitioning from Arnaldo's tree to new repositories in
> >> git.kernel.org:
> >>
> >>  * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
> >>  * kernel/git/perf/perf-tools-next (for new features for the next cycle)
> >>
> >> There is a branch with the same name.  Please check out
> >> perf-tools-next and play with it.
> >>
> >> Thanks,
> >> Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-23 14:30     ` Ian Rogers
@ 2023-06-23 17:59       ` Ian Rogers
  2023-06-26 12:32       ` Will Deacon
  2023-06-29 11:13       ` James Clark
  2 siblings, 0 replies; 18+ messages in thread
From: Ian Rogers @ 2023-06-23 17:59 UTC (permalink / raw)
  To: James Clark
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim

On Fri, Jun 23, 2023 at 7:30 AM Ian Rogers <irogers@google.com> wrote:
>
> On Fri, Jun 23, 2023 at 1:42 AM James Clark <james.clark@arm.com> wrote:
> >
> >
> >
> > On 22/06/2023 21:15, Ian Rogers wrote:
> > > On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
> > >>
> > >> Hello,
> > >>
> > >> In this development cycle, we've got a big change in how to handle
> > >> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
> > >> but caused some troubles on others.  So I'd like to confirm it works
> > >> on major platforms without problems.  Especially on ARM64 as it has
> > >> the big.LITTLE CPUs.
> > >
> > >
> > > Thanks Namhyung. I believe there is a change on ARM64 that makes a
> > > problem of one type into a slightly different problem. The perf parse
> > > events logic handles events certain known hardware events using known
> > > numeric encodings, for example, the cycles event has a type of 0
> > > (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
> > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
> > >
> > > The perf_event_open syscall can specify a CPU number of -1 for any. My
> > > understanding is that -1 on ARM translates into choose the PMU of the
> > > current running thread, this means that if a thread migrates from
> > > between core/PMU types it won't be recorded on the alternate PMU type.
> > >
> > > The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
> > > this, but appears not to be set on ARM's PMU:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
> > > Could an ARM64 related person fix this?
> > >
> >
> > Yes I was planning to look into this since this thread [1].
> > It seems like there are still some test failures on Juno which I'm
> > working through now, even after the fix in that thread. And I was
> > thinking that it should be possible to support
> > PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.
>
> Thanks James, my memory of Juno boards was early bring up work and I
> don't recall them having heterogeneous PMUs. Presumably this is
> another issue, I'm happy to help if you can share more details.
>
> It is great that you'll look to add PERF_PMU_CAP_EXTENDED_HW_TYPE,
> there is also the legacy kernel issue. Do you have a plan there or
> should supporting PERF_PMU_CAP_EXTENDED_HW_TYPE be the plan? As the
> Linux 6.5 merge window will open soon I'm concerned that these fixes
> may only make it into 6.6. Should we do anything short-term for 6.5?
> For example, if extended types aren't supported then disable the
> wildcard matching of PMUs in the "numeric", really PERF_TYPE_HARDWARE
> and PERF_TYPE_HW_CACHE, cases?
>
> Thanks,
> Ian

Not a heterogeneous ARM problem, but a problem potentially added in
perf-tools-next is that we alter groups of events based on a "group
PMU name". This is necessary so we can correctly group events for
metrics where the events require groups. There's some special logic so
that software and aux events behave the correct way:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n2051

But ARM PMUs don't appear to set the auxtrace flag:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n78
which is done for Intel here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/arch/x86/util/pmu.c?h=perf-tools-next#n32

Thanks,
Ian

> > [1]: https://lore.kernel.org/all/ZIxza13x+AwApbQb@kernel.org/
> >
> > > The perf tool is expecting it to be supported and so it tries to open
> > > the event for each core PMU, so the cycles event will open for each of
> > > a BIG.little PMU, however, as the extended type isn't supported the
> > > same event will be opened twice and the migration problem still
> > > exists. This problem exists for all PERF_TYPE_HARDWARE events (10) and
> > > all PERF_TYPE_HW_CACHE events (42).
> > >
> > > Assuming the capability issue is fixed, there is still an open issue
> > > about how to properly support these 52 events on older kernels. It
> > > would be possible to use the architecture standard events, so cycles
> > > can map to cpu_cycles, instructions to sw_incr, cache-references to
> > > l1d_cache (probably), etc. I'm not sure this will work for all 52
> > > events and the plumbing in the perf tool isn't straightforward.
> > > There's also the question of whether such a change of events should
> > > happen when we know the perf_event_open will be given a CPU, should
> > > that case not wildcard PMUs?>
> > > I'm happy to provide more context/details.
> > >
> > > Thanks,
> > > Ian
> > >
> > >> Also we are transitioning from Arnaldo's tree to new repositories in
> > >> git.kernel.org:
> > >>
> > >>  * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
> > >>  * kernel/git/perf/perf-tools-next (for new features for the next cycle)
> > >>
> > >> There is a branch with the same name.  Please check out
> > >> perf-tools-next and play with it.
> > >>
> > >> Thanks,
> > >> Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-23 14:30     ` Ian Rogers
  2023-06-23 17:59       ` Ian Rogers
@ 2023-06-26 12:32       ` Will Deacon
  2023-06-26 14:39         ` James Clark
  2023-06-29 11:13       ` James Clark
  2 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2023-06-26 12:32 UTC (permalink / raw)
  To: Ian Rogers
  Cc: James Clark, John Garry, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim, mark.rutland

On Fri, Jun 23, 2023 at 07:30:53AM -0700, Ian Rogers wrote:
> On Fri, Jun 23, 2023 at 1:42 AM James Clark <james.clark@arm.com> wrote:
> >
> >
> >
> > On 22/06/2023 21:15, Ian Rogers wrote:
> > > On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
> > >>
> > >> Hello,
> > >>
> > >> In this development cycle, we've got a big change in how to handle
> > >> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
> > >> but caused some troubles on others.  So I'd like to confirm it works
> > >> on major platforms without problems.  Especially on ARM64 as it has
> > >> the big.LITTLE CPUs.
> > >
> > >
> > > Thanks Namhyung. I believe there is a change on ARM64 that makes a
> > > problem of one type into a slightly different problem. The perf parse
> > > events logic handles events certain known hardware events using known
> > > numeric encodings, for example, the cycles event has a type of 0
> > > (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
> > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
> > >
> > > The perf_event_open syscall can specify a CPU number of -1 for any. My
> > > understanding is that -1 on ARM translates into choose the PMU of the
> > > current running thread, this means that if a thread migrates from
> > > between core/PMU types it won't be recorded on the alternate PMU type.
> > >
> > > The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
> > > this, but appears not to be set on ARM's PMU:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
> > > Could an ARM64 related person fix this?
> > >
> >
> > Yes I was planning to look into this since this thread [1].
> > It seems like there are still some test failures on Juno which I'm
> > working through now, even after the fix in that thread. And I was
> > thinking that it should be possible to support
> > PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.
> 
> Thanks James, my memory of Juno boards was early bring up work and I
> don't recall them having heterogeneous PMUs. Presumably this is
> another issue, I'm happy to help if you can share more details.
> 
> It is great that you'll look to add PERF_PMU_CAP_EXTENDED_HW_TYPE,
> there is also the legacy kernel issue. Do you have a plan there or
> should supporting PERF_PMU_CAP_EXTENDED_HW_TYPE be the plan? As the
> Linux 6.5 merge window will open soon I'm concerned that these fixes
> may only make it into 6.6. Should we do anything short-term for 6.5?
> For example, if extended types aren't supported then disable the
> wildcard matching of PMUs in the "numeric", really PERF_TYPE_HARDWARE
> and PERF_TYPE_HW_CACHE, cases?

What does PERF_PMU_CAP_EXTENDED_HW_TYPE buy us on Arm, given that we've
supported heterogeneous PMUs for years without issues? In other words,
what is the benefit of supporting two user ABIs here?

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-26 12:32       ` Will Deacon
@ 2023-06-26 14:39         ` James Clark
  2023-06-26 15:09           ` Will Deacon
  0 siblings, 1 reply; 18+ messages in thread
From: James Clark @ 2023-06-26 14:39 UTC (permalink / raw)
  To: Will Deacon, Ian Rogers
  Cc: John Garry, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim, mark.rutland



On 26/06/2023 13:32, Will Deacon wrote:
> On Fri, Jun 23, 2023 at 07:30:53AM -0700, Ian Rogers wrote:
>> On Fri, Jun 23, 2023 at 1:42 AM James Clark <james.clark@arm.com> wrote:
>>>
>>>
>>>
>>> On 22/06/2023 21:15, Ian Rogers wrote:
>>>> On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> In this development cycle, we've got a big change in how to handle
>>>>> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
>>>>> but caused some troubles on others.  So I'd like to confirm it works
>>>>> on major platforms without problems.  Especially on ARM64 as it has
>>>>> the big.LITTLE CPUs.
>>>>
>>>>
>>>> Thanks Namhyung. I believe there is a change on ARM64 that makes a
>>>> problem of one type into a slightly different problem. The perf parse
>>>> events logic handles events certain known hardware events using known
>>>> numeric encodings, for example, the cycles event has a type of 0
>>>> (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
>>>>
>>>> The perf_event_open syscall can specify a CPU number of -1 for any. My
>>>> understanding is that -1 on ARM translates into choose the PMU of the
>>>> current running thread, this means that if a thread migrates from
>>>> between core/PMU types it won't be recorded on the alternate PMU type.
>>>>
>>>> The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
>>>> this, but appears not to be set on ARM's PMU:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
>>>> Could an ARM64 related person fix this?
>>>>
>>>
>>> Yes I was planning to look into this since this thread [1].
>>> It seems like there are still some test failures on Juno which I'm
>>> working through now, even after the fix in that thread. And I was
>>> thinking that it should be possible to support
>>> PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.
>>
>> Thanks James, my memory of Juno boards was early bring up work and I
>> don't recall them having heterogeneous PMUs. Presumably this is
>> another issue, I'm happy to help if you can share more details.
>>
>> It is great that you'll look to add PERF_PMU_CAP_EXTENDED_HW_TYPE,
>> there is also the legacy kernel issue. Do you have a plan there or
>> should supporting PERF_PMU_CAP_EXTENDED_HW_TYPE be the plan? As the
>> Linux 6.5 merge window will open soon I'm concerned that these fixes
>> may only make it into 6.6. Should we do anything short-term for 6.5?
>> For example, if extended types aren't supported then disable the
>> wildcard matching of PMUs in the "numeric", really PERF_TYPE_HARDWARE
>> and PERF_TYPE_HW_CACHE, cases?
> 
> What does PERF_PMU_CAP_EXTENDED_HW_TYPE buy us on Arm, given that we've
> supported heterogeneous PMUs for years without issues? In other words,
> what is the benefit of supporting two user ABIs here?
> 
> Will

Aren't there some issues around per-process perf sessions and choosing
which PMU to use?

For example on Juno if I run perf on one of the big cores (1 and 2),
then I only get an accurate cycle count if the process I'm interested in
also runs on a big core:

  taskset --cpu-list 1 perf stat -e cycles -- taskset \
   --cpu-list 1 stress -c 1 -t 1

  1087015046      cycles:u

If the stress process runs on a small core (4 for example) then I get a
much lower count, which I think is only accounting for the taskset
before it finishes pinning stress to the small core:

  taskset --cpu-list 1 perf stat -e cycles -- taskset \
    --cpu-list 4 stress -c 1 -t 1

  68134032      cycles:u

Then if perf starts on a small core it gets even worse, you never get
any cycle counts, even if stress runs on the same core:

  taskset --cpu-list 0 perf stat -e cycles -- taskset \
    --cpu-list 0 stress -c 1 -t 1

  <not counted>      cycles:u

I'm not sure how many of these are just the perf tool doing the wrong
thing and should be fixed in other ways than using
PERF_PMU_CAP_EXTENDED_HW_TYPE. But there is also something to be said
for consistency that if PERF_PMU_CAP_EXTENDED_HW_TYPE is already
understood by the tool as a way to use different PMU types then
supporting it on Arm might make sense.

James

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-26 14:39         ` James Clark
@ 2023-06-26 15:09           ` Will Deacon
  2023-06-26 16:11             ` Ian Rogers
  0 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2023-06-26 15:09 UTC (permalink / raw)
  To: James Clark
  Cc: Ian Rogers, John Garry, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim, mark.rutland

Hi James,

On Mon, Jun 26, 2023 at 03:39:00PM +0100, James Clark wrote:
> On 26/06/2023 13:32, Will Deacon wrote:
> > What does PERF_PMU_CAP_EXTENDED_HW_TYPE buy us on Arm, given that we've
> > supported heterogeneous PMUs for years without issues? In other words,
> > what is the benefit of supporting two user ABIs here?
> > 
> Aren't there some issues around per-process perf sessions and choosing
> which PMU to use?
> 
> For example on Juno if I run perf on one of the big cores (1 and 2),
> then I only get an accurate cycle count if the process I'm interested in
> also runs on a big core:
> 
>   taskset --cpu-list 1 perf stat -e cycles -- taskset \
>    --cpu-list 1 stress -c 1 -t 1
> 
>   1087015046      cycles:u
> 
> If the stress process runs on a small core (4 for example) then I get a
> much lower count, which I think is only accounting for the taskset
> before it finishes pinning stress to the small core:
> 
>   taskset --cpu-list 1 perf stat -e cycles -- taskset \
>     --cpu-list 4 stress -c 1 -t 1
> 
>   68134032      cycles:u
> 
> Then if perf starts on a small core it gets even worse, you never get
> any cycle counts, even if stress runs on the same core:
> 
>   taskset --cpu-list 0 perf stat -e cycles -- taskset \
>     --cpu-list 0 stress -c 1 -t 1
> 
>   <not counted>      cycles:u
> 
> I'm not sure how many of these are just the perf tool doing the wrong
> thing and should be fixed in other ways than using
> PERF_PMU_CAP_EXTENDED_HW_TYPE. But there is also something to be said
> for consistency that if PERF_PMU_CAP_EXTENDED_HW_TYPE is already
> understood by the tool as a way to use different PMU types then
> supporting it on Arm might make sense.

I think the above is just an example of how the ABIs differ and doesn't
reflect a limitation of either, no? In other words, if you drive it wrong,
then yes, you get the wrong result. That's the same with any heterogeneous
system. I agree that consistency is great, but we've been exposing
heterogeneous PMUs on arm for what, a decade now? It feels a little
disingenuous to say we should be consistent with something that's this
hot off the press.

Anyway, my main question is still largely unanswered: what does the new
ABI gain us over what we have? Is it documented somewhere? For example,
how does it handle things like:

  - Different number of hardware counters across CPU types
  - An event that might not be supported across different CPU types
  - Events that might collide across CPU types (e.g. raw events that have
    the same encoding but count something different)

From what I can tell, the main difference between the interfaces seems to
be the level of resourcing that Intel and Arm have each put into the
tooling, and I don't think that's a particularly compelling reason to
take on the burden of maintaining both in the kernel. However, if there
are functional benefits with PERF_PMU_CAP_EXTENDED_HW_TYPE over what we've
been doing them I'm keen to hear about them.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-26 15:09           ` Will Deacon
@ 2023-06-26 16:11             ` Ian Rogers
  0 siblings, 0 replies; 18+ messages in thread
From: Ian Rogers @ 2023-06-26 16:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: James Clark, John Garry, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim, mark.rutland

On Mon, Jun 26, 2023 at 8:09 AM Will Deacon <will@kernel.org> wrote:
>
> Hi James,
>
> On Mon, Jun 26, 2023 at 03:39:00PM +0100, James Clark wrote:
> > On 26/06/2023 13:32, Will Deacon wrote:
> > > What does PERF_PMU_CAP_EXTENDED_HW_TYPE buy us on Arm, given that we've
> > > supported heterogeneous PMUs for years without issues? In other words,
> > > what is the benefit of supporting two user ABIs here?
> > >
> > Aren't there some issues around per-process perf sessions and choosing
> > which PMU to use?
> >
> > For example on Juno if I run perf on one of the big cores (1 and 2),
> > then I only get an accurate cycle count if the process I'm interested in
> > also runs on a big core:
> >
> >   taskset --cpu-list 1 perf stat -e cycles -- taskset \
> >    --cpu-list 1 stress -c 1 -t 1
> >
> >   1087015046      cycles:u
> >
> > If the stress process runs on a small core (4 for example) then I get a
> > much lower count, which I think is only accounting for the taskset
> > before it finishes pinning stress to the small core:
> >
> >   taskset --cpu-list 1 perf stat -e cycles -- taskset \
> >     --cpu-list 4 stress -c 1 -t 1
> >
> >   68134032      cycles:u
> >
> > Then if perf starts on a small core it gets even worse, you never get
> > any cycle counts, even if stress runs on the same core:
> >
> >   taskset --cpu-list 0 perf stat -e cycles -- taskset \
> >     --cpu-list 0 stress -c 1 -t 1
> >
> >   <not counted>      cycles:u
> >
> > I'm not sure how many of these are just the perf tool doing the wrong
> > thing and should be fixed in other ways than using
> > PERF_PMU_CAP_EXTENDED_HW_TYPE. But there is also something to be said
> > for consistency that if PERF_PMU_CAP_EXTENDED_HW_TYPE is already
> > understood by the tool as a way to use different PMU types then
> > supporting it on Arm might make sense.
>
> I think the above is just an example of how the ABIs differ and doesn't
> reflect a limitation of either, no? In other words, if you drive it wrong,
> then yes, you get the wrong result. That's the same with any heterogeneous
> system. I agree that consistency is great, but we've been exposing
> heterogeneous PMUs on arm for what, a decade now? It feels a little
> disingenuous to say we should be consistent with something that's this
> hot off the press.

55bcf6ef314a perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
Date:   Mon Apr 12 07:31:01 2021 -0700
https://lore.kernel.org/all/1618237865-33448-22-git-send-email-kan.liang@linux.intel.com/

> Anyway, my main question is still largely unanswered: what does the new
> ABI gain us over what we have? Is it documented somewhere?

I'm not sure why you see this as a new ABI? There are 32-bits in the
perf_event_attr config that now hold the extended type information
which were unused and erroneous if used previously. These bits just
clear up ambiguity for the legacy event types about which PMU is being
requested when perf events are associated with a task. There are IIRC
62 legacy hardware/hardware-cache events with hard coded PMU type
numbers. The perf tool will use these in preference to other events
and are used when no events/metrics are requested of the perf tool.
The extended type is described in the uapi perf_event.h, it probably
also makes sense to add man page documentation.

> For example,
> how does it handle things like:
>
>   - Different number of hardware counters across CPU types

This is unchanged. If you open a legacy say "cycles" event on CPU0 and
then use the raw/dynamic type to open other events, those events will
go into a perf hardware context and can multiplex when the counters
are all used.

>   - An event that might not be supported across different CPU types

This happens but is only detected for the hardware-cache events. The
ARM PMUs are already handling this, they just are unable to handle it
in per-thread mode.

>   - Events that might collide across CPU types (e.g. raw events that have
>     the same encoding but count something different)

This happens with homogeneous PMUs too. There's nothing special in the
kernel/tool to merge such events, if you use enough of them you'll get
multiplexing. It is a case of buyer beware.

> From what I can tell, the main difference between the interfaces seems to
> be the level of resourcing that Intel and Arm have each put into the
> tooling, and I don't think that's a particularly compelling reason to
> take on the burden of maintaining both in the kernel. However, if there
> are functional benefits with PERF_PMU_CAP_EXTENDED_HW_TYPE over what we've
> been doing them I'm keen to hear about them.

The benefits are that currently the perf tool is broken* for ARM
heterogeneous PMUs. Specifically, perf will open events "instructions"
or "cycles" for a benchmark like:

$ perf stat -e cycles benchmark

and the cycles event will be opened for the PMU of the current CPU the
perf tool happens to be running on. It won't be opened on the other
PMU. This means perf is only counting some fraction of the event.

At this point adding the capability to the PMU has been in the kernel
for over 2 years. On the user side Intel hardcoded support for their
CPUs and I recently tried to make this into something generic.
Unfortunately in making this generic I've discovered the extended type
bit was never added to ARM PMUs. This means that rather than open a
single event the perf tool is opening say the cycles event twice, but
because the extended type information causes errors on ARM we don't
encode that yielding the same event opened twice on the PMU of the CPU
that perf happens to be running on. I believe users with heterogeneous
ARM PMUs aware of the problem have been avoiding the legacy events, is
this something you are expecting the tool to do too? For example:

$ perf record benchmark

Will open the cycles event for the threads of benchmark. With extended
type information we open a cycles event on each PMU, but without it
we're only sampling on a fraction of the PMUs. Should the tool turn
cycles on ARM into an ARM architecture standard event? Which event and
who will write/test the mapping code? The problem is finite (there are
only 62 events) and there are no time machines, perhaps if there were
we wouldn't have the legacy events at all. The PMU driver change looks
about as small as it can be, adding a bit to the capabilities of the
PMU. I have no way to test such a change and it doesn't resolve the
problem for older kernels on the perf tool.

Thanks,
Ian

* I don't have an ARM system with heterogeneous PMUs to confirm this,
I'm inferring from the code and some tests Arnaldo ran on a system of
his.

> Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-23 14:30     ` Ian Rogers
  2023-06-23 17:59       ` Ian Rogers
  2023-06-26 12:32       ` Will Deacon
@ 2023-06-29 11:13       ` James Clark
  2023-06-30 15:30         ` Ian Rogers
  2 siblings, 1 reply; 18+ messages in thread
From: James Clark @ 2023-06-29 11:13 UTC (permalink / raw)
  To: Ian Rogers
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim



On 23/06/2023 15:30, Ian Rogers wrote:
> On Fri, Jun 23, 2023 at 1:42 AM James Clark <james.clark@arm.com> wrote:
>>
>>
>>
>> On 22/06/2023 21:15, Ian Rogers wrote:
>>> On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> In this development cycle, we've got a big change in how to handle
>>>> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
>>>> but caused some troubles on others.  So I'd like to confirm it works
>>>> on major platforms without problems.  Especially on ARM64 as it has
>>>> the big.LITTLE CPUs.
>>>
>>>
>>> Thanks Namhyung. I believe there is a change on ARM64 that makes a
>>> problem of one type into a slightly different problem. The perf parse
>>> events logic handles events certain known hardware events using known
>>> numeric encodings, for example, the cycles event has a type of 0
>>> (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
>>>
>>> The perf_event_open syscall can specify a CPU number of -1 for any. My
>>> understanding is that -1 on ARM translates into choose the PMU of the
>>> current running thread, this means that if a thread migrates from
>>> between core/PMU types it won't be recorded on the alternate PMU type.
>>>
>>> The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
>>> this, but appears not to be set on ARM's PMU:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
>>> Could an ARM64 related person fix this?
>>>
>>
>> Yes I was planning to look into this since this thread [1].
>> It seems like there are still some test failures on Juno which I'm
>> working through now, even after the fix in that thread. And I was
>> thinking that it should be possible to support
>> PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.
> 
> Thanks James, my memory of Juno boards was early bring up work and I
> don't recall them having heterogeneous PMUs. Presumably this is

Juno does seem to report two PMU types:

   ls /sys/bus/event_source/devices/
   ...
   armv8_cortex_a53  armv8_cortex_a57
   ...

> another issue, I'm happy to help if you can share more details.

So it looks like the failure starts at commit 8bc75f699c14 ("perf
parse-events: Support wildcards on raw events"). test__checkevent_raw()
is looking for PERF_TYPE_RAW at this point, but the commit has changed
the opened event to be whatever PMU it found (type 8 on n1sdp for
example) instead of PERF_TYPE_RAW (4) which causes the test to fail.

Then looking at the latest perf-tools-next (ad5f604e186a), now the event
opened goes back to PERF_TYPE_RAW again, but test__checkevent_raw() is
testing for a real PMU type. None of the PMUs have raw type so the test
still fails.

This is all on N1SDP which only has one PMU type so
perf_pmus__supports_extended_type() returns false.

On Juno it's similar, but perf_pmus__supports_extended_type() isn't
called for the raw test for some reason. It doesn't get hit until the
"instructions" and test__checkevent_symbolic_name() test later. I saw
this because I was looking into Arnaldo's suggestion to hard code a
return false for arm there. But it doesn't fix the test failures for
either platform.

The failures do look slightly different depending on whether there are
heterogeneous PMUs or not, so maybe there are two problems to fix. Here
are the full outputs of the test from both platforms. But I can't really
think what the fix should be (assuming we should first fix it without
looking at PERF_PMU_CAP_EXTENDED_HW_TYPE yet).

N1SDP (only the raw test fails):

 $ ls /sys/bus/event_source/devices/
 ...
 armv8_pmuv3_0
 ...

 $ ./perf test "Test event parsing" -vvv
     6: Parse event definition strings                                  :
    6.1: Test event parsing                                            :
  --- start ---
  test child forked, pid 21636
  running test 0 'syscalls:sys_enter_openat'
  Using CPUID 0x00000000410fd0c0
  running test 1 'syscalls:*'
  running test 2 'r1a'
  FAILED tests/parse-events.c:123 No PMU found for type
  Event test failure: test 2 'r1a'running test 3 '1:1'
  running test 4 'instructions'
  running test 5 'cycles/period=100000,config2/'
  running test 6 'faults'
  running test 7 'L1-dcache-load-miss'
  running test 8 'mem:0'
  running test 9 'mem:0:x'
  running test 10 'mem:0:r'
  running test 11 'mem:0:w'
  running test 12 'syscalls:sys_enter_openat:k'
  running test 13 'syscalls:*:u'
  running test 14 'r1a:kp'
  FAILED tests/parse-events.c:123 No PMU found for type
  Event test failure: test 14 'r1a:kp'running test 15 '1:1:hp'
  running test 16 'instructions:h'
  running test 17 'faults:u'
  running test 18 'L1-dcache-load-miss:kp'
  running test 19 'mem:0:u'
  running test 20 'mem:0:x:k'
  running test 21 'mem:0:r:hp'
  running test 22 'mem:0:w:up'
  running test 23 'r1,syscalls:sys_enter_openat:k,1:1:hp'
  running test 24 'instructions:G'
  running test 25 'instructions:H'
  running test 26 'mem:0:rw'
  running test 27 'mem:0:rw:kp'
  running test 28 '{instructions:k,cycles:upp}'
  running test 29 '{faults:k,cache-references}:u,cycles:k'
  running test 30
'group1{syscalls:sys_enter_openat:H,cycles:kppp},group2{cycles,1:3}:G,instructions:u'
  running test 31 '{cycles:u,instructions:kp}:p'
  running test 32 '{cycles,instructions}:G,{cycles:G,instructions:G},cycles'
  running test 33 '*:*'
  running test 34 '{cycles,cache-misses:G}:H'
  running test 35 '{cycles,cache-misses:H}:G'
  running test 36 '{cycles:G,cache-misses:H}:u'
  running test 37 '{cycles:G,cache-misses:H}:uG'
  running test 38 '{cycles,cache-misses,branch-misses}:S'
  running test 39 '{instructions,branch-misses}:Su'
  running test 40 'instructions:uDp'
  running test 41 '{cycles,cache-misses,branch-misses}:D'
  running test 42 'mem:0/1'
  running test 43 'mem:0/2:w'
  running test 44 'mem:0/4:rw:u'
  running test 45 'instructions:I'
  running test 46 'instructions:kIG'
  running test 47 'task-clock:P,cycles'
  running test 48 'instructions/name=insn/'
  running test 49 'r1234/name=rawpmu/'
  running test 50 '4:0x6530160/name=numpmu/'
  running test 51 'L1-dcache-misses/name=cachepmu/'
  running test 52 'intel_pt//u'
  ... SKIP
  running test 53
'cycles/name='COMPLEX_CYCLES_NAME:orig=cycles,desc=chip-clock-ticks'/Duk'
  running test 54 'cycles//u'
  running test 55 'cycles:k'
  running test 56 'instructions:uep'
  running test 57 '{cycles,cache-misses,branch-misses}:e'
  running test 58 'cycles/name=name/'
  running test 59 'cycles/name=l1d/'
  running test 60 'mem:0/name=breakpoint/'
  running test 61 'mem:0:x/name=breakpoint/'
  running test 62 'mem:0:r/name=breakpoint/'
  running test 63 'mem:0:w/name=breakpoint/'
  running test 64 'mem:0/name=breakpoint/u'
  running test 65 'mem:0:x/name=breakpoint/k'
  running test 66 'mem:0:r/name=breakpoint/hp'
  running test 67 'mem:0:w/name=breakpoint/up'
  running test 68 'mem:0:rw/name=breakpoint/'
  running test 69 'mem:0:rw/name=breakpoint/kp'
  running test 70 'mem:0/1/name=breakpoint/'
  running test 71 'mem:0/2:w/name=breakpoint/'
  running test 72 'mem:0/4:rw/name=breakpoint/u'
  running test 73 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'


Juno:

  (/sys/bus/event_source/devices/ output above)

      3: Parse event definition strings                                  :
    3.1: Test event parsing                                            :
  --- start ---
  running test 0 'r1a'
  Using CPUID 0x00000000410fd070
  FAILED tests/parse-events.c:123 No PMU found for type
  Event test failure: test 0 'r1a'running test 1 '1:1'
  running test 2 'instructions'
  Opening: unknown-hardware:HG
  ------------------------------------------------------------
  perf_event_attr:
    type                             0 (PERF_TYPE_HARDWARE)
    config                           0x800000000
    disabled                         1
  ------------------------------------------------------------
  sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
  sys_perf_event_open failed, error -2
  running test 3 'cycles/period=100000,config2/'
  running test 4 'faults'
  running test 5 'L1-dcache-load-miss'
  running test 6 'mem:0'
  running test 7 'mem:0:x'
  running test 8 'mem:0:r'
  running test 9 'mem:0:w'
  running test 10 'r1a:kp'
  FAILED tests/parse-events.c:123 No PMU found for type
  Event test failure: test 10 'r1a:kp'running test 11 '1:1:hp'
  running test 12 'instructions:h'
  FAILED tests/parse-events.c:329 wrong number of entries
  Event test failure: test 12 'instructions:h'running test 13 'faults:u'
  running test 14 'L1-dcache-load-miss:kp'
  running test 15 'mem:0:u'
  running test 16 'mem:0:x:k'
  running test 17 'mem:0:r:hp'
  running test 18 'mem:0:w:up'
  running test 19 'instructions:G'
  running test 20 'instructions:H'
  running test 21 'mem:0:rw'
  running test 22 'mem:0:rw:kp'
  running test 23 '{instructions:k,cycles:upp}'
  FAILED tests/parse-events.c:832 wrong number of entries
  Event test failure: test 23 '{instructions:k,cycles:upp}'running test
24 '{faults:k,cache-references}:u,cycles:k'
  FAILED tests/parse-events.c:875 wrong number of entries
  Event test failure: test 24
'{faults:k,cache-references}:u,cycles:k'running test 25
'{cycles:u,instructions:kp}:p'
  FAILED tests/parse-events.c:1041 wrong number of entries
  Event test failure: test 25 '{cycles:u,instructions:kp}:p'running test
26 '{cycles,instructions}:G,{cycles:G,instructions:G},cycles'
  FAILED tests/parse-events.c:1086 wrong number of entries
  Event test failure: test 26
'{cycles,instructions}:G,{cycles:G,instructions:G},cycles'running test
27 '{cycles,cache-misses:G}:H'
  FAILED tests/parse-events.c:1172 wrong number of entries
  Event test failure: test 27 '{cycles,cache-misses:G}:H'running test 28
'{cycles,cache-misses:H}:G'
  FAILED tests/parse-events.c:1213 wrong number of entries
  Event test failure: test 28 '{cycles,cache-misses:H}:G'running test 29
'{cycles:G,cache-misses:H}:u'
  FAILED tests/parse-events.c:1254 wrong number of entries
  Event test failure: test 29 '{cycles:G,cache-misses:H}:u'running test
30 '{cycles:G,cache-misses:H}:uG'
  FAILED tests/parse-events.c:1295 wrong number of entries
  Event test failure: test 30 '{cycles:G,cache-misses:H}:uG'running test
31 '{cycles,cache-misses,branch-misses}:S'
  FAILED tests/parse-events.c:1336 wrong number of entries
  Event test failure: test 31
'{cycles,cache-misses,branch-misses}:S'running test 32
'{instructions,branch-misses}:Su'
  FAILED tests/parse-events.c:1388 wrong number of entries
  Event test failure: test 32 '{instructions,branch-misses}:Su'running
test 33 'instructions:uDp'
  FAILED tests/parse-events.c:1427 wrong number of entries
  Event test failure: test 33 'instructions:uDp'running test 34
'{cycles,cache-misses,branch-misses}:D'
  FAILED tests/parse-events.c:1445 wrong number of entries
  Event test failure: test 34
'{cycles,cache-misses,branch-misses}:D'running test 35 'mem:0/1'
  running test 36 'mem:0/2:w'
  running test 37 'mem:0/4:rw:u'
  running test 38 'instructions:I'
  running test 39 'instructions:kIG'
  running test 40 'task-clock:P,cycles'
  FAILED tests/parse-events.c:1564 wrong number of entries
  Event test failure: test 40 'task-clock:P,cycles'running test 41
'instructions/name=insn/'
  running test 42 'r1234/name=rawpmu/'
  running test 43 '4:0x6530160/name=numpmu/'
  running test 44 'L1-dcache-misses/name=cachepmu/'
  running test 45 'intel_pt//u'
  ... SKIP
  running test 46
'cycles/name='COMPLEX_CYCLES_NAME:orig=cycles,desc=chip-clock-ticks'/Duk'
  running test 47 'cycles//u'
  running test 48 'cycles:k'
  running test 49 'instructions:uep'
  running test 50 '{cycles,cache-misses,branch-misses}:e'
  FAILED tests/parse-events.c:1490 wrong number of entries
  Event test failure: test 50
'{cycles,cache-misses,branch-misses}:e'running test 51 'cycles/name=name/'
  running test 52 'cycles/name=l1d/'
  running test 53 'mem:0/name=breakpoint/'
  running test 54 'mem:0:x/name=breakpoint/'
  running test 55 'mem:0:r/name=breakpoint/'
  running test 56 'mem:0:w/name=breakpoint/'
  running test 57 'mem:0/name=breakpoint/u'
  running test 58 'mem:0:x/name=breakpoint/k'
  running test 59 'mem:0:r/name=breakpoint/hp'
  running test 60 'mem:0:w/name=breakpoint/up'
  running test 61 'mem:0:rw/name=breakpoint/'
  running test 62 'mem:0:rw/name=breakpoint/kp'
  running test 63 'mem:0/1/name=breakpoint/'
  running test 64 'mem:0/2:w/name=breakpoint/'
  running test 65 'mem:0/4:rw/name=breakpoint/u'
  running test 66 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'


> 
> It is great that you'll look to add PERF_PMU_CAP_EXTENDED_HW_TYPE,
> there is also the legacy kernel issue. Do you have a plan there or
> should supporting PERF_PMU_CAP_EXTENDED_HW_TYPE be the plan? As the
> Linux 6.5 merge window will open soon I'm concerned that these fixes
> may only make it into 6.6. Should we do anything short-term for 6.5?
> For example, if extended types aren't supported then disable the
> wildcard matching of PMUs in the "numeric", really PERF_TYPE_HARDWARE
> and PERF_TYPE_HW_CACHE, cases?
> 
> Thanks,
> Ian
> 
>> [1]: https://lore.kernel.org/all/ZIxza13x+AwApbQb@kernel.org/
>>
>>> The perf tool is expecting it to be supported and so it tries to open
>>> the event for each core PMU, so the cycles event will open for each of
>>> a BIG.little PMU, however, as the extended type isn't supported the
>>> same event will be opened twice and the migration problem still
>>> exists. This problem exists for all PERF_TYPE_HARDWARE events (10) and
>>> all PERF_TYPE_HW_CACHE events (42).
>>>
>>> Assuming the capability issue is fixed, there is still an open issue
>>> about how to properly support these 52 events on older kernels. It
>>> would be possible to use the architecture standard events, so cycles
>>> can map to cpu_cycles, instructions to sw_incr, cache-references to
>>> l1d_cache (probably), etc. I'm not sure this will work for all 52
>>> events and the plumbing in the perf tool isn't straightforward.
>>> There's also the question of whether such a change of events should
>>> happen when we know the perf_event_open will be given a CPU, should
>>> that case not wildcard PMUs?>
>>> I'm happy to provide more context/details.
>>>
>>> Thanks,
>>> Ian
>>>
>>>> Also we are transitioning from Arnaldo's tree to new repositories in
>>>> git.kernel.org:
>>>>
>>>>  * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
>>>>  * kernel/git/perf/perf-tools-next (for new features for the next cycle)
>>>>
>>>> There is a branch with the same name.  Please check out
>>>> perf-tools-next and play with it.
>>>>
>>>> Thanks,
>>>> Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-29 11:13       ` James Clark
@ 2023-06-30 15:30         ` Ian Rogers
  2023-06-30 16:14           ` James Clark
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Rogers @ 2023-06-30 15:30 UTC (permalink / raw)
  To: James Clark
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim

On Thu, Jun 29, 2023 at 4:14 AM James Clark <james.clark@arm.com> wrote:
>
>
>
> On 23/06/2023 15:30, Ian Rogers wrote:
> > On Fri, Jun 23, 2023 at 1:42 AM James Clark <james.clark@arm.com> wrote:
> >>
> >>
> >>
> >> On 22/06/2023 21:15, Ian Rogers wrote:
> >>> On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> In this development cycle, we've got a big change in how to handle
> >>>> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
> >>>> but caused some troubles on others.  So I'd like to confirm it works
> >>>> on major platforms without problems.  Especially on ARM64 as it has
> >>>> the big.LITTLE CPUs.
> >>>
> >>>
> >>> Thanks Namhyung. I believe there is a change on ARM64 that makes a
> >>> problem of one type into a slightly different problem. The perf parse
> >>> events logic handles events certain known hardware events using known
> >>> numeric encodings, for example, the cycles event has a type of 0
> >>> (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
> >>>
> >>> The perf_event_open syscall can specify a CPU number of -1 for any. My
> >>> understanding is that -1 on ARM translates into choose the PMU of the
> >>> current running thread, this means that if a thread migrates from
> >>> between core/PMU types it won't be recorded on the alternate PMU type.
> >>>
> >>> The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
> >>> this, but appears not to be set on ARM's PMU:
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
> >>> Could an ARM64 related person fix this?
> >>>
> >>
> >> Yes I was planning to look into this since this thread [1].
> >> It seems like there are still some test failures on Juno which I'm
> >> working through now, even after the fix in that thread. And I was
> >> thinking that it should be possible to support
> >> PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.
> >
> > Thanks James, my memory of Juno boards was early bring up work and I
> > don't recall them having heterogeneous PMUs. Presumably this is
>
> Juno does seem to report two PMU types:
>
>    ls /sys/bus/event_source/devices/
>    ...
>    armv8_cortex_a53  armv8_cortex_a57
>    ...
>
> > another issue, I'm happy to help if you can share more details.
>
> So it looks like the failure starts at commit 8bc75f699c14 ("perf
> parse-events: Support wildcards on raw events"). test__checkevent_raw()
> is looking for PERF_TYPE_RAW at this point, but the commit has changed
> the opened event to be whatever PMU it found (type 8 on n1sdp for
> example) instead of PERF_TYPE_RAW (4) which causes the test to fail.
>
> Then looking at the latest perf-tools-next (ad5f604e186a), now the event
> opened goes back to PERF_TYPE_RAW again, but test__checkevent_raw() is
> testing for a real PMU type. None of the PMUs have raw type so the test
> still fails.
>
> This is all on N1SDP which only has one PMU type so
> perf_pmus__supports_extended_type() returns false.
>
> On Juno it's similar, but perf_pmus__supports_extended_type() isn't
> called for the raw test for some reason. It doesn't get hit until the
> "instructions" and test__checkevent_symbolic_name() test later. I saw
> this because I was looking into Arnaldo's suggestion to hard code a
> return false for arm there. But it doesn't fix the test failures for
> either platform.
>
> The failures do look slightly different depending on whether there are
> heterogeneous PMUs or not, so maybe there are two problems to fix. Here
> are the full outputs of the test from both platforms. But I can't really
> think what the fix should be (assuming we should first fix it without
> looking at PERF_PMU_CAP_EXTENDED_HW_TYPE yet).
>
> N1SDP (only the raw test fails):
>
>  $ ls /sys/bus/event_source/devices/
>  ...
>  armv8_pmuv3_0
>  ...
>
>  $ ./perf test "Test event parsing" -vvv
>      6: Parse event definition strings                                  :
>     6.1: Test event parsing                                            :
>   --- start ---
>   test child forked, pid 21636
>   running test 0 'syscalls:sys_enter_openat'
>   Using CPUID 0x00000000410fd0c0
>   running test 1 'syscalls:*'
>   running test 2 'r1a'
>   FAILED tests/parse-events.c:123 No PMU found for type
>   Event test failure: test 2 'r1a'running test 3 '1:1'
>   running test 4 'instructions'
>   running test 5 'cycles/period=100000,config2/'
>   running test 6 'faults'
>   running test 7 'L1-dcache-load-miss'
>   running test 8 'mem:0'
>   running test 9 'mem:0:x'
>   running test 10 'mem:0:r'
>   running test 11 'mem:0:w'
>   running test 12 'syscalls:sys_enter_openat:k'
>   running test 13 'syscalls:*:u'
>   running test 14 'r1a:kp'
>   FAILED tests/parse-events.c:123 No PMU found for type
>   Event test failure: test 14 'r1a:kp'running test 15 '1:1:hp'
>   running test 16 'instructions:h'
>   running test 17 'faults:u'
>   running test 18 'L1-dcache-load-miss:kp'
>   running test 19 'mem:0:u'
>   running test 20 'mem:0:x:k'
>   running test 21 'mem:0:r:hp'
>   running test 22 'mem:0:w:up'
>   running test 23 'r1,syscalls:sys_enter_openat:k,1:1:hp'
>   running test 24 'instructions:G'
>   running test 25 'instructions:H'
>   running test 26 'mem:0:rw'
>   running test 27 'mem:0:rw:kp'
>   running test 28 '{instructions:k,cycles:upp}'
>   running test 29 '{faults:k,cache-references}:u,cycles:k'
>   running test 30
> 'group1{syscalls:sys_enter_openat:H,cycles:kppp},group2{cycles,1:3}:G,instructions:u'
>   running test 31 '{cycles:u,instructions:kp}:p'
>   running test 32 '{cycles,instructions}:G,{cycles:G,instructions:G},cycles'
>   running test 33 '*:*'
>   running test 34 '{cycles,cache-misses:G}:H'
>   running test 35 '{cycles,cache-misses:H}:G'
>   running test 36 '{cycles:G,cache-misses:H}:u'
>   running test 37 '{cycles:G,cache-misses:H}:uG'
>   running test 38 '{cycles,cache-misses,branch-misses}:S'
>   running test 39 '{instructions,branch-misses}:Su'
>   running test 40 'instructions:uDp'
>   running test 41 '{cycles,cache-misses,branch-misses}:D'
>   running test 42 'mem:0/1'
>   running test 43 'mem:0/2:w'
>   running test 44 'mem:0/4:rw:u'
>   running test 45 'instructions:I'
>   running test 46 'instructions:kIG'
>   running test 47 'task-clock:P,cycles'
>   running test 48 'instructions/name=insn/'
>   running test 49 'r1234/name=rawpmu/'
>   running test 50 '4:0x6530160/name=numpmu/'
>   running test 51 'L1-dcache-misses/name=cachepmu/'
>   running test 52 'intel_pt//u'
>   ... SKIP
>   running test 53
> 'cycles/name='COMPLEX_CYCLES_NAME:orig=cycles,desc=chip-clock-ticks'/Duk'
>   running test 54 'cycles//u'
>   running test 55 'cycles:k'
>   running test 56 'instructions:uep'
>   running test 57 '{cycles,cache-misses,branch-misses}:e'
>   running test 58 'cycles/name=name/'
>   running test 59 'cycles/name=l1d/'
>   running test 60 'mem:0/name=breakpoint/'
>   running test 61 'mem:0:x/name=breakpoint/'
>   running test 62 'mem:0:r/name=breakpoint/'
>   running test 63 'mem:0:w/name=breakpoint/'
>   running test 64 'mem:0/name=breakpoint/u'
>   running test 65 'mem:0:x/name=breakpoint/k'
>   running test 66 'mem:0:r/name=breakpoint/hp'
>   running test 67 'mem:0:w/name=breakpoint/up'
>   running test 68 'mem:0:rw/name=breakpoint/'
>   running test 69 'mem:0:rw/name=breakpoint/kp'
>   running test 70 'mem:0/1/name=breakpoint/'
>   running test 71 'mem:0/2:w/name=breakpoint/'
>   running test 72 'mem:0/4:rw/name=breakpoint/u'
>   running test 73 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'

I saw the raw encoding issue on my Raspberry Pi. I also saw a lot of
libtraceevent errors. This is concerning me and we need to get a fix
for 6.5.

> Juno:
>
>   (/sys/bus/event_source/devices/ output above)
>
>       3: Parse event definition strings                                  :
>     3.1: Test event parsing                                            :
>   --- start ---
>   running test 0 'r1a'
>   Using CPUID 0x00000000410fd070
>   FAILED tests/parse-events.c:123 No PMU found for type
>   Event test failure: test 0 'r1a'running test 1 '1:1'
>   running test 2 'instructions'
>   Opening: unknown-hardware:HG
>   ------------------------------------------------------------
>   perf_event_attr:
>     type                             0 (PERF_TYPE_HARDWARE)
>     config                           0x800000000
>     disabled                         1
>   ------------------------------------------------------------

Here the config has the extended type in it but because ARM PMU's
heterogeneous PMUs aren't  supporting that we get divergence. I think
this is a legitimate failure until ARM's PMU driver is fixed to enable
the extended type information. We can make the test robust to this,
but it is probably a good idea to red flag the ARM PMU being wrong.

>   sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
>   sys_perf_event_open failed, error -2
>   running test 3 'cycles/period=100000,config2/'
>   running test 4 'faults'
>   running test 5 'L1-dcache-load-miss'
>   running test 6 'mem:0'
>   running test 7 'mem:0:x'
>   running test 8 'mem:0:r'
>   running test 9 'mem:0:w'
>   running test 10 'r1a:kp'
>   FAILED tests/parse-events.c:123 No PMU found for type
>   Event test failure: test 10 'r1a:kp'running test 11 '1:1:hp'
>   running test 12 'instructions:h'
>   FAILED tests/parse-events.c:329 wrong number of entries
>   Event test failure: test 12 'instructions:h'running test 13 'faults:u'
>   running test 14 'L1-dcache-load-miss:kp'
>   running test 15 'mem:0:u'
>   running test 16 'mem:0:x:k'
>   running test 17 'mem:0:r:hp'
>   running test 18 'mem:0:w:up'
>   running test 19 'instructions:G'
>   running test 20 'instructions:H'
>   running test 21 'mem:0:rw'
>   running test 22 'mem:0:rw:kp'
>   running test 23 '{instructions:k,cycles:upp}'
>   FAILED tests/parse-events.c:832 wrong number of entries
>   Event test failure: test 23 '{instructions:k,cycles:upp}'running test

These failures will relate to not opening enough events, there should
be one instructions and cycles event/evsel (both legacy types) per
PMU. Similarly below. All these failures will relate to the extended
type divergence, could this be a priority for someone, ARM, Linaro..
to look at? I'm happy to send emails, be on a call, etc.

Thanks,
Ian

> 24 '{faults:k,cache-references}:u,cycles:k'
>   FAILED tests/parse-events.c:875 wrong number of entries
>   Event test failure: test 24
> '{faults:k,cache-references}:u,cycles:k'running test 25
> '{cycles:u,instructions:kp}:p'
>   FAILED tests/parse-events.c:1041 wrong number of entries
>   Event test failure: test 25 '{cycles:u,instructions:kp}:p'running test
> 26 '{cycles,instructions}:G,{cycles:G,instructions:G},cycles'
>   FAILED tests/parse-events.c:1086 wrong number of entries
>   Event test failure: test 26
> '{cycles,instructions}:G,{cycles:G,instructions:G},cycles'running test
> 27 '{cycles,cache-misses:G}:H'
>   FAILED tests/parse-events.c:1172 wrong number of entries
>   Event test failure: test 27 '{cycles,cache-misses:G}:H'running test 28
> '{cycles,cache-misses:H}:G'
>   FAILED tests/parse-events.c:1213 wrong number of entries
>   Event test failure: test 28 '{cycles,cache-misses:H}:G'running test 29
> '{cycles:G,cache-misses:H}:u'
>   FAILED tests/parse-events.c:1254 wrong number of entries
>   Event test failure: test 29 '{cycles:G,cache-misses:H}:u'running test
> 30 '{cycles:G,cache-misses:H}:uG'
>   FAILED tests/parse-events.c:1295 wrong number of entries
>   Event test failure: test 30 '{cycles:G,cache-misses:H}:uG'running test
> 31 '{cycles,cache-misses,branch-misses}:S'
>   FAILED tests/parse-events.c:1336 wrong number of entries
>   Event test failure: test 31
> '{cycles,cache-misses,branch-misses}:S'running test 32
> '{instructions,branch-misses}:Su'
>   FAILED tests/parse-events.c:1388 wrong number of entries
>   Event test failure: test 32 '{instructions,branch-misses}:Su'running
> test 33 'instructions:uDp'
>   FAILED tests/parse-events.c:1427 wrong number of entries
>   Event test failure: test 33 'instructions:uDp'running test 34
> '{cycles,cache-misses,branch-misses}:D'
>   FAILED tests/parse-events.c:1445 wrong number of entries
>   Event test failure: test 34
> '{cycles,cache-misses,branch-misses}:D'running test 35 'mem:0/1'
>   running test 36 'mem:0/2:w'
>   running test 37 'mem:0/4:rw:u'
>   running test 38 'instructions:I'
>   running test 39 'instructions:kIG'
>   running test 40 'task-clock:P,cycles'
>   FAILED tests/parse-events.c:1564 wrong number of entries
>   Event test failure: test 40 'task-clock:P,cycles'running test 41
> 'instructions/name=insn/'
>   running test 42 'r1234/name=rawpmu/'
>   running test 43 '4:0x6530160/name=numpmu/'
>   running test 44 'L1-dcache-misses/name=cachepmu/'
>   running test 45 'intel_pt//u'
>   ... SKIP
>   running test 46
> 'cycles/name='COMPLEX_CYCLES_NAME:orig=cycles,desc=chip-clock-ticks'/Duk'
>   running test 47 'cycles//u'
>   running test 48 'cycles:k'
>   running test 49 'instructions:uep'
>   running test 50 '{cycles,cache-misses,branch-misses}:e'
>   FAILED tests/parse-events.c:1490 wrong number of entries
>   Event test failure: test 50
> '{cycles,cache-misses,branch-misses}:e'running test 51 'cycles/name=name/'
>   running test 52 'cycles/name=l1d/'
>   running test 53 'mem:0/name=breakpoint/'
>   running test 54 'mem:0:x/name=breakpoint/'
>   running test 55 'mem:0:r/name=breakpoint/'
>   running test 56 'mem:0:w/name=breakpoint/'
>   running test 57 'mem:0/name=breakpoint/u'
>   running test 58 'mem:0:x/name=breakpoint/k'
>   running test 59 'mem:0:r/name=breakpoint/hp'
>   running test 60 'mem:0:w/name=breakpoint/up'
>   running test 61 'mem:0:rw/name=breakpoint/'
>   running test 62 'mem:0:rw/name=breakpoint/kp'
>   running test 63 'mem:0/1/name=breakpoint/'
>   running test 64 'mem:0/2:w/name=breakpoint/'
>   running test 65 'mem:0/4:rw/name=breakpoint/u'
>   running test 66 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'
>
>
> >
> > It is great that you'll look to add PERF_PMU_CAP_EXTENDED_HW_TYPE,
> > there is also the legacy kernel issue. Do you have a plan there or
> > should supporting PERF_PMU_CAP_EXTENDED_HW_TYPE be the plan? As the
> > Linux 6.5 merge window will open soon I'm concerned that these fixes
> > may only make it into 6.6. Should we do anything short-term for 6.5?
> > For example, if extended types aren't supported then disable the
> > wildcard matching of PMUs in the "numeric", really PERF_TYPE_HARDWARE
> > and PERF_TYPE_HW_CACHE, cases?
> >
> > Thanks,
> > Ian
> >
> >> [1]: https://lore.kernel.org/all/ZIxza13x+AwApbQb@kernel.org/
> >>
> >>> The perf tool is expecting it to be supported and so it tries to open
> >>> the event for each core PMU, so the cycles event will open for each of
> >>> a BIG.little PMU, however, as the extended type isn't supported the
> >>> same event will be opened twice and the migration problem still
> >>> exists. This problem exists for all PERF_TYPE_HARDWARE events (10) and
> >>> all PERF_TYPE_HW_CACHE events (42).
> >>>
> >>> Assuming the capability issue is fixed, there is still an open issue
> >>> about how to properly support these 52 events on older kernels. It
> >>> would be possible to use the architecture standard events, so cycles
> >>> can map to cpu_cycles, instructions to sw_incr, cache-references to
> >>> l1d_cache (probably), etc. I'm not sure this will work for all 52
> >>> events and the plumbing in the perf tool isn't straightforward.
> >>> There's also the question of whether such a change of events should
> >>> happen when we know the perf_event_open will be given a CPU, should
> >>> that case not wildcard PMUs?>
> >>> I'm happy to provide more context/details.
> >>>
> >>> Thanks,
> >>> Ian
> >>>
> >>>> Also we are transitioning from Arnaldo's tree to new repositories in
> >>>> git.kernel.org:
> >>>>
> >>>>  * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
> >>>>  * kernel/git/perf/perf-tools-next (for new features for the next cycle)
> >>>>
> >>>> There is a branch with the same name.  Please check out
> >>>> perf-tools-next and play with it.
> >>>>
> >>>> Thanks,
> >>>> Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-30 15:30         ` Ian Rogers
@ 2023-06-30 16:14           ` James Clark
  2023-06-30 16:45             ` Ian Rogers
  0 siblings, 1 reply; 18+ messages in thread
From: James Clark @ 2023-06-30 16:14 UTC (permalink / raw)
  To: Ian Rogers
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim



On 30/06/2023 16:30, Ian Rogers wrote:
> On Thu, Jun 29, 2023 at 4:14 AM James Clark <james.clark@arm.com> wrote:
>>
>>
>>
>> On 23/06/2023 15:30, Ian Rogers wrote:
>>> On Fri, Jun 23, 2023 at 1:42 AM James Clark <james.clark@arm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 22/06/2023 21:15, Ian Rogers wrote:
>>>>> On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> In this development cycle, we've got a big change in how to handle
>>>>>> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
>>>>>> but caused some troubles on others.  So I'd like to confirm it works
>>>>>> on major platforms without problems.  Especially on ARM64 as it has
>>>>>> the big.LITTLE CPUs.
>>>>>
>>>>>
>>>>> Thanks Namhyung. I believe there is a change on ARM64 that makes a
>>>>> problem of one type into a slightly different problem. The perf parse
>>>>> events logic handles events certain known hardware events using known
>>>>> numeric encodings, for example, the cycles event has a type of 0
>>>>> (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
>>>>>
>>>>> The perf_event_open syscall can specify a CPU number of -1 for any. My
>>>>> understanding is that -1 on ARM translates into choose the PMU of the
>>>>> current running thread, this means that if a thread migrates from
>>>>> between core/PMU types it won't be recorded on the alternate PMU type.
>>>>>
>>>>> The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
>>>>> this, but appears not to be set on ARM's PMU:
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
>>>>> Could an ARM64 related person fix this?
>>>>>
>>>>
>>>> Yes I was planning to look into this since this thread [1].
>>>> It seems like there are still some test failures on Juno which I'm
>>>> working through now, even after the fix in that thread. And I was
>>>> thinking that it should be possible to support
>>>> PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.
>>>
>>> Thanks James, my memory of Juno boards was early bring up work and I
>>> don't recall them having heterogeneous PMUs. Presumably this is
>>
>> Juno does seem to report two PMU types:
>>
>>    ls /sys/bus/event_source/devices/
>>    ...
>>    armv8_cortex_a53  armv8_cortex_a57
>>    ...
>>
>>> another issue, I'm happy to help if you can share more details.
>>
>> So it looks like the failure starts at commit 8bc75f699c14 ("perf
>> parse-events: Support wildcards on raw events"). test__checkevent_raw()
>> is looking for PERF_TYPE_RAW at this point, but the commit has changed
>> the opened event to be whatever PMU it found (type 8 on n1sdp for
>> example) instead of PERF_TYPE_RAW (4) which causes the test to fail.
>>
>> Then looking at the latest perf-tools-next (ad5f604e186a), now the event
>> opened goes back to PERF_TYPE_RAW again, but test__checkevent_raw() is
>> testing for a real PMU type. None of the PMUs have raw type so the test
>> still fails.
>>
>> This is all on N1SDP which only has one PMU type so
>> perf_pmus__supports_extended_type() returns false.
>>
>> On Juno it's similar, but perf_pmus__supports_extended_type() isn't
>> called for the raw test for some reason. It doesn't get hit until the
>> "instructions" and test__checkevent_symbolic_name() test later. I saw
>> this because I was looking into Arnaldo's suggestion to hard code a
>> return false for arm there. But it doesn't fix the test failures for
>> either platform.
>>
>> The failures do look slightly different depending on whether there are
>> heterogeneous PMUs or not, so maybe there are two problems to fix. Here
>> are the full outputs of the test from both platforms. But I can't really
>> think what the fix should be (assuming we should first fix it without
>> looking at PERF_PMU_CAP_EXTENDED_HW_TYPE yet).
>>
>> N1SDP (only the raw test fails):
>>
>>  $ ls /sys/bus/event_source/devices/
>>  ...
>>  armv8_pmuv3_0
>>  ...
>>
>>  $ ./perf test "Test event parsing" -vvv
>>      6: Parse event definition strings                                  :
>>     6.1: Test event parsing                                            :
>>   --- start ---
>>   test child forked, pid 21636
>>   running test 0 'syscalls:sys_enter_openat'
>>   Using CPUID 0x00000000410fd0c0
>>   running test 1 'syscalls:*'
>>   running test 2 'r1a'
>>   FAILED tests/parse-events.c:123 No PMU found for type
>>   Event test failure: test 2 'r1a'running test 3 '1:1'
>>   running test 4 'instructions'
>>   running test 5 'cycles/period=100000,config2/'
>>   running test 6 'faults'
>>   running test 7 'L1-dcache-load-miss'
>>   running test 8 'mem:0'
>>   running test 9 'mem:0:x'
>>   running test 10 'mem:0:r'
>>   running test 11 'mem:0:w'
>>   running test 12 'syscalls:sys_enter_openat:k'
>>   running test 13 'syscalls:*:u'
>>   running test 14 'r1a:kp'
>>   FAILED tests/parse-events.c:123 No PMU found for type
>>   Event test failure: test 14 'r1a:kp'running test 15 '1:1:hp'
>>   running test 16 'instructions:h'
>>   running test 17 'faults:u'
>>   running test 18 'L1-dcache-load-miss:kp'
>>   running test 19 'mem:0:u'
>>   running test 20 'mem:0:x:k'
>>   running test 21 'mem:0:r:hp'
>>   running test 22 'mem:0:w:up'
>>   running test 23 'r1,syscalls:sys_enter_openat:k,1:1:hp'
>>   running test 24 'instructions:G'
>>   running test 25 'instructions:H'
>>   running test 26 'mem:0:rw'
>>   running test 27 'mem:0:rw:kp'
>>   running test 28 '{instructions:k,cycles:upp}'
>>   running test 29 '{faults:k,cache-references}:u,cycles:k'
>>   running test 30
>> 'group1{syscalls:sys_enter_openat:H,cycles:kppp},group2{cycles,1:3}:G,instructions:u'
>>   running test 31 '{cycles:u,instructions:kp}:p'
>>   running test 32 '{cycles,instructions}:G,{cycles:G,instructions:G},cycles'
>>   running test 33 '*:*'
>>   running test 34 '{cycles,cache-misses:G}:H'
>>   running test 35 '{cycles,cache-misses:H}:G'
>>   running test 36 '{cycles:G,cache-misses:H}:u'
>>   running test 37 '{cycles:G,cache-misses:H}:uG'
>>   running test 38 '{cycles,cache-misses,branch-misses}:S'
>>   running test 39 '{instructions,branch-misses}:Su'
>>   running test 40 'instructions:uDp'
>>   running test 41 '{cycles,cache-misses,branch-misses}:D'
>>   running test 42 'mem:0/1'
>>   running test 43 'mem:0/2:w'
>>   running test 44 'mem:0/4:rw:u'
>>   running test 45 'instructions:I'
>>   running test 46 'instructions:kIG'
>>   running test 47 'task-clock:P,cycles'
>>   running test 48 'instructions/name=insn/'
>>   running test 49 'r1234/name=rawpmu/'
>>   running test 50 '4:0x6530160/name=numpmu/'
>>   running test 51 'L1-dcache-misses/name=cachepmu/'
>>   running test 52 'intel_pt//u'
>>   ... SKIP
>>   running test 53
>> 'cycles/name='COMPLEX_CYCLES_NAME:orig=cycles,desc=chip-clock-ticks'/Duk'
>>   running test 54 'cycles//u'
>>   running test 55 'cycles:k'
>>   running test 56 'instructions:uep'
>>   running test 57 '{cycles,cache-misses,branch-misses}:e'
>>   running test 58 'cycles/name=name/'
>>   running test 59 'cycles/name=l1d/'
>>   running test 60 'mem:0/name=breakpoint/'
>>   running test 61 'mem:0:x/name=breakpoint/'
>>   running test 62 'mem:0:r/name=breakpoint/'
>>   running test 63 'mem:0:w/name=breakpoint/'
>>   running test 64 'mem:0/name=breakpoint/u'
>>   running test 65 'mem:0:x/name=breakpoint/k'
>>   running test 66 'mem:0:r/name=breakpoint/hp'
>>   running test 67 'mem:0:w/name=breakpoint/up'
>>   running test 68 'mem:0:rw/name=breakpoint/'
>>   running test 69 'mem:0:rw/name=breakpoint/kp'
>>   running test 70 'mem:0/1/name=breakpoint/'
>>   running test 71 'mem:0/2:w/name=breakpoint/'
>>   running test 72 'mem:0/4:rw/name=breakpoint/u'
>>   running test 73 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'
> 
> I saw the raw encoding issue on my Raspberry Pi. I also saw a lot of
> libtraceevent errors. This is concerning me and we need to get a fix
> for 6.5.
> 
>> Juno:
>>
>>   (/sys/bus/event_source/devices/ output above)
>>
>>       3: Parse event definition strings                                  :
>>     3.1: Test event parsing                                            :
>>   --- start ---
>>   running test 0 'r1a'
>>   Using CPUID 0x00000000410fd070
>>   FAILED tests/parse-events.c:123 No PMU found for type
>>   Event test failure: test 0 'r1a'running test 1 '1:1'
>>   running test 2 'instructions'
>>   Opening: unknown-hardware:HG
>>   ------------------------------------------------------------
>>   perf_event_attr:
>>     type                             0 (PERF_TYPE_HARDWARE)
>>     config                           0x800000000
>>     disabled                         1
>>   ------------------------------------------------------------
> 
> Here the config has the extended type in it but because ARM PMU's
> heterogeneous PMUs aren't  supporting that we get divergence. I think
> this is a legitimate failure until ARM's PMU driver is fixed to enable
> the extended type information. We can make the test robust to this,
> but it is probably a good idea to red flag the ARM PMU being wrong.

Shouldn't Perf just not request it if it isn't available? I thought the
capability stuff was supposed to help with knowing what the kernel will
support before opening the event. Even without that, Perf has the
fallback mechanism where it retries opening with fewer options if
something causes a failure.

> 
>>   sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
>>   sys_perf_event_open failed, error -2
>>   running test 3 'cycles/period=100000,config2/'
>>   running test 4 'faults'
>>   running test 5 'L1-dcache-load-miss'
>>   running test 6 'mem:0'
>>   running test 7 'mem:0:x'
>>   running test 8 'mem:0:r'
>>   running test 9 'mem:0:w'
>>   running test 10 'r1a:kp'
>>   FAILED tests/parse-events.c:123 No PMU found for type
>>   Event test failure: test 10 'r1a:kp'running test 11 '1:1:hp'
>>   running test 12 'instructions:h'
>>   FAILED tests/parse-events.c:329 wrong number of entries
>>   Event test failure: test 12 'instructions:h'running test 13 'faults:u'
>>   running test 14 'L1-dcache-load-miss:kp'
>>   running test 15 'mem:0:u'
>>   running test 16 'mem:0:x:k'
>>   running test 17 'mem:0:r:hp'
>>   running test 18 'mem:0:w:up'
>>   running test 19 'instructions:G'
>>   running test 20 'instructions:H'
>>   running test 21 'mem:0:rw'
>>   running test 22 'mem:0:rw:kp'
>>   running test 23 '{instructions:k,cycles:upp}'
>>   FAILED tests/parse-events.c:832 wrong number of entries
>>   Event test failure: test 23 '{instructions:k,cycles:upp}'running test
> 
> These failures will relate to not opening enough events, there should
> be one instructions and cycles event/evsel (both legacy types) per
> PMU. Similarly below. All these failures will relate to the extended
> type divergence, could this be a priority for someone, ARM, Linaro..
> to look at? I'm happy to send emails, be on a call, etc.
> 

Yes a call sounds good. I can do 8am to 6pm BST all next week pretty
much, if that works for you?

I think Will has a point that the Arm version without
PERF_PMU_CAP_EXTENDED_HW_TYPE has existed for a long time, and as far as
I know new Perf should be compatible with old kernels (including the
tests?) so we should make sure it works with both versions even if we do
add PERF_PMU_CAP_EXTENDED_HW_TYPE for Arm later.

I need to go back and re-check my previous message about the usability
(rather than test) issues on Juno and see which ones, if any, are
related to these changes. If something that a user could do before
doesn't work anymore maybe that adds another argument.

> Thanks,
> Ian
> 
>> 24 '{faults:k,cache-references}:u,cycles:k'
>>   FAILED tests/parse-events.c:875 wrong number of entries

[...]
>>   running test 65 'mem:0/4:rw/name=breakpoint/u'
>>   running test 66 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'
>>
>>
>>>
>>> It is great that you'll look to add PERF_PMU_CAP_EXTENDED_HW_TYPE,
>>> there is also the legacy kernel issue. Do you have a plan there or
>>> should supporting PERF_PMU_CAP_EXTENDED_HW_TYPE be the plan? As the
>>> Linux 6.5 merge window will open soon I'm concerned that these fixes
>>> may only make it into 6.6. Should we do anything short-term for 6.5?
>>> For example, if extended types aren't supported then disable the
>>> wildcard matching of PMUs in the "numeric", really PERF_TYPE_HARDWARE
>>> and PERF_TYPE_HW_CACHE, cases?

Sorry I missed this bit before, but I think that we need both cases to
work even if we add PERF_PMU_CAP_EXTENDED_HW_TYPE in the future. So yes
I think some short term tool fix would be needed. Unless it's just the
tests that are failing rather than any real use cases, then maybe
nothing needs to be done.

James

>>>
>>> Thanks,
>>> Ian
>>>
>>>> [1]: https://lore.kernel.org/all/ZIxza13x+AwApbQb@kernel.org/
>>>>
>>>>> The perf tool is expecting it to be supported and so it tries to open
>>>>> the event for each core PMU, so the cycles event will open for each of
>>>>> a BIG.little PMU, however, as the extended type isn't supported the
>>>>> same event will be opened twice and the migration problem still
>>>>> exists. This problem exists for all PERF_TYPE_HARDWARE events (10) and
>>>>> all PERF_TYPE_HW_CACHE events (42).
>>>>>
>>>>> Assuming the capability issue is fixed, there is still an open issue
>>>>> about how to properly support these 52 events on older kernels. It
>>>>> would be possible to use the architecture standard events, so cycles
>>>>> can map to cpu_cycles, instructions to sw_incr, cache-references to
>>>>> l1d_cache (probably), etc. I'm not sure this will work for all 52
>>>>> events and the plumbing in the perf tool isn't straightforward.
>>>>> There's also the question of whether such a change of events should
>>>>> happen when we know the perf_event_open will be given a CPU, should
>>>>> that case not wildcard PMUs?>
>>>>> I'm happy to provide more context/details.
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>> Also we are transitioning from Arnaldo's tree to new repositories in
>>>>>> git.kernel.org:
>>>>>>
>>>>>>  * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
>>>>>>  * kernel/git/perf/perf-tools-next (for new features for the next cycle)
>>>>>>
>>>>>> There is a branch with the same name.  Please check out
>>>>>> perf-tools-next and play with it.
>>>>>>
>>>>>> Thanks,
>>>>>> Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-30 16:14           ` James Clark
@ 2023-06-30 16:45             ` Ian Rogers
  2023-07-04 11:30               ` James Clark
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Rogers @ 2023-06-30 16:45 UTC (permalink / raw)
  To: James Clark
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim

On Fri, Jun 30, 2023 at 9:14 AM James Clark <james.clark@arm.com> wrote:
>
>
>
> On 30/06/2023 16:30, Ian Rogers wrote:
> > On Thu, Jun 29, 2023 at 4:14 AM James Clark <james.clark@arm.com> wrote:
> >>
> >>
> >>
> >> On 23/06/2023 15:30, Ian Rogers wrote:
> >>> On Fri, Jun 23, 2023 at 1:42 AM James Clark <james.clark@arm.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 22/06/2023 21:15, Ian Rogers wrote:
> >>>>> On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> In this development cycle, we've got a big change in how to handle
> >>>>>> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
> >>>>>> but caused some troubles on others.  So I'd like to confirm it works
> >>>>>> on major platforms without problems.  Especially on ARM64 as it has
> >>>>>> the big.LITTLE CPUs.
> >>>>>
> >>>>>
> >>>>> Thanks Namhyung. I believe there is a change on ARM64 that makes a
> >>>>> problem of one type into a slightly different problem. The perf parse
> >>>>> events logic handles events certain known hardware events using known
> >>>>> numeric encodings, for example, the cycles event has a type of 0
> >>>>> (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
> >>>>>
> >>>>> The perf_event_open syscall can specify a CPU number of -1 for any. My
> >>>>> understanding is that -1 on ARM translates into choose the PMU of the
> >>>>> current running thread, this means that if a thread migrates from
> >>>>> between core/PMU types it won't be recorded on the alternate PMU type.
> >>>>>
> >>>>> The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
> >>>>> this, but appears not to be set on ARM's PMU:
> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
> >>>>> Could an ARM64 related person fix this?
> >>>>>
> >>>>
> >>>> Yes I was planning to look into this since this thread [1].
> >>>> It seems like there are still some test failures on Juno which I'm
> >>>> working through now, even after the fix in that thread. And I was
> >>>> thinking that it should be possible to support
> >>>> PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.
> >>>
> >>> Thanks James, my memory of Juno boards was early bring up work and I
> >>> don't recall them having heterogeneous PMUs. Presumably this is
> >>
> >> Juno does seem to report two PMU types:
> >>
> >>    ls /sys/bus/event_source/devices/
> >>    ...
> >>    armv8_cortex_a53  armv8_cortex_a57
> >>    ...
> >>
> >>> another issue, I'm happy to help if you can share more details.
> >>
> >> So it looks like the failure starts at commit 8bc75f699c14 ("perf
> >> parse-events: Support wildcards on raw events"). test__checkevent_raw()
> >> is looking for PERF_TYPE_RAW at this point, but the commit has changed
> >> the opened event to be whatever PMU it found (type 8 on n1sdp for
> >> example) instead of PERF_TYPE_RAW (4) which causes the test to fail.
> >>
> >> Then looking at the latest perf-tools-next (ad5f604e186a), now the event
> >> opened goes back to PERF_TYPE_RAW again, but test__checkevent_raw() is
> >> testing for a real PMU type. None of the PMUs have raw type so the test
> >> still fails.
> >>
> >> This is all on N1SDP which only has one PMU type so
> >> perf_pmus__supports_extended_type() returns false.
> >>
> >> On Juno it's similar, but perf_pmus__supports_extended_type() isn't
> >> called for the raw test for some reason. It doesn't get hit until the
> >> "instructions" and test__checkevent_symbolic_name() test later. I saw
> >> this because I was looking into Arnaldo's suggestion to hard code a
> >> return false for arm there. But it doesn't fix the test failures for
> >> either platform.
> >>
> >> The failures do look slightly different depending on whether there are
> >> heterogeneous PMUs or not, so maybe there are two problems to fix. Here
> >> are the full outputs of the test from both platforms. But I can't really
> >> think what the fix should be (assuming we should first fix it without
> >> looking at PERF_PMU_CAP_EXTENDED_HW_TYPE yet).
> >>
> >> N1SDP (only the raw test fails):
> >>
> >>  $ ls /sys/bus/event_source/devices/
> >>  ...
> >>  armv8_pmuv3_0
> >>  ...
> >>
> >>  $ ./perf test "Test event parsing" -vvv
> >>      6: Parse event definition strings                                  :
> >>     6.1: Test event parsing                                            :
> >>   --- start ---
> >>   test child forked, pid 21636
> >>   running test 0 'syscalls:sys_enter_openat'
> >>   Using CPUID 0x00000000410fd0c0
> >>   running test 1 'syscalls:*'
> >>   running test 2 'r1a'
> >>   FAILED tests/parse-events.c:123 No PMU found for type
> >>   Event test failure: test 2 'r1a'running test 3 '1:1'
> >>   running test 4 'instructions'
> >>   running test 5 'cycles/period=100000,config2/'
> >>   running test 6 'faults'
> >>   running test 7 'L1-dcache-load-miss'
> >>   running test 8 'mem:0'
> >>   running test 9 'mem:0:x'
> >>   running test 10 'mem:0:r'
> >>   running test 11 'mem:0:w'
> >>   running test 12 'syscalls:sys_enter_openat:k'
> >>   running test 13 'syscalls:*:u'
> >>   running test 14 'r1a:kp'
> >>   FAILED tests/parse-events.c:123 No PMU found for type
> >>   Event test failure: test 14 'r1a:kp'running test 15 '1:1:hp'
> >>   running test 16 'instructions:h'
> >>   running test 17 'faults:u'
> >>   running test 18 'L1-dcache-load-miss:kp'
> >>   running test 19 'mem:0:u'
> >>   running test 20 'mem:0:x:k'
> >>   running test 21 'mem:0:r:hp'
> >>   running test 22 'mem:0:w:up'
> >>   running test 23 'r1,syscalls:sys_enter_openat:k,1:1:hp'
> >>   running test 24 'instructions:G'
> >>   running test 25 'instructions:H'
> >>   running test 26 'mem:0:rw'
> >>   running test 27 'mem:0:rw:kp'
> >>   running test 28 '{instructions:k,cycles:upp}'
> >>   running test 29 '{faults:k,cache-references}:u,cycles:k'
> >>   running test 30
> >> 'group1{syscalls:sys_enter_openat:H,cycles:kppp},group2{cycles,1:3}:G,instructions:u'
> >>   running test 31 '{cycles:u,instructions:kp}:p'
> >>   running test 32 '{cycles,instructions}:G,{cycles:G,instructions:G},cycles'
> >>   running test 33 '*:*'
> >>   running test 34 '{cycles,cache-misses:G}:H'
> >>   running test 35 '{cycles,cache-misses:H}:G'
> >>   running test 36 '{cycles:G,cache-misses:H}:u'
> >>   running test 37 '{cycles:G,cache-misses:H}:uG'
> >>   running test 38 '{cycles,cache-misses,branch-misses}:S'
> >>   running test 39 '{instructions,branch-misses}:Su'
> >>   running test 40 'instructions:uDp'
> >>   running test 41 '{cycles,cache-misses,branch-misses}:D'
> >>   running test 42 'mem:0/1'
> >>   running test 43 'mem:0/2:w'
> >>   running test 44 'mem:0/4:rw:u'
> >>   running test 45 'instructions:I'
> >>   running test 46 'instructions:kIG'
> >>   running test 47 'task-clock:P,cycles'
> >>   running test 48 'instructions/name=insn/'
> >>   running test 49 'r1234/name=rawpmu/'
> >>   running test 50 '4:0x6530160/name=numpmu/'
> >>   running test 51 'L1-dcache-misses/name=cachepmu/'
> >>   running test 52 'intel_pt//u'
> >>   ... SKIP
> >>   running test 53
> >> 'cycles/name='COMPLEX_CYCLES_NAME:orig=cycles,desc=chip-clock-ticks'/Duk'
> >>   running test 54 'cycles//u'
> >>   running test 55 'cycles:k'
> >>   running test 56 'instructions:uep'
> >>   running test 57 '{cycles,cache-misses,branch-misses}:e'
> >>   running test 58 'cycles/name=name/'
> >>   running test 59 'cycles/name=l1d/'
> >>   running test 60 'mem:0/name=breakpoint/'
> >>   running test 61 'mem:0:x/name=breakpoint/'
> >>   running test 62 'mem:0:r/name=breakpoint/'
> >>   running test 63 'mem:0:w/name=breakpoint/'
> >>   running test 64 'mem:0/name=breakpoint/u'
> >>   running test 65 'mem:0:x/name=breakpoint/k'
> >>   running test 66 'mem:0:r/name=breakpoint/hp'
> >>   running test 67 'mem:0:w/name=breakpoint/up'
> >>   running test 68 'mem:0:rw/name=breakpoint/'
> >>   running test 69 'mem:0:rw/name=breakpoint/kp'
> >>   running test 70 'mem:0/1/name=breakpoint/'
> >>   running test 71 'mem:0/2:w/name=breakpoint/'
> >>   running test 72 'mem:0/4:rw/name=breakpoint/u'
> >>   running test 73 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'
> >
> > I saw the raw encoding issue on my Raspberry Pi. I also saw a lot of
> > libtraceevent errors. This is concerning me and we need to get a fix
> > for 6.5.
> >
> >> Juno:
> >>
> >>   (/sys/bus/event_source/devices/ output above)
> >>
> >>       3: Parse event definition strings                                  :
> >>     3.1: Test event parsing                                            :
> >>   --- start ---
> >>   running test 0 'r1a'
> >>   Using CPUID 0x00000000410fd070
> >>   FAILED tests/parse-events.c:123 No PMU found for type
> >>   Event test failure: test 0 'r1a'running test 1 '1:1'
> >>   running test 2 'instructions'
> >>   Opening: unknown-hardware:HG
> >>   ------------------------------------------------------------
> >>   perf_event_attr:
> >>     type                             0 (PERF_TYPE_HARDWARE)
> >>     config                           0x800000000
> >>     disabled                         1
> >>   ------------------------------------------------------------
> >
> > Here the config has the extended type in it but because ARM PMU's
> > heterogeneous PMUs aren't  supporting that we get divergence. I think
> > this is a legitimate failure until ARM's PMU driver is fixed to enable
> > the extended type information. We can make the test robust to this,
> > but it is probably a good idea to red flag the ARM PMU being wrong.
>
> Shouldn't Perf just not request it if it isn't available? I thought the
> capability stuff was supposed to help with knowing what the kernel will
> support before opening the event.

So there are legacy events like instructions and cycles, the case here
is cycles. The type for the PMU is given as 0, no dynamically
registered PMU has this type, and the config is 0. For homogeneous
PMUs that'd be the end of the story:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n1510

For heterogeneous PMUs we need to open the event for each PMU. This is
because we want to track instructions or cycles on each PMU. The
encoding for this is to put the PMU's type into the extended type part
of the config. This is the 8 before all the 0s of the 0 config value.

In the kernel's core event opening handling is:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/kernel/events/core.c?h=perf/core#n11617
```
/*
* PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
* are often aliases for PERF_TYPE_RAW.
*/
type = event->attr.type;
if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE) {
type = event->attr.config >> PERF_PMU_TYPE_SHIFT;
if (!type) {
type = PERF_TYPE_RAW;
} else {
extended_type = true;
event->attr.config &= PERF_HW_EVENT_MASK;
}
}
```

So without the extended type PERF_TYPE_RAW is used for
PERF_TYPE_HARDWARE or PERF_TYPE_HW_CACHE. However, the capability
needs to be set on the PMU to support extended types otherwise the
code will fail:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/kernel/events/core.c?h=perf/core#n11637
```
if (event->attr.type != type && type != PERF_TYPE_RAW &&
   !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
goto fail;
```
For heterogeneous ARM PMUs I believe this is exactly the path taken to
cause the perf_event_open to fail. The resolution is to add the bit to
the PMU's capabilities. I believe doing it here will suffice:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm_pmu.c?h=perf/core#n877
So:
```
.capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS | PERF_PMU_CAP_EXTENDED_REGS,
```
becomes:
```
.capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS |
PERF_PMU_CAP_EXTENDED_REGS | PERF_PMU_CAP_EXTENDED_HW_TYPE,
```
Now strictly speaking you only need this in the case of heterogeneous
PMUs, but that's the case for other bits here and I don't see special
case logic...

> Even without that, Perf has the
> fallback mechanism where it retries opening with fewer options if
> something causes a failure.

The fallback with fewer options is a separate problem. The most common
case for that is that perf is trying to profile both user and kernel
code, but lacks permission. We retry just trying to profile user code.
The problem of permissions is a separate one to opening events on the
correct PMU.

> >
> >>   sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
> >>   sys_perf_event_open failed, error -2
> >>   running test 3 'cycles/period=100000,config2/'
> >>   running test 4 'faults'
> >>   running test 5 'L1-dcache-load-miss'
> >>   running test 6 'mem:0'
> >>   running test 7 'mem:0:x'
> >>   running test 8 'mem:0:r'
> >>   running test 9 'mem:0:w'
> >>   running test 10 'r1a:kp'
> >>   FAILED tests/parse-events.c:123 No PMU found for type
> >>   Event test failure: test 10 'r1a:kp'running test 11 '1:1:hp'
> >>   running test 12 'instructions:h'
> >>   FAILED tests/parse-events.c:329 wrong number of entries
> >>   Event test failure: test 12 'instructions:h'running test 13 'faults:u'
> >>   running test 14 'L1-dcache-load-miss:kp'
> >>   running test 15 'mem:0:u'
> >>   running test 16 'mem:0:x:k'
> >>   running test 17 'mem:0:r:hp'
> >>   running test 18 'mem:0:w:up'
> >>   running test 19 'instructions:G'
> >>   running test 20 'instructions:H'
> >>   running test 21 'mem:0:rw'
> >>   running test 22 'mem:0:rw:kp'
> >>   running test 23 '{instructions:k,cycles:upp}'
> >>   FAILED tests/parse-events.c:832 wrong number of entries
> >>   Event test failure: test 23 '{instructions:k,cycles:upp}'running test
> >
> > These failures will relate to not opening enough events, there should
> > be one instructions and cycles event/evsel (both legacy types) per
> > PMU. Similarly below. All these failures will relate to the extended
> > type divergence, could this be a priority for someone, ARM, Linaro..
> > to look at? I'm happy to send emails, be on a call, etc.
> >
>
> Yes a call sounds good. I can do 8am to 6pm BST all next week pretty
> much, if that works for you?

Let me check with my wife. Next week is funny for some odd
"independence" celebration here.

> I think Will has a point that the Arm version without
> PERF_PMU_CAP_EXTENDED_HW_TYPE has existed for a long time, and as far as
> I know new Perf should be compatible with old kernels (including the
> tests?) so we should make sure it works with both versions even if we do
> add PERF_PMU_CAP_EXTENDED_HW_TYPE for Arm later.

Agreed. It's the time machine problem, we can fix this but to fix it
for everyone requires a time machine. Fixing this problem on older
kernels is much more messy as event mapping logic is necessary. Given
a lack of complaints perhaps we just keep older kernels broken with a
failing test.

> I need to go back and re-check my previous message about the usability
> (rather than test) issues on Juno and see which ones, if any, are
> related to these changes. If something that a user could do before
> doesn't work anymore maybe that adds another argument.

Something that looks regressed in the new code is that we get >1
legacy cache event on heterogeneous ARM PMUs but the extended type
isn't set:
Iterating all core PMUs:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n464
The setting of the config will ignore not set the extended type bits:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n434
An example would be:
$ perf stat -e l1d-misses true
There should be two events on a heterogeneous system, but without the
extended type this will just induce unnecessary multiplexing. We
should probably just not iterate in the broken extended type case as
we are for PERF_TYPE_HARDWARE:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n1511

Thanks,
Ian

> > Thanks,
> > Ian
> >
> >> 24 '{faults:k,cache-references}:u,cycles:k'
> >>   FAILED tests/parse-events.c:875 wrong number of entries
>
> [...]
> >>   running test 65 'mem:0/4:rw/name=breakpoint/u'
> >>   running test 66 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'
> >>
> >>
> >>>
> >>> It is great that you'll look to add PERF_PMU_CAP_EXTENDED_HW_TYPE,
> >>> there is also the legacy kernel issue. Do you have a plan there or
> >>> should supporting PERF_PMU_CAP_EXTENDED_HW_TYPE be the plan? As the
> >>> Linux 6.5 merge window will open soon I'm concerned that these fixes
> >>> may only make it into 6.6. Should we do anything short-term for 6.5?
> >>> For example, if extended types aren't supported then disable the
> >>> wildcard matching of PMUs in the "numeric", really PERF_TYPE_HARDWARE
> >>> and PERF_TYPE_HW_CACHE, cases?
>
> Sorry I missed this bit before, but I think that we need both cases to
> work even if we add PERF_PMU_CAP_EXTENDED_HW_TYPE in the future. So yes
> I think some short term tool fix would be needed. Unless it's just the
> tests that are failing rather than any real use cases, then maybe
> nothing needs to be done.
>
> James
>
> >>>
> >>> Thanks,
> >>> Ian
> >>>
> >>>> [1]: https://lore.kernel.org/all/ZIxza13x+AwApbQb@kernel.org/
> >>>>
> >>>>> The perf tool is expecting it to be supported and so it tries to open
> >>>>> the event for each core PMU, so the cycles event will open for each of
> >>>>> a BIG.little PMU, however, as the extended type isn't supported the
> >>>>> same event will be opened twice and the migration problem still
> >>>>> exists. This problem exists for all PERF_TYPE_HARDWARE events (10) and
> >>>>> all PERF_TYPE_HW_CACHE events (42).
> >>>>>
> >>>>> Assuming the capability issue is fixed, there is still an open issue
> >>>>> about how to properly support these 52 events on older kernels. It
> >>>>> would be possible to use the architecture standard events, so cycles
> >>>>> can map to cpu_cycles, instructions to sw_incr, cache-references to
> >>>>> l1d_cache (probably), etc. I'm not sure this will work for all 52
> >>>>> events and the plumbing in the perf tool isn't straightforward.
> >>>>> There's also the question of whether such a change of events should
> >>>>> happen when we know the perf_event_open will be given a CPU, should
> >>>>> that case not wildcard PMUs?>
> >>>>> I'm happy to provide more context/details.
> >>>>>
> >>>>> Thanks,
> >>>>> Ian
> >>>>>
> >>>>>> Also we are transitioning from Arnaldo's tree to new repositories in
> >>>>>> git.kernel.org:
> >>>>>>
> >>>>>>  * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
> >>>>>>  * kernel/git/perf/perf-tools-next (for new features for the next cycle)
> >>>>>>
> >>>>>> There is a branch with the same name.  Please check out
> >>>>>> perf-tools-next and play with it.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-06-30 16:45             ` Ian Rogers
@ 2023-07-04 11:30               ` James Clark
  2023-07-04 12:33                 ` Ravi Bangoria
  0 siblings, 1 reply; 18+ messages in thread
From: James Clark @ 2023-07-04 11:30 UTC (permalink / raw)
  To: Ian Rogers
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Ravi Bangoria,
	Namhyung Kim



On 30/06/2023 17:45, Ian Rogers wrote:
> On Fri, Jun 30, 2023 at 9:14 AM James Clark <james.clark@arm.com> wrote:
>>
>>
>>
>> On 30/06/2023 16:30, Ian Rogers wrote:
>>> On Thu, Jun 29, 2023 at 4:14 AM James Clark <james.clark@arm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 23/06/2023 15:30, Ian Rogers wrote:
>>>>> On Fri, Jun 23, 2023 at 1:42 AM James Clark <james.clark@arm.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 22/06/2023 21:15, Ian Rogers wrote:
>>>>>>> On Thu, Jun 22, 2023 at 11:57 AM Namhyung Kim <namhyung@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> In this development cycle, we've got a big change in how to handle
>>>>>>>> heterogeneous core PMUs. It's mostly tested on Intel hybrid machines
>>>>>>>> but caused some troubles on others.  So I'd like to confirm it works
>>>>>>>> on major platforms without problems.  Especially on ARM64 as it has
>>>>>>>> the big.LITTLE CPUs.
>>>>>>>
>>>>>>>
>>>>>>> Thanks Namhyung. I believe there is a change on ARM64 that makes a
>>>>>>> problem of one type into a slightly different problem. The perf parse
>>>>>>> events logic handles events certain known hardware events using known
>>>>>>> numeric encodings, for example, the cycles event has a type of 0
>>>>>>> (PERF_TYPE_HARDWARE) and a config of 0 (PERF_COUNT_HW_CPU_CYCLES):
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n286
>>>>>>>
>>>>>>> The perf_event_open syscall can specify a CPU number of -1 for any. My
>>>>>>> understanding is that -1 on ARM translates into choose the PMU of the
>>>>>>> current running thread, this means that if a thread migrates from
>>>>>>> between core/PMU types it won't be recorded on the alternate PMU type.
>>>>>>>
>>>>>>> The extended type capability (PERF_PMU_CAP_EXTENDED_HW_TYPE) solves
>>>>>>> this, but appears not to be set on ARM's PMU:
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm-cmn.c#n2291
>>>>>>> Could an ARM64 related person fix this?
>>>>>>>
>>>>>>
>>>>>> Yes I was planning to look into this since this thread [1].
>>>>>> It seems like there are still some test failures on Juno which I'm
>>>>>> working through now, even after the fix in that thread. And I was
>>>>>> thinking that it should be possible to support
>>>>>> PERF_PMU_CAP_EXTENDED_HW_TYPE on Arm too.
>>>>>
>>>>> Thanks James, my memory of Juno boards was early bring up work and I
>>>>> don't recall them having heterogeneous PMUs. Presumably this is
>>>>
>>>> Juno does seem to report two PMU types:
>>>>
>>>>    ls /sys/bus/event_source/devices/
>>>>    ...
>>>>    armv8_cortex_a53  armv8_cortex_a57
>>>>    ...
>>>>
>>>>> another issue, I'm happy to help if you can share more details.
>>>>
>>>> So it looks like the failure starts at commit 8bc75f699c14 ("perf
>>>> parse-events: Support wildcards on raw events"). test__checkevent_raw()
>>>> is looking for PERF_TYPE_RAW at this point, but the commit has changed
>>>> the opened event to be whatever PMU it found (type 8 on n1sdp for
>>>> example) instead of PERF_TYPE_RAW (4) which causes the test to fail.
>>>>
>>>> Then looking at the latest perf-tools-next (ad5f604e186a), now the event
>>>> opened goes back to PERF_TYPE_RAW again, but test__checkevent_raw() is
>>>> testing for a real PMU type. None of the PMUs have raw type so the test
>>>> still fails.
>>>>
>>>> This is all on N1SDP which only has one PMU type so
>>>> perf_pmus__supports_extended_type() returns false.
>>>>
>>>> On Juno it's similar, but perf_pmus__supports_extended_type() isn't
>>>> called for the raw test for some reason. It doesn't get hit until the
>>>> "instructions" and test__checkevent_symbolic_name() test later. I saw
>>>> this because I was looking into Arnaldo's suggestion to hard code a
>>>> return false for arm there. But it doesn't fix the test failures for
>>>> either platform.
>>>>
>>>> The failures do look slightly different depending on whether there are
>>>> heterogeneous PMUs or not, so maybe there are two problems to fix. Here
>>>> are the full outputs of the test from both platforms. But I can't really
>>>> think what the fix should be (assuming we should first fix it without
>>>> looking at PERF_PMU_CAP_EXTENDED_HW_TYPE yet).
>>>>
>>>> N1SDP (only the raw test fails):
>>>>
>>>>  $ ls /sys/bus/event_source/devices/
>>>>  ...
>>>>  armv8_pmuv3_0
>>>>  ...
>>>>
>>>>  $ ./perf test "Test event parsing" -vvv
>>>>      6: Parse event definition strings                                  :
>>>>     6.1: Test event parsing                                            :
>>>>   --- start ---
>>>>   test child forked, pid 21636
>>>>   running test 0 'syscalls:sys_enter_openat'
>>>>   Using CPUID 0x00000000410fd0c0
>>>>   running test 1 'syscalls:*'
>>>>   running test 2 'r1a'
>>>>   FAILED tests/parse-events.c:123 No PMU found for type
>>>>   Event test failure: test 2 'r1a'running test 3 '1:1'
>>>>   running test 4 'instructions'
>>>>   running test 5 'cycles/period=100000,config2/'
>>>>   running test 6 'faults'
>>>>   running test 7 'L1-dcache-load-miss'
>>>>   running test 8 'mem:0'
>>>>   running test 9 'mem:0:x'
>>>>   running test 10 'mem:0:r'
>>>>   running test 11 'mem:0:w'
>>>>   running test 12 'syscalls:sys_enter_openat:k'
>>>>   running test 13 'syscalls:*:u'
>>>>   running test 14 'r1a:kp'
>>>>   FAILED tests/parse-events.c:123 No PMU found for type
>>>>   Event test failure: test 14 'r1a:kp'running test 15 '1:1:hp'
>>>>   running test 16 'instructions:h'
>>>>   running test 17 'faults:u'
>>>>   running test 18 'L1-dcache-load-miss:kp'
>>>>   running test 19 'mem:0:u'
>>>>   running test 20 'mem:0:x:k'
>>>>   running test 21 'mem:0:r:hp'
>>>>   running test 22 'mem:0:w:up'
>>>>   running test 23 'r1,syscalls:sys_enter_openat:k,1:1:hp'
>>>>   running test 24 'instructions:G'
>>>>   running test 25 'instructions:H'
>>>>   running test 26 'mem:0:rw'
>>>>   running test 27 'mem:0:rw:kp'
>>>>   running test 28 '{instructions:k,cycles:upp}'
>>>>   running test 29 '{faults:k,cache-references}:u,cycles:k'
>>>>   running test 30
>>>> 'group1{syscalls:sys_enter_openat:H,cycles:kppp},group2{cycles,1:3}:G,instructions:u'
>>>>   running test 31 '{cycles:u,instructions:kp}:p'
>>>>   running test 32 '{cycles,instructions}:G,{cycles:G,instructions:G},cycles'
>>>>   running test 33 '*:*'
>>>>   running test 34 '{cycles,cache-misses:G}:H'
>>>>   running test 35 '{cycles,cache-misses:H}:G'
>>>>   running test 36 '{cycles:G,cache-misses:H}:u'
>>>>   running test 37 '{cycles:G,cache-misses:H}:uG'
>>>>   running test 38 '{cycles,cache-misses,branch-misses}:S'
>>>>   running test 39 '{instructions,branch-misses}:Su'
>>>>   running test 40 'instructions:uDp'
>>>>   running test 41 '{cycles,cache-misses,branch-misses}:D'
>>>>   running test 42 'mem:0/1'
>>>>   running test 43 'mem:0/2:w'
>>>>   running test 44 'mem:0/4:rw:u'
>>>>   running test 45 'instructions:I'
>>>>   running test 46 'instructions:kIG'
>>>>   running test 47 'task-clock:P,cycles'
>>>>   running test 48 'instructions/name=insn/'
>>>>   running test 49 'r1234/name=rawpmu/'
>>>>   running test 50 '4:0x6530160/name=numpmu/'
>>>>   running test 51 'L1-dcache-misses/name=cachepmu/'
>>>>   running test 52 'intel_pt//u'
>>>>   ... SKIP
>>>>   running test 53
>>>> 'cycles/name='COMPLEX_CYCLES_NAME:orig=cycles,desc=chip-clock-ticks'/Duk'
>>>>   running test 54 'cycles//u'
>>>>   running test 55 'cycles:k'
>>>>   running test 56 'instructions:uep'
>>>>   running test 57 '{cycles,cache-misses,branch-misses}:e'
>>>>   running test 58 'cycles/name=name/'
>>>>   running test 59 'cycles/name=l1d/'
>>>>   running test 60 'mem:0/name=breakpoint/'
>>>>   running test 61 'mem:0:x/name=breakpoint/'
>>>>   running test 62 'mem:0:r/name=breakpoint/'
>>>>   running test 63 'mem:0:w/name=breakpoint/'
>>>>   running test 64 'mem:0/name=breakpoint/u'
>>>>   running test 65 'mem:0:x/name=breakpoint/k'
>>>>   running test 66 'mem:0:r/name=breakpoint/hp'
>>>>   running test 67 'mem:0:w/name=breakpoint/up'
>>>>   running test 68 'mem:0:rw/name=breakpoint/'
>>>>   running test 69 'mem:0:rw/name=breakpoint/kp'
>>>>   running test 70 'mem:0/1/name=breakpoint/'
>>>>   running test 71 'mem:0/2:w/name=breakpoint/'
>>>>   running test 72 'mem:0/4:rw/name=breakpoint/u'
>>>>   running test 73 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'
>>>
>>> I saw the raw encoding issue on my Raspberry Pi. I also saw a lot of
>>> libtraceevent errors. This is concerning me and we need to get a fix
>>> for 6.5.
>>>
>>>> Juno:
>>>>
>>>>   (/sys/bus/event_source/devices/ output above)
>>>>
>>>>       3: Parse event definition strings                                  :
>>>>     3.1: Test event parsing                                            :
>>>>   --- start ---
>>>>   running test 0 'r1a'
>>>>   Using CPUID 0x00000000410fd070
>>>>   FAILED tests/parse-events.c:123 No PMU found for type
>>>>   Event test failure: test 0 'r1a'running test 1 '1:1'
>>>>   running test 2 'instructions'
>>>>   Opening: unknown-hardware:HG
>>>>   ------------------------------------------------------------
>>>>   perf_event_attr:
>>>>     type                             0 (PERF_TYPE_HARDWARE)
>>>>     config                           0x800000000
>>>>     disabled                         1
>>>>   ------------------------------------------------------------
>>>
>>> Here the config has the extended type in it but because ARM PMU's
>>> heterogeneous PMUs aren't  supporting that we get divergence. I think
>>> this is a legitimate failure until ARM's PMU driver is fixed to enable
>>> the extended type information. We can make the test robust to this,
>>> but it is probably a good idea to red flag the ARM PMU being wrong.
>>
>> Shouldn't Perf just not request it if it isn't available? I thought the
>> capability stuff was supposed to help with knowing what the kernel will
>> support before opening the event.
> 
> So there are legacy events like instructions and cycles, the case here
> is cycles. The type for the PMU is given as 0, no dynamically
> registered PMU has this type, and the config is 0. For homogeneous
> PMUs that'd be the end of the story:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n1510
> 
> For heterogeneous PMUs we need to open the event for each PMU. This is
> because we want to track instructions or cycles on each PMU. The
> encoding for this is to put the PMU's type into the extended type part
> of the config. This is the 8 before all the 0s of the 0 config value.
> 
> In the kernel's core event opening handling is:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/kernel/events/core.c?h=perf/core#n11617
> ```
> /*
> * PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
> * are often aliases for PERF_TYPE_RAW.
> */
> type = event->attr.type;
> if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE) {
> type = event->attr.config >> PERF_PMU_TYPE_SHIFT;
> if (!type) {
> type = PERF_TYPE_RAW;
> } else {
> extended_type = true;
> event->attr.config &= PERF_HW_EVENT_MASK;
> }
> }
> ```
> 
> So without the extended type PERF_TYPE_RAW is used for
> PERF_TYPE_HARDWARE or PERF_TYPE_HW_CACHE. However, the capability
> needs to be set on the PMU to support extended types otherwise the
> code will fail:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/kernel/events/core.c?h=perf/core#n11637
> ```
> if (event->attr.type != type && type != PERF_TYPE_RAW &&
>    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
> goto fail;
> ```
> For heterogeneous ARM PMUs I believe this is exactly the path taken to
> cause the perf_event_open to fail. The resolution is to add the bit to
> the PMU's capabilities. I believe doing it here will suffice:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm_pmu.c?h=perf/core#n877
> So:
> ```
> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS | PERF_PMU_CAP_EXTENDED_REGS,
> ```
> becomes:
> ```
> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS |
> PERF_PMU_CAP_EXTENDED_REGS | PERF_PMU_CAP_EXTENDED_HW_TYPE,
> ```
> Now strictly speaking you only need this in the case of heterogeneous
> PMUs, but that's the case for other bits here and I don't see special
> case logic...
> 

Hi Ian,

I tested the above and it seems to work fine on Arm so I will send it
shortly. Some of the legacy perf stat use cases are working better with
it so it makes sense to me. I also don't see a reason to not always do
it even for homogeneous PMUs. To be honest I don't see a reason not to
remove it as a capability and have it as the default behaviour
everywhere? It all happens in the core code so I'm not sure adding it
per PMU device is necessary, but maybe there is a reason?

I've also got a fix for all of the test cases which works for both
scenarios where the kernel does and doesn't support
PERF_PMU_CAP_EXTENDED_HW_TYPE which I will also send shortly. So there's
probably no need for you to look into any of those failures on Arm anymore.

Apart from maybe the case with the cache events opening twice, but I can
also look into a fix for that if you like?

James

>> Even without that, Perf has the
>> fallback mechanism where it retries opening with fewer options if
>> something causes a failure.
> 
> The fallback with fewer options is a separate problem. The most common
> case for that is that perf is trying to profile both user and kernel
> code, but lacks permission. We retry just trying to profile user code.
> The problem of permissions is a separate one to opening events on the
> correct PMU.
> 
>>>
>>>>   sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
>>>>   sys_perf_event_open failed, error -2
>>>>   running test 3 'cycles/period=100000,config2/'
>>>>   running test 4 'faults'
>>>>   running test 5 'L1-dcache-load-miss'
>>>>   running test 6 'mem:0'
>>>>   running test 7 'mem:0:x'
>>>>   running test 8 'mem:0:r'
>>>>   running test 9 'mem:0:w'
>>>>   running test 10 'r1a:kp'
>>>>   FAILED tests/parse-events.c:123 No PMU found for type
>>>>   Event test failure: test 10 'r1a:kp'running test 11 '1:1:hp'
>>>>   running test 12 'instructions:h'
>>>>   FAILED tests/parse-events.c:329 wrong number of entries
>>>>   Event test failure: test 12 'instructions:h'running test 13 'faults:u'
>>>>   running test 14 'L1-dcache-load-miss:kp'
>>>>   running test 15 'mem:0:u'
>>>>   running test 16 'mem:0:x:k'
>>>>   running test 17 'mem:0:r:hp'
>>>>   running test 18 'mem:0:w:up'
>>>>   running test 19 'instructions:G'
>>>>   running test 20 'instructions:H'
>>>>   running test 21 'mem:0:rw'
>>>>   running test 22 'mem:0:rw:kp'
>>>>   running test 23 '{instructions:k,cycles:upp}'
>>>>   FAILED tests/parse-events.c:832 wrong number of entries
>>>>   Event test failure: test 23 '{instructions:k,cycles:upp}'running test
>>>
>>> These failures will relate to not opening enough events, there should
>>> be one instructions and cycles event/evsel (both legacy types) per
>>> PMU. Similarly below. All these failures will relate to the extended
>>> type divergence, could this be a priority for someone, ARM, Linaro..
>>> to look at? I'm happy to send emails, be on a call, etc.
>>>
>>
>> Yes a call sounds good. I can do 8am to 6pm BST all next week pretty
>> much, if that works for you?
> 
> Let me check with my wife. Next week is funny for some odd
> "independence" celebration here.
> 
>> I think Will has a point that the Arm version without
>> PERF_PMU_CAP_EXTENDED_HW_TYPE has existed for a long time, and as far as
>> I know new Perf should be compatible with old kernels (including the
>> tests?) so we should make sure it works with both versions even if we do
>> add PERF_PMU_CAP_EXTENDED_HW_TYPE for Arm later.
> 
> Agreed. It's the time machine problem, we can fix this but to fix it
> for everyone requires a time machine. Fixing this problem on older
> kernels is much more messy as event mapping logic is necessary. Given
> a lack of complaints perhaps we just keep older kernels broken with a
> failing test.
> 
>> I need to go back and re-check my previous message about the usability
>> (rather than test) issues on Juno and see which ones, if any, are
>> related to these changes. If something that a user could do before
>> doesn't work anymore maybe that adds another argument.
> 
> Something that looks regressed in the new code is that we get >1
> legacy cache event on heterogeneous ARM PMUs but the extended type
> isn't set:
> Iterating all core PMUs:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n464
> The setting of the config will ignore not set the extended type bits:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n434
> An example would be:
> $ perf stat -e l1d-misses true
> There should be two events on a heterogeneous system, but without the
> extended type this will just induce unnecessary multiplexing. We
> should probably just not iterate in the broken extended type case as
> we are for PERF_TYPE_HARDWARE:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n1511
> 
> Thanks,
> Ian
> 
>>> Thanks,
>>> Ian
>>>
>>>> 24 '{faults:k,cache-references}:u,cycles:k'
>>>>   FAILED tests/parse-events.c:875 wrong number of entries
>>
>> [...]
>>>>   running test 65 'mem:0/4:rw/name=breakpoint/u'
>>>>   running test 66 'mem:0/1/name=breakpoint1/,mem:0/4:rw/name=breakpoint2/'
>>>>
>>>>
>>>>>
>>>>> It is great that you'll look to add PERF_PMU_CAP_EXTENDED_HW_TYPE,
>>>>> there is also the legacy kernel issue. Do you have a plan there or
>>>>> should supporting PERF_PMU_CAP_EXTENDED_HW_TYPE be the plan? As the
>>>>> Linux 6.5 merge window will open soon I'm concerned that these fixes
>>>>> may only make it into 6.6. Should we do anything short-term for 6.5?
>>>>> For example, if extended types aren't supported then disable the
>>>>> wildcard matching of PMUs in the "numeric", really PERF_TYPE_HARDWARE
>>>>> and PERF_TYPE_HW_CACHE, cases?
>>
>> Sorry I missed this bit before, but I think that we need both cases to
>> work even if we add PERF_PMU_CAP_EXTENDED_HW_TYPE in the future. So yes
>> I think some short term tool fix would be needed. Unless it's just the
>> tests that are failing rather than any real use cases, then maybe
>> nothing needs to be done.
>>
>> James
>>
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>> [1]: https://lore.kernel.org/all/ZIxza13x+AwApbQb@kernel.org/
>>>>>>
>>>>>>> The perf tool is expecting it to be supported and so it tries to open
>>>>>>> the event for each core PMU, so the cycles event will open for each of
>>>>>>> a BIG.little PMU, however, as the extended type isn't supported the
>>>>>>> same event will be opened twice and the migration problem still
>>>>>>> exists. This problem exists for all PERF_TYPE_HARDWARE events (10) and
>>>>>>> all PERF_TYPE_HW_CACHE events (42).
>>>>>>>
>>>>>>> Assuming the capability issue is fixed, there is still an open issue
>>>>>>> about how to properly support these 52 events on older kernels. It
>>>>>>> would be possible to use the architecture standard events, so cycles
>>>>>>> can map to cpu_cycles, instructions to sw_incr, cache-references to
>>>>>>> l1d_cache (probably), etc. I'm not sure this will work for all 52
>>>>>>> events and the plumbing in the perf tool isn't straightforward.
>>>>>>> There's also the question of whether such a change of events should
>>>>>>> happen when we know the perf_event_open will be given a CPU, should
>>>>>>> that case not wildcard PMUs?>
>>>>>>> I'm happy to provide more context/details.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ian
>>>>>>>
>>>>>>>> Also we are transitioning from Arnaldo's tree to new repositories in
>>>>>>>> git.kernel.org:
>>>>>>>>
>>>>>>>>  * kernel/git/perf/perf-tools (for the fixes on the current dev cycle)
>>>>>>>>  * kernel/git/perf/perf-tools-next (for new features for the next cycle)
>>>>>>>>
>>>>>>>> There is a branch with the same name.  Please check out
>>>>>>>> perf-tools-next and play with it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-07-04 11:30               ` James Clark
@ 2023-07-04 12:33                 ` Ravi Bangoria
  2023-07-04 12:44                   ` James Clark
  0 siblings, 1 reply; 18+ messages in thread
From: Ravi Bangoria @ 2023-07-04 12:33 UTC (permalink / raw)
  To: James Clark, Ian Rogers
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Namhyung Kim,
	Ravi Bangoria

>> For heterogeneous ARM PMUs I believe this is exactly the path taken to
>> cause the perf_event_open to fail. The resolution is to add the bit to
>> the PMU's capabilities. I believe doing it here will suffice:
>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm_pmu.c?h=perf/core#n877
>> So:
>> ```
>> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS | PERF_PMU_CAP_EXTENDED_REGS,
>> ```
>> becomes:
>> ```
>> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS |
>> PERF_PMU_CAP_EXTENDED_REGS | PERF_PMU_CAP_EXTENDED_HW_TYPE,
>> ```
>> Now strictly speaking you only need this in the case of heterogeneous
>> PMUs, but that's the case for other bits here and I don't see special
>> case logic...
>>
> 
> Hi Ian,
> 
> I tested the above and it seems to work fine on Arm so I will send it
> shortly. Some of the legacy perf stat use cases are working better with
> it so it makes sense to me. I also don't see a reason to not always do
> it even for homogeneous PMUs. To be honest I don't see a reason not to
> remove it as a capability and have it as the default behaviour
> everywhere? It all happens in the core code so I'm not sure adding it
> per PMU device is necessary, but maybe there is a reason?

EventSelect on AMD is 12 bits where upper nibble is passed via config[35:32].
i.e. it conflicts with the semantics of PERF_PMU_CAP_EXTENDED_HW_TYPE. So,
please don't make it default :)

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-07-04 12:33                 ` Ravi Bangoria
@ 2023-07-04 12:44                   ` James Clark
  2023-07-04 14:47                     ` Ravi Bangoria
  0 siblings, 1 reply; 18+ messages in thread
From: James Clark @ 2023-07-04 12:44 UTC (permalink / raw)
  To: Ravi Bangoria, Ian Rogers
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Namhyung Kim



On 04/07/2023 13:33, Ravi Bangoria wrote:
>>> For heterogeneous ARM PMUs I believe this is exactly the path taken to
>>> cause the perf_event_open to fail. The resolution is to add the bit to
>>> the PMU's capabilities. I believe doing it here will suffice:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm_pmu.c?h=perf/core#n877
>>> So:
>>> ```
>>> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS | PERF_PMU_CAP_EXTENDED_REGS,
>>> ```
>>> becomes:
>>> ```
>>> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS |
>>> PERF_PMU_CAP_EXTENDED_REGS | PERF_PMU_CAP_EXTENDED_HW_TYPE,
>>> ```
>>> Now strictly speaking you only need this in the case of heterogeneous
>>> PMUs, but that's the case for other bits here and I don't see special
>>> case logic...
>>>
>>
>> Hi Ian,
>>
>> I tested the above and it seems to work fine on Arm so I will send it
>> shortly. Some of the legacy perf stat use cases are working better with
>> it so it makes sense to me. I also don't see a reason to not always do
>> it even for homogeneous PMUs. To be honest I don't see a reason not to
>> remove it as a capability and have it as the default behaviour
>> everywhere? It all happens in the core code so I'm not sure adding it
>> per PMU device is necessary, but maybe there is a reason?
> 
> EventSelect on AMD is 12 bits where upper nibble is passed via config[35:32].
> i.e. it conflicts with the semantics of PERF_PMU_CAP_EXTENDED_HW_TYPE. So,
> please don't make it default :)
> 
> Thanks,
> Ravi

Ah, thanks for the answer. It's a bit late now but I wonder why it
wasn't added as a completely new argument to perf_event_open. That would
have avoided that. Is there a difference between adding a completely new
argument and adding a new argument by stuffing something into an
existing field.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-07-04 12:44                   ` James Clark
@ 2023-07-04 14:47                     ` Ravi Bangoria
  2023-07-04 17:59                       ` Ian Rogers
  0 siblings, 1 reply; 18+ messages in thread
From: Ravi Bangoria @ 2023-07-04 14:47 UTC (permalink / raw)
  To: James Clark, Ian Rogers
  Cc: John Garry, Will Deacon, Leo Yan, Mike Leach, Kan Liang,
	Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Namhyung Kim,
	Ravi Bangoria

On 04-Jul-23 6:14 PM, James Clark wrote:
> 
> 
> On 04/07/2023 13:33, Ravi Bangoria wrote:
>>>> For heterogeneous ARM PMUs I believe this is exactly the path taken to
>>>> cause the perf_event_open to fail. The resolution is to add the bit to
>>>> the PMU's capabilities. I believe doing it here will suffice:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm_pmu.c?h=perf/core#n877
>>>> So:
>>>> ```
>>>> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS | PERF_PMU_CAP_EXTENDED_REGS,
>>>> ```
>>>> becomes:
>>>> ```
>>>> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS |
>>>> PERF_PMU_CAP_EXTENDED_REGS | PERF_PMU_CAP_EXTENDED_HW_TYPE,
>>>> ```
>>>> Now strictly speaking you only need this in the case of heterogeneous
>>>> PMUs, but that's the case for other bits here and I don't see special
>>>> case logic...
>>>>
>>>
>>> Hi Ian,
>>>
>>> I tested the above and it seems to work fine on Arm so I will send it
>>> shortly. Some of the legacy perf stat use cases are working better with
>>> it so it makes sense to me. I also don't see a reason to not always do
>>> it even for homogeneous PMUs. To be honest I don't see a reason not to
>>> remove it as a capability and have it as the default behaviour
>>> everywhere? It all happens in the core code so I'm not sure adding it
>>> per PMU device is necessary, but maybe there is a reason?
>>
>> EventSelect on AMD is 12 bits where upper nibble is passed via config[35:32].
>> i.e. it conflicts with the semantics of PERF_PMU_CAP_EXTENDED_HW_TYPE. So,
>> please don't make it default :)
>>
>> Thanks,
>> Ravi
> 
> Ah, thanks for the answer. It's a bit late now but I wonder why it
> wasn't added as a completely new argument to perf_event_open. That would
> have avoided that. Is there a difference between adding a completely new
> argument and adding a new argument by stuffing something into an
> existing field.

My bad. let me take my words back. I forgot that perf uses generalized event
codes for PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE and does not use actual
raw config.

Anyway, I think there is still a bit of ambiguity with this code. For ex,

  type = PERF_TYPE_HARDWARE;
  config = 0xU00000002;

counts instructions for 0xU = 0x0 / 0x1 / 0x3 / 0x4 / invalid pmu `type`,
otherwise perf_event_open() fails. See below simple test:

$ cat test.c
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <unistd.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
               int cpu, int group_fd, unsigned long flags)
{
        int ret;

        ret = syscall(SYS_perf_event_open, hw_event, pid, cpu,
                      group_fd, flags);
        return ret;
}

void loop(unsigned long config_upper_32)
{
        struct perf_event_attr pe;
        int fd;

        memset(&pe, 0, sizeof(pe));
        pe.type = PERF_TYPE_HARDWARE;
        pe.size = sizeof(pe);
        pe.config = config_upper_32 | 0x2;
        pe.disabled = 1;
        pe.exclude_kernel = 1;
        pe.exclude_hv = 1;

        fd = perf_event_open(&pe, 0, -1, -1, 0);
        if (fd == -1) {
                printf("type: %d, config: 0x%llx: Error\n", pe.type, pe.config);
                return;
        }

        printf("type: %d, config: 0x%llx: Ok\n", pe.type, pe.config);
        close(fd);
}

int main(void)
{
        unsigned long i;

        for (i = 0; i < 0x20; i++)
                loop(i << 32);
        return 0;
}

$ gcc -Wall -g test.c

$ ./a.out
type: 0, config: 0x2: Ok
type: 0, config: 0x100000002: Ok
type: 0, config: 0x200000002: Error
type: 0, config: 0x300000002: Ok
type: 0, config: 0x400000002: Ok
type: 0, config: 0x500000002: Error
type: 0, config: 0x600000002: Error
type: 0, config: 0x700000002: Error
type: 0, config: 0x800000002: Error
type: 0, config: 0x900000002: Error
type: 0, config: 0xa00000002: Error
type: 0, config: 0xb00000002: Error
type: 0, config: 0xc00000002: Error
type: 0, config: 0xd00000002: Error
type: 0, config: 0xe00000002: Error
type: 0, config: 0xf00000002: Error
type: 0, config: 0x1000000002: Error
type: 0, config: 0x1100000002: Error
type: 0, config: 0x1200000002: Ok
type: 0, config: 0x1300000002: Ok
type: 0, config: 0x1400000002: Ok
type: 0, config: 0x1500000002: Ok
type: 0, config: 0x1600000002: Ok
type: 0, config: 0x1700000002: Ok
type: 0, config: 0x1800000002: Ok
type: 0, config: 0x1900000002: Ok
type: 0, config: 0x1a00000002: Ok
type: 0, config: 0x1b00000002: Ok
type: 0, config: 0x1c00000002: Ok
type: 0, config: 0x1d00000002: Ok
type: 0, config: 0x1e00000002: Ok
type: 0, config: 0x1f00000002: Ok
$

Max pmu `type` value on this machine is 17 (i.e. 0x11).

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: perf-tools-next: Request for testing hybrid changes
  2023-07-04 14:47                     ` Ravi Bangoria
@ 2023-07-04 17:59                       ` Ian Rogers
  0 siblings, 0 replies; 18+ messages in thread
From: Ian Rogers @ 2023-07-04 17:59 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: James Clark, John Garry, Will Deacon, Leo Yan, Mike Leach,
	Kan Liang, Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Athira Rajeev, linux-perf-users, Thomas Richter, Namhyung Kim,
	Mark Rutland

On Tue, Jul 4, 2023 at 7:48 AM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>
> On 04-Jul-23 6:14 PM, James Clark wrote:
> >
> >
> > On 04/07/2023 13:33, Ravi Bangoria wrote:
> >>>> For heterogeneous ARM PMUs I believe this is exactly the path taken to
> >>>> cause the perf_event_open to fail. The resolution is to add the bit to
> >>>> the PMU's capabilities. I believe doing it here will suffice:
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/drivers/perf/arm_pmu.c?h=perf/core#n877
> >>>> So:
> >>>> ```
> >>>> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS | PERF_PMU_CAP_EXTENDED_REGS,
> >>>> ```
> >>>> becomes:
> >>>> ```
> >>>> .capabilities = PERF_PMU_CAP_HETEROGENEOUS_CPUS |
> >>>> PERF_PMU_CAP_EXTENDED_REGS | PERF_PMU_CAP_EXTENDED_HW_TYPE,
> >>>> ```
> >>>> Now strictly speaking you only need this in the case of heterogeneous
> >>>> PMUs, but that's the case for other bits here and I don't see special
> >>>> case logic...
> >>>>
> >>>
> >>> Hi Ian,
> >>>
> >>> I tested the above and it seems to work fine on Arm so I will send it
> >>> shortly. Some of the legacy perf stat use cases are working better with
> >>> it so it makes sense to me. I also don't see a reason to not always do
> >>> it even for homogeneous PMUs. To be honest I don't see a reason not to
> >>> remove it as a capability and have it as the default behaviour
> >>> everywhere? It all happens in the core code so I'm not sure adding it
> >>> per PMU device is necessary, but maybe there is a reason?

I don't see a reason. I suspect the original reason was to avoid
potential regressions. The only user of the bit is Intel's hybrid, so
I think it should be straightforward to get rid of it.

> >>
> >> EventSelect on AMD is 12 bits where upper nibble is passed via config[35:32].
> >> i.e. it conflicts with the semantics of PERF_PMU_CAP_EXTENDED_HW_TYPE. So,
> >> please don't make it default :)
> >>
> >> Thanks,
> >> Ravi
> >
> > Ah, thanks for the answer. It's a bit late now but I wonder why it
> > wasn't added as a completely new argument to perf_event_open. That would
> > have avoided that. Is there a difference between adding a completely new
> > argument and adding a new argument by stuffing something into an
> > existing field.
>
> My bad. let me take my words back. I forgot that perf uses generalized event
> codes for PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE and does not use actual
> raw config.

Yep. I suspect these type encodings were based on oprofile's behavior
a long time ago. So there are a bunch of legacy types and then after
those numbers come the dynamically registered PMUs that either need to
have json or sysfs event names:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n29

Unfortunately some of the more popular events, and default for things
like perf record, are legacy ones. You can see the hard coding of the
event names in the perf tool here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n341

> Anyway, I think there is still a bit of ambiguity with this code. For ex,
>
>   type = PERF_TYPE_HARDWARE;
>   config = 0xU00000002;
>
> counts instructions for 0xU = 0x0 / 0x1 / 0x3 / 0x4 / invalid pmu `type`,
> otherwise perf_event_open() fails. See below simple test:
>
> $ cat test.c
> #include <linux/perf_event.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/ioctl.h>
> #include <sys/syscall.h>
> #include <unistd.h>
>
> static long
> perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
>                int cpu, int group_fd, unsigned long flags)
> {
>         int ret;
>
>         ret = syscall(SYS_perf_event_open, hw_event, pid, cpu,
>                       group_fd, flags);
>         return ret;
> }
>
> void loop(unsigned long config_upper_32)
> {
>         struct perf_event_attr pe;
>         int fd;
>
>         memset(&pe, 0, sizeof(pe));
>         pe.type = PERF_TYPE_HARDWARE;
>         pe.size = sizeof(pe);
>         pe.config = config_upper_32 | 0x2;
>         pe.disabled = 1;
>         pe.exclude_kernel = 1;
>         pe.exclude_hv = 1;
>
>         fd = perf_event_open(&pe, 0, -1, -1, 0);
>         if (fd == -1) {
>                 printf("type: %d, config: 0x%llx: Error\n", pe.type, pe.config);
>                 return;
>         }
>
>         printf("type: %d, config: 0x%llx: Ok\n", pe.type, pe.config);
>         close(fd);
> }
>
> int main(void)
> {
>         unsigned long i;
>
>         for (i = 0; i < 0x20; i++)
>                 loop(i << 32);
>         return 0;
> }
>
> $ gcc -Wall -g test.c
>
> $ ./a.out
> type: 0, config: 0x2: Ok
> type: 0, config: 0x100000002: Ok
> type: 0, config: 0x200000002: Error
> type: 0, config: 0x300000002: Ok
> type: 0, config: 0x400000002: Ok
> type: 0, config: 0x500000002: Error
> type: 0, config: 0x600000002: Error
> type: 0, config: 0x700000002: Error
> type: 0, config: 0x800000002: Error
> type: 0, config: 0x900000002: Error
> type: 0, config: 0xa00000002: Error
> type: 0, config: 0xb00000002: Error
> type: 0, config: 0xc00000002: Error
> type: 0, config: 0xd00000002: Error
> type: 0, config: 0xe00000002: Error
> type: 0, config: 0xf00000002: Error
> type: 0, config: 0x1000000002: Error
> type: 0, config: 0x1100000002: Error
> type: 0, config: 0x1200000002: Ok
> type: 0, config: 0x1300000002: Ok
> type: 0, config: 0x1400000002: Ok
> type: 0, config: 0x1500000002: Ok
> type: 0, config: 0x1600000002: Ok
> type: 0, config: 0x1700000002: Ok
> type: 0, config: 0x1800000002: Ok
> type: 0, config: 0x1900000002: Ok
> type: 0, config: 0x1a00000002: Ok
> type: 0, config: 0x1b00000002: Ok
> type: 0, config: 0x1c00000002: Ok
> type: 0, config: 0x1d00000002: Ok
> type: 0, config: 0x1e00000002: Ok
> type: 0, config: 0x1f00000002: Ok
> $
>
> Max pmu `type` value on this machine is 17 (i.e. 0x11).

I tried this on an Alderlake with an older kernel. I increased the
loop to 64 values (0x40) as the max pmu type was 30:

$ uname -a
Linux alderlake 5.15.0-72-generic #79~20.04.1-Ubuntu SMP Thu Apr 20
22:12:07 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ cat /sys/bus/event_source/devices/*/type|sort -n
1
2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ ./a.out
type: 0, config: 0x2: Ok
type: 0, config: 0x100000002: Ok
type: 0, config: 0x200000002: Error
type: 0, config: 0x300000002: Ok
type: 0, config: 0x400000002: Ok
type: 0, config: 0x500000002: Error
type: 0, config: 0x600000002: Error
type: 0, config: 0x700000002: Error
type: 0, config: 0x800000002: Ok
type: 0, config: 0x900000002: Error
type: 0, config: 0xa00000002: Error
type: 0, config: 0xb00000002: Error
type: 0, config: 0xc00000002: Error
type: 0, config: 0xd00000002: Error
type: 0, config: 0xe00000002: Error
type: 0, config: 0xf00000002: Error
type: 0, config: 0x1000000002: Error
type: 0, config: 0x1100000002: Error
type: 0, config: 0x1200000002: Error
type: 0, config: 0x1300000002: Error
type: 0, config: 0x1400000002: Error
type: 0, config: 0x1500000002: Error
type: 0, config: 0x1600000002: Error
type: 0, config: 0x1700000002: Error
type: 0, config: 0x1800000002: Error
type: 0, config: 0x1900000002: Error
type: 0, config: 0x1a00000002: Error
type: 0, config: 0x1b00000002: Error
type: 0, config: 0x1c00000002: Error
type: 0, config: 0x1d00000002: Error
type: 0, config: 0x1e00000002: Error
type: 0, config: 0x1f00000002: Ok
type: 0, config: 0x2000000002: Ok
type: 0, config: 0x2100000002: Ok
type: 0, config: 0x2200000002: Ok
type: 0, config: 0x2300000002: Ok
type: 0, config: 0x2400000002: Ok
type: 0, config: 0x2500000002: Ok
type: 0, config: 0x2600000002: Ok
type: 0, config: 0x2700000002: Ok
type: 0, config: 0x2800000002: Ok
type: 0, config: 0x2900000002: Ok
type: 0, config: 0x2a00000002: Ok
type: 0, config: 0x2b00000002: Ok
type: 0, config: 0x2c00000002: Ok
type: 0, config: 0x2d00000002: Ok
type: 0, config: 0x2e00000002: Ok
type: 0, config: 0x2f00000002: Ok
type: 0, config: 0x3000000002: Ok
type: 0, config: 0x3100000002: Ok
type: 0, config: 0x3200000002: Ok
type: 0, config: 0x3300000002: Ok
type: 0, config: 0x3400000002: Ok
type: 0, config: 0x3500000002: Ok
type: 0, config: 0x3600000002: Ok
type: 0, config: 0x3700000002: Ok
type: 0, config: 0x3800000002: Ok
type: 0, config: 0x3900000002: Ok
type: 0, config: 0x3a00000002: Ok
type: 0, config: 0x3b00000002: Ok
type: 0, config: 0x3c00000002: Ok
type: 0, config: 0x3d00000002: Ok
type: 0, config: 0x3e00000002: Ok
type: 0, config: 0x3f00000002: Ok


Digging into the numbers a little:
0 is PERF_TYPE_HARDWARE but also the legacy 0 value. I think it should
succeed. It is not clear which PMU you get without a CPU argument to
perf_event_open, on ARM Mark told me it depends on the order the PMU
devices register.
2 is PERF_TYPE_SOFTWARE and I think it should fail.
3 is PERF_TYPE_HW_CACHE and it would be weird to use
PERF_TYPE_HARDWARE with an extended type of PERF_TYPE_HW_CACHE. I
think it should fail.
4 is PERF_TYPE_RAW and "cpu_core", it should succeed but on systems
like ARM where PERF_TYPE_RAW isn't a PMU this should probably fail.
8 is cpu_atom and should succeed.
>=31 (0x1f) there are no dynamically added PMUs with these types, I think they should all fail.

Thanks,
Ian

> Thanks,
> Ravi

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-07-04 18:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-22 18:57 perf-tools-next: Request for testing hybrid changes Namhyung Kim
2023-06-22 20:15 ` Ian Rogers
2023-06-23  8:42   ` James Clark
2023-06-23 14:30     ` Ian Rogers
2023-06-23 17:59       ` Ian Rogers
2023-06-26 12:32       ` Will Deacon
2023-06-26 14:39         ` James Clark
2023-06-26 15:09           ` Will Deacon
2023-06-26 16:11             ` Ian Rogers
2023-06-29 11:13       ` James Clark
2023-06-30 15:30         ` Ian Rogers
2023-06-30 16:14           ` James Clark
2023-06-30 16:45             ` Ian Rogers
2023-07-04 11:30               ` James Clark
2023-07-04 12:33                 ` Ravi Bangoria
2023-07-04 12:44                   ` James Clark
2023-07-04 14:47                     ` Ravi Bangoria
2023-07-04 17:59                       ` Ian Rogers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.