All of lore.kernel.org
 help / color / mirror / Atom feed
* Event reordering regression for software events
@ 2023-07-13 20:37 Andi Kleen
  2023-07-13 20:48 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2023-07-13 20:37 UTC (permalink / raw)
  To: irogers, namhyung, acme, linux-perf-users

Hi Ian,

The current post 6.3 linux-perf-next causes some fairly drastic
reordering in output for some event combinations. I bisected it down to
your patch:

commit 347c2f0a0988c59c402148054ef54648853fa669
Author: Ian Rogers <irogers@google.com>
Date:   Sat Mar 11 18:15:40 2023 -0800

    perf parse-events: Sort and group parsed events

    This change is intended to be a no-op for most current cases, the
    default sort order is the order the events were parsed. Where it
    varies is in how groups are handled. Previously an uncore and core
    event that are grouped would most often cause the group to be removed:

I currently only have a large test case, but for the test case [1] below 
most of the "dummy"s and some other software events end up in a row and also in the
wrong place, instead of the expected order.

good:

...
986476553;ns;duration_time;986476553;100.00;;
0;;dummy;983952237;100.00;;
0;;cs:u;983952237;100.00;;
16532;;minor-faults:u;983952237;100.00;;
0;;major-faults:u;983952237;100.00;;
0;;migrations:u;983952237;100.00;;
3532509496;;cycles:u;23999092;2.00;;
....

bad:

1010119571;ns;duration_time;1010119571;100.00;;
0;;dummy [software];1007748753;100.00;;
0;;cs:u [software];1007748753;100.00;;
16496;;minor-faults:u [software];1007748753;100.00;;
0;;major-faults:u [software];1007748753;100.00;;
0;;migrations:u [software];1007748753;100.00;;
0;;dummy [software];1007748753;100.00;;
0;;dummy [software];1007748753;100.00;;
0;;dummy [software];1007748753;100.00;;
0;;dummy [software];1007748753;100.00;;
0;;emulation-faults [software];1007748753;100.00;;
0;;dummy [software];1007748753;100.00;;
0;;dummy [software];1007748753;100.00;;
0;;emulation-faults [software];1007748753;100.00;;
0;;emulation-faults [software];1007748753;100.00;;
3603193382;;cycles:u;23996241;2.00;;
2277091922;;cpu/event=0x0,umask=0x3/u;23996241;2.00;;
4182126406;;cpu/event=0xc2,umask=0x2/u;23996241;2.00;;
4364677170;;cpu/event=0xc0,umask=0x0/u;23996241;2.00;;


Unfortunately this totally breaks toplev. It needs to have the dummies
in the right location.`

Another problem is that it also now adds [software] to lots of software
events, which of course also breaks any parsing tools. I haven't bisected
that too, but it needs fixing too.

I must say using perf recently is frustrating with all the regressions!
Please be a bit more careful with compatibility.

-Andi



[1] 

$PERF stat -x\; --no-merge -o xxx -e '{cpu/event=0xc0,umask=0x0/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x0,umask=0x3/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u,cpu/event=0xb1,umask=0x1/u,cpu/event=0xb1,umask=0x2,cmask=1/u},msr/tsc/,duration_time,dummy,uncore_imc/event=0x4,umask=0x3/,uncore_imc/event=0x4,umask=0xc/,cs:u,minor-faults:u,major-faults:u,migrations:u,{cycles:u,cpu/event=0x0,umask=0x3/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xc0,umask=0x0/u,cpu/event=0xc4,umask=0x40/u,cpu/event=0xb1,umask=0x1/u,cpu/event=0xe,umask=0x1/u},{cpu/event=0xc2,umask=0x2/u,cpu/event=0xc2,umask=0x2,cmask=1/u,cpu/event=0x3c,umask=0x0/ku,cpu/event=0xc0,umask=0x0/ku},{cpu/event=0x79,umask=0x8/u,cpu/event=0xa8,umask=0x1/u,cpu/event=0x79,umask=0x4/u,cpu/event=0x79,umask=0x30/u,cpu/event=0xc0,umask=0x0/u,cpu/event=0x3c,umask=0x0/u},{cpu/event=0x10,umask=0x20/u,cpu/event=0x10,umask=0x80/u,cpu/event=0x10,umask=0x10/u,cpu/event=0x10,umask=0x40/u},{cpu/event=0x11,umask=0x2/u,cpu/event=0x11,umask=0x1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u},{cpu/event=0x10,umask=0x20/u,cpu/event=0x10,umask=0x80/u,cpu/event=0x10,umask=0x10/u,cpu/event=0x10,umask=0x40/u,cpu/event=0x3c,umask=0x0/u},{cpu/event=0x3c,umask=0x0/ku,cpu/event=0x3c,umask=0x1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xd,umask=0x3,cmask=1/u},{cpu/event=0x9c,umask=0x1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u,cpu/event=0xc2,umask=0x2/u},{cpu/event=0xe,umask=0x1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u},{cpu/event=0x9c,umask=0x1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u,cpu/event=0xe,umask=0x1/u},{cpu/event=0x3c,umask=0x0/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u,cpu/event=0x9c,umask=0x1/u,cpu/event=0xc0,umask=0x0/u},{cpu/event=0xc2,umask=0x2/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u},{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0x79,umask=0x30/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u},{cpu/event=0xe,umask=0x1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u},{cpu/event=0xc5,umask=0x0/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0x3c,umask=0x1/u},dummy,{cpu/event=0xe,umask=0x1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u},{cpu/event=0x9c,umask=0x1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u,cpu/event=0xe,umask=0x1/u},dummy,{cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xa3,umask=0x4,cmask=4/u,cpu/event=0xb1,umask=0x1,cmask=1/u},{cpu/event=0xb1,umask=0x1,cmask=3/u,cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xc0,umask=0x0/u,cpu/event=0x5e,umask=0x1/u,cpu/event=0x9c,umask=0x1,cmask=4/u},{cpu/event=0x9c,umask=0x1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u,cpu/event=0xe,umask=0x1/u},{cpu/event=0xc2,umask=0x2/u,cpu/event=0xd,umask=0x3,cmask=1/u},dummy,dummy,emulation-faults,dummy,{cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xa3,umask=0x4,cmask=4/u,cpu/event=0xb1,umask=0x1,cmask=1/u},{cpu/event=0xb1,umask=0x1,cmask=3/u,cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xc0,umask=0x0/u,cpu/event=0x5e,umask=0x1/u,cpu/event=0x9c,umask=0x1,cmask=4/u},{cpu/event=0x79,umask=0x30,edge=1,cmask=1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x87,umask=0x1/u,cpu/event=0xab,umask=0x2/u,cpu/event=0xa2,umask=0x8/u},{cpu/event=0x85,umask=0x10/u,cpu/event=0x85,umask=0x4/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x8,umask=0x10/u,cpu/event=0x8,umask=0x4/u},{cpu/event=0xc5,umask=0x0/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xe6,umask=0x1f/u,cpu/event=0x3c,umask=0x0/u},{cpu/event=0xd1,umask=0x4/u,cpu/event=0xd1,umask=0x20/u,cpu/event=0xa3,umask=0x5,cmask=5/u,cpu/event=0x3c,umask=0x0/u},{cpu/event=0x14,umask=0x1/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u,cpu/event=0x11,umask=0x2/u},{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0x79,umask=0x30/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x3c,umask=0x2/u},{cpu/event=0xc2,umask=0x2/u,cpu/event=0x10,umask=0x1/u,cpu/event=0xb1,umask=0x1/u,cpu/event=0x10,umask=0x20/u},{cpu/event=0x10,umask=0x80/u,cpu/event=0x10,umask=0x10/u,cpu/event=0x10,umask=0x40/u,cpu/event=0x11,umask=0x1/u},{cpu/event=0x3c,umask=0x0/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0x3c,umask=0x2/u,cpu/event=0x3c,umask=0x1/u},dummy,{cpu/event=0xa3,umask=0x4,cmask=4/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xb1,umask=0x1,cmask=3/u,cpu/event=0xb1,umask=0x1,cmask=2/u},{cpu/event=0xc0,umask=0x0/u,cpu/event=0x5e,umask=0x1/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x3c,umask=0x0/u,cpu/event=0x60,umask=0x8,cmask=6/u},{cpu/event=0x3c,umask=0x0/u,cpu/event=0x60,umask=0x8,cmask=1/u,cpu/event=0x60,umask=0x8,cmask=6/u},{cpu/event=0xc2,umask=0x2/u,cpu/event=0x10,umask=0x1/u,cpu/event=0xb1,umask=0x1/u},{cpu/event=0x10,umask=0x20/u,cpu/event=0x10,umask=0x80/u,cpu/event=0xb1,umask=0x1/u,cpu/event=0x10,umask=0x10/u},emulation-faults,{cpu/event=0x10,umask=0x10/u,cpu/event=0x10,umask=0x40/u,cpu/event=0x11,umask=0x1/u,cpu/event=0x11,umask=0x2/u},emulation-faults,{cpu/event=0x11,umask=0x2/u,cpu/event=0x11,umask=0x1/u,cpu/event=0xb1,umask=0x1/u}'




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-13 20:37 Event reordering regression for software events Andi Kleen
@ 2023-07-13 20:48 ` Arnaldo Carvalho de Melo
  2023-07-13 21:24   ` Ian Rogers
  2023-07-13 23:55   ` Andi Kleen
  0 siblings, 2 replies; 13+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-07-13 20:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: irogers, namhyung, linux-perf-users

Em Thu, Jul 13, 2023 at 01:37:00PM -0700, Andi Kleen escreveu:
> Hi Ian,
> 
> The current post 6.3 linux-perf-next causes some fairly drastic
> reordering in output for some event combinations. I bisected it down to
> your patch:
> 
> commit 347c2f0a0988c59c402148054ef54648853fa669
> Author: Ian Rogers <irogers@google.com>
> Date:   Sat Mar 11 18:15:40 2023 -0800
> 
>     perf parse-events: Sort and group parsed events
> 
>     This change is intended to be a no-op for most current cases, the
>     default sort order is the order the events were parsed. Where it
>     varies is in how groups are handled. Previously an uncore and core
>     event that are grouped would most often cause the group to be removed:
> 
> I currently only have a large test case, but for the test case [1] below 
> most of the "dummy"s and some other software events end up in a row and also in the
> wrong place, instead of the expected order.
> 
> good:
> 
> ...
> 986476553;ns;duration_time;986476553;100.00;;
> 0;;dummy;983952237;100.00;;
> 0;;cs:u;983952237;100.00;;
> 16532;;minor-faults:u;983952237;100.00;;
> 0;;major-faults:u;983952237;100.00;;
> 0;;migrations:u;983952237;100.00;;
> 3532509496;;cycles:u;23999092;2.00;;
> ....
> 
> bad:
> 
> 1010119571;ns;duration_time;1010119571;100.00;;
> 0;;dummy [software];1007748753;100.00;;
> 0;;cs:u [software];1007748753;100.00;;
> 16496;;minor-faults:u [software];1007748753;100.00;;
> 0;;major-faults:u [software];1007748753;100.00;;
> 0;;migrations:u [software];1007748753;100.00;;
> 0;;dummy [software];1007748753;100.00;;
> 0;;dummy [software];1007748753;100.00;;
> 0;;dummy [software];1007748753;100.00;;
> 0;;dummy [software];1007748753;100.00;;
> 0;;emulation-faults [software];1007748753;100.00;;
> 0;;dummy [software];1007748753;100.00;;
> 0;;dummy [software];1007748753;100.00;;
> 0;;emulation-faults [software];1007748753;100.00;;
> 0;;emulation-faults [software];1007748753;100.00;;
> 3603193382;;cycles:u;23996241;2.00;;
> 2277091922;;cpu/event=0x0,umask=0x3/u;23996241;2.00;;
> 4182126406;;cpu/event=0xc2,umask=0x2/u;23996241;2.00;;
> 4364677170;;cpu/event=0xc0,umask=0x0/u;23996241;2.00;;
> 
> 
> Unfortunately this totally breaks toplev. It needs to have the dummies
> in the right location.`
> 
> Another problem is that it also now adds [software] to lots of software
> events, which of course also breaks any parsing tools. I haven't bisected
> that too, but it needs fixing too.
> 
> I must say using perf recently is frustrating with all the regressions!
> Please be a bit more careful with compatibility.

That is really unfortunate and indeed perf passed thru a lot of changes
trying to properly support hybrid systems and verify reference counting
correctness, a necessary step to resume attempts at being multithreaded.

We have seen a growing number of 'perf test' entries being submitted,
with writing shell scripts that check if the output from various command
lines produce expected results.

It would really be great to have new entries that exercise what is
expected by toplev so that we really are careful with compatibility.

- Arnaldo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-13 20:48 ` Arnaldo Carvalho de Melo
@ 2023-07-13 21:24   ` Ian Rogers
  2023-07-13 23:51     ` Andi Kleen
  2023-07-13 23:55   ` Andi Kleen
  1 sibling, 1 reply; 13+ messages in thread
From: Ian Rogers @ 2023-07-13 21:24 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Andi Kleen, namhyung, linux-perf-users

On Thu, Jul 13, 2023 at 1:48 PM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> Em Thu, Jul 13, 2023 at 01:37:00PM -0700, Andi Kleen escreveu:
> > Hi Ian,
> >
> > The current post 6.3 linux-perf-next causes some fairly drastic
> > reordering in output for some event combinations. I bisected it down to
> > your patch:
> >
> > commit 347c2f0a0988c59c402148054ef54648853fa669
> > Author: Ian Rogers <irogers@google.com>
> > Date:   Sat Mar 11 18:15:40 2023 -0800
> >
> >     perf parse-events: Sort and group parsed events
> >
> >     This change is intended to be a no-op for most current cases, the
> >     default sort order is the order the events were parsed. Where it
> >     varies is in how groups are handled. Previously an uncore and core
> >     event that are grouped would most often cause the group to be removed:
> >

So this is a no-op for most current cases where event groups aren't
used. The default order for events is the order they were inserted in.
Unfortunately Intel added perf metric, aka topdown, events where we
need to add or regroup events and also multiple vendors did
BIG.little/hybrid. Previously we'd sort and regroup uncore events
like:

perf stat -e '{uncore_imc_free_running/data_read/,uncore_imc_free_running/data_write/}'
...

to be say two groups of:

{uncore_imc_free_running_0/data_read/,uncore_imc_free_running_1/data_write/}
{uncore_imc_free_running_1/data_read/,uncore_imc_free_running_1/data_write/}

Rather than how they appear after parsing of:

{uncore_imc_free_running_0/data_read/,uncore_imc_free_running_1/data_read/,uncore_imc_free_running_0/data_write/,uncore_imc_free_running_1/data_write/}

Presumably toplev already supports this. With hybrid, what is true for
uncore PMUs is now true for core PMUs, so:

perf stat -e '{instructions,cycles}' ...

becomes:

{cpu_atom/instructions/,cpu_atom/cycles/}
{cpu_core/instructions/,cpu_core/cycles/}

Perf metric events require being in a group with the slots event
first. There are assumptions in the perf code that sibling events are
adjacent in the evlist. To fix and maintain all the requirements on
the evlist we've gotten to where we are now. I've tried to make the
code as simple and transparent as possible, but still fulfilling the
requirements coming from perf metric events and hybrid. The logic is
in parse_events__sort_events_and_fix_groups:

https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n2139

where the compare function evlist__cmp and
evsel__compute_group_pmu_name are key. The "group" PMU name is what
attempts to place all events that belong in the same group together,
which include software and aux events.

> > I currently only have a large test case, but for the test case [1] below
> > most of the "dummy"s and some other software events end up in a row and also in the
> > wrong place, instead of the expected order.
> >
> > good:
> >
> > ...
> > 986476553;ns;duration_time;986476553;100.00;;
> > 0;;dummy;983952237;100.00;;
> > 0;;cs:u;983952237;100.00;;
> > 16532;;minor-faults:u;983952237;100.00;;
> > 0;;major-faults:u;983952237;100.00;;
> > 0;;migrations:u;983952237;100.00;;
> > 3532509496;;cycles:u;23999092;2.00;;
> > ....
> >
> > bad:
> >
> > 1010119571;ns;duration_time;1010119571;100.00;;
> > 0;;dummy [software];1007748753;100.00;;
> > 0;;cs:u [software];1007748753;100.00;;
> > 16496;;minor-faults:u [software];1007748753;100.00;;
> > 0;;major-faults:u [software];1007748753;100.00;;
> > 0;;migrations:u [software];1007748753;100.00;;
> > 0;;dummy [software];1007748753;100.00;;
> > 0;;dummy [software];1007748753;100.00;;
> > 0;;dummy [software];1007748753;100.00;;
> > 0;;dummy [software];1007748753;100.00;;
> > 0;;emulation-faults [software];1007748753;100.00;;
> > 0;;dummy [software];1007748753;100.00;;
> > 0;;dummy [software];1007748753;100.00;;
> > 0;;emulation-faults [software];1007748753;100.00;;
> > 0;;emulation-faults [software];1007748753;100.00;;
> > 3603193382;;cycles:u;23996241;2.00;;
> > 2277091922;;cpu/event=0x0,umask=0x3/u;23996241;2.00;;
> > 4182126406;;cpu/event=0xc2,umask=0x2/u;23996241;2.00;;
> > 4364677170;;cpu/event=0xc0,umask=0x0/u;23996241;2.00;;
> >
> >
> > Unfortunately this totally breaks toplev. It needs to have the dummies
> > in the right location.`
> >
> > Another problem is that it also now adds [software] to lots of software
> > events, which of course also breaks any parsing tools. I haven't bisected
> > that too, but it needs fixing too.

There's nothing to bisect, the behavior has changed. We can revert the
behavior, break Intel hybrid and Intel perf metric events, but that
would seem unfortunate. Tbh, the mistake here is the assumptions that
toplev is making.

> >
> > I must say using perf recently is frustrating with all the regressions!
> > Please be a bit more careful with compatibility.
>
> That is really unfortunate and indeed perf passed thru a lot of changes
> trying to properly support hybrid systems and verify reference counting
> correctness, a necessary step to resume attempts at being multithreaded.
>
> We have seen a growing number of 'perf test' entries being submitted,
> with writing shell scripts that check if the output from various command
> lines produce expected results.
>
> It would really be great to have new entries that exercise what is
> expected by toplev so that we really are careful with compatibility.

We are careful with compatibility, in fact we've written a number of
tests to ensure the json and CSV output doesn't regress. Kan recently
added some tests for text output, but relying on it is likely a bad
idea. I've heard >1 proposal to do nesting of metrics and the like,
Kan recently removed events and put metric names in the text output
instead, even hiding events that have been programmed.. basically
relying on text output is wrong and toplev is broken in this regard.

Fixing hybrid and perf metric events weren't problems of my invention,
nor were they done with a casual disregard for perf users and no care
about breaking them. That said, testing can always be better so why
not add toplev's tests to tools/perf/tests/shell ? This way I can only
break things if I ignore a test regression.

Thanks,
Ian

>
> - Arnaldo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-13 21:24   ` Ian Rogers
@ 2023-07-13 23:51     ` Andi Kleen
  2023-07-14  1:14       ` Ian Rogers
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2023-07-13 23:51 UTC (permalink / raw)
  To: Ian Rogers; +Cc: Arnaldo Carvalho de Melo, namhyung, linux-perf-users


On Thu, Jul 13, 2023 at 02:24:22PM -0700, Ian Rogers wrote:
> On Thu, Jul 13, 2023 at 1:48 PM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> >
> > Em Thu, Jul 13, 2023 at 01:37:00PM -0700, Andi Kleen escreveu:
> > > Hi Ian,
> > >
> > > The current post 6.3 linux-perf-next causes some fairly drastic
> > > reordering in output for some event combinations. I bisected it down to
> > > your patch:
> > >
> > > commit 347c2f0a0988c59c402148054ef54648853fa669
> > > Author: Ian Rogers <irogers@google.com>
> > > Date:   Sat Mar 11 18:15:40 2023 -0800
> > >
> > >     perf parse-events: Sort and group parsed events
> > >
> > >     This change is intended to be a no-op for most current cases, the
> > >     default sort order is the order the events were parsed. Where it
> > >     varies is in how groups are handled. Previously an uncore and core
> > >     event that are grouped would most often cause the group to be removed:
> > >
> 
> So this is a no-op for most current cases where event groups aren't
> used. The default order for events is the order they were inserted in.

This doesn't seem to be what happens because the patch reorders
top-level events outside group. (like all the top level "dummy"ies)

But this commit broke things that are completely unrelated to hybrid and uncore,
just plain software events not being in a group. There's no reason 
to reorder those.

Basically anything you do for hybrid and topdown shouldn't affect
anything else. That's the basic rule.


> Perf metric events require being in a group with the slots event
> first. There are assumptions in the perf code that sibling events are
> adjacent in the evlist. To fix and maintain all the requirements on
> the evlist we've gotten to where we are now. I've tried to make the
> code as simple and transparent as possible, but still fulfilling the
> requirements coming from perf metric events and hybrid. The logic is
> in parse_events__sort_events_and_fix_groups:

This random reordering approach is just not viable. I explained
this before. If you have a tool which generates the same event in
multiple instances (for which there are many valid reasons related
to multiplexing accuracy), then it absolutely cannot handle random
reordering. The tool doesn't know which instance of its event is where
if it's not in the same position.

If you find you need reordering for your purposes and you know
you don't rely on ordering make it some kind of opt-in.
But forcing it on all perf users doesn't work.


> > > wrong place, instead of the expected order.
> > >
> > > good:
> > >
> > > ...
> > > 986476553;ns;duration_time;986476553;100.00;;
> > > 0;;dummy;983952237;100.00;;
> > > 0;;cs:u;983952237;100.00;;
> > > 16532;;minor-faults:u;983952237;100.00;;
> > > 0;;major-faults:u;983952237;100.00;;
> > > 0;;migrations:u;983952237;100.00;;
> > > 3532509496;;cycles:u;23999092;2.00;;
> > > ....
> > >
> > > bad:
> > >
> > > 1010119571;ns;duration_time;1010119571;100.00;;
> > > 0;;dummy [software];1007748753;100.00;;
> > > 0;;cs:u [software];1007748753;100.00;;
> > > 16496;;minor-faults:u [software];1007748753;100.00;;
> > > 0;;major-faults:u [software];1007748753;100.00;;
> > > 0;;migrations:u [software];1007748753;100.00;;
> > > 0;;dummy [software];1007748753;100.00;;
> > > 0;;dummy [software];1007748753;100.00;;
> > > 0;;dummy [software];1007748753;100.00;;
> > > 0;;dummy [software];1007748753;100.00;;
> > > 0;;emulation-faults [software];1007748753;100.00;;
> > > 0;;dummy [software];1007748753;100.00;;
> > > 0;;dummy [software];1007748753;100.00;;
> > > 0;;emulation-faults [software];1007748753;100.00;;
> > > 0;;emulation-faults [software];1007748753;100.00;;

Please explain how this insane order can make sense even with hybrid
or metrics. 

> > > 3603193382;;cycles:u;23996241;2.00;;
> > > 2277091922;;cpu/event=0x0,umask=0x3/u;23996241;2.00;;
> > > 4182126406;;cpu/event=0xc2,umask=0x2/u;23996241;2.00;;
> > > 4364677170;;cpu/event=0xc0,umask=0x0/u;23996241;2.00;;
> > >
> > >
> > > Unfortunately this totally breaks toplev. It needs to have the dummies
> > > in the right location.`
> > >
> > > Another problem is that it also now adds [software] to lots of software
> > > events, which of course also breaks any parsing tools. I haven't bisected
> > > that too, but it needs fixing too.
> 
> There's nothing to bisect, the behavior has changed. We can revert the
> behavior, break Intel hybrid and Intel perf metric events, but that

I've been using perf metrics for many years without problems without this patch.

I know hybrid has some issues that need to be fixed, but that
absolutely cannot break basic properties like having a parseable
output format.

> would seem unfortunate. Tbh, the mistake here is the assumptions that
> toplev is making.

It's not an random assumption, it's the only way to parse perf output uniquely.

Please make it an option. But this cannot stay this way unconditionally 
if perf stat is supposed to stay a tool which can be used from other programs.

> > correctness, a necessary step to resume attempts at being multithreaded.
> >
> > We have seen a growing number of 'perf test' entries being submitted,
> > with writing shell scripts that check if the output from various command
> > lines produce expected results.
> >
> > It would really be great to have new entries that exercise what is
> > expected by toplev so that we really are careful with compatibility.
> 
> We are careful with compatibility, in fact we've written a number of
> tests to ensure the json and CSV output doesn't regress. Kan recently

The tests are super basic and don't seem to cover any even moderately
complex cases. Yes it's great that you wrote tests, but given
the poor coverage you really cannot rely on just a few basic unit
tests here.

Given that, and the wide usage of perf and the existence of Hyrum's law the
only safe way is to avoid changing default behaviors. If you want
some behavior change make it opt-in.

> Kan recently removed events and put metric names in the text output
> instead, even hiding events that have been programmed.. basically
> relying on text output is wrong and toplev is broken in this regard.

I missed your explanation how toplev is supposed to parse the output
then instead?

> Fixing hybrid and perf metric events weren't problems of my invention,
> nor were they done with a casual disregard for perf users and no care
> about breaking them. That said, testing can always be better so why
> not add toplev's tests to tools/perf/tests/shell ? This way I can only
> break things if I ignore a test regression.

toplev generates the perf command lines dynamically and they are highly
dependent on the current system.  I don't see a simple way to make
independent tests, unless you want a big case statement for existing systems,
or add most of toplev.

I guess could add some script that is the equivalent of

git clone https://github.com/andikleen/pmu-tools
cd pmu-tools
PERF=/point/to/new/perf/binary  ./tl-tester

if that helps? The toplev test suite has reasonably good coverage.


-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-13 20:48 ` Arnaldo Carvalho de Melo
  2023-07-13 21:24   ` Ian Rogers
@ 2023-07-13 23:55   ` Andi Kleen
  1 sibling, 0 replies; 13+ messages in thread
From: Andi Kleen @ 2023-07-13 23:55 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: irogers, namhyung, linux-perf-users

> It would really be great to have new entries that exercise what is
> expected by toplev so that we really are careful with compatibility.

It is unfortunately difficult to do, see my reply to Ian.

But if you have automated checking tests perhays you could run my test suite too?

git clone https://github.com/andikleen/pmu-tools
cd pmu-tools
PERF=/new/perf ./tl-tester 

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-13 23:51     ` Andi Kleen
@ 2023-07-14  1:14       ` Ian Rogers
  2023-07-14  3:07         ` Andi Kleen
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Rogers @ 2023-07-14  1:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Arnaldo Carvalho de Melo, namhyung, linux-perf-users

On Thu, Jul 13, 2023 at 4:52 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> On Thu, Jul 13, 2023 at 02:24:22PM -0700, Ian Rogers wrote:
> > On Thu, Jul 13, 2023 at 1:48 PM Arnaldo Carvalho de Melo
> > <acme@kernel.org> wrote:
> > >
> > > Em Thu, Jul 13, 2023 at 01:37:00PM -0700, Andi Kleen escreveu:
> > > > Hi Ian,
> > > >
> > > > The current post 6.3 linux-perf-next causes some fairly drastic
> > > > reordering in output for some event combinations. I bisected it down to
> > > > your patch:
> > > >
> > > > commit 347c2f0a0988c59c402148054ef54648853fa669
> > > > Author: Ian Rogers <irogers@google.com>
> > > > Date:   Sat Mar 11 18:15:40 2023 -0800
> > > >
> > > >     perf parse-events: Sort and group parsed events
> > > >
> > > >     This change is intended to be a no-op for most current cases, the
> > > >     default sort order is the order the events were parsed. Where it
> > > >     varies is in how groups are handled. Previously an uncore and core
> > > >     event that are grouped would most often cause the group to be removed:
> > > >
> >
> > So this is a no-op for most current cases where event groups aren't
> > used. The default order for events is the order they were inserted in.
>
> This doesn't seem to be what happens because the patch reorders
> top-level events outside group. (like all the top level "dummy"ies)
>
> But this commit broke things that are completely unrelated to hybrid and uncore,
> just plain software events not being in a group. There's no reason
> to reorder those.
>
> Basically anything you do for hybrid and topdown shouldn't affect
> anything else. That's the basic rule.

It is not clear that modifying the order of events in the evlist
qualifies as a breakage, and it is necessary because of complex kernel
APIs that particularly in the case of perf metric events, Intel were
responsible for.

> > Perf metric events require being in a group with the slots event
> > first. There are assumptions in the perf code that sibling events are
> > adjacent in the evlist. To fix and maintain all the requirements on
> > the evlist we've gotten to where we are now. I've tried to make the
> > code as simple and transparent as possible, but still fulfilling the
> > requirements coming from perf metric events and hybrid. The logic is
> > in parse_events__sort_events_and_fix_groups:
>
> This random reordering approach is just not viable. I explained
> this before. If you have a tool which generates the same event in
> multiple instances (for which there are many valid reasons related
> to multiplexing accuracy), then it absolutely cannot handle random
> reordering. The tool doesn't know which instance of its event is where
> if it's not in the same position.
>
> If you find you need reordering for your purposes and you know
> you don't rely on ordering make it some kind of opt-in.
> But forcing it on all perf users doesn't work.

As I keep explaining, the necessity has come from Intel.

>
> > > > wrong place, instead of the expected order.
> > > >
> > > > good:
> > > >
> > > > ...
> > > > 986476553;ns;duration_time;986476553;100.00;;
> > > > 0;;dummy;983952237;100.00;;
> > > > 0;;cs:u;983952237;100.00;;
> > > > 16532;;minor-faults:u;983952237;100.00;;
> > > > 0;;major-faults:u;983952237;100.00;;
> > > > 0;;migrations:u;983952237;100.00;;
> > > > 3532509496;;cycles:u;23999092;2.00;;
> > > > ....
> > > >
> > > > bad:
> > > >
> > > > 1010119571;ns;duration_time;1010119571;100.00;;
> > > > 0;;dummy [software];1007748753;100.00;;
> > > > 0;;cs:u [software];1007748753;100.00;;
> > > > 16496;;minor-faults:u [software];1007748753;100.00;;
> > > > 0;;major-faults:u [software];1007748753;100.00;;
> > > > 0;;migrations:u [software];1007748753;100.00;;
> > > > 0;;dummy [software];1007748753;100.00;;
> > > > 0;;dummy [software];1007748753;100.00;;
> > > > 0;;dummy [software];1007748753;100.00;;
> > > > 0;;dummy [software];1007748753;100.00;;
> > > > 0;;emulation-faults [software];1007748753;100.00;;
> > > > 0;;dummy [software];1007748753;100.00;;
> > > > 0;;dummy [software];1007748753;100.00;;
> > > > 0;;emulation-faults [software];1007748753;100.00;;
> > > > 0;;emulation-faults [software];1007748753;100.00;;
>
> Please explain how this insane order can make sense even with hybrid
> or metrics.

Tbh, I'm not sure what you are doing. linux-next is on 6.5 but you
mention 6.3. You don't list a platform for testing on. I'm not even
sure if this is a sincere bug report or just a rant that things have
changed. I've hopefully motivated and explained why things changed,
pointing to the exact location in the code that the changes in the
evlist occur.

> > > > 3603193382;;cycles:u;23996241;2.00;;
> > > > 2277091922;;cpu/event=0x0,umask=0x3/u;23996241;2.00;;
> > > > 4182126406;;cpu/event=0xc2,umask=0x2/u;23996241;2.00;;
> > > > 4364677170;;cpu/event=0xc0,umask=0x0/u;23996241;2.00;;
> > > >
> > > >
> > > > Unfortunately this totally breaks toplev. It needs to have the dummies
> > > > in the right location.`
> > > >
> > > > Another problem is that it also now adds [software] to lots of software
> > > > events, which of course also breaks any parsing tools. I haven't bisected
> > > > that too, but it needs fixing too.
> >
> > There's nothing to bisect, the behavior has changed. We can revert the
> > behavior, break Intel hybrid and Intel perf metric events, but that
>
> I've been using perf metrics for many years without problems without this patch.
>
> I know hybrid has some issues that need to be fixed, but that
> absolutely cannot break basic properties like having a parseable
> output format.

I think the last count I had hybrid was broken in about a dozen
different regards in Linux perf 6.4. It was a substantial undertaking
fixing in for 6.5, where fixes were making the tool not segv, produce
meaningful counts), etc. "some issues" is a complete
misclassification, the perf tool on a hybrid system is barely usable
in Linux 6.4. Running perf test is sufficient to confirm this.

> > would seem unfortunate. Tbh, the mistake here is the assumptions that
> > toplev is making.
>
> It's not an random assumption, it's the only way to parse perf output uniquely.
>
> Please make it an option. But this cannot stay this way unconditionally
> if perf stat is supposed to stay a tool which can be used from other programs.

You should be using the json or CSV outputs. These are intended for
tool consumption, but the formats could be better. The text output, as
argued to me by your colleagues, is for human consumption and so
should read well in preference to legacy compatibility.

> > > correctness, a necessary step to resume attempts at being multithreaded.
> > >
> > > We have seen a growing number of 'perf test' entries being submitted,
> > > with writing shell scripts that check if the output from various command
> > > lines produce expected results.
> > >
> > > It would really be great to have new entries that exercise what is
> > > expected by toplev so that we really are careful with compatibility.
> >
> > We are careful with compatibility, in fact we've written a number of
> > tests to ensure the json and CSV output doesn't regress. Kan recently
>
> The tests are super basic and don't seem to cover any even moderately
> complex cases. Yes it's great that you wrote tests, but given
> the poor coverage you really cannot rely on just a few basic unit
> tests here.
>
> Given that, and the wide usage of perf and the existence of Hyrum's law the
> only safe way is to avoid changing default behaviors. If you want
> some behavior change make it opt-in.

Sure, can we reverse decisions relating to the perf metric events API
and its grouping requirements? No, so we're stuck with an imperfect
world where the tool has to do the best it can.

For hybrid the problem of regrouping events still isn't properly
solved. Consider:

perf stat -e '{cycles,faults}' ...

On a hybrid system we get the cycles event for the cpu_atom and
cpu_core PMUs, but faults as a software event will exist once and be
grouped on only 1 of the PMUs. This behavior is clearly silly, faults
should measure both on cpu_atom and cpu_core PMUs, as it would appear
on a non-hybrid system, and we should fix this. But what you're
arguing is that we can't and shouldn't as Hyrum's law tells us that
someone is depending on the 1 PMU behavior. I don't believe this, or
if they are depending on the behavior they are silly and should
explicitly list the groups and PMUs. We need to make the tool behave
properly for increasingly complex kernel PMU event requirements.
Changing the evlist to make things work for the complex kernel event
requirements isn't the same as saying we're willfully trying to break
users, we're trying to make users succeed even as the underlying CPUs,
PMUs, etc. become more complex.

This issue is often made substantially more complex as fixing Intel's
kernel PMU driver is pushed back upon - for example, we can't rely on
weak groups, or "perf stat -e 'slots,branches,slots' -I 1000" has the
branches event multiplexing for no reason.

> > Kan recently removed events and put metric names in the text output
> > instead, even hiding events that have been programmed.. basically
> > relying on text output is wrong and toplev is broken in this regard.
>
> I missed your explanation how toplev is supposed to parse the output
> then instead?

Use the json or CSV output formats, json should be more stable as
values are labelled.

> > Fixing hybrid and perf metric events weren't problems of my invention,
> > nor were they done with a casual disregard for perf users and no care
> > about breaking them. That said, testing can always be better so why
> > not add toplev's tests to tools/perf/tests/shell ? This way I can only
> > break things if I ignore a test regression.
>
> toplev generates the perf command lines dynamically and they are highly
> dependent on the current system.  I don't see a simple way to make
> independent tests, unless you want a big case statement for existing systems,
> or add most of toplev.

Some minimal tests could be:
 - tests that repeat examples from man pages
 - tests for bugs past and present
 - some form of maximal stress test
I don't know toplev and don't have time to set this up.

Thanks,
Ian

> I guess could add some script that is the equivalent of
>
> git clone https://github.com/andikleen/pmu-tools
> cd pmu-tools
> PERF=/point/to/new/perf/binary  ./tl-tester
>
> if that helps? The toplev test suite has reasonably good coverage.
>
>
> -Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-14  1:14       ` Ian Rogers
@ 2023-07-14  3:07         ` Andi Kleen
  2023-07-14  3:56           ` Ian Rogers
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2023-07-14  3:07 UTC (permalink / raw)
  To: Ian Rogers; +Cc: Arnaldo Carvalho de Melo, namhyung, linux-perf-users

On Thu, Jul 13, 2023 at 06:14:16PM -0700, Ian Rogers wrote:
> On Thu, Jul 13, 2023 at 4:52 PM Andi Kleen <ak@linux.intel.com> wrote:
> >
> >
> > On Thu, Jul 13, 2023 at 02:24:22PM -0700, Ian Rogers wrote:
> > > On Thu, Jul 13, 2023 at 1:48 PM Arnaldo Carvalho de Melo
> > > <acme@kernel.org> wrote:
> > > >
> > > > Em Thu, Jul 13, 2023 at 01:37:00PM -0700, Andi Kleen escreveu:
> > > > > Hi Ian,
> > > > >
> > > > > The current post 6.3 linux-perf-next causes some fairly drastic
> > > > > reordering in output for some event combinations. I bisected it down to
> > > > > your patch:
> > > > >
> > > > > commit 347c2f0a0988c59c402148054ef54648853fa669
> > > > > Author: Ian Rogers <irogers@google.com>
> > > > > Date:   Sat Mar 11 18:15:40 2023 -0800
> > > > >
> > > > >     perf parse-events: Sort and group parsed events
> > > > >
> > > > >     This change is intended to be a no-op for most current cases, the
> > > > >     default sort order is the order the events were parsed. Where it
> > > > >     varies is in how groups are handled. Previously an uncore and core
> > > > >     event that are grouped would most often cause the group to be removed:
> > > > >
> > >
> > > So this is a no-op for most current cases where event groups aren't
> > > used. The default order for events is the order they were inserted in.
> >
> > This doesn't seem to be what happens because the patch reorders
> > top-level events outside group. (like all the top level "dummy"ies)
> >
> > But this commit broke things that are completely unrelated to hybrid and uncore,
> > just plain software events not being in a group. There's no reason
> > to reorder those.
> >
> > Basically anything you do for hybrid and topdown shouldn't affect
> > anything else. That's the basic rule.
> 
> It is not clear that modifying the order of events in the evlist
> qualifies as a breakage, 

I don't know what to say here. 

> and it is necessary because of complex kernel
> APIs that particularly in the case of perf metric events, Intel were
> responsible for.

But my example has no perf metrics.

Also taking a step back: 

You're saying reordering is needed for perf metrics because slots
has to come first in the group. 

But the expectation here was always that the user would specify
the groups this way, otherwise it doesn't work.  Do you disagree with
that?

So why does the perf tool need to fix it? Just pass it to the kernel
and it won't schedule and the user has to fix it.

Perhaps perf can give warnings to make it clearer.

I understand that the metrics code has to make sure to not generate
such groups, but that's a different problem that shouldn't affect how
non metrics events are handled.

Also I reviewed the commit in question and at least the example
doesn't even have perf metrics ?!? It's trying to fix some uncore issue.

So it seems what we're talking about here has nothing to do with 
perf metrics.

How about replacing the sorting commit with a warning and ask 
the user to do the reordering? Would that work for you?

> > > > > wrong place, instead of the expected order.
> > > > >
> > > > > good:
> > > > >
> > > > > ...
> > > > > 986476553;ns;duration_time;986476553;100.00;;
> > > > > 0;;dummy;983952237;100.00;;
> > > > > 0;;cs:u;983952237;100.00;;
> > > > > 16532;;minor-faults:u;983952237;100.00;;
> > > > > 0;;major-faults:u;983952237;100.00;;
> > > > > 0;;migrations:u;983952237;100.00;;
> > > > > 3532509496;;cycles:u;23999092;2.00;;
> > > > > ....
> > > > >
> > > > > bad:
> > > > >
> > > > > 1010119571;ns;duration_time;1010119571;100.00;;
> > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > 0;;cs:u [software];1007748753;100.00;;
> > > > > 16496;;minor-faults:u [software];1007748753;100.00;;
> > > > > 0;;major-faults:u [software];1007748753;100.00;;
> > > > > 0;;migrations:u [software];1007748753;100.00;;
> > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > 0;;emulation-faults [software];1007748753;100.00;;
> > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > 0;;emulation-faults [software];1007748753;100.00;;
> > > > > 0;;emulation-faults [software];1007748753;100.00;;
> >
> > Please explain how this insane order can make sense even with hybrid
> > or metrics.
> 
> Tbh, I'm not sure what you are doing. linux-next is on 6.5 but you
> mention 6.3. You don't list a platform for testing on. I'm not even

6.3 was my original reference point, but since I bisected the issue
it's not 6.3 but exactly just this one commit that changes behavior.

The platform is SKX.

> 
> > > would seem unfortunate. Tbh, the mistake here is the assumptions that
> > > toplev is making.
> >
> > It's not an random assumption, it's the only way to parse perf output uniquely.
> >
> > Please make it an option. But this cannot stay this way unconditionally
> > if perf stat is supposed to stay a tool which can be used from other programs.
> 
> You should be using the json or CSV outputs. These are intended for

This is the CSV output that is reordered! (see the example you just
quoted)

I haven't tested JSON, but I assume that is reordered too.

> 
> For hybrid the problem of regrouping events still isn't properly
> solved. Consider:


Yes I understand that hybrid has issues. But the first step to solving
them is to not break existing platforms.

> 
> perf stat -e '{cycles,faults}' ...
> 
> On a hybrid system we get the cycles event for the cpu_atom and
> cpu_core PMUs, but faults as a software event will exist once and be
> grouped on only 1 of the PMUs. This behavior is clearly silly, faults
> should measure both on cpu_atom and cpu_core PMUs, as it would appear
> on a non-hybrid system, and we should fix this. But what you're
> arguing is that we can't and shouldn't as Hyrum's law tells us that
> someone is depending on the 1 PMU behavior.

TBH i don't have a strong opinion on keeping hybrid the same
As you say it is somewhat broken, so I guess there are less 
existing users and some change there is acceptable, as long
as it doesn't affect the more mature platforms.

As for your example it seems somewhat obscure and I would probably
consider it user error. The non PMU qualified support was really only intended to
keep existing perf record/stat default working, for everything else
it should be discouraged. Trying to fix every corner case
here automatically is unlikely to be productive.

So perhaps just print a warning in this case? 

> explicitly list the groups and PMUs. We need to make the tool behave
> properly for increasingly complex kernel PMU event requirements.

I agree, but reordering the complete event list with a huge impact
on non hybrid is not the way to do it.


> This issue is often made substantially more complex as fixing Intel's
> kernel PMU driver is pushed back upon - for example, we can't rely on
> weak groups,

I thought they still work most of the time?

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-14  3:07         ` Andi Kleen
@ 2023-07-14  3:56           ` Ian Rogers
  2023-07-17  2:00             ` Andi Kleen
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Rogers @ 2023-07-14  3:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Arnaldo Carvalho de Melo, namhyung, linux-perf-users

On Thu, Jul 13, 2023 at 8:07 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> On Thu, Jul 13, 2023 at 06:14:16PM -0700, Ian Rogers wrote:
> > On Thu, Jul 13, 2023 at 4:52 PM Andi Kleen <ak@linux.intel.com> wrote:
> > >
> > >
> > > On Thu, Jul 13, 2023 at 02:24:22PM -0700, Ian Rogers wrote:
> > > > On Thu, Jul 13, 2023 at 1:48 PM Arnaldo Carvalho de Melo
> > > > <acme@kernel.org> wrote:
> > > > >
> > > > > Em Thu, Jul 13, 2023 at 01:37:00PM -0700, Andi Kleen escreveu:
> > > > > > Hi Ian,
> > > > > >
> > > > > > The current post 6.3 linux-perf-next causes some fairly drastic
> > > > > > reordering in output for some event combinations. I bisected it down to
> > > > > > your patch:
> > > > > >
> > > > > > commit 347c2f0a0988c59c402148054ef54648853fa669
> > > > > > Author: Ian Rogers <irogers@google.com>
> > > > > > Date:   Sat Mar 11 18:15:40 2023 -0800
> > > > > >
> > > > > >     perf parse-events: Sort and group parsed events
> > > > > >
> > > > > >     This change is intended to be a no-op for most current cases, the
> > > > > >     default sort order is the order the events were parsed. Where it
> > > > > >     varies is in how groups are handled. Previously an uncore and core
> > > > > >     event that are grouped would most often cause the group to be removed:
> > > > > >
> > > >
> > > > So this is a no-op for most current cases where event groups aren't
> > > > used. The default order for events is the order they were inserted in.
> > >
> > > This doesn't seem to be what happens because the patch reorders
> > > top-level events outside group. (like all the top level "dummy"ies)
> > >
> > > But this commit broke things that are completely unrelated to hybrid and uncore,
> > > just plain software events not being in a group. There's no reason
> > > to reorder those.
> > >
> > > Basically anything you do for hybrid and topdown shouldn't affect
> > > anything else. That's the basic rule.
> >
> > It is not clear that modifying the order of events in the evlist
> > qualifies as a breakage,
>
> I don't know what to say here.

We do wildcard expansion of the PMUs, which leads to the evlist order
being unstable. Fwiw this will change with:
https://lore.kernel.org/lkml/20230711055859.1242497-1-irogers@google.com/

> > and it is necessary because of complex kernel
> > APIs that particularly in the case of perf metric events, Intel were
> > responsible for.
>
> But my example has no perf metrics.
>
> Also taking a step back:
>
> You're saying reordering is needed for perf metrics because slots
> has to come first in the group.
>
> But the expectation here was always that the user would specify
> the groups this way, otherwise it doesn't work.  Do you disagree with
> that?

Yes I disagree. We can't always force perf metric events in a group
because weak groups don't always work. When we place the perf metric
events outside of a group they are broken. So now we don't group the
events and fix the topdown events to be in a group, in arch specific
code. It is a complete pain brought about by the Intel core PMU
driver.

> So why does the perf tool need to fix it? Just pass it to the kernel
> and it won't schedule and the user has to fix it.

Because in some cases, and the cases I have happened to care about,
the groups of events come from metrics.

> Perhaps perf can give warnings to make it clearer.

We do warn when we reorder the events.

> I understand that the metrics code has to make sure to not generate
> such groups, but that's a different problem that shouldn't affect how
> non metrics events are handled.

But things like wildcard group expansion and regrouping is a metric
problem too. So why would there be a parse events for metrics and a
separate one for events? Doing it this way would make metric group
testing significantly more hard too, as we can't just try the events
without the metric.

> Also I reviewed the commit in question and at least the example
> doesn't even have perf metrics ?!? It's trying to fix some uncore issue.

The uncore issue has been long standing, however, the fix used to be
in a separate "fix uncore" function. Then there was a similar "fix
hybrid" function. What is different now is that there is a single fix
function that isn't hard coded to Intel PMU names, etc.

> So it seems what we're talking about here has nothing to do with
> perf metrics.
>
> How about replacing the sorting commit with a warning and ask
> the user to do the reordering? Would that work for you?

No.

> > > > > > wrong place, instead of the expected order.
> > > > > >
> > > > > > good:
> > > > > >
> > > > > > ...
> > > > > > 986476553;ns;duration_time;986476553;100.00;;
> > > > > > 0;;dummy;983952237;100.00;;
> > > > > > 0;;cs:u;983952237;100.00;;
> > > > > > 16532;;minor-faults:u;983952237;100.00;;
> > > > > > 0;;major-faults:u;983952237;100.00;;
> > > > > > 0;;migrations:u;983952237;100.00;;
> > > > > > 3532509496;;cycles:u;23999092;2.00;;
> > > > > > ....
> > > > > >
> > > > > > bad:
> > > > > >
> > > > > > 1010119571;ns;duration_time;1010119571;100.00;;
> > > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > > 0;;cs:u [software];1007748753;100.00;;
> > > > > > 16496;;minor-faults:u [software];1007748753;100.00;;
> > > > > > 0;;major-faults:u [software];1007748753;100.00;;
> > > > > > 0;;migrations:u [software];1007748753;100.00;;
> > > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > > 0;;emulation-faults [software];1007748753;100.00;;
> > > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > > 0;;dummy [software];1007748753;100.00;;
> > > > > > 0;;emulation-faults [software];1007748753;100.00;;
> > > > > > 0;;emulation-faults [software];1007748753;100.00;;
> > >
> > > Please explain how this insane order can make sense even with hybrid
> > > or metrics.
> >
> > Tbh, I'm not sure what you are doing. linux-next is on 6.5 but you
> > mention 6.3. You don't list a platform for testing on. I'm not even
>
> 6.3 was my original reference point, but since I bisected the issue
> it's not 6.3 but exactly just this one commit that changes behavior.
>
> The platform is SKX.
>
> >
> > > > would seem unfortunate. Tbh, the mistake here is the assumptions that
> > > > toplev is making.
> > >
> > > It's not an random assumption, it's the only way to parse perf output uniquely.
> > >
> > > Please make it an option. But this cannot stay this way unconditionally
> > > if perf stat is supposed to stay a tool which can be used from other programs.
> >
> > You should be using the json or CSV outputs. These are intended for
>
> This is the CSV output that is reordered! (see the example you just
> quoted)
>
> I haven't tested JSON, but I assume that is reordered too.

The output of the JSON is a dictionary and inherently unordered.

> >
> > For hybrid the problem of regrouping events still isn't properly
> > solved. Consider:
>
>
> Yes I understand that hybrid has issues. But the first step to solving
> them is to not break existing platforms.

This is a false assertion. Coverage testing is up about 10%, we have
more tests and more tests passing on more platforms. Objectively the
code base has improved. On hybrid the behavior is night and day -
crashes vs having the complete TopdownL1 working on both PMUs, testing
parity with non-hybrid. We also fixed all address and leak sanitizer
bugs. Perf in Linux 6.5 even without the awesome new features like
lock contention analysis, is the best perf tool we've ever had.

> >
> > perf stat -e '{cycles,faults}' ...
> >
> > On a hybrid system we get the cycles event for the cpu_atom and
> > cpu_core PMUs, but faults as a software event will exist once and be
> > grouped on only 1 of the PMUs. This behavior is clearly silly, faults
> > should measure both on cpu_atom and cpu_core PMUs, as it would appear
> > on a non-hybrid system, and we should fix this. But what you're
> > arguing is that we can't and shouldn't as Hyrum's law tells us that
> > someone is depending on the 1 PMU behavior.
>
> TBH i don't have a strong opinion on keeping hybrid the same
> As you say it is somewhat broken, so I guess there are less
> existing users and some change there is acceptable, as long
> as it doesn't affect the more mature platforms.
>
> As for your example it seems somewhat obscure and I would probably
> consider it user error. The non PMU qualified support was really only intended to
> keep existing perf record/stat default working, for everything else
> it should be discouraged. Trying to fix every corner case
> here automatically is unlikely to be productive.
>
> So perhaps just print a warning in this case?
>
> > explicitly list the groups and PMUs. We need to make the tool behave
> > properly for increasingly complex kernel PMU event requirements.
>
> I agree, but reordering the complete event list with a huge impact
> on non hybrid is not the way to do it.

Does SKX have uncore? Yes, so you need to reorder events and this
change didn't start it.

>
> > This issue is often made substantially more complex as fixing Intel's
> > kernel PMU driver is pushed back upon - for example, we can't rely on
> > weak groups,
>
> I thought they still work most of the time?

$ for i in `ls tools/perf/pmu-events/arch/x86/*/*-metrics.json`; do
echo -n "$i "; grep NO_GROUP_EVENTS $i|wc -l; done
tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json 21
tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json 2
tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json 20
tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json 20
tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json 23
tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json 32
tools/perf/pmu-events/arch/x86/elkhartlake/ehl-metrics.json 0
tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json 20
tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json 23
tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json 19
tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json 19
tools/perf/pmu-events/arch/x86/ivybridge/ivb-metrics.json 20
tools/perf/pmu-events/arch/x86/ivytown/ivt-metrics.json 23
tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json 8
tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json 19
tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json 8
tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json 19
tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json 30
tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json 31
tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json 19

NO_GROUP_EVENTS is used to avoid a group being used, an important use
being avoiding PMU driver bugs.

Thanks,
Ian

> -Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-14  3:56           ` Ian Rogers
@ 2023-07-17  2:00             ` Andi Kleen
  2023-07-17 13:37               ` Arnaldo Carvalho de Melo
  2023-07-18 18:25               ` Ian Rogers
  0 siblings, 2 replies; 13+ messages in thread
From: Andi Kleen @ 2023-07-17  2:00 UTC (permalink / raw)
  To: Ian Rogers; +Cc: Arnaldo Carvalho de Melo, namhyung, linux-perf-users

On Thu, Jul 13, 2023 at 08:56:55PM -0700, Ian Rogers wrote:
> Yes I disagree. We can't always force perf metric events in a group
> because weak groups don't always work. When we place the perf metric
> events outside of a group they are broken. So now we don't group the
> events and fix the topdown events to be in a group, in arch specific
> code. It is a complete pain brought about by the Intel core PMU
> driver.

Was that the old topdown slots have to be first issue, or something
else?

If it's so painful I guess could revisit the kernel part, although
I suspect anything that the kernel could do the perf tool could do too.

> 
> > So why does the perf tool need to fix it? Just pass it to the kernel
> > and it won't schedule and the user has to fix it.
> 
> Because in some cases, and the cases I have happened to care about,
> the groups of events come from metrics.

That's fine, but can't you just fix it for the metrics only.

Also even for metrics if you freely reorder them they might become
unparseable.  But the events inside the metrics should be freely
reorderable (well technically someone could have a dependency on that
too, but at least in my mind that's much less likely than their
dependency just being on the metric) 

So what it really needs is a sort only insight the metric.

That shouldn't be too hard to implement, right?

Also I guess we could add some test cases on avoiding reordering,
although I'm optimistic that once it is established that reordering
should be minimized/avoided it is unlikely to be something 
that breask frequently.

Would it be possible to revert the global sort patch until that 
is sorted out? Otherwise I'm afraid I'll end up with perf releases
that are absolutely incompatible to toplev, and I'll get problems
with my user base who often cannot easily update perf.

Arnaldo, can you please comment on this, what is your policy 
on regressions here?

> But things like wildcard group expansion and regrouping is a metric
> problem too. So why would there be a parse events for metrics and a
> separate one for events? Doing it this way would make metric group
> testing significantly more hard too, as we can't just try the events
> without the metric.

Hmm, perhaps add some kind of debug flag to handle this case?
As long as it's not default I don't have any issues with that.

> > How about replacing the sorting commit with a warning and ask
> > the user to do the reordering? Would that work for you?
> 
> No.

Well I hope we can get to some compromise here. Right now 
this arbitary unpredictable reordering is a show stopper for me,
and I suspect it will be for any other perf stat users that do some kind
of post processing.

As long as we find a solution like above that works for the metrics
it should be ok right?

> > The platform is SKX.
> >
> > >
> > > > > would seem unfortunate. Tbh, the mistake here is the assumptions that
> > > > > toplev is making.
> > > >
> > > > It's not an random assumption, it's the only way to parse perf output uniquely.
> > > >
> > > > Please make it an option. But this cannot stay this way unconditionally
> > > > if perf stat is supposed to stay a tool which can be used from other programs.
> > >
> > > You should be using the json or CSV outputs. These are intended for
> >
> > This is the CSV output that is reordered! (see the example you just
> > quoted)
> >
> > I haven't tested JSON, but I assume that is reordered too.
> 
> The output of the JSON is a dictionary and inherently unordered.

I haven't tried to use it, but if you have something like
perf stat --json -e '{cycles,branches},{cycles,cache-misses}' how would you
distinguish the two different cycles without order? 

> crashes vs having the complete TopdownL1 working on both PMUs, testing
> parity with non-hybrid. We also fixed all address and leak sanitizer
> bugs. 

> Perf in Linux 6.5 even without the awesome new features like
> lock contention analysis, is the best perf tool we've ever had.

Sure and without arbitrary reordering and unparseable output it would be
even better ...

> Does SKX have uncore? Yes, so you need to reorder events and this
> change didn't start it.

Yes of course it has an uncore.

You only need to reorder if you create broken groups.  At least my tool
doesn't generate them.

> NO_GROUP_EVENTS is used to avoid a group being used, an important use
> being avoiding PMU driver bugs.
> 

Hmm, not sure what that is refering to, but ok.

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-17  2:00             ` Andi Kleen
@ 2023-07-17 13:37               ` Arnaldo Carvalho de Melo
  2023-07-26 15:05                 ` Andi Kleen
  2023-07-18 18:25               ` Ian Rogers
  1 sibling, 1 reply; 13+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-07-17 13:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ian Rogers, namhyung, linux-perf-users

Em Sun, Jul 16, 2023 at 07:00:44PM -0700, Andi Kleen escreveu:
> On Thu, Jul 13, 2023 at 08:56:55PM -0700, Ian Rogers wrote:
> > Yes I disagree. We can't always force perf metric events in a group
> > because weak groups don't always work. When we place the perf metric
> > events outside of a group they are broken. So now we don't group the
> > events and fix the topdown events to be in a group, in arch specific
> > code. It is a complete pain brought about by the Intel core PMU
> > driver.
> 
> Was that the old topdown slots have to be first issue, or something
> else?
> 
> If it's so painful I guess could revisit the kernel part, although
> I suspect anything that the kernel could do the perf tool could do too.
> 
> > 
> > > So why does the perf tool need to fix it? Just pass it to the kernel
> > > and it won't schedule and the user has to fix it.
> > 
> > Because in some cases, and the cases I have happened to care about,
> > the groups of events come from metrics.
> 
> That's fine, but can't you just fix it for the metrics only.
> 
> Also even for metrics if you freely reorder them they might become
> unparseable.  But the events inside the metrics should be freely

Don't we have a JSON output? My expectation is that after the work that
Ian has been doing we will settle down, but even then downstreamers
should try to use JSON as input as that would avoid the problems we are
having?

You mentioned that it was difficult to add entries to 'perf test' but
that we could try running toplev's regression test, right? I can add it
to my set of tests but perhaps the best thing would be to wire it up to
one of the CIs out there?

- Arnaldo

> reorderable (well technically someone could have a dependency on that
> too, but at least in my mind that's much less likely than their
> dependency just being on the metric) 
> 
> So what it really needs is a sort only insight the metric.
> 
> That shouldn't be too hard to implement, right?
> 
> Also I guess we could add some test cases on avoiding reordering,
> although I'm optimistic that once it is established that reordering
> should be minimized/avoided it is unlikely to be something 
> that breask frequently.
> 
> Would it be possible to revert the global sort patch until that 
> is sorted out? Otherwise I'm afraid I'll end up with perf releases
> that are absolutely incompatible to toplev, and I'll get problems
> with my user base who often cannot easily update perf.
> 
> Arnaldo, can you please comment on this, what is your policy 
> on regressions here?

 
> > But things like wildcard group expansion and regrouping is a metric
> > problem too. So why would there be a parse events for metrics and a
> > separate one for events? Doing it this way would make metric group
> > testing significantly more hard too, as we can't just try the events
> > without the metric.
> 
> Hmm, perhaps add some kind of debug flag to handle this case?
> As long as it's not default I don't have any issues with that.
> 
> > > How about replacing the sorting commit with a warning and ask
> > > the user to do the reordering? Would that work for you?
> > 
> > No.
> 
> Well I hope we can get to some compromise here. Right now 
> this arbitary unpredictable reordering is a show stopper for me,
> and I suspect it will be for any other perf stat users that do some kind
> of post processing.
> 
> As long as we find a solution like above that works for the metrics
> it should be ok right?
> 
> > > The platform is SKX.
> > >
> > > >
> > > > > > would seem unfortunate. Tbh, the mistake here is the assumptions that
> > > > > > toplev is making.
> > > > >
> > > > > It's not an random assumption, it's the only way to parse perf output uniquely.
> > > > >
> > > > > Please make it an option. But this cannot stay this way unconditionally
> > > > > if perf stat is supposed to stay a tool which can be used from other programs.
> > > >
> > > > You should be using the json or CSV outputs. These are intended for
> > >
> > > This is the CSV output that is reordered! (see the example you just
> > > quoted)
> > >
> > > I haven't tested JSON, but I assume that is reordered too.
> > 
> > The output of the JSON is a dictionary and inherently unordered.
> 
> I haven't tried to use it, but if you have something like
> perf stat --json -e '{cycles,branches},{cycles,cache-misses}' how would you
> distinguish the two different cycles without order? 
> 
> > crashes vs having the complete TopdownL1 working on both PMUs, testing
> > parity with non-hybrid. We also fixed all address and leak sanitizer
> > bugs. 
> 
> > Perf in Linux 6.5 even without the awesome new features like
> > lock contention analysis, is the best perf tool we've ever had.
> 
> Sure and without arbitrary reordering and unparseable output it would be
> even better ...
> 
> > Does SKX have uncore? Yes, so you need to reorder events and this
> > change didn't start it.
> 
> Yes of course it has an uncore.
> 
> You only need to reorder if you create broken groups.  At least my tool
> doesn't generate them.
> 
> > NO_GROUP_EVENTS is used to avoid a group being used, an important use
> > being avoiding PMU driver bugs.
> > 
> 
> Hmm, not sure what that is refering to, but ok.
> 
> -Andi

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-17  2:00             ` Andi Kleen
  2023-07-17 13:37               ` Arnaldo Carvalho de Melo
@ 2023-07-18 18:25               ` Ian Rogers
  2023-07-19  0:53                 ` Ian Rogers
  1 sibling, 1 reply; 13+ messages in thread
From: Ian Rogers @ 2023-07-18 18:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Arnaldo Carvalho de Melo, namhyung, linux-perf-users

On Sun, Jul 16, 2023 at 7:00 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> On Thu, Jul 13, 2023 at 08:56:55PM -0700, Ian Rogers wrote:
> > Yes I disagree. We can't always force perf metric events in a group
> > because weak groups don't always work. When we place the perf metric
> > events outside of a group they are broken. So now we don't group the
> > events and fix the topdown events to be in a group, in arch specific
> > code. It is a complete pain brought about by the Intel core PMU
> > driver.
>
> Was that the old topdown slots have to be first issue, or something
> else?

Yes, the current ordering is defined here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/arch/x86/util/evlist.c?h=perf-tools-next#n79

> If it's so painful I guess could revisit the kernel part, although
> I suspect anything that the kernel could do the perf tool could do too.
>
> >
> > > So why does the perf tool need to fix it? Just pass it to the kernel
> > > and it won't schedule and the user has to fix it.
> >
> > Because in some cases, and the cases I have happened to care about,
> > the groups of events come from metrics.
>
> That's fine, but can't you just fix it for the metrics only.
>
> Also even for metrics if you freely reorder them they might become
> unparseable.  But the events inside the metrics should be freely
> reorderable (well technically someone could have a dependency on that
> too, but at least in my mind that's much less likely than their
> dependency just being on the metric)
>
> So what it really needs is a sort only insight the metric.
>
> That shouldn't be too hard to implement, right?

I'll describe why this isn't a fix below.

> Also I guess we could add some test cases on avoiding reordering,
> although I'm optimistic that once it is established that reordering
> should be minimized/avoided it is unlikely to be something
> that breask frequently.
>
> Would it be possible to revert the global sort patch until that
> is sorted out? Otherwise I'm afraid I'll end up with perf releases
> that are absolutely incompatible to toplev, and I'll get problems
> with my user base who often cannot easily update perf.

This would break the majority of metrics from Icelake forward.

> Arnaldo, can you please comment on this, what is your policy
> on regressions here?
>
> > But things like wildcard group expansion and regrouping is a metric
> > problem too. So why would there be a parse events for metrics and a
> > separate one for events? Doing it this way would make metric group
> > testing significantly more hard too, as we can't just try the events
> > without the metric.
>
> Hmm, perhaps add some kind of debug flag to handle this case?
> As long as it's not default I don't have any issues with that.
>
> > > How about replacing the sorting commit with a warning and ask
> > > the user to do the reordering? Would that work for you?
> >
> > No.
>
> Well I hope we can get to some compromise here. Right now
> this arbitary unpredictable reordering is a show stopper for me,
> and I suspect it will be for any other perf stat users that do some kind
> of post processing.
>
> As long as we find a solution like above that works for the metrics
> it should be ok right?

So you've resisted working through what your issue with the sorting
is. To call the sorting arbitrary is of course wrong, but you keep
wanting to say everything the perf tool is doing is wrong. Moving
forward...

What is happening in your example is straightforward. The sorting
code's goal is to keep events in their place but what to do about:
perf stat -e 'cycles,{uncore_imc_free_running/data_read/,uncore_imc_free_running/data_write/}'
where there are 2 uncore_imc_free_running PMUs? After parsing the
evlist looks like:
cycles,{uncore_imc_free_running_0/data_read/,uncore_imc_free_running_1/data_read/,uncore_imc_free_running_0/data_write/,uncore_imc_free_running_1/data_write/}
We need to turn 1 group into 2, we need to reorder by the PMUs, this leads to:
cycles,{uncore_imc_free_running_0/data_read/,uncore_imc_free_running_0/data_write/},{uncore_imc_free_running_1/data_read/,uncore_imc_free_running_1/data_write/}
There is no regression here, it has always been perf's behavior to
reorder events and create groups globally.
When sorting groups the code uses the position in the original list of
the group's leader:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n2124
For events within a group it sorts them by group PMU name (which is
normally just the PMU name), if the PMU names are the same it sorts
the events by their original position in the list possibly corrected
for perf metric events.

Okay, so now imagine the lack of grouping brought about by perf metric
(topdown) events and broken Intel PMUs that fail to break weak groups.
For a hypothetical example we create a list of events like:
perf stat -e 'topdown-fe-bound,cycles,instructions,slots'
As with the group case we need to reorder the events to have slots and
topdown-fe-bound together and grouped. The code achieves:
{slots,topdown-fe-bound},cycles,instructions
As grouped events need to be adjacent in the evlist, we need to sort
all the events without a group to be together and place the perf
metric events first. They are then placed in a group.
So that we best respect the order of ungrouped and grouped events, we
place the ungrouped events as if they are a group with a leader with
the index of the first ungrouped event.
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n2101
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n2170

So this then comes to your example where you have ungrouped events,
grouped events in a giant list. Well on Icelake+ we could have perf
metric events in those ungrouped events, this means we need to sort
the ungrouped events together, potentially moving perf metric events
to the front and grouping them.

Hey hey hey, you say. I'm on SkylakeX and don't have perf metric
events. Why should the tool be trying to sort events to fix issues on
a PMU I don't have?
The problem is that Icelake+ architectures all have this issue and I
keep trying to make the code simpler and do the same thing for
everybody. I got rid of hardcoded "cpu_atom" and "cpu_core", the
separate hybrid event parsing, ... I'm trying to make it that perf
isn't a concatenated group of architecture specific PMU drivers and
there is commonality in how everything is done.
Well you say, I still want my ungrouped events to be scattered around
my list and you better not move them as that is the world's largest
regression, is crazy arbitrary rubbish and please go dump all your
changes in the garbage.

I disagree, the sorting is doing its best to only fix what is
necessary and otherwise to rely on the command line. That said, it can
only fix:
perf stat -e 'slots,{uncore_imc_free_running_0/data_read/,uncore_imc_free_running_1/data_read/},topdown-fe-bound'
If it is able to move slots and topdown-fe-bound to be adjacent.

So how can I make you happy:
1) throw everything away.. well that would be a challenge, break stuff
and I don't want to fix the resultant mess.
2) we can sort only for metrics. This leads to metric event parsing
and event parsing, duplication, testing and other issues. I'm trying
to make metrics sit on top of event parsing and not fork them. This is
more true after the hybrid changes, where things like setting up a
default cycles event is done through event parsing to avoid
duplication of the wildcard logic. Plus we still need event sorting
and grouping logic without metrics for at least uncore events.
3) we can sort only on Icelake+ architectures. Well isn't that the
majority of what Intel cares about now-a-days? It also leads to having
2 code paths and testing problems. It also won't work as uncore events
won't be sorted, regrouped and fixed.
4) we add a special toplev flag that disables sorting and grouping.
Well that will break sorting and grouping of uncore events and so I
don't think this will make you happy.
5)  we add a special toplev flag that on detecting a core PMUs causes
sorting and regrouping to be disabled. This isn't much better than 4
and still unlikely to make you happy.

So what can we do?
1) change how perf metric (topdown) events are implemented in the PMU.
This strategy works best with a time machine and we're duty bound, in
that case, to fix other problems in history.
2) stop toplev being a PITA? Wouldn't that be nice.
3) we do some kind of ungrouped event sorting hack. Rather than make
every ungrouped event look like having a common leader index, we only
do that for events that need to be forced into a group (perf metric
events). This should mean ungrouped events are sorted wrt groups in
the place they are originally in the list. This should fix the
pre-Icelake problems but on Icelake+ we can still get the "arbitrary"
reordering issue you so despise for perf metric events.

I'm going to give (3) a go and mail it today.

> > > The platform is SKX.
> > >
> > > >
> > > > > > would seem unfortunate. Tbh, the mistake here is the assumptions that
> > > > > > toplev is making.
> > > > >
> > > > > It's not an random assumption, it's the only way to parse perf output uniquely.
> > > > >
> > > > > Please make it an option. But this cannot stay this way unconditionally
> > > > > if perf stat is supposed to stay a tool which can be used from other programs.
> > > >
> > > > You should be using the json or CSV outputs. These are intended for
> > >
> > > This is the CSV output that is reordered! (see the example you just
> > > quoted)
> > >
> > > I haven't tested JSON, but I assume that is reordered too.
> >
> > The output of the JSON is a dictionary and inherently unordered.
>
> I haven't tried to use it, but if you have something like
> perf stat --json -e '{cycles,branches},{cycles,cache-misses}' how would you
> distinguish the two different cycles without order?
>
> > crashes vs having the complete TopdownL1 working on both PMUs, testing
> > parity with non-hybrid. We also fixed all address and leak sanitizer
> > bugs.
>
> > Perf in Linux 6.5 even without the awesome new features like
> > lock contention analysis, is the best perf tool we've ever had.
>
> Sure and without arbitrary reordering and unparseable output it would be
> even better ...

Ugh.. am I being motivated here to fix your toplev problem?

> > Does SKX have uncore? Yes, so you need to reorder events and this
> > change didn't start it.
>
> Yes of course it has an uncore.
>
> You only need to reorder if you create broken groups.  At least my tool
> doesn't generate them.

Zip-a-Dee-Doo-Dah.

> > NO_GROUP_EVENTS is used to avoid a group being used, an important use
> > being avoiding PMU driver bugs.
> >
>
> Hmm, not sure what that is refering to, but ok.

You can take the metrics and remove the NO_GROUP_EVENTS. The failure
mode is that the weak group fails to be broken. I believe the issues
to be some mix of errata, naivety in the fake PMU driver when
scheduling onto counters and not appreciating that not all counters
are created equal, possible bugs, etc. The issues may not just be on
the core PMU, it depends on the metric.

Given I'm fixing things for you, can you please fix:
$ perf stat -e 'slots,branches,slots' -a sleep 1
to not multiplex on Icelake+.

This can be achieved with:
```
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2607,7 +2607,7 @@ static int group_can_go_on(struct perf_event
*event, int can_add_hw)
        * Otherwise, try to add it if all previous groups were able
        * to go on.
        */
-       return can_add_hw;
+       return 1;
}

static void add_event_to_ctx(struct perf_event *event,
```
That means we'll consider adding every hardware event just as we
consider adding every software event. A more complex fix would be to
make can_add_hw some kind of bitmap that indicates different counter
types are available, and to only stop searching only when all counter
types are exhausted. I'm perfectly happy getting rid of the can_add_hw
optimization and not worrying about the potential extra overhead. That
overhead is theoretic and under user control. The unnecessary
multiplexing is just silly and unavoidable.

Thanks,
Ian

> -Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-18 18:25               ` Ian Rogers
@ 2023-07-19  0:53                 ` Ian Rogers
  0 siblings, 0 replies; 13+ messages in thread
From: Ian Rogers @ 2023-07-19  0:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Arnaldo Carvalho de Melo, namhyung, linux-perf-users

On Tue, Jul 18, 2023 at 11:25 AM Ian Rogers <irogers@google.com> wrote:
>
> On Sun, Jul 16, 2023 at 7:00 PM Andi Kleen <ak@linux.intel.com> wrote:
> >
> > On Thu, Jul 13, 2023 at 08:56:55PM -0700, Ian Rogers wrote:
> > > Yes I disagree. We can't always force perf metric events in a group
> > > because weak groups don't always work. When we place the perf metric
> > > events outside of a group they are broken. So now we don't group the
> > > events and fix the topdown events to be in a group, in arch specific
> > > code. It is a complete pain brought about by the Intel core PMU
> > > driver.
> >
> > Was that the old topdown slots have to be first issue, or something
> > else?
>
> Yes, the current ordering is defined here:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/arch/x86/util/evlist.c?h=perf-tools-next#n79
>
> > If it's so painful I guess could revisit the kernel part, although
> > I suspect anything that the kernel could do the perf tool could do too.
> >
> > >
> > > > So why does the perf tool need to fix it? Just pass it to the kernel
> > > > and it won't schedule and the user has to fix it.
> > >
> > > Because in some cases, and the cases I have happened to care about,
> > > the groups of events come from metrics.
> >
> > That's fine, but can't you just fix it for the metrics only.
> >
> > Also even for metrics if you freely reorder them they might become
> > unparseable.  But the events inside the metrics should be freely
> > reorderable (well technically someone could have a dependency on that
> > too, but at least in my mind that's much less likely than their
> > dependency just being on the metric)
> >
> > So what it really needs is a sort only insight the metric.
> >
> > That shouldn't be too hard to implement, right?
>
> I'll describe why this isn't a fix below.
>
> > Also I guess we could add some test cases on avoiding reordering,
> > although I'm optimistic that once it is established that reordering
> > should be minimized/avoided it is unlikely to be something
> > that breask frequently.
> >
> > Would it be possible to revert the global sort patch until that
> > is sorted out? Otherwise I'm afraid I'll end up with perf releases
> > that are absolutely incompatible to toplev, and I'll get problems
> > with my user base who often cannot easily update perf.
>
> This would break the majority of metrics from Icelake forward.
>
> > Arnaldo, can you please comment on this, what is your policy
> > on regressions here?
> >
> > > But things like wildcard group expansion and regrouping is a metric
> > > problem too. So why would there be a parse events for metrics and a
> > > separate one for events? Doing it this way would make metric group
> > > testing significantly more hard too, as we can't just try the events
> > > without the metric.
> >
> > Hmm, perhaps add some kind of debug flag to handle this case?
> > As long as it's not default I don't have any issues with that.
> >
> > > > How about replacing the sorting commit with a warning and ask
> > > > the user to do the reordering? Would that work for you?
> > >
> > > No.
> >
> > Well I hope we can get to some compromise here. Right now
> > this arbitary unpredictable reordering is a show stopper for me,
> > and I suspect it will be for any other perf stat users that do some kind
> > of post processing.
> >
> > As long as we find a solution like above that works for the metrics
> > it should be ok right?
>
> So you've resisted working through what your issue with the sorting
> is. To call the sorting arbitrary is of course wrong, but you keep
> wanting to say everything the perf tool is doing is wrong. Moving
> forward...
>
> What is happening in your example is straightforward. The sorting
> code's goal is to keep events in their place but what to do about:
> perf stat -e 'cycles,{uncore_imc_free_running/data_read/,uncore_imc_free_running/data_write/}'
> where there are 2 uncore_imc_free_running PMUs? After parsing the
> evlist looks like:
> cycles,{uncore_imc_free_running_0/data_read/,uncore_imc_free_running_1/data_read/,uncore_imc_free_running_0/data_write/,uncore_imc_free_running_1/data_write/}
> We need to turn 1 group into 2, we need to reorder by the PMUs, this leads to:
> cycles,{uncore_imc_free_running_0/data_read/,uncore_imc_free_running_0/data_write/},{uncore_imc_free_running_1/data_read/,uncore_imc_free_running_1/data_write/}
> There is no regression here, it has always been perf's behavior to
> reorder events and create groups globally.
> When sorting groups the code uses the position in the original list of
> the group's leader:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n2124
> For events within a group it sorts them by group PMU name (which is
> normally just the PMU name), if the PMU names are the same it sorts
> the events by their original position in the list possibly corrected
> for perf metric events.
>
> Okay, so now imagine the lack of grouping brought about by perf metric
> (topdown) events and broken Intel PMUs that fail to break weak groups.
> For a hypothetical example we create a list of events like:
> perf stat -e 'topdown-fe-bound,cycles,instructions,slots'
> As with the group case we need to reorder the events to have slots and
> topdown-fe-bound together and grouped. The code achieves:
> {slots,topdown-fe-bound},cycles,instructions
> As grouped events need to be adjacent in the evlist, we need to sort
> all the events without a group to be together and place the perf
> metric events first. They are then placed in a group.
> So that we best respect the order of ungrouped and grouped events, we
> place the ungrouped events as if they are a group with a leader with
> the index of the first ungrouped event.
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n2101
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n2170
>
> So this then comes to your example where you have ungrouped events,
> grouped events in a giant list. Well on Icelake+ we could have perf
> metric events in those ungrouped events, this means we need to sort
> the ungrouped events together, potentially moving perf metric events
> to the front and grouping them.
>
> Hey hey hey, you say. I'm on SkylakeX and don't have perf metric
> events. Why should the tool be trying to sort events to fix issues on
> a PMU I don't have?
> The problem is that Icelake+ architectures all have this issue and I
> keep trying to make the code simpler and do the same thing for
> everybody. I got rid of hardcoded "cpu_atom" and "cpu_core", the
> separate hybrid event parsing, ... I'm trying to make it that perf
> isn't a concatenated group of architecture specific PMU drivers and
> there is commonality in how everything is done.
> Well you say, I still want my ungrouped events to be scattered around
> my list and you better not move them as that is the world's largest
> regression, is crazy arbitrary rubbish and please go dump all your
> changes in the garbage.
>
> I disagree, the sorting is doing its best to only fix what is
> necessary and otherwise to rely on the command line. That said, it can
> only fix:
> perf stat -e 'slots,{uncore_imc_free_running_0/data_read/,uncore_imc_free_running_1/data_read/},topdown-fe-bound'
> If it is able to move slots and topdown-fe-bound to be adjacent.
>
> So how can I make you happy:
> 1) throw everything away.. well that would be a challenge, break stuff
> and I don't want to fix the resultant mess.
> 2) we can sort only for metrics. This leads to metric event parsing
> and event parsing, duplication, testing and other issues. I'm trying
> to make metrics sit on top of event parsing and not fork them. This is
> more true after the hybrid changes, where things like setting up a
> default cycles event is done through event parsing to avoid
> duplication of the wildcard logic. Plus we still need event sorting
> and grouping logic without metrics for at least uncore events.
> 3) we can sort only on Icelake+ architectures. Well isn't that the
> majority of what Intel cares about now-a-days? It also leads to having
> 2 code paths and testing problems. It also won't work as uncore events
> won't be sorted, regrouped and fixed.
> 4) we add a special toplev flag that disables sorting and grouping.
> Well that will break sorting and grouping of uncore events and so I
> don't think this will make you happy.
> 5)  we add a special toplev flag that on detecting a core PMUs causes
> sorting and regrouping to be disabled. This isn't much better than 4
> and still unlikely to make you happy.
>
> So what can we do?
> 1) change how perf metric (topdown) events are implemented in the PMU.
> This strategy works best with a time machine and we're duty bound, in
> that case, to fix other problems in history.
> 2) stop toplev being a PITA? Wouldn't that be nice.
> 3) we do some kind of ungrouped event sorting hack. Rather than make
> every ungrouped event look like having a common leader index, we only
> do that for events that need to be forced into a group (perf metric
> events). This should mean ungrouped events are sorted wrt groups in
> the place they are originally in the list. This should fix the
> pre-Icelake problems but on Icelake+ we can still get the "arbitrary"
> reordering issue you so despise for perf metric events.
>
> I'm going to give (3) a go and mail it today.

Testing:
https://lore.kernel.org/lkml/20230719001836.198363-1-irogers@google.com/
with the given events on SKX shows no event re-ordering compared to
say perf 6.1.

The event list contains wild carded events for:
uncore_imc/event=0x4,umask=0x3/,uncore_imc/event=0x4,umask=0xc/

where the order is arbitrary and I see:
uncore_imc_5/event=0x4,umask=0x3/
uncore_imc_3/event=0x4,umask=0x3/
uncore_imc_1/event=0x4,umask=0x3/
uncore_imc_4/event=0x4,umask=0x3/
uncore_imc_2/event=0x4,umask=0x3/
uncore_imc_0/event=0x4,umask=0x3/
uncore_imc_5/event=0x4,umask=0xc/
uncore_imc_3/event=0x4,umask=0xc/
uncore_imc_1/event=0x4,umask=0xc/
uncore_imc_4/event=0x4,umask=0xc/
uncore_imc_2/event=0x4,umask=0xc/
uncore_imc_0/event=0x4,umask=0xc/

This order will be broken by:
https://lore.kernel.org/lkml/20230711055859.1242497-2-irogers@google.com/
as the order will be numerically ascending.

As described, the sorting will still apply to perf metric events.
toplev will probably not be impacted as the perf metric events would
need to already be sorted to work on older versions of perf.

Thanks,
Ian

> > > > The platform is SKX.
> > > >
> > > > >
> > > > > > > would seem unfortunate. Tbh, the mistake here is the assumptions that
> > > > > > > toplev is making.
> > > > > >
> > > > > > It's not an random assumption, it's the only way to parse perf output uniquely.
> > > > > >
> > > > > > Please make it an option. But this cannot stay this way unconditionally
> > > > > > if perf stat is supposed to stay a tool which can be used from other programs.
> > > > >
> > > > > You should be using the json or CSV outputs. These are intended for
> > > >
> > > > This is the CSV output that is reordered! (see the example you just
> > > > quoted)
> > > >
> > > > I haven't tested JSON, but I assume that is reordered too.
> > >
> > > The output of the JSON is a dictionary and inherently unordered.
> >
> > I haven't tried to use it, but if you have something like
> > perf stat --json -e '{cycles,branches},{cycles,cache-misses}' how would you
> > distinguish the two different cycles without order?
> >
> > > crashes vs having the complete TopdownL1 working on both PMUs, testing
> > > parity with non-hybrid. We also fixed all address and leak sanitizer
> > > bugs.
> >
> > > Perf in Linux 6.5 even without the awesome new features like
> > > lock contention analysis, is the best perf tool we've ever had.
> >
> > Sure and without arbitrary reordering and unparseable output it would be
> > even better ...
>
> Ugh.. am I being motivated here to fix your toplev problem?
>
> > > Does SKX have uncore? Yes, so you need to reorder events and this
> > > change didn't start it.
> >
> > Yes of course it has an uncore.
> >
> > You only need to reorder if you create broken groups.  At least my tool
> > doesn't generate them.
>
> Zip-a-Dee-Doo-Dah.
>
> > > NO_GROUP_EVENTS is used to avoid a group being used, an important use
> > > being avoiding PMU driver bugs.
> > >
> >
> > Hmm, not sure what that is refering to, but ok.
>
> You can take the metrics and remove the NO_GROUP_EVENTS. The failure
> mode is that the weak group fails to be broken. I believe the issues
> to be some mix of errata, naivety in the fake PMU driver when
> scheduling onto counters and not appreciating that not all counters
> are created equal, possible bugs, etc. The issues may not just be on
> the core PMU, it depends on the metric.
>
> Given I'm fixing things for you, can you please fix:
> $ perf stat -e 'slots,branches,slots' -a sleep 1
> to not multiplex on Icelake+.
>
> This can be achieved with:
> ```
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2607,7 +2607,7 @@ static int group_can_go_on(struct perf_event
> *event, int can_add_hw)
>         * Otherwise, try to add it if all previous groups were able
>         * to go on.
>         */
> -       return can_add_hw;
> +       return 1;
> }
>
> static void add_event_to_ctx(struct perf_event *event,
> ```
> That means we'll consider adding every hardware event just as we
> consider adding every software event. A more complex fix would be to
> make can_add_hw some kind of bitmap that indicates different counter
> types are available, and to only stop searching only when all counter
> types are exhausted. I'm perfectly happy getting rid of the can_add_hw
> optimization and not worrying about the potential extra overhead. That
> overhead is theoretic and under user control. The unnecessary
> multiplexing is just silly and unavoidable.
>
> Thanks,
> Ian
>
> > -Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Event reordering regression for software events
  2023-07-17 13:37               ` Arnaldo Carvalho de Melo
@ 2023-07-26 15:05                 ` Andi Kleen
  0 siblings, 0 replies; 13+ messages in thread
From: Andi Kleen @ 2023-07-26 15:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Ian Rogers, namhyung, linux-perf-users

<sorry for the late reply I was on vacation>

On Mon, Jul 17, 2023 at 10:37:35AM -0300, Arnaldo Carvalho de Melo wrote:
> Don't we have a JSON output? My expectation is that after the work that
> Ian has been doing we will settle down, but even then downstreamers
> should try to use JSON as input as that would avoid the problems we are
> having?


In classic perf CSV was our machine readable output, and that is what got broken
here. I hope nobody is advocating for abolishing CSV output as a machine
readable format.

For JSON it got broken too because of:

> > I haven't tried to use it, but if you have something like
> > perf stat --json -e '{cycles,branches},{cycles,cache-misses}' how would you
> > distinguish it without order?

> 
> You mentioned that it was difficult to add entries to 'perf test' but

I can add some tests for basic reordering not occurring. That should
be possible.

> that we could try running toplev's regression test, right? I can add it
> to my set of tests but perhaps the best thing would be to wire it up to
> one of the CIs out there?

Yes that would be a good idea. Note that there are sometimes problems
which are on the toplev side, not perf side, but I guess can sort that
out when that happens.


-And

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-07-26 15:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-13 20:37 Event reordering regression for software events Andi Kleen
2023-07-13 20:48 ` Arnaldo Carvalho de Melo
2023-07-13 21:24   ` Ian Rogers
2023-07-13 23:51     ` Andi Kleen
2023-07-14  1:14       ` Ian Rogers
2023-07-14  3:07         ` Andi Kleen
2023-07-14  3:56           ` Ian Rogers
2023-07-17  2:00             ` Andi Kleen
2023-07-17 13:37               ` Arnaldo Carvalho de Melo
2023-07-26 15:05                 ` Andi Kleen
2023-07-18 18:25               ` Ian Rogers
2023-07-19  0:53                 ` Ian Rogers
2023-07-13 23:55   ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.