* perf measure for stalled cycles per instruction on newer Intel processors
@ 2020-10-15 14:53 Or Gerlitz
2020-10-15 18:33 ` Andi Kleen
0 siblings, 1 reply; 4+ messages in thread
From: Or Gerlitz @ 2020-10-15 14:53 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Brendan Gregg; +Cc: Linux Netdev List
Hi,
Earlier Intel processors (e.g E5-2650) support the more of classical
two stall events (for backend and frontend [1]) and then perf shows
the nice measure of stalled cycles per instruction - e.g here where we
have IPC of 0.91 and CSPI (see [2]) of 0.68:
9,568,273,970 cycles # 2.679 GHz
(53.30%)
5,979,155,843 stalled-cycles-frontend # 62.49% frontend
cycles idle (53.31%)
4,874,774,413 stalled-cycles-backend # 50.95% backend
cycles idle (53.31%)
8,732,767,750 instructions # 0.91 insn per
cycle
# 0.68 stalled
cycles per insn (59.97%)
Running over a system with newer processor (6254) I noted that there
are sort of zillion (..) stall events [2] and perf -e $EVENT for them
does show thier count.
However perf stat doesn't show any more the "stalled cycles per insn"
computation.
Looking in the perf sources, it seems we do that only if the
backend/frontend events exist (perf_stat__print_shadow_stats function)
- am I correct in my reading of the code?
If it's the case, what's needed here to get this or similar measure back?
If it's not the case, if you can suggest how to get perf to emit this
quantity there.
Thanks,
Or.
[1] perf list | grep stalled-cycles
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
[2] http://www.brendangregg.com/perf.html#CPUstatistics
[3] perf list | grep stall -A 1 (manipulated, there are more..)
cycle_activity.stalls_l3_miss
[Execution stalls while L3 cache miss demand load is outstanding]
cycle_activity.stalls_l1d_miss
[Execution stalls while L1 cache miss demand load is outstanding]
cycle_activity.stalls_l2_miss
[Execution stalls while L2 cache miss demand load is outstanding]
cycle_activity.stalls_mem_any
[Execution stalls while memory subsystem has an outstanding load]
cycle_activity.stalls_total
[Total execution stalls]
ild_stall.lcp
[Core cycles the allocator was stalled due to recovery from earlier
partial_rat_stalls.scoreboard
[Cycles where the pipeline is stalled due to serializing operations]
resource_stalls.any
[Resource-related stall cycles]
resource_stalls.sb
[Cycles stalled due to no store buffers available. (not including
partial_rat_stalls.scoreboard
[Cycles where the pipeline is stalled due to serializing operations]
resource_stalls.any
[Resource-related stall cycles]
resource_stalls.sb
[Cycles stalled due to no store buffers available. (not including
draining form sync)]
uops_executed.stall_cycles
[Counts number of cycles no uops were dispatched to be executed on this
uops_issued.stall_cycles
[Cycles when Resource Allocation Table (RAT) does not issue Uops to
uops_retired.stall_cycles
[Cycles without actually retired uops]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: perf measure for stalled cycles per instruction on newer Intel processors
2020-10-15 14:53 perf measure for stalled cycles per instruction on newer Intel processors Or Gerlitz
@ 2020-10-15 18:33 ` Andi Kleen
2020-10-18 17:42 ` Or Gerlitz
0 siblings, 1 reply; 4+ messages in thread
From: Andi Kleen @ 2020-10-15 18:33 UTC (permalink / raw)
To: Or Gerlitz; +Cc: Peter Zijlstra, Ingo Molnar, Brendan Gregg, Linux Netdev List
On Thu, Oct 15, 2020 at 05:53:40PM +0300, Or Gerlitz wrote:
> Hi,
>
> Earlier Intel processors (e.g E5-2650) support the more of classical
> two stall events (for backend and frontend [1]) and then perf shows
> the nice measure of stalled cycles per instruction - e.g here where we
> have IPC of 0.91 and CSPI (see [2]) of 0.68:
Don't use it. It's misleading on a out-of-order CPU because you don't
know if it's actually limiting anything.
If you want useful bottleneck data use --topdown.
-Andi
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: perf measure for stalled cycles per instruction on newer Intel processors
2020-10-15 18:33 ` Andi Kleen
@ 2020-10-18 17:42 ` Or Gerlitz
2020-10-19 1:00 ` Andi Kleen
0 siblings, 1 reply; 4+ messages in thread
From: Or Gerlitz @ 2020-10-18 17:42 UTC (permalink / raw)
To: Andi Kleen; +Cc: Peter Zijlstra, Ingo Molnar, Brendan Gregg, Linux Netdev List
On Thu, Oct 15, 2020 at 9:33 PM Andi Kleen <andi@firstfloor.org> wrote:
> On Thu, Oct 15, 2020 at 05:53:40PM +0300, Or Gerlitz wrote:
> > Earlier Intel processors (e.g E5-2650) support the more of classical
> > two stall events (for backend and frontend [1]) and then perf shows
> > the nice measure of stalled cycles per instruction - e.g here where we
> > have IPC of 0.91 and CSPI (see [2]) of 0.68:
>
> Don't use it. It's misleading on a out-of-order CPU because you don't
> know if it's actually limiting anything.
>
> If you want useful bottleneck data use --topdown.
So running again, this time with the below params, I got this output
where all the right most column is colored red. I wonder what can be
said on the amount/ratio of stalls for this app - if you can maybe recommend
some posts of yours to better understand that, I saw some comment in the
perf-stat man page and some lwn article but wasn't really able to figure it out.
FWIW, the kernel is 5.5.7-100.fc30.x86_64 and the CPU E5-2650 0
$ perf stat --topdown -a taskset -c 0 $APP
[...]
Performance counter stats for 'system wide':
retiring bad speculation
frontend bound backend bound
S0-D0-C0 1 24.9% 1.1%
16.1% 57.9%
S0-D0-C1 1 16.3% 1.3%
17.3% 65.1%
S0-D0-C2 1 17.0% 1.2%
15.3% 66.5%
S0-D0-C3 1 18.3% 0.8%
8.2% 72.8%
S0-D0-C4 1 18.1% 0.8%
8.5% 72.6%
S0-D0-C5 1 17.6% 0.8%
10.0% 71.6%
S0-D0-C6 1 18.3% 0.7%
7.4% 73.6%
S0-D0-C7 1 15.4% 1.4%
22.1% 61.2%
S1-D0-C0 1 15.9% 1.4%
16.4% 66.3%
S1-D0-C1 1 21.9% 2.6%
16.9% 58.5%
S1-D0-C2 1 20.8% 3.7%
17.1% 58.4%
S1-D0-C3 1 17.8% 1.0%
9.2% 72.1%
S1-D0-C4 1 17.8% 1.0%
9.0% 72.2%
S1-D0-C5 1 17.8% 1.0%
9.0% 72.2%
S1-D0-C6 1 17.4% 1.4%
12.8% 68.4%
S1-D0-C7 1 23.6% 4.3%
17.2% 55.0%
13.341823591 seconds time elapsed
while running with perf stat -d gives this:
$ perf stat -d taskset -c 0 $APP
Performance counter stats for 'taskset -c 0 ./main.gcc9.3.1':
15,075.30 msec task-clock # 0.900 CPUs
utilized
199 context-switches # 0.013 K/sec
1 cpu-migrations # 0.000 K/sec
117,987 page-faults # 0.008 M/sec
40,907,365,540 cycles # 2.714 GHz
26,431,604,986 stalled-cycles-frontend # 64.61% frontend
cycles idle
21,734,615,045 stalled-cycles-backend # 53.13% backend
cycles idle
35,339,765,469 instructions # 0.86 insn per
cycle
# 0.75 stalled
cycles per insn
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: perf measure for stalled cycles per instruction on newer Intel processors
2020-10-18 17:42 ` Or Gerlitz
@ 2020-10-19 1:00 ` Andi Kleen
0 siblings, 0 replies; 4+ messages in thread
From: Andi Kleen @ 2020-10-19 1:00 UTC (permalink / raw)
To: Or Gerlitz
Cc: Andi Kleen, Peter Zijlstra, Ingo Molnar, Brendan Gregg,
Linux Netdev List
> > Don't use it. It's misleading on a out-of-order CPU because you don't
> > know if it's actually limiting anything.
> >
> > If you want useful bottleneck data use --topdown.
>
> So running again, this time with the below params, I got this output
> where all the right most column is colored red. I wonder what can be
> said on the amount/ratio of stalls for this app - if you can maybe recommend
> some posts of yours to better understand that, I saw some comment in the
> perf-stat man page and some lwn article but wasn't really able to figure it out.
TopDown determines what limits the execution the most.
The application is mostly backend bound (55-72%). This can be either memory
issues (more common), or sometimes also execution issues. Standard perf
doesn't support a further break down beyond these high level categories,
but there are alternative tools that do (e.g. mine is "toplev" in
https://github.com/andikleen/pmu-tools or VTune)
Some references on TopDown:
https://github.com/andikleen/pmu-tools/wiki/toplev-manual
http://bit.ly/tma-ispass14
The tools above would also allow you to sample where the stalls
are occuring.
-Andi
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-10-19 1:00 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-15 14:53 perf measure for stalled cycles per instruction on newer Intel processors Or Gerlitz
2020-10-15 18:33 ` Andi Kleen
2020-10-18 17:42 ` Or Gerlitz
2020-10-19 1:00 ` Andi Kleen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.