broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

From: Milian Wolff <milian.wolff@kdab.com>
To: linux-perf-users@vger.kernel.org
Cc: Andi Kleen <andi@firstfloor.org>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	linux-kernel@vger.kernel.org
Subject: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]
Date: Mon, 28 Aug 2017 16:08:47 +0200	[thread overview]
Message-ID: <1503933130.rLYJWxPBaP@milian-kdab2> (raw)
In-Reply-To: <1654509.hqsMC4Q0uj@agathebauer>

[-- Attachment #1: Type: text/plain, Size: 6111 bytes --]

On Sunday, June 18, 2017 9:53:05 PM CEST Milian Wolff wrote:
> On Sonntag, 18. Juni 2017 06:22:19 CEST Andi Kleen wrote:
> > Milian Wolff <milian.wolff@kdab.com> writes:
> > > But when I look at the naively calculated first derivative, to visualize
> > > CPU load, i.e. CPU clock rate in Hz, then things start to become
> > > somewhat
> > > confusing:
> > > 
> > > ~~~~
> > > perf script -F time,period | awk 'BEGIN {lastTime = -1;} { time = $1 +
> > > 0.0; if (lastTime != -1) {printf("%.6f\t%f\n", time, $2 / (time -
> > > lastTime));} lastTime = time; }' | gnuplot -p -e "plot '-' with
> > > linespoints"
> > > ~~~~
> > 
> > The perf time stamps approach the maximum precision of double (12 vs
> > 15 digits). Likely the division loses too many digits, which may cause
> > the bogus results. I've ran into similar problems before.
> 
> I don't think so, just look at the raw values:
> 
> $ perf script -F time,period --ns
> 71789.438122347:          1
> 71789.438127160:          1
> 71789.438129599:          7
> 71789.438131844:         94
> 71789.438134282:       1391
> 71789.438139871:      19106
> 71789.438156426:     123336
> ...
> 
> $ qalc '123336/(71789.438156426s - 71789.438139871s) to Hz'
> 123336 / ((71789.438 * second) - (71789.438 * second)) = approx. 7.4500755E9
> Hz
> 
> > One way around is is to normalize the time stamps first that they
> > start with 0, but this only works for shorter traces.
> > Or use some bignum float library
> 
> I take the time delta between two samples, so a normalization of the
> individual times to 0 would not affect my calculations - the delta stays the
> same after all.
> 
> Also, using bignum in my calculations wouldn't change anything either. If
> perf tells me that 123336 cycles have been executed in 16.555 us, then it
> will always be larger than any expected value. At 3.2GHz it should be
> maximally 52976 cycles in such a short timeframe...
> 
> > Also at the beginning of frequency the periods are very small, and
> > the default us resolution will give big jumps for such a calculation.
> 
> OK, but who/what measures the large cycle values then? Is this a PMU
> limitation? Or is this an issue with the interaction with the kernel, when
> the algorithm tries to find a good frequency at the beginning?
> 
> > It's better to use the script --ns option then, but that makes the
> > double precision problem event worse.
> 
> See above, using `--ns` doesn't change anything. And qalc e.g. already uses
> bignum internally.
> 
> > In generally you get better results by avoiding frequency mode,
> > but always specify a fixed period.
> 
> This indeed removes the spikes at the beginning:
> 
> perf record --switch-events --call-graph dwarf -P -c 500000
> 
> The value is chosen to give a similar sample count to frequency mode.

Hey all,

I want to revive the above discussion as I'm still completely puzzled by the 
observation. The tl;dr; for those who have not followed the previous 
discussion:

perf record in frequency mode (i.e. "record ~1000 samples per second") 
sometimes reports excessively large cycle counts.

In the previous mails I have outlined how to visualize this issue graphically 
with gnuplot, by drawing the first derivative of the cycles over time which 
gives nicely comparable numbers to ones CPU clock speed.

Sometimes, this value goe up to 10Ghz and beyond. Sometimes the values are so 
broken (i.e. so high), that they completely break the analysis with perf 
report or similar, as they completely skew the total event count and thereby 
drastically influence the fractional cost reported by perf. E.g. just now I 
ran `perf record` on another application and got this result:

~~~~~
$ perf script | grep page_remove_rmap -C 10
 QXcbEventReader 23866 605019.879412:     128193 cycles:ppp:  ffffffffbb23fbf5 
__fget_light (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879469:          1 cycles:ppp:  ffffffffbb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
 QXcbEventReader 23866 605019.879471:    1810360 cycles:ppp:  ffffffffbb1cb4ec 
find_vma (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879472:          1 cycles:ppp:  ffffffffbb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879474:         10 cycles:ppp:  ffffffffbb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879475:        216 cycles:ppp:  ffffffffbb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879486:       5106 cycles:ppp:  ffffffffbb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879489:      19381 cycles:ppp:      7f85beae64ce 
QWaitCondition::wait (/usr/lib/libQt5Core.so.5.9.1)
  lab_mandelbrot 23865 605019.879495:     251346 cycles:ppp:      7f85bf2567c1 
QScreen::~QScreen (/usr/lib/libQt5Gui.so.5.9.1)
  lab_mandelbrot 23865 605019.880722:    3210571 cycles:ppp:      7f85bd96f850 
__cxa_finalize (/usr/lib/libc-2.25.so)
  lab_mandelbrot 23865 605019.881592: 21110358010774 cycles:ppp:  
ffffffffbb1d4218 page_remove_rmap (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
~~~~~

Note the last sample's cycle cost of 21110358010774. This is so large, that it 
completely dominates the total event count, which lies at 21126914067278. The 
runtime for this perf record was about 4s, it was on a single threaded 
application with 4.12.8-2-ARCH on a Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz.
So that hardware should at most be capable of running up to ~1.36E10 cycles 
over 4s. perf record, thanks to the last sample, measured ~2.11E13 cycles - 
clearly off the charts.

I have never seen this issue outside of perf's frequency mode. But then again, 
that mode is the default and quite useful. Can anyone explain what I'm seeing 
here?

Is it a bug in the kernel?
Is it a bug/limitation in the PMU?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]