All of lore.kernel.org
 help / color / mirror / Atom feed
* Failure to parallelize
@ 2016-08-17 13:55 Benjamin King
  2016-08-18  9:56 ` Milian Wolff
  0 siblings, 1 reply; 5+ messages in thread
From: Benjamin King @ 2016-08-17 13:55 UTC (permalink / raw)
  To: linux-perf-users

Hi,

I recently had a performance regression where the program mysteriously became
20% slower without executing more instructions or burning more cycles. It
turned out that a loop lost an openmp pragma and wasn't parallel afterwards.
This was a tiny part of a larger diff and missed during code review.

I was struggeling to find this with perf. "perf record" did show me mostly
identical values. "perf stat" also was mostly the same, including "task-clock
(msec)".

Eventually, I had noticed the lower number for "CPUs utilized", but I had no
idea, where in my code this would be.

In the following sample code, I am always getting ~10% reported by perf for
the function bar(), regardless of whether I am calling it in parallel or not.

Is there some way to make the difference more visible in perf? 

Cheers,
  Benjamin King

----- 8< -----
// gcc -g -fopenmp noppy.c -o noppy; perf record ./noppy; perf report
#include <omp.h>
#include <stdio.h>

void foo() // ~90% of "work" is done here
{
  int i;
  for ( i = 0; i < 900; ++i )
    asm("nop;nop;nop;nop;");
}

void bar() // ~10% of "work" is done here
{
  int i;
  for ( i = 0; i < 100; ++i )
    asm("nop;nop;nop;nop;");
}

int main()
{
  int s;
  for ( s = 0; s < 1; ++s )
  {
    long i;
#pragma omp parallel for
    for ( i = 0; i < 1000000; ++i )
      foo();
    // Whoops, I accidently deleted the following pragma
//#pragma omp parallel for
    for ( i = 0; i < 1000000; ++i )
      bar();
  }
}
----- 8< -----

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Failure to parallelize
  2016-08-17 13:55 Failure to parallelize Benjamin King
@ 2016-08-18  9:56 ` Milian Wolff
  2016-08-18 18:50   ` Benjamin King
  0 siblings, 1 reply; 5+ messages in thread
From: Milian Wolff @ 2016-08-18  9:56 UTC (permalink / raw)
  To: benjaminking; +Cc: linux-perf-users

[-- Attachment #1: Type: text/plain, Size: 2145 bytes --]

On Wednesday, August 17, 2016 3:55:28 PM CEST Benjamin King wrote:
> Hi,
> 
> I recently had a performance regression where the program mysteriously
> became 20% slower without executing more instructions or burning more
> cycles. It turned out that a loop lost an openmp pragma and wasn't parallel
> afterwards. This was a tiny part of a larger diff and missed during code
> review.
> 
> I was struggeling to find this with perf. "perf record" did show me mostly
> identical values. "perf stat" also was mostly the same, including
> "task-clock (msec)".
> 
> Eventually, I had noticed the lower number for "CPUs utilized", but I had no
> idea, where in my code this would be.
> 
> In the following sample code, I am always getting ~10% reported by perf for
> the function bar(), regardless of whether I am calling it in parallel or
> not.
> 
> Is there some way to make the difference more visible in perf?

I think that won't even be detected by perf's way of doing sleep time 
profiling, e.g.:

https://github.com/milianw/shell-helpers/blob/master/perf-sleep-record

Because there is no contention - it's simply Amdahl's law that's tripping you 
up. Having a look at the CPU utilization is very important when writing 
(supposedly) parallel code.

But perf is currently lacking a good visualization for what's going on when - 
it only gives you aggregated data over the whole run. You could try `perf 
timeline` - it will at least tell you that only one thread is running at some 
point. 

Figuring out what code is running then is still not possible easily, except by 
manually looking at the output of `perf script`. I think as a first step we 
should add some way to only analyze a certain time range in perf report. That  
paired with timeline could show the issue.

I also want to point out that the thread timeline in VTune would directly show 
you the issue, and the UI gives you the ability to zoom/filter in on the time 
range to figure out what code is running.

Cheers, HTH

-- 
Milian Wolff | milian.wolff@kdab.com | Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5903 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Failure to parallelize
  2016-08-18  9:56 ` Milian Wolff
@ 2016-08-18 18:50   ` Benjamin King
  2016-08-22 21:14     ` Andi Kleen
  0 siblings, 1 reply; 5+ messages in thread
From: Benjamin King @ 2016-08-18 18:50 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users

On Thu, Aug 18, 2016 at 11:56:45AM +0200, Milian Wolff wrote:
>> Is there some way to make the difference [between threaded and non
>> threaded code] more visible in perf?
>
>I think that won't even be detected by perf's way of doing sleep time
>profiling, e.g.:
>
>https://github.com/milianw/shell-helpers/blob/master/perf-sleep-record
>
>Because there is no contention - it's simply Amdahl's law that's tripping you
>up. Having a look at the CPU utilization is very important when writing
>(supposedly) parallel code.

Yes, I should have looked at CPU utilization more closely right away. But
even now that I know this, it seems awkward that I am unable to demonstrate
with my trusty old profiler which part of my program just took longer in terms
of wall clock time. I don't have VTune at my disposal, but it could be that I
am using the wrong tool for the job.

Still, I dabbled a bit with "perf record -s ...; perf report -T", but I find the
output a little confusing. To wit:

-----8< noppy1.c
#include <omp.h>
void foo() { int i; for ( i = 0; i < 900; ++i ) asm("nop;nop;nop;nop;"); }
void bar() { int i; for ( i = 0; i < 100; ++i ) asm("nop;nop;nop;nop;"); }
int main() {
  long i;
#pragma omp parallel for
  for ( i = 0; i < 1000000; ++i ) foo();
  for ( i = 0; i < 1000000; ++i ) bar();
}
-----8< noppy2.c
#include <omp.h>
void foo() { int i; for ( i = 0; i < 900; ++i ) asm("nop;nop;nop;nop;"); }
void bar() { int i; for ( i = 0; i < 100; ++i ) asm("nop;nop;nop;nop;"); }
int main() {
  long i;
#pragma omp parallel for
  for ( i = 0; i < 1000000; ++i ) foo();
#pragma omp parallel for
  for ( i = 0; i < 1000000; ++i ) bar();
}
-----8< gcc noppy1.c -g -fopenmp -o noppy1;perf record -s ./noppy1;perf report -T
    92.15%  noppy1   noppy1              [.] foo
     7.08%  noppy1   noppy1              [.] bar
    ...
#  PID   TID  cycles:pp   cycles:pp  cycles:pp
  3853  3856          0  1492046281          0
  3853  3854          0    57482400          0
  3853  3855          0           0          0
-----8< gcc noppy2.c -g -fopenmp -o noppy2;perf record -s ./noppy2;perf report -T
    88.97%  noppy2   noppy2            [.] foo                   
    10.27%  noppy2   noppy2            [.] bar                   
    ...
#  PID   TID        cycles:pp  cycles:pp  cycles:pp
  3869  3870                0   56778112          0
  3869  3871       2180814133   57030240          0
  3869  3872  139866901929176          0          0
-----8<


So, there is some difference in cycles:pp but I totally don't get what this
table in the end of perf report -T is supposed to mean. The large value for TID
3872 looks broken.


I had more luck with 'perf report --per-thread':

-----8< perf record --per-thread ./noppy1;perf report
  68.90%  noppy1   noppy1            [.] foo
  29.79%  noppy1   noppy1            [.] bar
...
-----8< perf record --per-thread ./noppy2;perf report
  87.18%  noppy2   noppy2            [.] foo
  10.35%  noppy2   noppy2            [.] bar
...
-----8<

So, noppy1 looks different and I can see that the effort between foo and bar
shifted. Adding a '--show-nr-samples' or a '--show-total-period' is then
telling me that the effort for foo() stays the same, but bar() gets more
expensive.

Unfortunately, I still do not understand what exactly '--per-thread' is doing.
The manpage is a little brief and I did not look in the code yet.
But it's a start!

Cheers,
  Benjamin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Failure to parallelize
  2016-08-18 18:50   ` Benjamin King
@ 2016-08-22 21:14     ` Andi Kleen
  2016-08-23  6:10       ` Benjamin King
  0 siblings, 1 reply; 5+ messages in thread
From: Andi Kleen @ 2016-08-22 21:14 UTC (permalink / raw)
  To: Benjamin King; +Cc: Milian Wolff, linux-perf-users

Benjamin King <benjaminking@web.de> writes:
>
> Still, I dabbled a bit with "perf record -s ...; perf report -T", but I find the
> output a little confusing. To wit:

I would rather use perf report --sort cpu,sym

This reports all samples separated by CPUs, but sorted in the same view.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Failure to parallelize
  2016-08-22 21:14     ` Andi Kleen
@ 2016-08-23  6:10       ` Benjamin King
  0 siblings, 0 replies; 5+ messages in thread
From: Benjamin King @ 2016-08-23  6:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Milian Wolff, linux-perf-users

Hi Andi,

On Mon, Aug 22, 2016 at 02:14:08PM -0700, Andi Kleen wrote:
>Benjamin King <benjaminking@web.de> writes:
>> Still, I dabbled a bit with "perf record -s ...; perf report -T", but I find the
>> output a little confusing. To wit:
>
>I would rather use perf report --sort cpu,sym
>This reports all samples separated by CPUs, but sorted in the same view.

Thanks! 'perf record ...; perf report --sort pid,sym'
did the job for me. Here I can clearly see a difference between a function
being called in parallel vs one that is not via the thread id.

With 'perf report --sort cpu,sym' I get '-001' for the CPU number. 'perf
script -Fcpu,ip' tells me that:

Samples for 'cycles:pp' event do not have CPU attribute set. Cannot print 'cpu' field.

This was with perf 4.4.0 on an i7-3537U. I stripped off the ':pp', tried
recording instructions rather than cycles, and also tested with perf 4.2.3 on
an i7-3770, but had no luck with per CPU display.

Cheers,
  Benjamin

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-08-23  6:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-17 13:55 Failure to parallelize Benjamin King
2016-08-18  9:56 ` Milian Wolff
2016-08-18 18:50   ` Benjamin King
2016-08-22 21:14     ` Andi Kleen
2016-08-23  6:10       ` Benjamin King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.