Failure to parallelize

* Failure to parallelize
@ 2016-08-17 13:55 Benjamin King
  2016-08-18  9:56 ` Milian Wolff
  0 siblings, 1 reply; 5+ messages in thread
From: Benjamin King @ 2016-08-17 13:55 UTC (permalink / raw)
  To: linux-perf-users

Hi,

I recently had a performance regression where the program mysteriously became
20% slower without executing more instructions or burning more cycles. It
turned out that a loop lost an openmp pragma and wasn't parallel afterwards.
This was a tiny part of a larger diff and missed during code review.

I was struggeling to find this with perf. "perf record" did show me mostly
identical values. "perf stat" also was mostly the same, including "task-clock
(msec)".

Eventually, I had noticed the lower number for "CPUs utilized", but I had no
idea, where in my code this would be.

In the following sample code, I am always getting ~10% reported by perf for
the function bar(), regardless of whether I am calling it in parallel or not.

Is there some way to make the difference more visible in perf? 

Cheers,
  Benjamin King

----- 8< -----
// gcc -g -fopenmp noppy.c -o noppy; perf record ./noppy; perf report
#include <omp.h>
#include <stdio.h>

void foo() // ~90% of "work" is done here
{
  int i;
  for ( i = 0; i < 900; ++i )
    asm("nop;nop;nop;nop;");
}

void bar() // ~10% of "work" is done here
{
  int i;
  for ( i = 0; i < 100; ++i )
    asm("nop;nop;nop;nop;");
}

int main()
{
  int s;
  for ( s = 0; s < 1; ++s )
  {
    long i;
#pragma omp parallel for
    for ( i = 0; i < 1000000; ++i )
      foo();
    // Whoops, I accidently deleted the following pragma
//#pragma omp parallel for
    for ( i = 0; i < 1000000; ++i )
      bar();
  }
}
----- 8< -----

^ permalink raw reply	[flat|nested] 5+ messages in thread