All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
@ 2009-11-26  8:05 ling.ma
  2009-11-26  9:49 ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: ling.ma @ 2009-11-26  8:05 UTC (permalink / raw)
  To: mingo; +Cc: hpa, tglx, linux-kernel, Ma Ling

From: Ma Ling <ling.ma@intel.com>

Hi All

In current kernel compile original option we prefer Os to O2. Os will reduce
compiled kernel code size obviously, and O2 pay more attention to performance
than code size, so in real environment O2 will bring more i-cache miss than Os,
totally performance should slowdown.

In our system test machine kernel code size from Os is 12M, and that from O2 is 14M.
 
But we have two questions about it on latest platform: 
1. 10% * current kernel code size from Os(CPU execution path)
   is far more L1 i-cache size, the difference of i-cache-miss counts from
   both options should become little.
2. our latest platform should has excellent prefetch capability by adjusting
   predication execution path.

Based on above reasons we re-compiled linux kernel with O2 option on below platform.
CPU type: 2P Quad-core Core i7(2 socket*4 core *2 hyper threads)
CPU frequency: 2670MHz
Memory: 6 x 1GBMb

We mainly tested common and stable benchmarks two times,  results show
O2 performance is better than Os (linux kernel version 2.6.32-rc8)  

Benchmarks:                          improvement 
volano                                8%
netperf                               6.7% 
tbench                                6.45%
Kbuild                                5.5% (3 time test, average improvement)
specjbb2000                           2%
fio                                   2%
specjbb2005                           No change
cpu2000                               No change
aim7                                  No change
hackbench                             No Change
oltp                                  No Change

This patch try to enable O2 option and disable Os option.

Appreciate any comments.

Thanks
Ling

---
 arch/x86/configs/x86_64_defconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index 6c86acd..d564b90 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -126,7 +126,7 @@ CONFIG_INITRAMFS_SOURCE=""
 CONFIG_RD_GZIP=y
 CONFIG_RD_BZIP2=y
 CONFIG_RD_LZMA=y
-CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+CONFIG_CC_OPTIMIZE_FOR_SIZE=n
 CONFIG_SYSCTL=y
 CONFIG_ANON_INODES=y
 # CONFIG_EMBEDDED is not set
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
  2009-11-26  8:05 [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform ling.ma
@ 2009-11-26  9:49 ` Ingo Molnar
  2009-12-01  8:54   ` Ma, Ling
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2009-11-26  9:49 UTC (permalink / raw)
  To: ling.ma, Arjan van de Ven, Dave Jones; +Cc: hpa, tglx, linux-kernel


* ling.ma@intel.com <ling.ma@intel.com> wrote:

> Benchmarks:                          improvement 
> volano                                8%
> netperf                               6.7% 
> tbench                                6.45%
> Kbuild                                5.5% (3 time test, average improvement)

that Kbuild result looks suspicious. A kbuild only uses 25% of system 
time, so an 5.5% improvement means that system utilization dropped from 
25% to 19.5%, a 28% improvement in the kernel! That looks rather 
unlikely.

Could you please post before/after 'perf stat --repeat 3' results so 
that we can see the noise level?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
  2009-11-26  9:49 ` Ingo Molnar
@ 2009-12-01  8:54   ` Ma, Ling
  2009-12-01 10:14     ` Arjan van de Ven
  2009-12-02  9:47     ` Ingo Molnar
  0 siblings, 2 replies; 10+ messages in thread
From: Ma, Ling @ 2009-12-01  8:54 UTC (permalink / raw)
  To: Ingo Molnar, Arjan van de Ven, Dave Jones; +Cc: hpa, tglx, linux-kernel

Hi Ingo

Thanks for your correction, so we use perf stat --repeat 3 to test volano, tbench, and kbuild,
Because netperf has multiple items we may send out later.

volano_Os:

Performance counter stats for '/bm/bin/runs -t volano -r /bm/recipes/lkp-ne02.recipe' (3 runs):

 6386111.436735  task-clock-msecs         #     13.554 CPUs    ( +-   0.336% )
      914192633  context-switches         #      0.143 M/sec   ( +-   0.046% )
       49186605  CPU-migrations           #      0.008 M/sec   ( +-   0.962% )
         768344  page-faults              #      0.000 M/sec   ( +-   0.338% )
 18680627716893  cycles                   #   2925.196 M/sec   ( +-   0.339% )
  7247421283541  instructions             #      0.388 IPC     ( +-   0.124% )
   226838591574  cache-references         #     35.521 M/sec   ( +-   0.971% )
     9420427393  cache-misses             #      1.475 M/sec   ( +-   0.897% )

  471.172398867  seconds time elapsed   ( +-   1.292% )

volano_O2:

Performance counter stats for '/bm/bin/runs -t volano -r /bm/recipes/lkp-ne02.recipe' (3 runs):

 5873675.998422  task-clock-msecs         #     13.447 CPUs    ( +-   0.338% )
      916070728  context-switches         #      0.156 M/sec   ( +-   0.050% )
       48759104  CPU-migrations           #      0.008 M/sec   ( +-   0.614% )
         738964  page-faults              #      0.000 M/sec   ( +-   0.082% )
 17145170491943  cycles                   #   2918.985 M/sec   ( +-   0.288% )
  7324126478801  instructions             #      0.427 IPC     ( +-   0.090% )
   219064318074  cache-references         #     37.296 M/sec   ( +-   0.792% )
     9491237013  cache-misses             #      1.616 M/sec   ( +-   0.439% )

  436.806579899  seconds time elapsed   ( +-   0.392% )

O2 is better than Os for volano

tbench_Os:

Performance counter stats for '/bm/bin/runs -t tbench -r /bm/recipes/lkp-ne02.recipe' (3 runs):

 11630970.099215  task-clock-msecs         #     15.476 CPUs    ( +-   1.285% )
     1162148139  context-switches         #      0.100 M/sec   ( +-   0.372% )
          39772  CPU-migrations           #      0.000 M/sec   ( +-   0.502% )
        1536289  page-faults              #      0.000 M/sec   ( +-   0.020% )
 33408973681696  cycles                   #   2872.415 M/sec   ( +-   0.028% )
 14229765107716  instructions             #      0.426 IPC     ( +-   0.113% )
   290717607018  cache-references         #     24.995 M/sec   ( +-  10.425% )
     2525058529  cache-misses             #      0.217 M/sec   ( +-   1.798% )

  751.537009428  seconds time elapsed   ( +-   0.173% )

tbench_O2:

Performance counter stats for '/bm/bin/runs -t tbench -r /bm/recipes/lkp-ne02.recipe' (3 runs):

 12093825.537708  task-clock-msecs         #     16.084 CPUs    ( +-   6.363% )
     1235837814  context-switches         #      0.102 M/sec   ( +-   0.857% )
          42363  CPU-migrations           #      0.000 M/sec   ( +-   3.968% )
        1535481  page-faults              #      0.000 M/sec   ( +-   0.350% )
 33028312063911  cycles                   #   2731.006 M/sec   ( +-   0.908% )
 15535465986643  instructions             #      0.470 IPC     ( +-   0.058% )
   280118529329  cache-references         #     23.162 M/sec   ( +-   0.695% )
     2866275183  cache-misses             #      0.237 M/sec   ( +-   0.893% )

  751.921568581  seconds time elapsed   ( +-   0.182% )

O2 is not different with Os for tbench

kbuild_Os:

Performance counter stats for '/bm/bin/runs -t kbuild -r /bm/recipes/lkp-ne02.recipe' (3 runs):

  886426.102100  task-clock-msecs         #      1.053 CPUs    ( +-   1.712% )
         980944  context-switches         #      0.001 M/sec   ( +-   1.149% )
         285613  CPU-migrations           #      0.000 M/sec   ( +-   1.543% )
       81244856  page-faults              #      0.092 M/sec   ( +-   1.611% )
  2610381816839  cycles                   #   2944.839 M/sec   ( +-   1.696% )
  2907701964460  instructions             #      1.114 IPC     ( +-   1.726% )
    14758764510  cache-references         #     16.650 M/sec   ( +-   1.581% )
     3212068899  cache-misses             #      3.624 M/sec   ( +-   1.729% )

  841.492770793  seconds time elapsed   ( +-   0.209% )

kbuild_O2:

Performance counter stats for '/bm/bin/runs -t kbuild -r /bm/recipes/lkp-ne02.recipe' (3 runs):

  897281.428095  task-clock-msecs         #      1.062 CPUs    ( +-   0.524% )
         964812  context-switches         #      0.001 M/sec   ( +-   1.630% )
         287443  CPU-migrations           #      0.000 M/sec   ( +-   0.532% )
       82509345  page-faults              #      0.092 M/sec   ( +-   0.071% )
  2635837258275  cycles                   #   2937.581 M/sec   ( +-   0.150% )
  2955626723788  instructions             #      1.121 IPC     ( +-   0.117% )
    14939108242  cache-references         #     16.649 M/sec   ( +-   0.609% )
     3267365744  cache-misses             #      3.641 M/sec   ( +-   0.066% )

  844.891541856  seconds time elapsed   ( +-   0.468% )
O2 is not different with Os for kbuild 

Thanks
Ling

> -----Original Message-----
> From: Ingo Molnar [mailto:mingo@elte.hu]
> Sent: Thursday, November 26, 2009 5:50 PM
> To: Ma, Ling; Arjan van de Ven; Dave Jones
> Cc: hpa@zytor.com; tglx@linutronix.de; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86
> platform
> 
> 
> * ling.ma@intel.com <ling.ma@intel.com> wrote:
> 
> > Benchmarks:                          improvement
> > volano                                8%
> > netperf                               6.7%
> > tbench                                6.45%
> > Kbuild                                5.5% (3 time test, average
> improvement)
> 
> that Kbuild result looks suspicious. A kbuild only uses 25% of system
> time, so an 5.5% improvement means that system utilization dropped from
> 25% to 19.5%, a 28% improvement in the kernel! That looks rather
> unlikely.
> 
> Could you please post before/after 'perf stat --repeat 3' results so
> that we can see the noise level?
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
  2009-12-01  8:54   ` Ma, Ling
@ 2009-12-01 10:14     ` Arjan van de Ven
  2009-12-01 16:11       ` H. Peter Anvin
  2009-12-03 15:03       ` Ma, Ling
  2009-12-02  9:47     ` Ingo Molnar
  1 sibling, 2 replies; 10+ messages in thread
From: Arjan van de Ven @ 2009-12-01 10:14 UTC (permalink / raw)
  To: Ma, Ling; +Cc: Ingo Molnar, Dave Jones, hpa, tglx, linux-kernel

On Tue, 1 Dec 2009 16:54:04 +0800
"Ma, Ling" <ling.ma@intel.com> wrote:

> Hi Ingo
> 
> Thanks for your correction, so we use perf stat --repeat 3 to test
> volano, tbench, and kbuild, Because netperf has multiple items we may
> send out later.

a key question is.. how much more memory do you have free due to -Os?
(because memory is cache is performance on a system level as well)
and how much less icache pressure is there?


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
  2009-12-01 10:14     ` Arjan van de Ven
@ 2009-12-01 16:11       ` H. Peter Anvin
  2009-12-03 15:03       ` Ma, Ling
  1 sibling, 0 replies; 10+ messages in thread
From: H. Peter Anvin @ 2009-12-01 16:11 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Ma, Ling, Ingo Molnar, Dave Jones, tglx, linux-kernel

On 12/01/2009 02:14 AM, Arjan van de Ven wrote:
> On Tue, 1 Dec 2009 16:54:04 +0800
> "Ma, Ling" <ling.ma@intel.com> wrote:
> 
>> Hi Ingo
>>
>> Thanks for your correction, so we use perf stat --repeat 3 to test
>> volano, tbench, and kbuild, Because netperf has multiple items we may
>> send out later.
> 
> a key question is.. how much more memory do you have free due to -Os?
> (because memory is cache is performance on a system level as well)
> and how much less icache pressure is there?
> 

>From the re-run, it sounds like the only test that actually shows a
significant difference is volano.  From reading the numbers, it looks
like the improvements are almost exclusively in IPC i.e. better
scheduling -- all the other metrics are substantially worse; including a
10% increase in cache misses.

It would be interesting to see what functions are hot in volano.  It
might very well be that we could get a boost without significantly bloat
the kernel as a whole by picking out a couple of hot object files and
compiling those with -O2 or -O3.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
  2009-12-01  8:54   ` Ma, Ling
  2009-12-01 10:14     ` Arjan van de Ven
@ 2009-12-02  9:47     ` Ingo Molnar
  1 sibling, 0 replies; 10+ messages in thread
From: Ingo Molnar @ 2009-12-02  9:47 UTC (permalink / raw)
  To: Ma, Ling; +Cc: Arjan van de Ven, Dave Jones, hpa, tglx, linux-kernel


* Ma, Ling <ling.ma@intel.com> wrote:

> Hi Ingo
> 
> Thanks for your correction, so we use perf stat --repeat 3 to test 
> volano, tbench, and kbuild, Because netperf has multiple items we may 
> send out later.
> 
> volano_Os:

>  18680627716893  cycles                   #   2925.196 M/sec   ( +-   0.339% )
>   7247421283541  instructions             #      0.388 IPC     ( +-   0.124% )
>    226838591574  cache-references         #     35.521 M/sec   ( +-   0.971% )
>      9420427393  cache-misses             #      1.475 M/sec   ( +-   0.897% )

> volano_O2:

>  17145170491943  cycles                   #   2918.985 M/sec   ( +-   0.288% )
>   7324126478801  instructions             #      0.427 IPC     ( +-   0.090% )
>    219064318074  cache-references         #     37.296 M/sec   ( +-   0.792% )
>      9491237013  cache-misses             #      1.616 M/sec   ( +-   0.439% )

> O2 is better than Os for volano
> O2 is not different with Os for tbench
> O2 is not different with Os for kbuild 

Ok, this looks pretty credible, thanks for going through it.

For Volano, the difference is 8.9%, well above the 0.3% noise level, so 
it's significant.

Would it be possible to do a 'perf record' and 'perf report' comparison 
between two volano runs, to see where the nearly 10% overhead comes 
from? It might be one or two functions mis-optimized by GCC perhaps. Or 
it could be across-the-spectrum slowdown.

Note that the number of instructions increased only by 1%, but the 
overhead by 9%. So we might be hitting some nasty corner case - or it 
might be some caching effect. (which does not seem to be supported by 
the numbers though - the LLC cache-misses does not look significantly 
higher in the Os case)

'perf annotate fn_name' will also help you see where the overhead 
hot-spots are. If you build the vmlinux via CONFIG_DEBUG_INFO the perf 
annotate output will interleave assembly and source code output. 
(otherwise it will be assembly output only)

You probably want to use the latest version of 'perf' for all that 
analysis, from:

  http://people.redhat.com/mingo/tip.git/README

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
  2009-12-01 10:14     ` Arjan van de Ven
  2009-12-01 16:11       ` H. Peter Anvin
@ 2009-12-03 15:03       ` Ma, Ling
  2009-12-03 15:05         ` H. Peter Anvin
  1 sibling, 1 reply; 10+ messages in thread
From: Ma, Ling @ 2009-12-03 15:03 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Ingo Molnar, Dave Jones, hpa, tglx, linux-kernel

> a key question is.. how much more memory do you have free due to -Os?
> (because memory is cache is performance on a system level as well)
The kernel code size from Os is 12M, that from O2 is 14M.
> and how much less icache pressure is there?
>From perf stat report, cache reference(unified cache) from O2 is almost the same with Os.

Thanks
Ling 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
  2009-12-03 15:03       ` Ma, Ling
@ 2009-12-03 15:05         ` H. Peter Anvin
  2009-12-03 15:31           ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: H. Peter Anvin @ 2009-12-03 15:05 UTC (permalink / raw)
  To: Ma, Ling; +Cc: Arjan van de Ven, Ingo Molnar, Dave Jones, tglx, linux-kernel

On 12/03/2009 07:03 AM, Ma, Ling wrote:
>> a key question is.. how much more memory do you have free due to -Os?
>> (because memory is cache is performance on a system level as well)
> The kernel code size from Os is 12M, that from O2 is 14M.
>> and how much less icache pressure is there?
> From perf stat report, cache reference(unified cache) from O2 is almost the same with Os.

The icache pressure was substantially higher (by ~10%) in the reports
that I saw.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
  2009-12-03 15:05         ` H. Peter Anvin
@ 2009-12-03 15:31           ` Ingo Molnar
  2009-12-03 15:46             ` H. Peter Anvin
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2009-12-03 15:31 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ma, Ling, Arjan van de Ven, Dave Jones, tglx, linux-kernel


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/03/2009 07:03 AM, Ma, Ling wrote:
> >> a key question is.. how much more memory do you have free due to -Os?
> >> (because memory is cache is performance on a system level as well)
> > The kernel code size from Os is 12M, that from O2 is 14M.
> >> and how much less icache pressure is there?
> > From perf stat report, cache reference(unified cache) from O2 is almost the same with Os.
> 
> The icache pressure was substantially higher (by ~10%) in the reports 
> that I saw.

hm, icache numbers are not included in perf stat runs by default. Are 
there some icache numbers i missed perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform
  2009-12-03 15:31           ` Ingo Molnar
@ 2009-12-03 15:46             ` H. Peter Anvin
  0 siblings, 0 replies; 10+ messages in thread
From: H. Peter Anvin @ 2009-12-03 15:46 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Ma, Ling, Arjan van de Ven, Dave Jones, tglx, linux-kernel

On 12/03/2009 07:31 AM, Ingo Molnar wrote:
> 
> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
>> On 12/03/2009 07:03 AM, Ma, Ling wrote:
>>>> a key question is.. how much more memory do you have free due to -Os?
>>>> (because memory is cache is performance on a system level as well)
>>> The kernel code size from Os is 12M, that from O2 is 14M.
>>>> and how much less icache pressure is there?
>>> From perf stat report, cache reference(unified cache) from O2 is almost the same with Os.
>>
>> The icache pressure was substantially higher (by ~10%) in the reports 
>> that I saw.
> 
> hm, icache numbers are not included in perf stat runs by default. Are 
> there some icache numbers i missed perhaps?
> 

Sorry, you're right; cache references and cache misses.  Furthermore,
I'm wrong, I was looking at references *per unit time*, which just show
that roughly the same number was squeezed into a shorter time.

Never mind me... :-/

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-12-03 15:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-26  8:05 [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform ling.ma
2009-11-26  9:49 ` Ingo Molnar
2009-12-01  8:54   ` Ma, Ling
2009-12-01 10:14     ` Arjan van de Ven
2009-12-01 16:11       ` H. Peter Anvin
2009-12-03 15:03       ` Ma, Ling
2009-12-03 15:05         ` H. Peter Anvin
2009-12-03 15:31           ` Ingo Molnar
2009-12-03 15:46             ` H. Peter Anvin
2009-12-02  9:47     ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.