linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] [x86]: Compiler Option Os is better on latest x86
@ 2013-01-25 14:11 ling.ma.program
  2013-01-26 12:25 ` [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig tip-bot for Ma Ling
  2013-01-28 17:15 ` [PATCH] [x86]: Compiler Option Os is better on latest x86 Valdis.Kletnieks
  0 siblings, 2 replies; 12+ messages in thread
From: ling.ma.program @ 2013-01-25 14:11 UTC (permalink / raw)
  To: mingo; +Cc: tglx, hpa, linux-kernel, Ma Ling

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 6642 bytes --]

From: Ma Ling <ling.ml@alipay.com>

  Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
-falign-loops, -falign-labels are very helpful to improve CPU front-end
throughput because CPU fetch instruction by 16 aligned–bytes code block
per cycle.

  In order to save power and get higher performance, Sandy Bridge 
starts to introduce decoded-cache, instructions will be kept in it
after decode stage. When CPU refetches the instruction, decoded cache could
provide 32 aligned-bytes instruction block, instead of 16 bytes from I-cache,
fewer branch miss penalty resulted from shorter pipeline. It requires hot
code should be put into decoded cache as possible we can. Sandy Bridge,
Ivy Bridge, and Haswell all implemented this feature, Os-Optimize for size
should be better than O2 on them.

Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
respectively. The results show Os improve performance netperf 4.8%,
2.7% for volano as below

O2 + netperf
Performance counter stats for 'netperf' (3 runs):

       5416.157986 task-clock                #    0.541 CPUs utilized            ( +-  0.19% )
           348,249 context-switches          #    0.064 M/sec                    ( +-  0.17% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
               353 page-faults               #    0.000 M/sec                    ( +-  0.16% )
    13,166,254,384 cycles                    #    2.431 GHz                      ( +-  0.18% )
     8,827,499,807 stalled-cycles-frontend   #   67.05% frontend cycles idle     ( +-  0.29% )
     5,951,234,060 stalled-cycles-backend    #   45.20% backend  cycles idle     ( +-  0.44% )
     8,122,481,914 instructions              #    0.62  insns per cycle
                                             #    1.09  stalled cycles per insn  ( +-  0.17% )
     1,415,864,138 branches                  #  261.415 M/sec                    ( +-  0.17% )
        16,975,308 branch-misses             #    1.20% of all branches          ( +-  0.61% )

      10.007215371 seconds time elapsed                                          ( +-  0.03% )

Os + netperf

Performance counter stats for 'netperf' (3 runs):

       5395.386704 task-clock                #    0.539 CPUs utilized            ( +-  0.14% )
           345,880 context-switches          #    0.064 M/sec                    ( +-  0.25% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
               354 page-faults               #    0.000 M/sec                    ( +-  0.00% )
    13,142,706,297 cycles                    #    2.436 GHz                      ( +-  0.23% )
     8,379,382,641 stalled-cycles-frontend   #   63.76% frontend cycles idle     ( +-  0.50% )
     5,513,722,219 stalled-cycles-backend    #   41.95% backend  cycles idle     ( +-  0.71% )
     8,554,202,795 instructions              #    0.65  insns per cycle
                                             #    0.98  stalled cycles per insn  ( +-  0.25% )
     1,530,020,505 branches                  #  283.579 M/sec                    ( +-  0.25% )
        17,710,406 branch-misses             #    1.16% of all branches          ( +-  1.00% )

      10.004859867 seconds time elapsed               

During the same time (10.004859867 seconds) IPC from Os is 0.65, O2 is 0.62, Os improved performance 4.8%

O2 + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

     210627.115313 task-clock                #    0.781 CPUs utilized            ( +-  0.92% )
        13,812,610 context-switches          #    0.066 M/sec                    ( +-  0.17% )
         2,352,755 CPU-migrations            #    0.011 M/sec                    ( +-  0.84% )
           208,333 page-faults               #    0.001 M/sec                    ( +-  1.58% )
   525,627,073,405 cycles                    #    2.496 GHz                      ( +-  0.96% )
   428,177,571,365 stalled-cycles-frontend   #   81.46% frontend cycles idle     ( +-  1.09% )
   370,885,224,739 stalled-cycles-backend    #   70.56% backend  cycles idle     ( +-  1.18% )
   187,662,577,544 instructions              #    0.36  insns per cycle
                                             #    2.28  stalled cycles per insn  ( +-  0.31% )
    35,684,976,425 branches                  #  169.423 M/sec                    ( +-  0.45% )
     1,062,086,942 branch-misses             #    2.98% of all branches          ( +-  0.08% )

     269.764578435 seconds time elapsed    
         
Os + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

     209545.786941 task-clock                #    0.778 CPUs utilized            ( +-  0.66% )
        13,864,142 context-switches          #    0.066 M/sec                    ( +-  0.29% )
         2,326,826 CPU-migrations            #    0.011 M/sec                    ( +-  0.83% )
           205,575 page-faults               #    0.001 M/sec                    ( +-  2.63% )
   523,366,588,452 cycles                    #    2.498 GHz                      ( +-  0.75% )
   419,200,472,430 stalled-cycles-frontend   #   80.10% frontend cycles idle     ( +-  0.86% )
   362,044,374,737 stalled-cycles-backend    #   69.18% backend  cycles idle     ( +-  0.96% )
   193,274,857,837 instructions              #    0.37  insns per cycle
                                             #    2.17  stalled cycles per insn  ( +-  0.51% )
    37,657,832,686 branches                  #  179.712 M/sec                    ( +-  0.42% )
     1,061,005,300 branch-misses             #    2.82% of all branches          ( +-  0.86% )

     269.410275674 seconds time elapsed                                          ( +-  0.06% )

During the same  time (269.410275674 seconds) IPC from Os is 0.37, O2 is 0.36, Os improved performance 2.7%

Signed-off-by: Ma Ling <ling.ml@alipay.com>
---
 arch/x86/configs/x86_64_defconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index 76eb290..813c30d 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -16,6 +16,7 @@ CONFIG_CGROUP_CPUACCT=y
 CONFIG_RESOURCE_COUNTERS=y
 CONFIG_CGROUP_SCHED=y
 CONFIG_BLK_DEV_INITRD=y
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
 # CONFIG_COMPAT_BRK is not set
 CONFIG_PROFILING=y
 CONFIG_KPROBES=y
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
  2013-01-25 14:11 [PATCH] [x86]: Compiler Option Os is better on latest x86 ling.ma.program
@ 2013-01-26 12:25 ` tip-bot for Ma Ling
  2013-01-26 12:52   ` Borislav Petkov
  2013-01-28 17:15 ` [PATCH] [x86]: Compiler Option Os is better on latest x86 Valdis.Kletnieks
  1 sibling, 1 reply; 12+ messages in thread
From: tip-bot for Ma Ling @ 2013-01-26 12:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, arjan, torvalds, jbeulich, ling.ml,
	akpm, rostedt, tglx

Commit-ID:  d94ffd677469ef729e9d6e968191872577a6119e
Gitweb:     http://git.kernel.org/tip/d94ffd677469ef729e9d6e968191872577a6119e
Author:     Ma Ling <ling.ml@alipay.com>
AuthorDate: Fri, 25 Jan 2013 09:11:01 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 26 Jan 2013 13:09:15 +0100

x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE=y in the 64-bit defconfig

Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger
instructon and unified cache, sophisticated instruction prefetch
weaken instruction cache miss, meanwhile flags such as
 -falign-functions, -falign-jumps, -falign-loops, -falign-labels
are very helpful to improve CPU front-end throughput because CPU
fetch instruction by 16 aligned–bytes code block per cycle.

In order to save power and get higher performance, Sandy Bridge
starts to introduce decoded-cache, instructions will be kept in
it after decode stage. When CPU refetches the instruction,
decoded cache could provide 32 aligned-bytes instruction block,
instead of 16 bytes from I-cache, fewer branch miss penalty
resulted from shorter pipeline. It requires hot code should be
put into decoded cache as possible we can. Sandy Bridge, Ivy
Bridge, and Haswell all implemented this feature, Os-Optimize
for size should be better than O2 on them.

Based on above reasons, we compiled linux kernel 3.6.9 with O2
and Os respectively. The results show Os improve performance
netperf 4.8%, 2.7% for volano as below:

O2 + netperf
Performance counter stats for 'netperf' (3 runs):

       5416.157986 task-clock                #    0.541 CPUs utilized            ( +-  0.19% )
           348,249 context-switches          #    0.064 M/sec                    ( +-  0.17% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
               353 page-faults               #    0.000 M/sec                    ( +-  0.16% )
    13,166,254,384 cycles                    #    2.431 GHz                      ( +-  0.18% )
     8,827,499,807 stalled-cycles-frontend   #   67.05% frontend cycles idle     ( +-  0.29% )
     5,951,234,060 stalled-cycles-backend    #   45.20% backend  cycles idle     ( +-  0.44% )
     8,122,481,914 instructions              #    0.62  insns per cycle
                                             #    1.09  stalled cycles per insn  ( +-  0.17% )
     1,415,864,138 branches                  #  261.415 M/sec                    ( +-  0.17% )
        16,975,308 branch-misses             #    1.20% of all branches          ( +-  0.61% )

      10.007215371 seconds time elapsed                                          ( +-  0.03% )

Os + netperf

Performance counter stats for 'netperf' (3 runs):

       5395.386704 task-clock                #    0.539 CPUs utilized            ( +-  0.14% )
           345,880 context-switches          #    0.064 M/sec                    ( +-  0.25% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
               354 page-faults               #    0.000 M/sec                    ( +-  0.00% )
    13,142,706,297 cycles                    #    2.436 GHz                      ( +-  0.23% )
     8,379,382,641 stalled-cycles-frontend   #   63.76% frontend cycles idle     ( +-  0.50% )
     5,513,722,219 stalled-cycles-backend    #   41.95% backend  cycles idle     ( +-  0.71% )
     8,554,202,795 instructions              #    0.65  insns per cycle
                                             #    0.98  stalled cycles per insn  ( +-  0.25% )
     1,530,020,505 branches                  #  283.579 M/sec                    ( +-  0.25% )
        17,710,406 branch-misses             #    1.16% of all branches          ( +-  1.00% )

      10.004859867 seconds time elapsed

During the same time (10.004859867 seconds) IPC from Os is 0.65,
O2 is 0.62, Os improved performance 4.8%.

O2 + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

     210627.115313 task-clock                #    0.781 CPUs utilized            ( +-  0.92% )
        13,812,610 context-switches          #    0.066 M/sec                    ( +-  0.17% )
         2,352,755 CPU-migrations            #    0.011 M/sec                    ( +-  0.84% )
           208,333 page-faults               #    0.001 M/sec                    ( +-  1.58% )
   525,627,073,405 cycles                    #    2.496 GHz                      ( +-  0.96% )
   428,177,571,365 stalled-cycles-frontend   #   81.46% frontend cycles idle     ( +-  1.09% )
   370,885,224,739 stalled-cycles-backend    #   70.56% backend  cycles idle     ( +-  1.18% )
   187,662,577,544 instructions              #    0.36  insns per cycle
                                             #    2.28  stalled cycles per insn  ( +-  0.31% )
    35,684,976,425 branches                  #  169.423 M/sec                    ( +-  0.45% )
     1,062,086,942 branch-misses             #    2.98% of all branches          ( +-  0.08% )

     269.764578435 seconds time elapsed

Os + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

     209545.786941 task-clock                #    0.778 CPUs utilized            ( +-  0.66% )
        13,864,142 context-switches          #    0.066 M/sec                    ( +-  0.29% )
         2,326,826 CPU-migrations            #    0.011 M/sec                    ( +-  0.83% )
           205,575 page-faults               #    0.001 M/sec                    ( +-  2.63% )
   523,366,588,452 cycles                    #    2.498 GHz                      ( +-  0.75% )
   419,200,472,430 stalled-cycles-frontend   #   80.10% frontend cycles idle     ( +-  0.86% )
   362,044,374,737 stalled-cycles-backend    #   69.18% backend  cycles idle     ( +-  0.96% )
   193,274,857,837 instructions              #    0.37  insns per cycle
                                             #    2.17  stalled cycles per insn  ( +-  0.51% )
    37,657,832,686 branches                  #  179.712 M/sec                    ( +-  0.42% )
     1,061,005,300 branch-misses             #    2.82% of all branches          ( +-  0.86% )

     269.410275674 seconds time elapsed                                          ( +-  0.06% )

During the same  time (269.410275674 seconds) IPC from Os is
0.37, O2 is 0.36, Os improved performance 2.7%.

Signed-off-by: Ma Ling <ling.ml@alipay.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1359123061-6139-1-git-send-email-ling.ma@alipay.com
[ So, this is a bit symbolic as most people don't use the defconfig,
  but the measurements are useful nevertheless so let's commit this
  if there are no objections. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/configs/x86_64_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index 671524d..2fcde13 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -18,6 +18,7 @@ CONFIG_CGROUP_CPUACCT=y
 CONFIG_RESOURCE_COUNTERS=y
 CONFIG_CGROUP_SCHED=y
 CONFIG_BLK_DEV_INITRD=y
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
 # CONFIG_COMPAT_BRK is not set
 CONFIG_PROFILING=y
 CONFIG_KPROBES=y

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
  2013-01-26 12:25 ` [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig tip-bot for Ma Ling
@ 2013-01-26 12:52   ` Borislav Petkov
  2013-01-26 15:18     ` H. Peter Anvin
  0 siblings, 1 reply; 12+ messages in thread
From: Borislav Petkov @ 2013-01-26 12:52 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, torvalds, arjan, jbeulich, ling.ml,
	rostedt, akpm, tglx
  Cc: linux-tip-commits

On Sat, Jan 26, 2013 at 04:25:57AM -0800, tip-bot for Ma Ling wrote:
> During the same  time (269.410275674 seconds) IPC from Os is
> 0.37, O2 is 0.36, Os improved performance 2.7%.
> 
> Signed-off-by: Ma Ling <ling.ml@alipay.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Link: http://lkml.kernel.org/r/1359123061-6139-1-git-send-email-ling.ma@alipay.com
> [ So, this is a bit symbolic as most people don't use the defconfig,
>   but the measurements are useful nevertheless so let's commit this
>   if there are no objections. ]

What about

commit 3a55fb0d9fe8e2f4594329edd58c5fd6f35a99dd
Author: Kirill Smelkov <kirr@mns.spb.ru>
Date:   Fri Nov 2 15:41:01 2012 +0400

    Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE

?

Reportedly, -Os generates suboptimal code in certain cases and we advise
against it with this patch.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
  2013-01-26 12:52   ` Borislav Petkov
@ 2013-01-26 15:18     ` H. Peter Anvin
  2013-01-26 15:42       ` Borislav Petkov
  2013-01-26 19:43       ` Linus Torvalds
  0 siblings, 2 replies; 12+ messages in thread
From: H. Peter Anvin @ 2013-01-26 15:18 UTC (permalink / raw)
  To: Borislav Petkov, mingo, linux-kernel, torvalds, arjan, jbeulich,
	ling.ml, rostedt, akpm, tglx
  Cc: linux-tip-commits

On the CPUs Ling is testing on the downsides of -Os probably matter less, in particular since rep movsb works well.

It is questionable as a generic default, though.

The whole -Ok discussion came from that.

Borislav Petkov <bp@alien8.de> wrote:

>On Sat, Jan 26, 2013 at 04:25:57AM -0800, tip-bot for Ma Ling wrote:
>> During the same  time (269.410275674 seconds) IPC from Os is
>> 0.37, O2 is 0.36, Os improved performance 2.7%.
>> 
>> Signed-off-by: Ma Ling <ling.ml@alipay.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Arjan van de Ven <arjan@linux.intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jan Beulich <jbeulich@suse.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Link:
>http://lkml.kernel.org/r/1359123061-6139-1-git-send-email-ling.ma@alipay.com
>> [ So, this is a bit symbolic as most people don't use the defconfig,
>>   but the measurements are useful nevertheless so let's commit this
>>   if there are no objections. ]
>
>What about
>
>commit 3a55fb0d9fe8e2f4594329edd58c5fd6f35a99dd
>Author: Kirill Smelkov <kirr@mns.spb.ru>
>Date:   Fri Nov 2 15:41:01 2012 +0400
>
>    Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE
>
>?
>
>Reportedly, -Os generates suboptimal code in certain cases and we
>advise
>against it with this patch.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
  2013-01-26 15:18     ` H. Peter Anvin
@ 2013-01-26 15:42       ` Borislav Petkov
  2013-01-26 19:43       ` Linus Torvalds
  1 sibling, 0 replies; 12+ messages in thread
From: Borislav Petkov @ 2013-01-26 15:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo, linux-kernel, torvalds, arjan, jbeulich, ling.ml, rostedt,
	akpm, tglx, linux-tip-commits

On Sat, Jan 26, 2013 at 07:18:26AM -0800, H. Peter Anvin wrote:
> On the CPUs Ling is testing on the downsides of -Os probably matter
>less, in particular since rep movsb works well.
>
> It is questionable as a generic default, though.
>
> The whole -Ok discussion came from that.

Hmm,

maybe this warrants some sort of a text addition to the config option
about when and where it is ok to select -Os. Also, is it so that
on those CPUs, -Os is always better or at least neutral so that it
doesn't cause any perf regressions when people enable it? I mean, I
see only netperf and volano runs in the commit message, maybe running
a comprehensive set of benchmarks would give us a much more detailed
picture...

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
  2013-01-26 15:18     ` H. Peter Anvin
  2013-01-26 15:42       ` Borislav Petkov
@ 2013-01-26 19:43       ` Linus Torvalds
  2013-01-26 21:02         ` Steven Rostedt
  2013-01-26 21:08         ` H. Peter Anvin
  1 sibling, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2013-01-26 19:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Jan Beulich, ling.ml, Steven Rostedt,
	Andrew Morton, Thomas Gleixner, linux-tip-commits

On Sat, Jan 26, 2013 at 7:18 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On the CPUs Ling is testing on the downsides of -Os probably matter less, in particular since rep movsb works well.
>
> It is questionable as a generic default, though.

So being the person who really pushed for -Os to begin with (I think
I$ and instruction decode bandwidth is one of the most fundamental
limits to CPU performance), I wouldn't mind it if we reintroduced it.

HOWEVER.

It wasn't just "rep movs". The thing that killed -Os for me was that
it makes it impossible to try to optimize hot code, because -Os seems
to throw out branch prediction information. So when you use "likely()"
etc to try to teach the compiler to lay out code a certain way so that
code that never really gets executed isn't even brought into the I$,
-Os then screws it up completely.

Of course, maybe newer versions of gcc might not suck so horribly with
-Os, I haven't actually tried in a while.

[ Just tested. Still does it ]

Also, I doubt Ling was testing a SB CPU. Because "rep movb" still
sucks pretty bad on SB. What core *is* Ling testing? Haswell?

Ugh. We could make it depend on the optimization target. I'd also wish
there was some way to just tune gcc -Os to be closer to reasonable. Or
make -O2 not do some of the excessive crap it does (it aligns code
*much* too much, for example - who cares if you can do it with a
single instruction, if that instruction is so long that it uses up
half your decode bandwidth?)

The problem, of course, is that most -O2 code generation is done
assuming hot loops that don't show much if any I$ issues. And the -Os
thing is done *purely* for size, not taking any performance into
account at all. There's no balanced middle ground, which is what _we_
would want.

                  Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
  2013-01-26 19:43       ` Linus Torvalds
@ 2013-01-26 21:02         ` Steven Rostedt
  2013-01-26 21:04           ` H. Peter Anvin
  2013-01-27 12:49           ` Ingo Molnar
  2013-01-26 21:08         ` H. Peter Anvin
  1 sibling, 2 replies; 12+ messages in thread
From: Steven Rostedt @ 2013-01-26 21:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Borislav Petkov, Ingo Molnar,
	Linux Kernel Mailing List, Arjan van de Ven, Jan Beulich,
	ling.ml, Andrew Morton, Thomas Gleixner, linux-tip-commits

On Sat, 2013-01-26 at 11:43 -0800, Linus Torvalds wrote:

> The problem, of course, is that most -O2 code generation is done
> assuming hot loops that don't show much if any I$ issues. And the -Os
> thing is done *purely* for size, not taking any performance into
> account at all. There's no balanced middle ground, which is what _we_
> would want.

Gcc needs to implement a -Olinus

;-)

-- Steve



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
  2013-01-26 21:02         ` Steven Rostedt
@ 2013-01-26 21:04           ` H. Peter Anvin
  2013-01-27 12:49           ` Ingo Molnar
  1 sibling, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2013-01-26 21:04 UTC (permalink / raw)
  To: Steven Rostedt, Linus Torvalds
  Cc: Borislav Petkov, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Jan Beulich, ling.ml, Andrew Morton,
	Thomas Gleixner, linux-tip-commits

We have discussed -Ok(ernel) with the gcc guys in earnest.  They are receptive but lack the round tuits.

Steven Rostedt <rostedt@goodmis.org> wrote:

>On Sat, 2013-01-26 at 11:43 -0800, Linus Torvalds wrote:
>
>> The problem, of course, is that most -O2 code generation is done
>> assuming hot loops that don't show much if any I$ issues. And the -Os
>> thing is done *purely* for size, not taking any performance into
>> account at all. There's no balanced middle ground, which is what _we_
>> would want.
>
>Gcc needs to implement a -Olinus
>
>;-)
>
>-- Steve

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
  2013-01-26 19:43       ` Linus Torvalds
  2013-01-26 21:02         ` Steven Rostedt
@ 2013-01-26 21:08         ` H. Peter Anvin
  1 sibling, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2013-01-26 21:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Jan Beulich, ling.ml, Steven Rostedt,
	Andrew Morton, Thomas Gleixner, linux-tip-commits

The fast rep movsb was introduced on Ivy Bridge, IIRC.

Linus Torvalds <torvalds@linux-foundation.org> wrote:

>On Sat, Jan 26, 2013 at 7:18 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On the CPUs Ling is testing on the downsides of -Os probably matter
>less, in particular since rep movsb works well.
>>
>> It is questionable as a generic default, though.
>
>So being the person who really pushed for -Os to begin with (I think
>I$ and instruction decode bandwidth is one of the most fundamental
>limits to CPU performance), I wouldn't mind it if we reintroduced it.
>
>HOWEVER.
>
>It wasn't just "rep movs". The thing that killed -Os for me was that
>it makes it impossible to try to optimize hot code, because -Os seems
>to throw out branch prediction information. So when you use "likely()"
>etc to try to teach the compiler to lay out code a certain way so that
>code that never really gets executed isn't even brought into the I$,
>-Os then screws it up completely.
>
>Of course, maybe newer versions of gcc might not suck so horribly with
>-Os, I haven't actually tried in a while.
>
>[ Just tested. Still does it ]
>
>Also, I doubt Ling was testing a SB CPU. Because "rep movb" still
>sucks pretty bad on SB. What core *is* Ling testing? Haswell?
>
>Ugh. We could make it depend on the optimization target. I'd also wish
>there was some way to just tune gcc -Os to be closer to reasonable. Or
>make -O2 not do some of the excessive crap it does (it aligns code
>*much* too much, for example - who cares if you can do it with a
>single instruction, if that instruction is so long that it uses up
>half your decode bandwidth?)
>
>The problem, of course, is that most -O2 code generation is done
>assuming hot loops that don't show much if any I$ issues. And the -Os
>thing is done *purely* for size, not taking any performance into
>account at all. There's no balanced middle ground, which is what _we_
>would want.
>
>                  Linus

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
  2013-01-26 21:02         ` Steven Rostedt
  2013-01-26 21:04           ` H. Peter Anvin
@ 2013-01-27 12:49           ` Ingo Molnar
  1 sibling, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2013-01-27 12:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, H. Peter Anvin, Borislav Petkov,
	Linux Kernel Mailing List, Arjan van de Ven, Jan Beulich,
	ling.ml, Andrew Morton, Thomas Gleixner, linux-tip-commits


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sat, 2013-01-26 at 11:43 -0800, Linus Torvalds wrote:
> 
> > The problem, of course, is that most -O2 code generation is done
> > assuming hot loops that don't show much if any I$ issues. And the -Os
> > thing is done *purely* for size, not taking any performance into
> > account at all. There's no balanced middle ground, which is what _we_
> > would want.
> 
> Gcc needs to implement a -Olinus

What we really want is a sane default for 'library code' 
optimization:

 - cache-cold optimizations for run-through-once non-looping 
   code (-Os)

 - good loop optimizations for anything that arguably loops (-O2)

 - plus common-sense fixes to -Os like not throwing away 
   explicit branch hints we go to great pains to insert.

Possibly some time this decade.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [x86]: Compiler Option Os is better on latest x86
  2013-01-25 14:11 [PATCH] [x86]: Compiler Option Os is better on latest x86 ling.ma.program
  2013-01-26 12:25 ` [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig tip-bot for Ma Ling
@ 2013-01-28 17:15 ` Valdis.Kletnieks
  2013-01-29  8:12   ` Ingo Molnar
  1 sibling, 1 reply; 12+ messages in thread
From: Valdis.Kletnieks @ 2013-01-28 17:15 UTC (permalink / raw)
  To: ling.ma.program; +Cc: mingo, tglx, hpa, linux-kernel, Ma Ling

[-- Attachment #1: Type: text/plain, Size: 784 bytes --]

On Fri, 25 Jan 2013 09:11:01 -0500, ling.ma.program@gmail.com said:

> Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
> respectively. The results show Os improve performance netperf 4.8%,
> 2.7% for volano as below

Am I allowed to NAK this?  What the numbers given so far *actually*
show is 4.8% more instructions executed, *not* 4.8% better performance.

I'm having a *very* hard time convincing myself that what we're seeing isn't
simply the expected behavior of loops *not* being unrolled and similar
non-optimizations done by -Os, so more instructions get executed to do the same
amount of work.

Rather than "run for 10 seconds and count instructions", can we
"run for 50,000 syscalls and count clock time" or similar that shows
an *actual* improvement?


[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [x86]: Compiler Option Os is better on latest x86
  2013-01-28 17:15 ` [PATCH] [x86]: Compiler Option Os is better on latest x86 Valdis.Kletnieks
@ 2013-01-29  8:12   ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2013-01-29  8:12 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: ling.ma.program, mingo, tglx, hpa, linux-kernel, Ma Ling


* Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:

> On Fri, 25 Jan 2013 09:11:01 -0500, ling.ma.program@gmail.com said:
> 
> > Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
> > respectively. The results show Os improve performance netperf 4.8%,
> > 2.7% for volano as below
> 
> Am I allowed to NAK this?  What the numbers given so far 
> *actually* show is 4.8% more instructions executed, *not* 4.8% 
> better performance.

cycles and elapsed time is down in both tests - the speedup 
seems statistically a wash in the first test and significant for 
the second workload.

the instruction count might be an artifact of byte wise versus 
word wise REP; MOV.

> I'm having a *very* hard time convincing myself that what 
> we're seeing isn't simply the expected behavior of loops *not* 
> being unrolled and similar non-optimizations done by -Os, so 
> more instructions get executed to do the same amount of work.
> 
> Rather than "run for 10 seconds and count instructions", can 
> we "run for 50,000 syscalls and count clock time" or similar 
> that shows an *actual* improvement?

Look at the numbers, it counts a whole lot of other things as 
well beyond instructions - elapsed time being the most important 
one.

But more numbers never hurt.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-01-29  8:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-25 14:11 [PATCH] [x86]: Compiler Option Os is better on latest x86 ling.ma.program
2013-01-26 12:25 ` [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig tip-bot for Ma Ling
2013-01-26 12:52   ` Borislav Petkov
2013-01-26 15:18     ` H. Peter Anvin
2013-01-26 15:42       ` Borislav Petkov
2013-01-26 19:43       ` Linus Torvalds
2013-01-26 21:02         ` Steven Rostedt
2013-01-26 21:04           ` H. Peter Anvin
2013-01-27 12:49           ` Ingo Molnar
2013-01-26 21:08         ` H. Peter Anvin
2013-01-28 17:15 ` [PATCH] [x86]: Compiler Option Os is better on latest x86 Valdis.Kletnieks
2013-01-29  8:12   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).