All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Jakub Jelinek <jakub@redhat.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Jason Low <jason.low2@hp.com>, Brian Gerst <brgerst@gmail.com>,
	Aswin Chandramouleeswaran <aswin@hp.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	"H. Peter Anvin" <hpa@zytor.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Richard Henderson <rth@twiddle.net>
Subject: Re: [PATCH] x86: Turn off GCC branch probability heuristics
Date: Sun, 12 Apr 2015 09:41:24 +0200	[thread overview]
Message-ID: <20150412074123.GB9062@gmail.com> (raw)
In-Reply-To: <CA+55aFxNbqEGouKAmQ72=skkm8NWBUu7jh92BshFsCwF3r294g@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Apr 11, 2015 at 11:57 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > I thinks its just the no-guess one:
> >
> >    text        data       dec  patch           reduction
> > 7563475     1781048  10302987
> > 7192973     1780024   9931461  no-guess            -4.8%
> > 7354819     1781048    958464  align-1             -2.7%
> > 7192973     1780024   9931461  no-guess + align-1  -4.8%
> 
> Yeah, a 5% code expansion is a big deal. Sadly, it looks like
> 'no-guess' also disables our explicit likely/unlikely handling.
> 
> Damn. If it actually honored likely/unlikely, then we should just do
> it - and manually fix up any places where we really care.
> 
> But the fact that it apparently entirely disables not just the
> guesses, but our *explicit* likely/unlikely, means that we can't fix
> up the mistakes.
> 
> And in many of the hot codepaths that likely/unlikely really does
> matter. Some of our hottest paths have known "this basically never
> happens" situations that we do *not* want to break up our L1 I$ over.
> There's a number of functions that have been optimized to really
> generate good code, and "-fno-guess-branch-probability" disables those
> manual optimizations.
> 
> So we'd have no way to fix it for the cases that matter.
> 
> Sad.
> 
> It might be worth bringing this up with some gcc people. I added Jakub
> to the cc. Any other gcc people suggestions?

(Skip to the 'More numbers' section below to see more measurements.)

So what would be nice to have is if GCC had an optimization option to 
disable all branch probability heuristics (like 
-fno-guess-branch-probability), except explicit __builtin_expect() 
hints (which -fno-guess-branch-probability unfortunately disables).

So what would be useful to have is a -fno-branch-probability-heuristics
GCC option or so, or a "--param branch-probability-heuristics=0" 
option.

I found one related option:

  --param predictable-branch-outcome=N

           predictable-branch-outcome
               When branch is predicted to be taken with probability 
               lower than this threshold (in percent), then it is 
               considered well predictable. The default is 10.

I tried the values 1 and 0, on a -O2 kernel (i.e. guess-branch-probability
was enabled), in the hope that it maybe turns off all non-__builtin_expect()
branch heuristics, but it only had very minor size impact in the 0.001% range.

More numbers:
-------------

I also measured the effect of our __builtin_expect() 
(likely()/unlikely()) branch annotations.

As an experiment I mapped both likely()/unlikely() to the four 
possible __builtin_expect() probability settings:

  likely(): __builtin_expect(, 1)    unlikely(): __builtin_expect(, 0)
  likely(): __builtin_expect(, 0)    unlikely(): __builtin_expect(, 1)
  likely(): __builtin_expect(, 0)    unlikely(): __builtin_expect(, 0)
  likely(): __builtin_expect(, 1)    unlikely(): __builtin_expect(, 1)

(the first mapping is the only one that makes sense, and it is the one 
that is used by the kernel currently.)

The goal of mixing up the probability mappings was to measure the full 
'range' of code size impact that the kernel's explicit hints are 
causing, versus the impact that GCC's own branch probability 
heuristics are causing:

     text      data     bss      dec  filename
 12566383   1617840 1089536 15273759  vmlinux.expect=10 [==vanilla]
 12460250   1617840 1089536 15167626  vmlinux.expect=01
 12563332   1617840 1089536 15270708  vmlinux.expect=00
 12463035   1617840 1089536 15170411  vmlinux.expect=11
 12533382   1617840 1089536 15240758  vmlinux.no-expect

 11923529   1617840 1089536 14630905  vmlinux.-fno-guess-branch-probability


[ This was done on a v4.1-rc7-ish vanilla -O2 kernel (no alignment 
  tweaks), using 'make defconfig' and 'make kvmconfig' on x86-64 to 
  turn it into a minimally bootable kernel. I used GCC 4.9.1. ]

the 'vmlinux.none' kernel is a vanilla kernel with all 
__builtin_expect() hints removed: i.e. GCC heuristics are deciding all 
branch probabilities in the kernel.

So the code size 'range' that the kernel's own probability hints are 
moving in is around 0.4%:

  - the 'vanilla' mapping is the largest (not unexpectedly)

  - the 'inverse' mapping is the smallest, by 0.8%

  - the 00 and 11 mappings are about mid range, 0.4% away from the 
    extremes

  - the 'none' mapping is also mid-range, which too is somewhat
    expected: GCC heuristics would pick only part of our hints.

But note how the 'no GCC heuristics at all' setting reduces size 
brutally, by 5.4%.

So if we were able to do that, while keeping __builtin_expect(), and 
used our own heuristics only via __builtin_expect(), then we could 
still possibly expect a total code size shrinkage of around 5.0%:

  -5.4% code size reduction that comes from removal of all hints,

  +0.4% code size increase between the 'none' and the '10' mappings 
        in the experiment above.

Note that even if I apply all the align=1 patches, 
-fno-guess-branch-probability on top of that still gives me 2.6% code 
size savings:

     text      data     bss      dec  filename
 12566383   1617840 1089536 15273759  vmlinux.expect=10 [==vanilla]
 11923529   1617840 1089536 14630905  vmlinux.-fno-guess-branch-probability
 11903663   1617840 1089536 14611039  vmlinux.align=1
 11646102   1617840 1089536 14353478  vmlinux.align=1+fno-guess-branch-probability

The smallest vmlinux has:

  - about    41,000 functions ('tT' symbols in System.map)
  - about   300,000 branch/jump instructions (all objdump -d asm mnemonics starting with 'j')
  - about   165,000 function calls (all objdump -d asm mnemonics matching 'call')
  - about 2,330,000 instructions (all objdump -d asm mnemonics)

With align=1, GCC's heuristics added about 1,200 new branches and 
1,350 new function calls, 76,900 instructions, altogether +257,561 
bytes of code:

 # of x86 instructions
 2549742                     vmlinux.expect=10 [==vanilla]
 2391069                     vmlinux.-fno-guess-branch-probability
 2411568                     vmlinux.align=1
 2334595                     vmlinux.align=1+fno-guess-branch-probability

(For completeness, the patch generating the smallest kernel is 
attached below.)

The takeaway: the code size savings above are 7.9%. Even if we 
restored __builtin_expect() hints, we'd probably still see a combined 
7.5% code shrinkage.

That would be a rather significant I$ win, with very little cost that 
I can see!

Thanks,

	Ingo

---
 arch/x86/Makefile | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5ba2d9ce82dc..a6d3feb90b97 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,10 +77,25 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+        # Pack jump targets tightly, don't align them to the default 16 bytes:
+        KBUILD_CFLAGS += -falign-jumps=1
+
+        # Pack functions tightly as well:
+        KBUILD_CFLAGS += -falign-functions=1
+
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
 
+        #
+        # Don't guess branch probabilities, follow the code and unlikely()/likely() hints,
+        # which reduces vmlinux size by about 5.4%:
+        #
+        KBUILD_CFLAGS += -fno-guess-branch-probability
+
 	# Use -mpreferred-stack-boundary=3 if supported.
 	KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=3)
 

  parent reply	other threads:[~2015-04-12  7:41 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
2015-04-09  9:00   ` [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner() tip-bot for Jason Low
2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09  5:37   ` Ingo Molnar
2015-04-09  6:40     ` Jason Low
2015-04-09  7:53       ` Ingo Molnar
2015-04-09 16:47         ` Linus Torvalds
2015-04-09 17:56           ` Paul E. McKenney
2015-04-09 18:08             ` Linus Torvalds
2015-04-09 18:16               ` Linus Torvalds
2015-04-09 18:39                 ` Paul E. McKenney
2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
2015-04-10  9:12                     ` Ingo Molnar
2015-04-10  9:21                       ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
2015-04-10 13:46                                   ` Borislav Petkov
2015-05-15  9:40                                   ` [tip:x86/asm] " tip-bot for Ingo Molnar
2015-05-17  6:03                                   ` [tip:x86/apic] " tip-bot for Ingo Molnar
2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
2015-05-15 18:36                                   ` Linus Torvalds
2015-05-15 20:52                                     ` Denys Vlasenko
2015-05-17  5:58                                     ` Ingo Molnar
2015-05-17  7:09                                       ` Ingo Molnar
2015-05-17  7:30                                         ` Ingo Molnar
2015-05-18  9:28                                       ` Denys Vlasenko
2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
2015-05-20  0:47                                         ` Linus Torvalds
2015-05-20 12:21                                           ` Denys Vlasenko
2015-05-21 11:36                                             ` Ingo Molnar
2015-05-21 11:38                                             ` Denys Vlasenko
2016-04-16 21:08                                               ` Denys Vlasenko
2015-05-20 13:09                                           ` Ingo Molnar
2015-05-20 11:29                                         ` Denys Vlasenko
2015-05-21 13:28                                           ` Ingo Molnar
2015-05-21 14:03                                           ` Ingo Molnar
2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-04-10 13:18                                 ` H. Peter Anvin
2015-04-10 17:54                                   ` Ingo Molnar
2015-04-10 18:32                                     ` H. Peter Anvin
2015-04-11 14:41                                   ` Markus Trippelsdorf
2015-04-12 10:14                                     ` Ingo Molnar
2015-04-13 16:23                                       ` Markus Trippelsdorf
2015-04-13 17:26                                         ` Markus Trippelsdorf
2015-04-13 18:31                                           ` Linus Torvalds
2015-04-13 19:09                                             ` Markus Trippelsdorf
2015-04-14  5:38                                               ` Ingo Molnar
2015-04-14  8:23                                                 ` Markus Trippelsdorf
2015-04-14  9:16                                                   ` Ingo Molnar
2015-04-14 11:17                                                     ` Markus Trippelsdorf
2015-04-14 12:09                                                       ` Ingo Molnar
2015-04-10 18:48                                 ` Linus Torvalds
2015-04-12 23:44                                   ` Maciej W. Rozycki
2015-04-10 19:23                                 ` Daniel Borkmann
2015-04-11 13:48                                 ` Markus Trippelsdorf
2015-04-10 13:19                               ` Borislav Petkov
2015-04-10 13:54                                 ` Denys Vlasenko
2015-04-10 14:01                                   ` Borislav Petkov
2015-04-10 14:53                                     ` Denys Vlasenko
2015-04-10 15:25                                       ` Borislav Petkov
2015-04-10 15:48                                         ` Denys Vlasenko
2015-04-10 15:54                                           ` Borislav Petkov
2015-04-10 21:44                                             ` Borislav Petkov
2015-04-10 18:54                                       ` Linus Torvalds
2015-04-10 14:10                               ` Paul E. McKenney
2015-04-11 14:28                                 ` Josh Triplett
2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
2015-04-11 17:41                                 ` Linus Torvalds
2015-04-11 18:57                                   ` Thomas Gleixner
2015-04-11 19:35                                     ` Linus Torvalds
2015-04-12  5:47                                       ` Ingo Molnar
2015-04-12  6:20                                         ` Markus Trippelsdorf
2015-04-12 10:15                                           ` Ingo Molnar
2015-04-12  7:56                                         ` Mike Galbraith
2015-04-12  7:41                                       ` Ingo Molnar [this message]
2015-04-12  8:07                                     ` Ingo Molnar
2015-04-12 21:11                                     ` Jan Hubicka
2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-05-14 18:17                                 ` Ingo Molnar
2015-05-14 19:04                                   ` Denys Vlasenko
2015-05-14 19:44                                     ` Ingo Molnar
2015-05-15 15:45                                   ` Josh Triplett
2015-05-17  5:34                                     ` Ingo Molnar
2015-05-17 19:18                                       ` Josh Triplett
2015-05-18  6:48                                         ` Ingo Molnar
2015-05-15  9:39                               ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 17:49                           ` Linus Torvalds
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 18:09                               ` Linus Torvalds
2015-04-10 14:20                     ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
2015-04-10 17:44                       ` Ingo Molnar
2015-04-10 18:05                         ` Paul E. McKenney
2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09 19:58                   ` Paul E. McKenney
2015-04-09 20:58                     ` Jason Low
2015-04-09 21:07                       ` Paul E. McKenney
2015-04-09 19:59                   ` Davidlohr Bueso
2015-04-09 20:36                 ` Jason Low
2015-04-10  2:43                   ` Andev
2015-04-10  9:04                   ` Ingo Molnar
2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
2015-04-08 20:10   ` Jason Low

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150412074123.GB9062@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aswin@hp.com \
    --cc=bp@alien8.de \
    --cc=brgerst@gmail.com \
    --cc=dave@stgolabs.net \
    --cc=dvlasenk@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jakub@redhat.com \
    --cc=jason.low2@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rth@twiddle.net \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.