From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752341AbbDLVPL (ORCPT <rfc822;w@1wt.eu>);
	Sun, 12 Apr 2015 17:15:11 -0400
Received: from plane.gmane.org ([80.91.229.3]:58225 "EHLO plane.gmane.org"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751776AbbDLVPI (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Sun, 12 Apr 2015 17:15:08 -0400
X-Injected-Via-Gmane: http://gmane.org/
To: linux-kernel@vger.kernel.org
From: Jan Hubicka <hubicka@ucw.cz>
Subject: Re: [PATCH] x86: Turn off GCC branch probability heuristics
Date: Sun, 12 Apr 2015 21:11:40 +0000 (UTC)
Message-ID: <loom.20150412T224219-68@post.gmane.org>
References: <20150409175652.GI6464@linux.vnet.ibm.com> <CA+55aFzXMDjQQ7jTjsPdh1RikXfgV7OCd-+13cz06MOmDBA33w@mail.gmail.com> <CA+55aFwZWi6ecDmVsMBQJTrgrW3GD2DaRtpiOspe=5amR1=dNg@mail.gmail.com> <20150409183926.GM6464@linux.vnet.ibm.com> <20150410090051.GA28549@gmail.com> <20150410091252.GA27630@gmail.com> <20150410092152.GA21332@gmail.com> <20150410111427.GA30477@gmail.com> <20150410112748.GB30477@gmail.com> <20150410120846.GA17101@gmail.com> <20150411092021.GA9478@gmail.com> <CA+55aFygq02Y9FATg9JOrLkM97a__DNi_LoMd+FmX9n3a-JoBA@mail.gmail.com> <alpine.DEB.2.11.1504112038590.3845@nanos>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: sea.gmane.org
User-Agent: Loom/3.14 (http://gmane.org/)
X-Loom-IP: 24.64.45.84 (Mozilla/5.0 (X11; Linux x86_64; rv:37.0) Gecko/20100101 Firefox/37.0)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hello,
for me as GCC developer, this is definitely an intersting observation. Let
me briefly explain what happen here.

-fguess-branch-probability does a lot more than just BB reordering.  The way
GCC works is that it first guesses probability of every branch and then uses
the probabilities to estimate a profile (that is frequency at which every BB
executed). This profile is then used thorough the optimization queue to make
decisions that either bloat the code (such as loop header copying) or that
pessimize one code path and improve other code path (such spill code placement).

This is based on 
  T Ball, JR Larus: Branch prediction for free 
and 
  Y Wu, JR Larus: Static branch frequency and program profile analysis. 
In GCC it was implemented in 2000
http://www.ucw.cz/~hubicka/papers/proj/index.html and the earlier ad-hoc
heuristics (that, for example, placed spill code depending on loop depth of
the basic blocks) was slowly replaced by profile driven ones. (Most nowdays
compilers works this way.) This however also means that the compiler is
quite dependent on the profile.

-fno-guess-branch-probability will leave all frequencies to be 0 and all
edge probabilities 0%. This will make most of these heuristics to shut
itself off or possibly behave funny. Disabling all the heuristics except for
builtin_expect will likely result in quite funny profiles, too, where
frequencies will drop to 50% after every branch.

I am very interested in getting -O2 code size down (GCC 5 has quite few
improvements in this area thus most motivated by C++ code and LTO). We
should try to identify what optimization causes most of bloat and start from
these.

http://www.ucw.cz/~hubicka/papers/amd64/node4.html has 13 years old data one
performance and code size with this flag.  Back then branch prediction
improved performance by 4.1% on specint at the expense of 5.66% code size
cost. Very large portion of the bloat was the basic block reordering. I will
try to re-run the benchmarks.

In 2003 the largest portion of growth came from basic block reordering
(-freorder-blocks) which does some tail code duplication that helped the AMD
K7/K8 generation chips it was tested on because of decoder limitations. 
Basic block reordering pass was implemented in 2000 based on
http://dl.acm.org/citation.cfm?id=305178 and while it was improved in
meantime (primarily by Google adding special cases and hot/cold
partitioning), it is still the same software trace algorithm. So it may be
time to rewrite it which should not be too hard.

Would it be possible to identify functions that change most and look into these?

Honza