From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932090AbbETNJ5 (ORCPT <rfc822;w@1wt.eu>);
	Wed, 20 May 2015 09:09:57 -0400
Received: from mail-wg0-f53.google.com ([74.125.82.53]:36505 "EHLO
	mail-wg0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753660AbbETNJt (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 20 May 2015 09:09:49 -0400
Date: Wed, 20 May 2015 15:09:44 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>, Davidlohr Bueso <dave@stgolabs.net>,
        Peter Anvin <hpa@zytor.com>, Denys Vlasenko <dvlasenk@redhat.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Tim Chen <tim.c.chen@linux.intel.com>, Borislav Petkov <bp@alien8.de>,
        Peter Zijlstra <peterz@infradead.org>,
        "Chandramouleeswaran, Aswin" <aswin@hp.com>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>,
        Brian Gerst <brgerst@gmail.com>,
        Paul McKenney <paulmck@linux.vnet.ibm.com>,
        Thomas Gleixner <tglx@linutronix.de>, Jason Low <jason.low2@hp.com>,
        "linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>,
        Arjan van de Ven <arjan@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache
 footprint of kernel functions
Message-ID: <20150520130944.GA30424@gmail.com>
References: <20150410121808.GA19918@gmail.com>
 <tip-4874fe1eeb40b403a8c9d0ddeb4d166cab3f37ba@git.kernel.org>
 <CA+55aFywCXk083w78cQGbRKh-ERLtE8v9PZhqxaHcyHJxSNsFQ@mail.gmail.com>
 <20150517055551.GB17002@gmail.com>
 <20150519213820.GA31688@gmail.com>
 <CA+55aFwog9oCat-FQqW52SvVGyZohNZPeuWk92a2cvhjc6H38Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CA+55aFwog9oCat-FQqW52SvVGyZohNZPeuWk92a2cvhjc6H38Q@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, May 19, 2015 at 2:38 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> > The optimal I$ miss rate is at 64 bytes - which is 9% better than 
> > the default kernel's I$ miss rate at 16 bytes alignment.
> 
> Ok, these numbers looks reasonable (which is, of course, defined as 
> "meets Linus' expectations"), so I like it.
> 
> At the same time, I have to admit that I abhor a 64-byte function 
> alignment, when we have a fair number of functions that are (much) 
> smaller than that.
> 
> Is there some way to get gcc to take the size of the function into 
> account? Because aligning a 16-byte or 32-byte function on a 64-byte 
> alignment is just criminally nasty and wasteful.
> 
> From your numbers the 64-byte alignment definitely makes sense in 
> general, but I really think it would be much nicer if we could get 
> something like "align functions to their power-of-two size rounded 
> up, up to a maximum of 64 bytes"

I think the ideal strategy would be to minimize the number of cache 
line boundaries that cut across a function body, but otherwise pack as 
tightly as possible.

I.e. a good first approximation would be to pack functions tightly 
within a single cache line as long as the next function still fits - 
and go to the next cacheline if it doesn't.

This makes sure we use the cachelines to their max, while also making 
sure that functions are fragmented across more cachelines than 
necessary.

> Maybe I did something wrong, but doing this:
> 
>     export last=0
>     nm vmlinux | grep ' [tT] ' | sort | while read i t name
>     do
>         size=$((0x$i-$last)); last=0x$i; lastname=$name
>         [ $size -ge 16 ] && echo $size $lastname
>     done | sort -n | less -S
> 
> seems to say that we have a *lot* of small functions (don't do this 
> with a debug build that has a lot of odd things, do it with 
> something you'd actually boot and run).

Yeah, we do, and I ran your script and it looks similar to what I did 
a few days ago, so I think your observations are correct.

> The above assumes the default 16-byte alignment, and gets rid of the 
> the zero-sized ones (due to mainly system call aliases), and the 
> ones less than 16 bytes (obviously not aligned as-is). But you still 
> end up with a *lot* of functions.a lot of the really small ones are 
> silly setup functions etc, but there's actually a fair number of 
> 16-byte functions.

So if you build with -falign-functions=1 to get the true size of the 
functions then the numbers are even more convincing: about 8% of all 
functions in vmlinux on a typica distro config are 16 bytes or 
smaller, 20% are 32 bytes or smaller, 36% are 64 bytes or smaller.

> I seem to get ~30k functions in my defconfig vmlinux file, and about 
> half seem to be lless than 96 bytes (that's _with_ the 16-byte 
> alignment). In fact, there seems to be ~5500 functions that are 32 
> bytes or less, of which 1850 functions are 16 bytes or less.

Yes.

So given the prevalence of small functions I still find my result 
highly non-intuitive: packing them tightly _should_ have helped I$ 
footprint.

But I'm certainly not going to argue against numbers!

> Aligning a 16-byte function to 64 bytes really does sound wrong, and 
> there's a fair number of them.  Of course, it depends on what's 
> around it just how much memory it wastes, but it *definitely* 
> doesn't help I$ to round small functions up to the next cacheline 
> too.
> 
> I dunno. I might have screwed up the above shellscript badly and my 
> numbers may be pure garbage. But apart from the tail end that has 
> insane big sizes (due to section changes or intermixed data or 
> something, I suspect) it doesn't look obviously wrong. So I think it 
> might be a reasonable approximation.
> 
> We'd need toolchain help to do saner alignment.

So in theory we could use -ffunction-sections and then create a linker 
script on the fly with arbitrary alignment logic to our liking, but 
I'd guess it would be a bit slow and possibly also somewhat fragile, 
as linker scripts aren't the most robust pieces of GNU tooling.

Another advantage would be that we could reorder functions (within the 
same .o) to achieve better packing.

I'll try play with it a bit to see how feasible it is, and to see 
whether more performance is possible with better I$ packing.

Thanks,

	Ingo