From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753469AbbETMWx (ORCPT ); Wed, 20 May 2015 08:22:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57990 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753430AbbETMWt (ORCPT ); Wed, 20 May 2015 08:22:49 -0400 Message-ID: <555C7C57.1070608@redhat.com> Date: Wed, 20 May 2015 14:21:43 +0200 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Linus Torvalds , Ingo Molnar CC: Andy Lutomirski , Davidlohr Bueso , Peter Anvin , Linux Kernel Mailing List , Tim Chen , Borislav Petkov , Peter Zijlstra , "Chandramouleeswaran, Aswin" , Peter Zijlstra , Brian Gerst , Paul McKenney , Thomas Gleixner , Jason Low , "linux-tip-commits@vger.kernel.org" , Arjan van de Ven , Andrew Morton Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions References: <20150410121808.GA19918@gmail.com> <20150517055551.GB17002@gmail.com> <20150519213820.GA31688@gmail.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/20/2015 02:47 AM, Linus Torvalds wrote: > On Tue, May 19, 2015 at 2:38 PM, Ingo Molnar wrote: >> >> The optimal I$ miss rate is at 64 bytes - which is 9% better than the >> default kernel's I$ miss rate at 16 bytes alignment. > > Ok, these numbers looks reasonable (which is, of course, defined as > "meets Linus' expectations"), so I like it. > > At the same time, I have to admit that I abhor a 64-byte function > alignment, when we have a fair number of functions that are (much) > smaller than that. > > Is there some way to get gcc to take the size of the function into > account? Because aligning a 16-byte or 32-byte function on a 64-byte > alignment is just criminally nasty and wasteful. > > From your numbers the 64-byte alignment definitely makes sense in > general, but I really think it would be much nicer if we could get > something like "align functions to their power-of-two size rounded up, > up to a maximum of 64 bytes" Well, that would be a bit hard to implement for gcc, at least in its traditional mode where it emits assembly source, not machine code. However, not all is lost. I was thinking about Ingo's AMD results: linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed AMD is almost perfect. Having no alignment at all still works very well. Almost perfect. Where "almost" comes from? I bet it comes from the small fraction of functions which got unlucly enough to have their first instruction split by 64-byte boundary. If we would be able to avoid just this corner case, that would help a lot. And GNU as has means to do that! See https://sourceware.org/binutils/docs/as/P2align.html .p2align N1,FILL,N3 "The third expression is also absolute, and is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive." So what we need is to put something like ".p2align 64,,7" before every function. ( Why 7? defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions. 6923 of them have 1st insn 5 or more bytes long, 5841 of them have 1st insn 6 or more bytes long, 5095 of them have 1st insn 7 or more bytes long, 786 of them have 1st insn 8 or more bytes long, 548 of them have 1st insn 9 or more bytes long, 375 of them have 1st insn 10 or more bytes long, 73 of them have 1st insn 11 or more bytes long, one of them has 1st insn 12 bytes long: this "heroic" instruction is in local_touch_nmi() 65 48 c7 05 44 3c 00 7f 00 00 00 00 movq $0x0,%gs:0x7f003c44(%rip) Thus ensuring that at least seven first bytes do not cross 64-byte boundary would cover >98% of all functions. ) gcc can't do that right now. With -falign-functions=N, it emits ".p2align next_power_of_2(N),,N-1" We need to make it just a tiny bit smarter. > We'd need toolchain help to do saner alignment. Yep. I'm going to create a gcc BZ with a feature request, unless you disagree with my musings above. -- vda