From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932292AbbEUODX (ORCPT ); Thu, 21 May 2015 10:03:23 -0400 Received: from mail-wi0-f174.google.com ([209.85.212.174]:32829 "EHLO mail-wi0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755095AbbEUODU (ORCPT ); Thu, 21 May 2015 10:03:20 -0400 Date: Thu, 21 May 2015 16:03:14 +0200 From: Ingo Molnar To: Denys Vlasenko Cc: Linus Torvalds , Andy Lutomirski , Davidlohr Bueso , Peter Anvin , Linux Kernel Mailing List , Tim Chen , Borislav Petkov , Peter Zijlstra , "Chandramouleeswaran, Aswin" , Peter Zijlstra , Brian Gerst , Paul McKenney , Thomas Gleixner , Jason Low , "linux-tip-commits@vger.kernel.org" , Arjan van de Ven , Andrew Morton Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Message-ID: <20150521140314.GB544@gmail.com> References: <20150410121808.GA19918@gmail.com> <20150517055551.GB17002@gmail.com> <20150519213820.GA31688@gmail.com> <555C7012.3040806@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <555C7012.3040806@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Denys Vlasenko wrote: > > The AMD system, with a starkly different x86 microarchitecture, is > > showing similar characteristics: > > > > linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%) > > linux-falign-functions=_32-bytes/res-amd.txt: 110,433,214 L1-icache-load-misses ( +- 0.15% ) (100.00%) > > linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%) > > linux-falign-functions=128-bytes/res-amd.txt: 119,100,216 L1-icache-load-misses ( +- 0.22% ) (100.00%) > > linux-falign-functions=_16-bytes/res-amd.txt: 122,916,937 L1-icache-load-misses ( +- 0.15% ) (100.00%) > > linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%) > > linux-falign-functions=__2-bytes/res-amd.txt: 124,337,908 L1-icache-load-misses ( +- 0.71% ) (100.00%) > > linux-falign-functions=__4-bytes/res-amd.txt: 125,221,805 L1-icache-load-misses ( +- 0.09% ) (100.00%) > > linux-falign-functions=256-bytes/res-amd.txt: 135,761,433 L1-icache-load-misses ( +- 0.18% ) (100.00%) > > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 159,918,181 L1-icache-load-misses ( +- 0.10% ) (100.00%) > > linux-falign-functions=512-bytes/res-amd.txt: 170,307,064 L1-icache-load-misses ( +- 0.26% ) (100.00%) > > > > 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes > > is much steeper than on Intel systems: cache associativity is likely > > lower on this AMD CPU. > > > > Interestingly the 1 byte alignment result is still pretty good on AMD > > systems - and I used the exact same kernel image on both systems, so > > the layout of the functions is exactly the same. > > > > Elapsed time is noisier, but shows a similar trend: > > > > linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed ( +- 2.74% ) > > linux-falign-functions=128-bytes/res-amd.txt: 1.932961745 seconds time elapsed ( +- 2.18% ) > > linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed ( +- 1.84% ) > > linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed ( +- 2.15% ) > > linux-falign-functions=_32-bytes/res-amd.txt: 1.962074787 seconds time elapsed ( +- 2.38% ) > > linux-falign-functions=_16-bytes/res-amd.txt: 2.000941789 seconds time elapsed ( +- 1.18% ) > > linux-falign-functions=__4-bytes/res-amd.txt: 2.002305627 seconds time elapsed ( +- 2.75% ) > > linux-falign-functions=256-bytes/res-amd.txt: 2.003218532 seconds time elapsed ( +- 3.16% ) > > linux-falign-functions=__2-bytes/res-amd.txt: 2.031252839 seconds time elapsed ( +- 1.77% ) > > linux-falign-functions=512-bytes/res-amd.txt: 2.080632439 seconds time elapsed ( +- 1.06% ) > > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 2.346644318 seconds time elapsed ( +- 2.19% ) > > > > 64 bytes alignment is the sweet spot here as well, it's 3.7% > > faster than the default 16 bytes alignment. > > In AMD, 64 bytes win too, yes, but by a *very* small margin. 8 bytes > and 1 byte alignments have basically same timings, and both take > what, +0.63% of time longer to run? See my previous mail why this is a misleading conclusion: the timings have so much stddev that they are not really comparable. But if you look at the cachemiss you'll see the true difference: linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%) linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%) linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%) 1 byte alignment isn't optimal on AMD either. > > + # > > + # Allocate a separate cacheline for every function, > > + # for optimal instruction cache packing: > > + # > > + KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT) > > How about -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1 instead? > > This avoids pathological cases where function starting just a few > bytes after 64-bytes boundary gets aligned to the next one, wasting > ~60 bytes. Well, that should be pretty similar to the 32-byte aligned numbers I already posted, right? The 32-byte numbers are better than the default 16-byte alignment, but 64-byte alignment was even better. A quick build shows: text data bss dec hex filename 13534242 2579560 1634304 17748106 10ed08a linux-falign-functions=__1-bytes/vmlinux 13554530 2579560 1634304 17768394 10f1fca linux-falign-functions=__2-bytes/vmlinux 13590946 2579560 1634304 17804810 10fae0a linux-falign-functions=__4-bytes/vmlinux 13658786 2579560 1634304 17872650 110b70a linux-falign-functions=__8-bytes/vmlinux 13799602 2579560 1634304 18013466 112dd1a linux-falign-functions=_16-bytes/vmlinux 13943906 2579560 1634304 18157770 11510ca linux-falign-functions=_33-bytes/vmlinux [1] 14075330 2579560 1634304 18289194 117122a linux-falign-functions=_32-bytes/vmlinux [2] 14664898 2579560 1634304 18878762 120112a linux-falign-functions=_64-bytes/vmlinux 15980994 2579560 1634304 20194858 134262a linux-falign-functions=128-bytes/vmlinux 19038018 2591848 1634304 23264170 162fbaa linux-falign-functions=256-bytes/vmlinux 26391106 2591848 1634304 30617258 1d32eaa linux-falign-functions=512-bytes/vmlinux Interestingly the -falign-functions=33 kernel is marginally smaller than the -falign-functions=32 kernel, by 0.1%. But the numbers aren't exceptional: 671,702,272 L1-icache-load-misses ( +- 0.06% ) (100.00%) 11,892,913,320 instructions ( +- 0.01% ) 300,030 context-switches ( +- 0.00% ) 7.349312237 seconds time elapsed ( +- 1.20% ) Compared to the others it's on rank 3: linux-falign-functions=_64-bytes/res.txt: 647,853,942 L1-icache-load-misses ( +- 0.07% ) (100.00%) linux-falign-functions=128-bytes/res.txt: 669,401,612 L1-icache-load-misses ( +- 0.08% ) (100.00%) linux-falign-functions=_33-bytes/res.txt: [NEW] 671,702,272 L1-icache-load-misses ( +- 0.06% ) (100.00%) linux-falign-functions=_32-bytes/res.txt: 685,969,043 L1-icache-load-misses ( +- 0.08% ) (100.00%) linux-falign-functions=256-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%) linux-falign-functions=512-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%) linux-falign-functions=_16-bytes/res.txt: 706,080,917 L1-icache-load-misses [vanilla kernel] ( +- 0.05% ) (100.00%) linux-falign-functions=__1-bytes/res.txt: 724,539,055 L1-icache-load-misses ( +- 0.31% ) (100.00%) linux-falign-functions=__4-bytes/res.txt: 725,707,848 L1-icache-load-misses ( +- 0.12% ) (100.00%) linux-falign-functions=__8-bytes/res.txt: 726,543,194 L1-icache-load-misses ( +- 0.04% ) (100.00%) linux-falign-functions=__2-bytes/res.txt: 738,946,179 L1-icache-load-misses ( +- 0.12% ) (100.00%) linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 921,910,808 L1-icache-load-misses ( +- 0.05% ) (100.00%) So I still think the best strategy would be a variant of what I mentioned before: 1) 'small functions': whose size is 1-63 bytes. 2) 'large functions': whose size is 64- bytes. 3) align all large functions to 64 bytes cachelines. 4) pack all small functions into remaining large function tail holes, without them spilling over into the next cacheline 5) pack the remaining small functions with each other, tightly, as long as they don't spill over to the next cacheline. 6) once optimization steps 1-5 are performed, for all cachelines that contain small functions and still have a hole near the end of the cacheline, allow functions to be aligned to 32/16/8/4/2 bytes as long as they still fit into the cacheline. So for example lets consider this layout with tight packing: [small-fn1][small-fn2][large-fn3...........................][small-fn4 ..... ][hole............] [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........] Both 'fn3' and 'fn4' are on two cachelines. After step 5 we'd get such a packing: [small-fn1][small-fn2][hole....][large-fn3...........................][small-fn4 ..... ][hole..] [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........] fn1, fn2 and fn4 each have their own cacheline, while large-fn3 necessarily has to spill into the next cacheline due to its size. 'fn4' is now on a single cache line. After step 6 we'd get some extra intra-cacheline alignment by utilizing the holes: [small-fn1]..[small-fn2][hole..][large-fn3...........................]...[small-fn4 ..... ][hle] [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........] Note how small-fn2 got moved forward by 2 bytes, to move it to an 8 bytes boundary and small-fn4 got moved forward by 3 bytes - at the expense of the holes at the end of the cachelines. This function alignment optimization basically minimizes the number of cache line boundaries intersecting individual functions, so this minimizes the I$ footprint aggressively. This is not a trivial computation btw.: because even if we do this only per .o object file we'd have to consider nr_fns! permutations of all possible function ordering. We'd want to pick the ordering that matches the above 1-6 constraints, _and_ which is as close to 'source code ordering' as possible. Such an optimization cannot be described via simple local alignment rules like any variant of -falign-functions, as it requires reordering of functions. Thanks, Ingo