From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753309AbbETLbI (ORCPT ); Wed, 20 May 2015 07:31:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44194 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753114AbbETLbA (ORCPT ); Wed, 20 May 2015 07:31:00 -0400 Message-ID: <555C7012.3040806@redhat.com> Date: Wed, 20 May 2015 13:29:22 +0200 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Ingo Molnar , Linus Torvalds CC: Andy Lutomirski , Davidlohr Bueso , Peter Anvin , Linux Kernel Mailing List , Tim Chen , Borislav Petkov , Peter Zijlstra , "Chandramouleeswaran, Aswin" , Peter Zijlstra , Brian Gerst , Paul McKenney , Thomas Gleixner , Jason Low , "linux-tip-commits@vger.kernel.org" , Arjan van de Ven , Andrew Morton Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions References: <20150410121808.GA19918@gmail.com> <20150517055551.GB17002@gmail.com> <20150519213820.GA31688@gmail.com> In-Reply-To: <20150519213820.GA31688@gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/19/2015 11:38 PM, Ingo Molnar wrote: > Here's the result from the Intel system: > > linux-falign-functions=_64-bytes/res.txt: 647,853,942 L1-icache-load-misses ( +- 0.07% ) (100.00%) > linux-falign-functions=128-bytes/res.txt: 669,401,612 L1-icache-load-misses ( +- 0.08% ) (100.00%) > linux-falign-functions=_32-bytes/res.txt: 685,969,043 L1-icache-load-misses ( +- 0.08% ) (100.00%) > linux-falign-functions=256-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%) > linux-falign-functions=512-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%) > linux-falign-functions=_16-bytes/res.txt: 706,080,917 L1-icache-load-misses [vanilla kernel] ( +- 0.05% ) (100.00%) > linux-falign-functions=__1-bytes/res.txt: 724,539,055 L1-icache-load-misses ( +- 0.31% ) (100.00%) > linux-falign-functions=__4-bytes/res.txt: 725,707,848 L1-icache-load-misses ( +- 0.12% ) (100.00%) > linux-falign-functions=__8-bytes/res.txt: 726,543,194 L1-icache-load-misses ( +- 0.04% ) (100.00%) > linux-falign-functions=__2-bytes/res.txt: 738,946,179 L1-icache-load-misses ( +- 0.12% ) (100.00%) > linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 921,910,808 L1-icache-load-misses ( +- 0.05% ) (100.00%) > > The optimal I$ miss rate is at 64 bytes - which is 9% better than the > default kernel's I$ miss rate at 16 bytes alignment. > > The 128/256/512 bytes numbers show an increasing amount of cache > misses: probably due to the artificially reduced associativity of the > caching. > > Surprisingly there's a rather marked improvement in elapsed time as > well: > > linux-falign-functions=_64-bytes/res.txt: 7.154816369 seconds time elapsed ( +- 0.03% ) > linux-falign-functions=_32-bytes/res.txt: 7.231074263 seconds time elapsed ( +- 0.12% ) > linux-falign-functions=__8-bytes/res.txt: 7.292203002 seconds time elapsed ( +- 0.30% ) > linux-falign-functions=128-bytes/res.txt: 7.314226040 seconds time elapsed ( +- 0.29% ) > linux-falign-functions=_16-bytes/res.txt: 7.333597250 seconds time elapsed [vanilla kernel] ( +- 0.48% ) > linux-falign-functions=__1-bytes/res.txt: 7.367139908 seconds time elapsed ( +- 0.28% ) > linux-falign-functions=__4-bytes/res.txt: 7.371721930 seconds time elapsed ( +- 0.26% ) > linux-falign-functions=__2-bytes/res.txt: 7.410033936 seconds time elapsed ( +- 0.34% ) > linux-falign-functions=256-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% ) > linux-falign-functions=512-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% ) > linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 8.531418784 seconds time elapsed ( +- 0.19% ) > > the workload got 2.5% faster - which is pretty nice! This result is 5+ > standard deviations above the noise of the measurement. > > Side note: see how catastrophic -Os (CC_OPTIMIZE_FOR_SIZE=y) > performance is: markedly higher cache miss rate despite a 'smaller' > kernel, and the workload is 16.3% slower (!). > > Part of the -Os picture is that the -Os kernel is executing much more > instructions: > > linux-falign-functions=_64-bytes/res.txt: 11,851,763,357 instructions ( +- 0.01% ) > linux-falign-functions=__1-bytes/res.txt: 11,852,538,446 instructions ( +- 0.01% ) > linux-falign-functions=_16-bytes/res.txt: 11,854,159,736 instructions ( +- 0.01% ) > linux-falign-functions=__4-bytes/res.txt: 11,864,421,708 instructions ( +- 0.01% ) > linux-falign-functions=__8-bytes/res.txt: 11,865,947,941 instructions ( +- 0.01% ) > linux-falign-functions=_32-bytes/res.txt: 11,867,369,566 instructions ( +- 0.01% ) > linux-falign-functions=128-bytes/res.txt: 11,867,698,477 instructions ( +- 0.01% ) > linux-falign-functions=__2-bytes/res.txt: 11,870,853,247 instructions ( +- 0.01% ) > linux-falign-functions=256-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% ) > linux-falign-functions=512-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% ) > linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 14,318,175,358 instructions ( +- 0.01% ) > > 21.2% more instructions executed ... that cannot go well. > > So this should be a reminder that it's effective I$ footprint and > number of instructions executed that matters to performance, not > kernel size alone. With current GCC -Os should only be used on > embedded systems where one is willing to make the kernel 10%+ slower, > in exchange for a 20% smaller kernel. Can you post your .config for the test? If you have CONFIG_OPTIMIZE_INLINING=y in your -Os test, consider re-testing with it turned off. You may be seeing this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 > The AMD system, with a starkly different x86 microarchitecture, is > showing similar characteristics: > > linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%) > linux-falign-functions=_32-bytes/res-amd.txt: 110,433,214 L1-icache-load-misses ( +- 0.15% ) (100.00%) > linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%) > linux-falign-functions=128-bytes/res-amd.txt: 119,100,216 L1-icache-load-misses ( +- 0.22% ) (100.00%) > linux-falign-functions=_16-bytes/res-amd.txt: 122,916,937 L1-icache-load-misses ( +- 0.15% ) (100.00%) > linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%) > linux-falign-functions=__2-bytes/res-amd.txt: 124,337,908 L1-icache-load-misses ( +- 0.71% ) (100.00%) > linux-falign-functions=__4-bytes/res-amd.txt: 125,221,805 L1-icache-load-misses ( +- 0.09% ) (100.00%) > linux-falign-functions=256-bytes/res-amd.txt: 135,761,433 L1-icache-load-misses ( +- 0.18% ) (100.00%) > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 159,918,181 L1-icache-load-misses ( +- 0.10% ) (100.00%) > linux-falign-functions=512-bytes/res-amd.txt: 170,307,064 L1-icache-load-misses ( +- 0.26% ) (100.00%) > > 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes > is much steeper than on Intel systems: cache associativity is likely > lower on this AMD CPU. > > Interestingly the 1 byte alignment result is still pretty good on AMD > systems - and I used the exact same kernel image on both systems, so > the layout of the functions is exactly the same. > > Elapsed time is noisier, but shows a similar trend: > > linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed ( +- 2.74% ) > linux-falign-functions=128-bytes/res-amd.txt: 1.932961745 seconds time elapsed ( +- 2.18% ) > linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed ( +- 1.84% ) > linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed ( +- 2.15% ) > linux-falign-functions=_32-bytes/res-amd.txt: 1.962074787 seconds time elapsed ( +- 2.38% ) > linux-falign-functions=_16-bytes/res-amd.txt: 2.000941789 seconds time elapsed ( +- 1.18% ) > linux-falign-functions=__4-bytes/res-amd.txt: 2.002305627 seconds time elapsed ( +- 2.75% ) > linux-falign-functions=256-bytes/res-amd.txt: 2.003218532 seconds time elapsed ( +- 3.16% ) > linux-falign-functions=__2-bytes/res-amd.txt: 2.031252839 seconds time elapsed ( +- 1.77% ) > linux-falign-functions=512-bytes/res-amd.txt: 2.080632439 seconds time elapsed ( +- 1.06% ) > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 2.346644318 seconds time elapsed ( +- 2.19% ) > > 64 bytes alignment is the sweet spot here as well, it's 3.7% faster > than the default 16 bytes alignment. In AMD, 64 bytes win too, yes, but by a *very* small margin. 8 bytes and 1 byte alignments have basically same timings, and both take what, +0.63% of time longer to run? linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed I wouldn't say that it's the same as Intel. There the difference between 64 byte alignment and no alignment at all is five times larger than for AMD, it's +3%: linux-falign-functions=_64-bytes/res.txt: 7.154816369 seconds time elapsed linux-falign-functions=_32-bytes/res.txt: 7.231074263 seconds time elapsed linux-falign-functions=__8-bytes/res.txt: 7.292203002 seconds time elapsed linux-falign-functions=_16-bytes/res.txt: 7.333597250 seconds time elapsed linux-falign-functions=__1-bytes/res.txt: 7.367139908 seconds time elapsed > So based on those measurements, I think we should do the exact > opposite of my original patch that reduced alignment to 1 bytes, and > increase kernel function address alignment from 16 bytes to the > natural cache line size (64 bytes on modern CPUs). > + # > + # Allocate a separate cacheline for every function, > + # for optimal instruction cache packing: > + # > + KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT) How about -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1 instead? This avoids pathological cases where function starting just a few bytes after 64-bytes boundary gets aligned to the next one, wasting ~60 bytes. -- vda