From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754047AbbEULg0 (ORCPT ); Thu, 21 May 2015 07:36:26 -0400 Received: from mail-wi0-f181.google.com ([209.85.212.181]:33277 "EHLO mail-wi0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752331AbbEULgW (ORCPT ); Thu, 21 May 2015 07:36:22 -0400 Date: Thu, 21 May 2015 13:36:17 +0200 From: Ingo Molnar To: Denys Vlasenko Cc: Linus Torvalds , Andy Lutomirski , Davidlohr Bueso , Peter Anvin , Linux Kernel Mailing List , Tim Chen , Borislav Petkov , Peter Zijlstra , "Chandramouleeswaran, Aswin" , Peter Zijlstra , Brian Gerst , Paul McKenney , Thomas Gleixner , Jason Low , "linux-tip-commits@vger.kernel.org" , Arjan van de Ven , Andrew Morton Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Message-ID: <20150521113617.GA7911@gmail.com> References: <20150410121808.GA19918@gmail.com> <20150517055551.GB17002@gmail.com> <20150519213820.GA31688@gmail.com> <555C7C57.1070608@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <555C7C57.1070608@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Denys Vlasenko wrote: > I was thinking about Ingo's AMD results: > > linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed > linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed > linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed > > AMD is almost perfect. Having no alignment at all still works very > well. [...] Not quite. As I mentioned it in my post, the 'time elapsed' numbers were very noisy in the AMD case - and you've cut off the stddev column that shows this. Here is the full data: linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed ( +- 2.74% ) linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed ( +- 1.84% ) linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed ( +- 2.15% ) 2-3% of stddev for a 3.7% speedup is not conclusive. What you should use instead is the cachemiss counts, which is a good proxy and a lot more stable statistically: linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%) linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%) linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%) which shows that 64 bytes alignment still generates a better I$ layout than tight packing, resulting in 4.3% fewer I$ misses. On Intel it's more pronounced: linux-falign-functions=_64-bytes/res.txt: 647,853,942 L1-icache-load-misses ( +- 0.07% ) (100.00%) linux-falign-functions=__1-bytes/res.txt: 724,539,055 L1-icache-load-misses ( +- 0.31% ) (100.00%) 12% difference. Note that the Intel workload is running on SSDs which makes the cache footprint several times larger, and the workload is more realistic as well than the AMD test that was running in tmpfs. I think it's a fair bet to assume that the AMD system will show a similar difference if it were to run the same workload. Allowing smaller functions to be cut in half by cacheline boundaries looks like a losing strategy, especially with larger workloads. The modified scheme I suggested: 64 bytes alignment + intelligent packing might do even better than dumb 64 bytes alignment. Thanks, Ingo