From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756501AbbDJRyW (ORCPT ); Fri, 10 Apr 2015 13:54:22 -0400 Received: from mail-wg0-f42.google.com ([74.125.82.42]:36001 "EHLO mail-wg0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756248AbbDJRyN (ORCPT ); Fri, 10 Apr 2015 13:54:13 -0400 Date: Fri, 10 Apr 2015 19:54:07 +0200 From: Ingo Molnar To: "H. Peter Anvin" Cc: Denys Vlasenko , "Paul E. McKenney" , Linus Torvalds , Jason Low , Peter Zijlstra , Davidlohr Bueso , Tim Chen , Aswin Chandramouleeswaran , LKML , Borislav Petkov , Andy Lutomirski , Brian Gerst , Thomas Gleixner , Peter Zijlstra Subject: Re: [PATCH] x86: Align jump targets to 1 byte boundaries Message-ID: <20150410175407.GB6563@gmail.com> References: <20150409183926.GM6464@linux.vnet.ibm.com> <20150410090051.GA28549@gmail.com> <20150410091252.GA27630@gmail.com> <20150410092152.GA21332@gmail.com> <20150410111427.GA30477@gmail.com> <20150410112748.GB30477@gmail.com> <20150410120846.GA17101@gmail.com> <5527C700.3030405@redhat.com> <5527CD92.1080901@zytor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5527CD92.1080901@zytor.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * H. Peter Anvin wrote: > On 04/10/2015 05:50 AM, Denys Vlasenko wrote: > > > > However, I'm an -Os guy. Expect -O2 people to disagree :) > > > > The problem with -Os is that the compiler will make *any* tradeoffs > to save a byte. It is really designed to squeeze as much code into > a fixed-size chunk, e.g. a ROM, as possible. > > We have asked for an -Okernel mode from the gcc folks forever. It > basically would mean "-Os except when really dumb." Yes, and it appears that not aligning to 16 bytes gives 5.5% size savings already - which is a big chunk of the -Os win. So we might be able to get a "poor man's -Okernel" by not aligning. (I'm also looking at GCC options to make loop unrolls less aggressive - that's another common source of bloat.) I strongly suspect it's the silly 'use weird, wildly data-dependent instructions just to save a single byte' games are that are killing -Os performance in practice. > As far as the 16-byte alignment, my understanding is not that it is > related to the I$ but rather is the decoder datum. Yeah, but the decoder stops if the prefetch crosses a cache line? So it appears to be an interaction of the 16 byte prefetch window and cache line boundaries? Btw., given that much of a real life kernel's instructions execute cache-cold, a 5.5% reduction in kernel size could easily speed up cache-cold execution by a couple of percent. In the cache-cold case the prefetch window size is probably not important at all, what determines execution speed is cache miss latency and cache footprint. [ At least in my simple mental picture of it, which might be wrong ;-) ] Thanks, Ingo