From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756304AbbDJP12 (ORCPT ); Fri, 10 Apr 2015 11:27:28 -0400 Received: from mail.skyhub.de ([78.46.96.112]:41837 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754708AbbDJP1Z (ORCPT ); Fri, 10 Apr 2015 11:27:25 -0400 Date: Fri, 10 Apr 2015 17:25:10 +0200 From: Borislav Petkov To: Denys Vlasenko Cc: Ingo Molnar , "Paul E. McKenney" , Linus Torvalds , Jason Low , Peter Zijlstra , Davidlohr Bueso , Tim Chen , Aswin Chandramouleeswaran , LKML , Andy Lutomirski , Brian Gerst , "H. Peter Anvin" , Thomas Gleixner , Peter Zijlstra Subject: Re: [PATCH] x86: Align jump targets to 1 byte boundaries Message-ID: <20150410152510.GK28074@pd.tnic> References: <20150410090051.GA28549@gmail.com> <20150410091252.GA27630@gmail.com> <20150410092152.GA21332@gmail.com> <20150410111427.GA30477@gmail.com> <20150410112748.GB30477@gmail.com> <20150410120846.GA17101@gmail.com> <20150410131929.GE28074@pd.tnic> <5527D631.4090905@redhat.com> <20150410140141.GI28074@pd.tnic> <5527E3E9.7010608@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5527E3E9.7010608@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 10, 2015 at 04:53:29PM +0200, Denys Vlasenko wrote: > There are people who experimentally researched this. > According to this guy: > > http://www.agner.org/optimize/microarchitecture.pdf > > Intel CPUs can decode only up to 16 bytes at a time > (but the have loop buffers and some has uop cache, > which can skip decoding entirely). Ok, so Intel don't need a 32-byte fetch window. Probably the uop cache and other tricks make a larger fetch window not really important in that case. A larger fetch window means also more power and having predecoded stuff in a cache is a much better story no matter from where you look at it. > AMD CPUs can decode 21 bytes at best. With two cores active, > only 16 bytes. > > > """ > 10 Haswell pipeline > ... > 10.1 Pipeline > The pipeline is similar to previous designs, but improved with more of everything. It is "more of everything", yeah! :) > 10.2 Instruction fetch and decoding > The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle in single > threaded applications. That uop cache is simply spot on, it seems. > There are four decoders, which can handle instructions generating up to four μops per clock > cycle in the way described on page 120 for Sandy Bridge. > Instructions with any number of prefixes are decoded in a single clock cycle. There is no > penalty for redundant prefixes. That's nice. > 15 AMD Bulldozer, Piledriver and Steamroller pipeline > 15.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller > ... > 15.2 Instruction fetch > The instruction fetcher is shared between the two cores of an execution unit. The instruction > fetcher can fetch 32 aligned bytes of code per clock cycle from the level-1 code cache. The That's also understandable - you want to enlarge the fetch window for the two cores of a compute unit as they share a front end. > measured fetch rate was up to 16 bytes per clock per core when two cores were active, and > up to 21 bytes per clock in linear code when only one core was active. The fetch rate is > lower than these maximum values when instructions are misaligned. > Critical subroutine entries and loop entries should not start near the end of a 32-bytes block. > You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in > the first four instructions after a critical label. > """ All F15h models are Bulldozer uarch with improvements. For example, later F15h models have things like loop buffer and loop predictor which can replay loops under certain conditions, thus diminishing the importance of the fetch window size wrt to loops performance. And then there's AMD F16h Software Optimization Guide, that's the Jaguar uarch: "...The processor can fetch 32 bytes per cycle and can scan two 16-byte instruction windows for up to two instruction decodes per cycle. ... 2.7.2 Loop Alignment For the Family 16h processor loop alignment is not usually a significant issue. However, for hot loops, some further knowledge of trade-offs can be helpful. Since the processor can read an aligned 32-byte fetch block every cycle, to achieve maximum fetch bandwidth the loop start point should be aligned to 32 bytes. For very hot loops, it may be useful to further consider branch placement. The branch predictor can process the first two branches in a cache line in a single cycle through the sparse predictor. For best performance, any branches in the first cache line of the hot loop should be in the sparse predictor. The simplest way to guarantee this for very hot loops is to align the start point to a cache line (64-byte) boundary." -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --