From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933718AbbDJPuA (ORCPT ); Fri, 10 Apr 2015 11:50:00 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57451 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933496AbbDJPt5 (ORCPT ); Fri, 10 Apr 2015 11:49:57 -0400 Message-ID: <5527F0E3.40306@redhat.com> Date: Fri, 10 Apr 2015 17:48:51 +0200 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Borislav Petkov CC: Ingo Molnar , "Paul E. McKenney" , Linus Torvalds , Jason Low , Peter Zijlstra , Davidlohr Bueso , Tim Chen , Aswin Chandramouleeswaran , LKML , Andy Lutomirski , Brian Gerst , "H. Peter Anvin" , Thomas Gleixner , Peter Zijlstra Subject: Re: [PATCH] x86: Align jump targets to 1 byte boundaries References: <20150410090051.GA28549@gmail.com> <20150410091252.GA27630@gmail.com> <20150410092152.GA21332@gmail.com> <20150410111427.GA30477@gmail.com> <20150410112748.GB30477@gmail.com> <20150410120846.GA17101@gmail.com> <20150410131929.GE28074@pd.tnic> <5527D631.4090905@redhat.com> <20150410140141.GI28074@pd.tnic> <5527E3E9.7010608@redhat.com> <20150410152510.GK28074@pd.tnic> In-Reply-To: <20150410152510.GK28074@pd.tnic> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/10/2015 05:25 PM, Borislav Petkov wrote: >> measured fetch rate was up to 16 bytes per clock per core when two cores were active, and >> up to 21 bytes per clock in linear code when only one core was active. The fetch rate is >> lower than these maximum values when instructions are misaligned. >> Critical subroutine entries and loop entries should not start near the end of a 32-bytes block. >> You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in >> the first four instructions after a critical label. >> """ > > All F15h models are Bulldozer uarch with improvements. For example, > later F15h models have things like loop buffer and loop predictor > which can replay loops under certain conditions, thus diminishing the > importance of the fetch window size wrt to loops performance. > > And then there's AMD F16h Software Optimization Guide, that's the Jaguar > uarch: > > "...The processor can fetch 32 bytes per cycle and can scan two 16-byte > instruction windows for up to two instruction decodes per cycle. As you know, manuals are not be-all, end-all documents. They contains mistakes. And they are written before silicon is finalized, and sometimes they advertise capabilities which in the end had to be downscaled. It's hard to check a 1000+ pages document and correct all mistakes, especially hard-to-quantify ones. In the same document by Agner Fog, he says that he failed to confirm 32-byte fetch on Fam16h CPUs: """ 16 AMD Bobcat and Jaguar pipeline ... ... 16.2 Instruction fetch The instruction fetch rate is stated as "up to 32 bytes per cycle", but this is not confirmed by my measurements which consistently show a maximum of 16 bytes per clock cycle on average for both Bobcat and Jaguar. Some reports say that the Jaguar has a loop buffer, but I cannot detect any improvement in performance for tiny loops. """