From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S933718AbbDJPuA (ORCPT <rfc822;w@1wt.eu>);
	Fri, 10 Apr 2015 11:50:00 -0400
Received: from mx1.redhat.com ([209.132.183.28]:57451 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S933496AbbDJPt5 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Fri, 10 Apr 2015 11:49:57 -0400
Message-ID: <5527F0E3.40306@redhat.com>
Date: Fri, 10 Apr 2015 17:48:51 +0200
From: Denys Vlasenko <dvlasenk@redhat.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
MIME-Version: 1.0
To: Borislav Petkov <bp@alien8.de>
CC: Ingo Molnar <mingo@kernel.org>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Jason Low <jason.low2@hp.com>, Peter Zijlstra <peterz@infradead.org>,
        Davidlohr Bueso <dave@stgolabs.net>,
        Tim Chen <tim.c.chen@linux.intel.com>,
        Aswin Chandramouleeswaran <aswin@hp.com>,
        LKML <linux-kernel@vger.kernel.org>,
        Andy Lutomirski <luto@amacapital.net>, Brian Gerst <brgerst@gmail.com>,
        "H. Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCH] x86: Align jump targets to 1 byte boundaries
References: <20150410090051.GA28549@gmail.com> <20150410091252.GA27630@gmail.com> <20150410092152.GA21332@gmail.com> <20150410111427.GA30477@gmail.com> <20150410112748.GB30477@gmail.com> <20150410120846.GA17101@gmail.com> <20150410131929.GE28074@pd.tnic> <5527D631.4090905@redhat.com> <20150410140141.GI28074@pd.tnic> <5527E3E9.7010608@redhat.com> <20150410152510.GK28074@pd.tnic>
In-Reply-To: <20150410152510.GK28074@pd.tnic>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 04/10/2015 05:25 PM, Borislav Petkov wrote:
>> measured fetch rate was up to 16 bytes per clock per core when two cores were active, and
>> up to 21 bytes per clock in linear code when only one core was active. The fetch rate is
>> lower than these maximum values when instructions are misaligned.
>> Critical subroutine entries and loop entries should not start near the end of a 32-bytes block.
>> You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in
>> the first four instructions after a critical label.
>> """
> 
> All F15h models are Bulldozer uarch with improvements. For example,
> later F15h models have things like loop buffer and loop predictor
> which can replay loops under certain conditions, thus diminishing the
> importance of the fetch window size wrt to loops performance.
> 
> And then there's AMD F16h Software Optimization Guide, that's the Jaguar
> uarch:
> 
> "...The processor can fetch 32 bytes per cycle and can scan two 16-byte
> instruction windows for up to two instruction decodes per cycle.

As you know, manuals are not be-all, end-all documents.
They contains mistakes. And they are written before silicon
is finalized, and sometimes they advertise capabilities
which in the end had to be downscaled. It's hard to check
a 1000+ pages document and correct all mistakes, especially
hard-to-quantify ones.

In the same document by Agner Fog, he says that he failed to confirm
32-byte fetch on Fam16h CPUs:

"""
16 AMD Bobcat and Jaguar pipeline
...
...
16.2 Instruction fetch
The instruction fetch rate is stated as "up to 32 bytes per cycle",
but this is not confirmed by my measurements which consistently show
a maximum of 16 bytes per clock cycle on average for both Bobcat and
Jaguar. Some reports say that the Jaguar has a loop buffer,
but I cannot detect any improvement in performance for tiny loops.
"""