Re: [PATCH] x86: Align jump targets to 1 byte boundaries

From: Borislav Petkov <bp@alien8.de>
To: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jason Low <jason.low2@hp.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aswin Chandramouleeswaran <aswin@hp.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Brian Gerst <brgerst@gmail.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCH] x86: Align jump targets to 1 byte boundaries
Date: Fri, 10 Apr 2015 17:25:10 +0200	[thread overview]
Message-ID: <20150410152510.GK28074@pd.tnic> (raw)
In-Reply-To: <5527E3E9.7010608@redhat.com>

On Fri, Apr 10, 2015 at 04:53:29PM +0200, Denys Vlasenko wrote:
> There are people who experimentally researched this.
> According to this guy:
> 
> http://www.agner.org/optimize/microarchitecture.pdf
> 
> Intel CPUs can decode only up to 16 bytes at a time
> (but the have loop buffers and some has uop cache,
> which can skip decoding entirely).

Ok, so Intel don't need a 32-byte fetch window. Probably the uop cache
and other tricks make a larger fetch window not really important in that
case. A larger fetch window means also more power and having predecoded
stuff in a cache is a much better story no matter from where you look at
it.

> AMD CPUs can decode 21 bytes at best. With two cores active,
> only 16 bytes.
> 
> 
> """
> 10 Haswell pipeline
> ...
> 10.1 Pipeline
> The pipeline is similar to previous designs, but improved with more of everything. It is

"more of everything", yeah! :)

> 10.2 Instruction fetch and decoding
> The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle in single
> threaded applications.

That uop cache is simply spot on, it seems.

> There are four decoders, which can handle instructions generating up to four μops per clock
> cycle in the way described on page 120 for Sandy Bridge.
> Instructions with any number of prefixes are decoded in a single clock cycle. There is no
> penalty for redundant prefixes.

That's nice.

> 15 AMD Bulldozer, Piledriver and Steamroller pipeline
> 15.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller
> ...
> 15.2 Instruction fetch
> The instruction fetcher is shared between the two cores of an execution unit. The instruction
> fetcher can fetch 32 aligned bytes of code per clock cycle from the level-1 code cache. The

That's also understandable - you want to enlarge the fetch window for
the two cores of a compute unit as they share a front end.

> measured fetch rate was up to 16 bytes per clock per core when two cores were active, and
> up to 21 bytes per clock in linear code when only one core was active. The fetch rate is
> lower than these maximum values when instructions are misaligned.
> Critical subroutine entries and loop entries should not start near the end of a 32-bytes block.
> You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in
> the first four instructions after a critical label.
> """

All F15h models are Bulldozer uarch with improvements. For example,
later F15h models have things like loop buffer and loop predictor
which can replay loops under certain conditions, thus diminishing the
importance of the fetch window size wrt to loops performance.

And then there's AMD F16h Software Optimization Guide, that's the Jaguar
uarch:

"...The processor can fetch 32 bytes per cycle and can scan two 16-byte
instruction windows for up to two instruction decodes per cycle.

...

2.7.2 Loop Alignment

For the Family 16h processor loop alignment is not usually a significant
issue. However, for hot loops, some further knowledge of trade-offs can
be helpful. Since the processor can read an aligned 32-byte fetch block
every cycle, to achieve maximum fetch bandwidth the loop start point
should be aligned to 32 bytes. For very hot loops, it may be useful to
further consider branch placement. The branch predictor can process the
first two branches in a cache line in a single cycle through the sparse
predictor. For best performance, any branches in the first cache line
of the hot loop should be in the sparse predictor. The simplest way to
guarantee this for very hot loops is to align the start point to a cache
line (64-byte) boundary."

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--