Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

From: Ingo Molnar <mingo@kernel.org>
To: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Davidlohr Bueso <dave@stgolabs.net>, Peter Anvin <hpa@zytor.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <peterz@infradead.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Brian Gerst <brgerst@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jason Low <jason.low2@hp.com>,
	"linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>,
	Arjan van de Ven <arjan@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
Date: Thu, 21 May 2015 16:03:14 +0200	[thread overview]
Message-ID: <20150521140314.GB544@gmail.com> (raw)
In-Reply-To: <555C7012.3040806@redhat.com>

* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> > The AMD system, with a starkly different x86 microarchitecture, is 
> > showing similar characteristics:
> > 
> > linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> > linux-falign-functions=_32-bytes/res-amd.txt:        110,433,214      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> > linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)
> > linux-falign-functions=128-bytes/res-amd.txt:        119,100,216      L1-icache-load-misses                                         ( +-  0.22% )  (100.00%)
> > linux-falign-functions=_16-bytes/res-amd.txt:        122,916,937      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> > linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> > linux-falign-functions=__2-bytes/res-amd.txt:        124,337,908      L1-icache-load-misses                                         ( +-  0.71% )  (100.00%)
> > linux-falign-functions=__4-bytes/res-amd.txt:        125,221,805      L1-icache-load-misses                                         ( +-  0.09% )  (100.00%)
> > linux-falign-functions=256-bytes/res-amd.txt:        135,761,433      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        159,918,181      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> > linux-falign-functions=512-bytes/res-amd.txt:        170,307,064      L1-icache-load-misses                                         ( +-  0.26% )  (100.00%)
> > 
> > 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes 
> > is much steeper than on Intel systems: cache associativity is likely 
> > lower on this AMD CPU.
> > 
> > Interestingly the 1 byte alignment result is still pretty good on AMD 
> > systems - and I used the exact same kernel image on both systems, so 
> > the layout of the functions is exactly the same.
> > 
> > Elapsed time is noisier, but shows a similar trend:
> > 
> > linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
> > linux-falign-functions=128-bytes/res-amd.txt:        1.932961745 seconds time elapsed                                          ( +-  2.18% )
> > linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
> > linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )
> > linux-falign-functions=_32-bytes/res-amd.txt:        1.962074787 seconds time elapsed                                          ( +-  2.38% )
> > linux-falign-functions=_16-bytes/res-amd.txt:        2.000941789 seconds time elapsed                                          ( +-  1.18% )
> > linux-falign-functions=__4-bytes/res-amd.txt:        2.002305627 seconds time elapsed                                          ( +-  2.75% )
> > linux-falign-functions=256-bytes/res-amd.txt:        2.003218532 seconds time elapsed                                          ( +-  3.16% )
> > linux-falign-functions=__2-bytes/res-amd.txt:        2.031252839 seconds time elapsed                                          ( +-  1.77% )
> > linux-falign-functions=512-bytes/res-amd.txt:        2.080632439 seconds time elapsed                                          ( +-  1.06% )
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        2.346644318 seconds time elapsed                                          ( +-  2.19% )
> > 
> > 64 bytes alignment is the sweet spot here as well, it's 3.7% 
> > faster than the default 16 bytes alignment.
> 
> In AMD, 64 bytes win too, yes, but by a *very* small margin. 8 bytes 
> and 1 byte alignments have basically same timings, and both take 
> what, +0.63% of time longer to run?

See my previous mail why this is a misleading conclusion: the timings 
have so much stddev that they are not really comparable. But if you 
look at the cachemiss you'll see the true difference:

 linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
 linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
 linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)

1 byte alignment isn't optimal on AMD either.

> > +        #
> > +        # Allocate a separate cacheline for every function,
> > +        # for optimal instruction cache packing:
> > +        #
> > +        KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)
> 
> How about  -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1  instead?
> 
> This avoids pathological cases where function starting just a few 
> bytes after 64-bytes boundary gets aligned to the next one, wasting 
> ~60 bytes.

Well, that should be pretty similar to the 32-byte aligned numbers I 
already posted, right?

The 32-byte numbers are better than the default 16-byte alignment, but 
64-byte alignment was even better.

A quick build shows:

   text    data     bss     dec     hex filename
13534242        2579560 1634304 17748106        10ed08a linux-falign-functions=__1-bytes/vmlinux
13554530        2579560 1634304 17768394        10f1fca linux-falign-functions=__2-bytes/vmlinux
13590946        2579560 1634304 17804810        10fae0a linux-falign-functions=__4-bytes/vmlinux
13658786        2579560 1634304 17872650        110b70a linux-falign-functions=__8-bytes/vmlinux
13799602        2579560 1634304 18013466        112dd1a linux-falign-functions=_16-bytes/vmlinux
13943906        2579560 1634304 18157770        11510ca linux-falign-functions=_33-bytes/vmlinux [1]
14075330        2579560 1634304 18289194        117122a linux-falign-functions=_32-bytes/vmlinux [2]
14664898        2579560 1634304 18878762        120112a linux-falign-functions=_64-bytes/vmlinux
15980994        2579560 1634304 20194858        134262a linux-falign-functions=128-bytes/vmlinux
19038018        2591848 1634304 23264170        162fbaa linux-falign-functions=256-bytes/vmlinux
26391106        2591848 1634304 30617258        1d32eaa linux-falign-functions=512-bytes/vmlinux

Interestingly the -falign-functions=33 kernel is marginally smaller 
than the -falign-functions=32 kernel, by 0.1%.

But the numbers aren't exceptional:

       671,702,272      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
    11,892,913,320      instructions                                                  ( +-  0.01% )
           300,030      context-switches                                              ( +-  0.00% )

       7.349312237 seconds time elapsed                                          ( +-  1.20% )

Compared to the others it's on rank 3:

linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
linux-falign-functions=128-bytes/res.txt:        669,401,612      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=_33-bytes/res.txt: [NEW]  671,702,272      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=_32-bytes/res.txt:        685,969,043      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=256-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=512-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=_16-bytes/res.txt:        706,080,917      L1-icache-load-misses   [vanilla kernel]                      ( +-  0.05% )  (100.00%)
linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)
linux-falign-functions=__4-bytes/res.txt:        725,707,848      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-falign-functions=__8-bytes/res.txt:        726,543,194      L1-icache-load-misses                                         ( +-  0.04% )  (100.00%)
linux-falign-functions=__2-bytes/res.txt:        738,946,179      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        921,910,808      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)

So I still think the best strategy would be a variant of what I 
mentioned before:

  1) 'small functions': whose size is 1-63 bytes.

  2) 'large functions': whose size is 64- bytes.

  3) align all large functions to 64 bytes cachelines.

  4) pack all small functions into remaining large function tail 
     holes, without them spilling over into the next cacheline

  5) pack the remaining small functions with each other, tightly, as 
     long as they don't spill over to the next cacheline.

  6) once optimization steps 1-5 are performed, for all cachelines 
     that contain small functions and still have a hole near the end 
     of the cacheline, allow functions to be aligned to 32/16/8/4/2 
     bytes as long as they still fit into the cacheline.

So for example lets consider this layout with tight packing:

  [small-fn1][small-fn2][large-fn3...........................][small-fn4 ..... ][hole............]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

Both 'fn3' and 'fn4' are on two cachelines.

After step 5 we'd get such a packing:

  [small-fn1][small-fn2][hole....][large-fn3...........................][small-fn4 ..... ][hole..]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

fn1, fn2 and fn4 each have their own cacheline, while large-fn3 
necessarily has to spill into the next cacheline due to its size. 
'fn4' is now on a single cache line.

After step 6 we'd get some extra intra-cacheline alignment by 
utilizing the holes:

  [small-fn1]..[small-fn2][hole..][large-fn3...........................]...[small-fn4 ..... ][hle]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

Note how small-fn2 got moved forward by 2 bytes, to move it to an 8 
bytes boundary and small-fn4 got moved forward by 3 bytes - at the 
expense of the holes at the end of the cachelines.

This function alignment optimization basically minimizes the number of 
cache line boundaries intersecting individual functions, so this 
minimizes the I$ footprint aggressively.

This is not a trivial computation btw.: because even if we do this 
only per .o object file we'd have to consider nr_fns! permutations of 
all possible function ordering. We'd want to pick the ordering that 
matches the above 1-6 constraints, _and_ which is as close to 'source 
code ordering' as possible.

Such an optimization cannot be described via simple local alignment 
rules like any variant of -falign-functions, as it requires reordering 
of functions.

Thanks,

	Ingo