[RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Davidlohr Bueso <dave@stgolabs.net>, Peter Anvin <hpa@zytor.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <peterz@infradead.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Brian Gerst <brgerst@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jason Low <jason.low2@hp.com>,
	"linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>,
	Arjan van de Ven <arjan@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
Date: Tue, 19 May 2015 23:38:20 +0200	[thread overview]
Message-ID: <20150519213820.GA31688@gmail.com> (raw)
In-Reply-To: <20150517055551.GB17002@gmail.com>


* Ingo Molnar <mingo@kernel.org> wrote:

> This function packing argument fails:
> 
>   - for large functions that are physically fragmented
> 
>   - if less than half of all functions in a hot workload are
>     packed together. This might be the common case in fact.
> 
>   - even if functions are technically 'packed' next to each other,
>     this only works for small functions: larger functions typically 
>     are hotter near their heads, with unlikely codepaths being in 
>     their tails.
> 
> > Size matters, but size matters mainly from an I$ standpoint, not 
> > from some absolute 'big is bad" issue.
> 
> Absolutely.
> 
> > [...] Also, even when size matters, performance matters too. I do 
> > want performance numbers. Is this measurable?
> 
> Will try to measure this. I'm somewhat sceptical that I'll be able 
> to measure any signal: alignment effects are very hard to measure on 
> x86, especially on any realistic workload.

So I spent the better part of today trying to measure all this. As 
expected it's very hard to measure such alignment effects: anything 
that misses the cache and is a real workload tends to be rather noisy.

After a couple of fruitless attempts I wrote the test program below.

It tries to create a more or less meaningful macro-workload that 
executes a lot of diverse kernel functions, while still having 
relatively low data footprint:

/*
 * Copyright (C) 2015, Ingo Molnar <mingo@redhat.com>
 */
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>
#include <stdlib.h>
#include <locale.h>
#include <sys/mman.h>
#include <sys/stat.h>

#define BUG_ON(cond) assert(!(cond))

#define DEFAULT_LOOPS 100000

int main(int argc, char **argv)
{
	const char *file_name = "./tmp-test-creat.txt";
	struct stat stat_buf;
	long loops;
	long i;
	int ret;

	setlocale(LC_ALL, "");

	if (argc > 1)
		loops = atol(argv[1]);
	else
		loops = DEFAULT_LOOPS;

	printf("# VFS-mix: creat()+stat()+mmap()+munmap()+close()+unlink()ing a file in $(pwd) %'ld times ... ", loops);
	fflush(stdout);

	for (i = 0; i < loops; i++) {
		int fd;

		fd = creat(file_name, 00700);
		BUG_ON(fd < 0);

		ret = lstat(file_name, &stat_buf);
		BUG_ON(ret != 0);

		ret = lseek(fd, 4095, SEEK_SET);
		BUG_ON(ret != 4095);

		close(fd);

		fd = open(file_name, O_RDWR|O_CREAT|O_TRUNC);
		BUG_ON(fd < 0);

		{
			char c = 1;

			ret = write(fd, &c, 1);
			BUG_ON(ret != 1);
		}

		{
			char *mmap_buf = (char *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

			BUG_ON(mmap_buf == (void *)-1L);

			mmap_buf[0] = 1;

			ret = munmap(mmap_buf, 4096);
			BUG_ON(ret != 0);
		}

		close(fd);

		ret = unlink(file_name);
		BUG_ON(ret != 0);
	}

	printf("done.\n");

	return 0;
}

This is a pretty silly test in itself that re-creates the same file 
again and again and does some VFS and MM ops on it, then deletes it. 
But the nice thing about it is that in every iteration, if ran on a 
journalling filesystem such as ext4, it touches:

  - the VFS
  - various filesystem low level functions
  - various IO/block layer functions
  - low level IO driver
  - memory allocators (slub and page allocator)
  - IRQ handling
  - page fault handling
  - mmap() paths
  - syscall entry/exit path
  - scheduling
  - the RCU subsystem
  - the workqueue subsystem
  - the perf events subsystem

That's a ton of activity: running this on ext4 executes around 1,500 
unique kernel functions (!), around 118,000 instructions per iteration 
- with relatively small data footprint.

The profile is pretty flat, with almost 1,000 functions being beyond 
the 0.01% overhead threshold.

This test workload has a non-trivial I$ footprint: with 1,500 
functions and the median kernel function size of around 150 bytes, 
it's about 220k of text executed - well beyond L1 instruction cache 
sizes.

So this is a test that has a realistic instruction cache footprint, 
but has micro-benchmark data footprint properties - which reduces 
noise while still excercising the I$ effect we seek.

To further reduce measurement noise, I ran this on a system with SSD 
disks, so the IRQ and block IO workload is almost synchronous and the 
CPU is fully utilized.

Then I bound the workload to a single CPU, with prio SCHED_FIFO:90, 
and ran it the following way:

    taskset 1 chrt -f 99 perf stat -a -C 0 --pre sync --repeat 10 \
      -e L1-icache-load-misses \
      -e instructions          \
      -e context-switches      \
    ./test-vfs

Then I built 11 kernels, each with different function alignment 
settings:

fomalhaut:~/linux> size linux-*/vmlinux | sort -n
    text           data     bss      dec     filename
12150606        2565544 1634304 16350454     linux-____CC_OPTIMIZE_FOR_SIZE=y/vmlinux
13534242        2579560 1634304 17748106     linux-falign-functions=__1-bytes/vmlinux
13554530        2579560 1634304 17768394     linux-falign-functions=__2-bytes/vmlinux
13590946        2579560 1634304 17804810     linux-falign-functions=__4-bytes/vmlinux
13658786        2579560 1634304 17872650     linux-falign-functions=__8-bytes/vmlinux
13799602        2579560 1634304 18013466     linux-falign-functions=_16-bytes/vmlinux
14075330        2579560 1634304 18289194     linux-falign-functions=_32-bytes/vmlinux
14664898        2579560 1634304 18878762     linux-falign-functions=_64-bytes/vmlinux
15980994        2579560 1634304 20194858     linux-falign-functions=128-bytes/vmlinux
19038018        2591848 1634304 23264170     linux-falign-functions=256-bytes/vmlinux
26391106        2591848 1634304 30617258     linux-falign-functions=512-bytes/vmlinux

(I've added the -Os kernel as a comparison.)

A single run results in output like:

 Performance counter stats for 'system wide' (10 runs):

       649,398,972      L1-icache-load-misses                                         ( +-  0.09% )  (100.00%)
    11,875,234,916      instructions                                                  ( +-  0.00% )
           300,038      context-switches                                              ( +-  0.00% )

       7.198533444 seconds time elapsed                                          ( +-  0.28% )

The 'instructions' and 'context-switches' metrics can be used to 
cross-check that the run was undisturbed and that it executed exactly 
the same workload as other runs.

'time elapsed' is rather noisy, while 'L1-icache-load-misses' is the 
L1 instruction cache misses metric that is the most interesting one: 
it's a proxy metric for effective I$ footprint of the workload: if the 
footprint is larger then the number of misses goes up, if it's tigher 
then the number of misses goes down.

I ran this on two systems:

... an Intel system:

 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 62
 model name      : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
 stepping        : 7

 cache size      : 25600 KB
 cache_alignment : 64

... and an AMD system:

 vendor_id       : AuthenticAMD
 cpu family      : 21
 model           : 1
 model name      : AMD Opteron(tm) Processor 6278                 
 stepping        : 2

 cache size      : 2048 KB
 cache_alignment : 64

The CPU frequencies were set to the max on both systems, to reduce 
power throttling noise and run-to-run skew.

The AMD system did not have SSDs, so there I used tmpfs: this reduced 
the cache footprint to about 30% of that of the Intel system. The 
alignment effect was still borderline measurable.

I ran the test 10 times on the AMD system (so 100 runs), and 3 times 
on the Intel system (which took longer due to the file touching 
ext4/SSD) - i.e. 30 runs.

I picked the best result from the tests, to reduce noise from workload 
perturbations and from data cache layout effects. (If I take the 
average values then the effect is still visible, but noisier.)

Here's the result from the Intel system:

linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
linux-falign-functions=128-bytes/res.txt:        669,401,612      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=_32-bytes/res.txt:        685,969,043      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=256-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=512-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=_16-bytes/res.txt:        706,080,917      L1-icache-load-misses   [vanilla kernel]                      ( +-  0.05% )  (100.00%)
linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)
linux-falign-functions=__4-bytes/res.txt:        725,707,848      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-falign-functions=__8-bytes/res.txt:        726,543,194      L1-icache-load-misses                                         ( +-  0.04% )  (100.00%)
linux-falign-functions=__2-bytes/res.txt:        738,946,179      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        921,910,808      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)

The optimal I$ miss rate is at 64 bytes - which is 9% better than the 
default kernel's I$ miss rate at 16 bytes alignment.

The 128/256/512 bytes numbers show an increasing amount of cache 
misses: probably due to the artificially reduced associativity of the 
caching.

Surprisingly there's a rather marked improvement in elapsed time as 
well:

linux-falign-functions=_64-bytes/res.txt:        7.154816369 seconds time elapsed                                          ( +-  0.03% )
linux-falign-functions=_32-bytes/res.txt:        7.231074263 seconds time elapsed                                          ( +-  0.12% )
linux-falign-functions=__8-bytes/res.txt:        7.292203002 seconds time elapsed                                          ( +-  0.30% )
linux-falign-functions=128-bytes/res.txt:        7.314226040 seconds time elapsed                                          ( +-  0.29% )
linux-falign-functions=_16-bytes/res.txt:        7.333597250 seconds time elapsed     [vanilla kernel]                     ( +-  0.48% )
linux-falign-functions=__1-bytes/res.txt:        7.367139908 seconds time elapsed                                          ( +-  0.28% )
linux-falign-functions=__4-bytes/res.txt:        7.371721930 seconds time elapsed                                          ( +-  0.26% )
linux-falign-functions=__2-bytes/res.txt:        7.410033936 seconds time elapsed                                          ( +-  0.34% )
linux-falign-functions=256-bytes/res.txt:        7.507029637 seconds time elapsed                                          ( +-  0.07% )
linux-falign-functions=512-bytes/res.txt:        7.507029637 seconds time elapsed                                          ( +-  0.07% )
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        8.531418784 seconds time elapsed                                          ( +-  0.19% )

the workload got 2.5% faster - which is pretty nice! This result is 5+ 
standard deviations above the noise of the measurement.

Side note: see how catastrophic -Os (CC_OPTIMIZE_FOR_SIZE=y) 
performance is: markedly higher cache miss rate despite a 'smaller' 
kernel, and the workload is 16.3% slower (!).

Part of the -Os picture is that the -Os kernel is executing much more 
instructions:

linux-falign-functions=_64-bytes/res.txt:     11,851,763,357      instructions                                                  ( +-  0.01% )
linux-falign-functions=__1-bytes/res.txt:     11,852,538,446      instructions                                                  ( +-  0.01% )
linux-falign-functions=_16-bytes/res.txt:     11,854,159,736      instructions                                                  ( +-  0.01% )
linux-falign-functions=__4-bytes/res.txt:     11,864,421,708      instructions                                                  ( +-  0.01% )
linux-falign-functions=__8-bytes/res.txt:     11,865,947,941      instructions                                                  ( +-  0.01% )
linux-falign-functions=_32-bytes/res.txt:     11,867,369,566      instructions                                                  ( +-  0.01% )
linux-falign-functions=128-bytes/res.txt:     11,867,698,477      instructions                                                  ( +-  0.01% )
linux-falign-functions=__2-bytes/res.txt:     11,870,853,247      instructions                                                  ( +-  0.01% )
linux-falign-functions=256-bytes/res.txt:     11,876,281,686      instructions                                                  ( +-  0.01% )
linux-falign-functions=512-bytes/res.txt:     11,876,281,686      instructions                                                  ( +-  0.01% )
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:     14,318,175,358      instructions                                                  ( +-  0.01% )

21.2% more instructions executed ... that cannot go well.

So this should be a reminder that it's effective I$ footprint and 
number of instructions executed that matters to performance, not 
kernel size alone. With current GCC -Os should only be used on 
embedded systems where one is willing to make the kernel 10%+ slower, 
in exchange for a 20% smaller kernel.

The AMD system, with a starkly different x86 microarchitecture, is 
showing similar characteristics:

linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
linux-falign-functions=_32-bytes/res-amd.txt:        110,433,214      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)
linux-falign-functions=128-bytes/res-amd.txt:        119,100,216      L1-icache-load-misses                                         ( +-  0.22% )  (100.00%)
linux-falign-functions=_16-bytes/res-amd.txt:        122,916,937      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
linux-falign-functions=__2-bytes/res-amd.txt:        124,337,908      L1-icache-load-misses                                         ( +-  0.71% )  (100.00%)
linux-falign-functions=__4-bytes/res-amd.txt:        125,221,805      L1-icache-load-misses                                         ( +-  0.09% )  (100.00%)
linux-falign-functions=256-bytes/res-amd.txt:        135,761,433      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        159,918,181      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
linux-falign-functions=512-bytes/res-amd.txt:        170,307,064      L1-icache-load-misses                                         ( +-  0.26% )  (100.00%)

64 bytes is a similar sweet spot. Note that the penalty at 512 bytes 
is much steeper than on Intel systems: cache associativity is likely 
lower on this AMD CPU.

Interestingly the 1 byte alignment result is still pretty good on AMD 
systems - and I used the exact same kernel image on both systems, so 
the layout of the functions is exactly the same.

Elapsed time is noisier, but shows a similar trend:

linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
linux-falign-functions=128-bytes/res-amd.txt:        1.932961745 seconds time elapsed                                          ( +-  2.18% )
linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )
linux-falign-functions=_32-bytes/res-amd.txt:        1.962074787 seconds time elapsed                                          ( +-  2.38% )
linux-falign-functions=_16-bytes/res-amd.txt:        2.000941789 seconds time elapsed                                          ( +-  1.18% )
linux-falign-functions=__4-bytes/res-amd.txt:        2.002305627 seconds time elapsed                                          ( +-  2.75% )
linux-falign-functions=256-bytes/res-amd.txt:        2.003218532 seconds time elapsed                                          ( +-  3.16% )
linux-falign-functions=__2-bytes/res-amd.txt:        2.031252839 seconds time elapsed                                          ( +-  1.77% )
linux-falign-functions=512-bytes/res-amd.txt:        2.080632439 seconds time elapsed                                          ( +-  1.06% )
linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        2.346644318 seconds time elapsed                                          ( +-  2.19% )

64 bytes alignment is the sweet spot here as well, it's 3.7% faster 
than the default 16 bytes alignment.

So based on those measurements, I think we should do the exact 
opposite of my original patch that reduced alignment to 1 bytes, and 
increase kernel function address alignment from 16 bytes to the 
natural cache line size (64 bytes on modern CPUs).

The cost is a 6.2% larger kernel image:

 13799602        2579560 1634304 18013466     linux-falign-functions=_16-bytes/vmlinux
 14664898        2579560 1634304 18878762     linux-falign-functions=_64-bytes/vmlinux

But this is basically just a RAM cost, it does not increase effective 
cache footprint (in fact it decreases it), so it's likely a win on 
everything but RAM starved embedded systems.

In the future we could perhaps still pack small functions of certain 
subsystems (such as hot locking functions).

The patch below implements the alignment optimization by the build 
system using X86_L1_CACHE_SIZE to align functions, limited to 64-bit 
systems for the time being.

Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig.cpu | 17 +++++++++++++++++
 arch/x86/Makefile    |  6 ++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 6983314c8b37..9eacb85efda9 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -304,6 +304,23 @@ config X86_L1_CACHE_SHIFT
 	default "4" if MELAN || M486 || MGEODEGX1
 	default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX
 
+config X86_L1_CACHE_SIZE
+	int
+	default "16" if X86_L1_CACHE_SHIFT=4
+	default "32" if X86_L1_CACHE_SHIFT=5
+	default "64" if X86_L1_CACHE_SHIFT=6
+	default "128" if X86_L1_CACHE_SHIFT=7
+
+#
+# We optimize function alignment on 64-bit kernels,
+# to pack the instruction cache optimally:
+#
+config X86_FUNCTION_ALIGNMENT
+	int
+	default "16"			if !64BIT
+	default X86_L1_CACHE_SIZE	if 64BIT && !CC_OPTIMIZE_FOR_SIZE
+	default "1"			if 64BIT && CC_OPTIMIZE_FOR_SIZE
+
 config X86_PPRO_FENCE
 	bool "PentiumPro memory ordering errata workaround"
 	depends on M686 || M586MMX || M586TSC || M586 || M486 || MGEODEGX1
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 57996ee840dd..45ebf0bbe833 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -83,6 +83,12 @@ else
         # Pack loops tightly as well:
         KBUILD_CFLAGS += -falign-loops=1
 
+        #
+        # Allocate a separate cacheline for every function,
+        # for optimal instruction cache packing:
+        #
+        KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)