[PATCH] x86: Pack loops tightly as well

From: Ingo Molnar <mingo@kernel.org>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Jason Low <jason.low2@hp.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aswin Chandramouleeswaran <aswin@hp.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Borislav Petkov <bp@alien8.de>,
	Andy Lutomirski <luto@amacapital.net>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Brian Gerst <brgerst@gmail.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: [PATCH] x86: Pack loops tightly as well
Date: Fri, 10 Apr 2015 14:30:18 +0200	[thread overview]
Message-ID: <20150410123017.GB19918@gmail.com> (raw)
In-Reply-To: <20150410121808.GA19918@gmail.com>


* Ingo Molnar <mingo@kernel.org> wrote:

> > I realize that x86 CPU manufacturers recommend 16-byte jump target 
> > alignments (it's in the Intel optimization manual), but the cost 
> > of that is very significant:
> > 
> >         text           data       bss         dec      filename
> >     12566391        1617840   1089536    15273767      vmlinux.align.16-byte
> >     12224951        1617840   1089536    14932327      vmlinux.align.1-byte
> > 
> > By using 1 byte jump target alignment (i.e. no alignment at all) 
> > we get an almost 3% reduction in kernel size (!) - and a probably 
> > similar reduction in I$ footprint.
> 
> Likewise we could pack functions tightly as well via the patch 
> below:
> 
>      text	   data	    bss	     dec	 filename
>  12566391	1617840	1089536	15273767	 vmlinux.align.16-byte
>  12224951	1617840	1089536	14932327	 vmlinux.align.1-byte
>  11976567	1617840	1089536	14683943	 vmlinux.align.1-byte.funcs-1-byte
> 
> Which brings another 2% reduction in the kernel's code size.
> 
> It would be interesting to see some benchmarks with these two 
> patches applied. Only lightly tested.

And the final patch below also packs loops tightly:

     text        data    bss     dec              filename
 12566391        1617840 1089536 15273767         vmlinux.align.16-byte
 12224951        1617840 1089536 14932327         vmlinux.align.1-byte
 11976567        1617840 1089536 14683943         vmlinux.align.1-byte.funcs-1-byte
 11903735        1617840 1089536 14611111         vmlinux.align.1-byte.funcs-1-byte.loops-1-byte

The total reduction is 5.5%.

Now loop alignment is beneficial if:

 - a loop is cache-hot and its surroundings are not.

Loop alignment is harmful if:

 - a loop is cache-cold
 - a loop's surroundings are cache-hot as well
 - two cache-hot loops are close to each other

and I'd argue that the latter three harmful scenarios are much more 
common in the kernel. Similar arguments can be made for function 
alignment as well. (Jump target alignment is a bit different but I 
think the same conclusion holds.)

(I might have missed some CPU microarchitectural details though that 
would make such packing undesirable.)

Thanks,

	Ingo

=============================>
>From cfc2ca24908cce66b9df1f711225d461f5d59b97 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Fri, 10 Apr 2015 14:20:30 +0200
Subject: [PATCH] x86: Pack loops tightly as well

Not-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 573d0c459f99..10989a73b986 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -83,6 +83,9 @@ else
         # Pack functions tightly as well:
         KBUILD_CFLAGS += -falign-functions=1
 
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)