linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Davidlohr Bueso <dave@stgolabs.net>, Peter Anvin <hpa@zytor.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <peterz@infradead.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Brian Gerst <brgerst@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jason Low <jason.low2@hp.com>,
	"linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>,
	Arjan van de Ven <arjan@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
Date: Thu, 21 May 2015 16:03:14 +0200	[thread overview]
Message-ID: <20150521140314.GB544@gmail.com> (raw)
In-Reply-To: <555C7012.3040806@redhat.com>


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> > The AMD system, with a starkly different x86 microarchitecture, is 
> > showing similar characteristics:
> > 
> > linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> > linux-falign-functions=_32-bytes/res-amd.txt:        110,433,214      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> > linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)
> > linux-falign-functions=128-bytes/res-amd.txt:        119,100,216      L1-icache-load-misses                                         ( +-  0.22% )  (100.00%)
> > linux-falign-functions=_16-bytes/res-amd.txt:        122,916,937      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> > linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> > linux-falign-functions=__2-bytes/res-amd.txt:        124,337,908      L1-icache-load-misses                                         ( +-  0.71% )  (100.00%)
> > linux-falign-functions=__4-bytes/res-amd.txt:        125,221,805      L1-icache-load-misses                                         ( +-  0.09% )  (100.00%)
> > linux-falign-functions=256-bytes/res-amd.txt:        135,761,433      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        159,918,181      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> > linux-falign-functions=512-bytes/res-amd.txt:        170,307,064      L1-icache-load-misses                                         ( +-  0.26% )  (100.00%)
> > 
> > 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes 
> > is much steeper than on Intel systems: cache associativity is likely 
> > lower on this AMD CPU.
> > 
> > Interestingly the 1 byte alignment result is still pretty good on AMD 
> > systems - and I used the exact same kernel image on both systems, so 
> > the layout of the functions is exactly the same.
> > 
> > Elapsed time is noisier, but shows a similar trend:
> > 
> > linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
> > linux-falign-functions=128-bytes/res-amd.txt:        1.932961745 seconds time elapsed                                          ( +-  2.18% )
> > linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
> > linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )
> > linux-falign-functions=_32-bytes/res-amd.txt:        1.962074787 seconds time elapsed                                          ( +-  2.38% )
> > linux-falign-functions=_16-bytes/res-amd.txt:        2.000941789 seconds time elapsed                                          ( +-  1.18% )
> > linux-falign-functions=__4-bytes/res-amd.txt:        2.002305627 seconds time elapsed                                          ( +-  2.75% )
> > linux-falign-functions=256-bytes/res-amd.txt:        2.003218532 seconds time elapsed                                          ( +-  3.16% )
> > linux-falign-functions=__2-bytes/res-amd.txt:        2.031252839 seconds time elapsed                                          ( +-  1.77% )
> > linux-falign-functions=512-bytes/res-amd.txt:        2.080632439 seconds time elapsed                                          ( +-  1.06% )
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        2.346644318 seconds time elapsed                                          ( +-  2.19% )
> > 
> > 64 bytes alignment is the sweet spot here as well, it's 3.7% 
> > faster than the default 16 bytes alignment.
> 
> In AMD, 64 bytes win too, yes, but by a *very* small margin. 8 bytes 
> and 1 byte alignments have basically same timings, and both take 
> what, +0.63% of time longer to run?

See my previous mail why this is a misleading conclusion: the timings 
have so much stddev that they are not really comparable. But if you 
look at the cachemiss you'll see the true difference:

 linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
 linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
 linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)

1 byte alignment isn't optimal on AMD either.

> > +        #
> > +        # Allocate a separate cacheline for every function,
> > +        # for optimal instruction cache packing:
> > +        #
> > +        KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)
> 
> How about  -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1  instead?
> 
> This avoids pathological cases where function starting just a few 
> bytes after 64-bytes boundary gets aligned to the next one, wasting 
> ~60 bytes.

Well, that should be pretty similar to the 32-byte aligned numbers I 
already posted, right?

The 32-byte numbers are better than the default 16-byte alignment, but 
64-byte alignment was even better.

A quick build shows:

   text    data     bss     dec     hex filename
13534242        2579560 1634304 17748106        10ed08a linux-falign-functions=__1-bytes/vmlinux
13554530        2579560 1634304 17768394        10f1fca linux-falign-functions=__2-bytes/vmlinux
13590946        2579560 1634304 17804810        10fae0a linux-falign-functions=__4-bytes/vmlinux
13658786        2579560 1634304 17872650        110b70a linux-falign-functions=__8-bytes/vmlinux
13799602        2579560 1634304 18013466        112dd1a linux-falign-functions=_16-bytes/vmlinux
13943906        2579560 1634304 18157770        11510ca linux-falign-functions=_33-bytes/vmlinux [1]
14075330        2579560 1634304 18289194        117122a linux-falign-functions=_32-bytes/vmlinux [2]
14664898        2579560 1634304 18878762        120112a linux-falign-functions=_64-bytes/vmlinux
15980994        2579560 1634304 20194858        134262a linux-falign-functions=128-bytes/vmlinux
19038018        2591848 1634304 23264170        162fbaa linux-falign-functions=256-bytes/vmlinux
26391106        2591848 1634304 30617258        1d32eaa linux-falign-functions=512-bytes/vmlinux

Interestingly the -falign-functions=33 kernel is marginally smaller 
than the -falign-functions=32 kernel, by 0.1%.

But the numbers aren't exceptional:

       671,702,272      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
    11,892,913,320      instructions                                                  ( +-  0.01% )
           300,030      context-switches                                              ( +-  0.00% )

       7.349312237 seconds time elapsed                                          ( +-  1.20% )

Compared to the others it's on rank 3:

linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
linux-falign-functions=128-bytes/res.txt:        669,401,612      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=_33-bytes/res.txt: [NEW]  671,702,272      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=_32-bytes/res.txt:        685,969,043      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=256-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=512-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=_16-bytes/res.txt:        706,080,917      L1-icache-load-misses   [vanilla kernel]                      ( +-  0.05% )  (100.00%)
linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)
linux-falign-functions=__4-bytes/res.txt:        725,707,848      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-falign-functions=__8-bytes/res.txt:        726,543,194      L1-icache-load-misses                                         ( +-  0.04% )  (100.00%)
linux-falign-functions=__2-bytes/res.txt:        738,946,179      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        921,910,808      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)

So I still think the best strategy would be a variant of what I 
mentioned before:

  1) 'small functions': whose size is 1-63 bytes.

  2) 'large functions': whose size is 64- bytes.

  3) align all large functions to 64 bytes cachelines.

  4) pack all small functions into remaining large function tail 
     holes, without them spilling over into the next cacheline

  5) pack the remaining small functions with each other, tightly, as 
     long as they don't spill over to the next cacheline.

  6) once optimization steps 1-5 are performed, for all cachelines 
     that contain small functions and still have a hole near the end 
     of the cacheline, allow functions to be aligned to 32/16/8/4/2 
     bytes as long as they still fit into the cacheline.

So for example lets consider this layout with tight packing:
  
  [small-fn1][small-fn2][large-fn3...........................][small-fn4 ..... ][hole............]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

Both 'fn3' and 'fn4' are on two cachelines.

After step 5 we'd get such a packing:

  [small-fn1][small-fn2][hole....][large-fn3...........................][small-fn4 ..... ][hole..]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

fn1, fn2 and fn4 each have their own cacheline, while large-fn3 
necessarily has to spill into the next cacheline due to its size. 
'fn4' is now on a single cache line.

After step 6 we'd get some extra intra-cacheline alignment by 
utilizing the holes:

  [small-fn1]..[small-fn2][hole..][large-fn3...........................]...[small-fn4 ..... ][hle]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

Note how small-fn2 got moved forward by 2 bytes, to move it to an 8 
bytes boundary and small-fn4 got moved forward by 3 bytes - at the 
expense of the holes at the end of the cachelines.

This function alignment optimization basically minimizes the number of 
cache line boundaries intersecting individual functions, so this 
minimizes the I$ footprint aggressively.

This is not a trivial computation btw.: because even if we do this 
only per .o object file we'd have to consider nr_fns! permutations of 
all possible function ordering. We'd want to pick the ordering that 
matches the above 1-6 constraints, _and_ which is as close to 'source 
code ordering' as possible.

Such an optimization cannot be described via simple local alignment 
rules like any variant of -falign-functions, as it requires reordering 
of functions.

Thanks,

	Ingo

  parent reply	other threads:[~2015-05-21 14:03 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
2015-04-09  9:00   ` [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner() tip-bot for Jason Low
2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09  5:37   ` Ingo Molnar
2015-04-09  6:40     ` Jason Low
2015-04-09  7:53       ` Ingo Molnar
2015-04-09 16:47         ` Linus Torvalds
2015-04-09 17:56           ` Paul E. McKenney
2015-04-09 18:08             ` Linus Torvalds
2015-04-09 18:16               ` Linus Torvalds
2015-04-09 18:39                 ` Paul E. McKenney
2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
2015-04-10  9:12                     ` Ingo Molnar
2015-04-10  9:21                       ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
2015-04-10 13:46                                   ` Borislav Petkov
2015-05-15  9:40                                   ` [tip:x86/asm] " tip-bot for Ingo Molnar
2015-05-17  6:03                                   ` [tip:x86/apic] " tip-bot for Ingo Molnar
2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
2015-05-15 18:36                                   ` Linus Torvalds
2015-05-15 20:52                                     ` Denys Vlasenko
2015-05-17  5:58                                     ` Ingo Molnar
2015-05-17  7:09                                       ` Ingo Molnar
2015-05-17  7:30                                         ` Ingo Molnar
2015-05-18  9:28                                       ` Denys Vlasenko
2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
2015-05-20  0:47                                         ` Linus Torvalds
2015-05-20 12:21                                           ` Denys Vlasenko
2015-05-21 11:36                                             ` Ingo Molnar
2015-05-21 11:38                                             ` Denys Vlasenko
2016-04-16 21:08                                               ` Denys Vlasenko
2015-05-20 13:09                                           ` Ingo Molnar
2015-05-20 11:29                                         ` Denys Vlasenko
2015-05-21 13:28                                           ` Ingo Molnar
2015-05-21 14:03                                           ` Ingo Molnar [this message]
2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-04-10 13:18                                 ` H. Peter Anvin
2015-04-10 17:54                                   ` Ingo Molnar
2015-04-10 18:32                                     ` H. Peter Anvin
2015-04-11 14:41                                   ` Markus Trippelsdorf
2015-04-12 10:14                                     ` Ingo Molnar
2015-04-13 16:23                                       ` Markus Trippelsdorf
2015-04-13 17:26                                         ` Markus Trippelsdorf
2015-04-13 18:31                                           ` Linus Torvalds
2015-04-13 19:09                                             ` Markus Trippelsdorf
2015-04-14  5:38                                               ` Ingo Molnar
2015-04-14  8:23                                                 ` Markus Trippelsdorf
2015-04-14  9:16                                                   ` Ingo Molnar
2015-04-14 11:17                                                     ` Markus Trippelsdorf
2015-04-14 12:09                                                       ` Ingo Molnar
2015-04-10 18:48                                 ` Linus Torvalds
2015-04-12 23:44                                   ` Maciej W. Rozycki
2015-04-10 19:23                                 ` Daniel Borkmann
2015-04-11 13:48                                 ` Markus Trippelsdorf
2015-04-10 13:19                               ` Borislav Petkov
2015-04-10 13:54                                 ` Denys Vlasenko
2015-04-10 14:01                                   ` Borislav Petkov
2015-04-10 14:53                                     ` Denys Vlasenko
2015-04-10 15:25                                       ` Borislav Petkov
2015-04-10 15:48                                         ` Denys Vlasenko
2015-04-10 15:54                                           ` Borislav Petkov
2015-04-10 21:44                                             ` Borislav Petkov
2015-04-10 18:54                                       ` Linus Torvalds
2015-04-10 14:10                               ` Paul E. McKenney
2015-04-11 14:28                                 ` Josh Triplett
2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
2015-04-11 17:41                                 ` Linus Torvalds
2015-04-11 18:57                                   ` Thomas Gleixner
2015-04-11 19:35                                     ` Linus Torvalds
2015-04-12  5:47                                       ` Ingo Molnar
2015-04-12  6:20                                         ` Markus Trippelsdorf
2015-04-12 10:15                                           ` Ingo Molnar
2015-04-12  7:56                                         ` Mike Galbraith
2015-04-12  7:41                                       ` Ingo Molnar
2015-04-12  8:07                                     ` Ingo Molnar
2015-04-12 21:11                                     ` Jan Hubicka
2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-05-14 18:17                                 ` Ingo Molnar
2015-05-14 19:04                                   ` Denys Vlasenko
2015-05-14 19:44                                     ` Ingo Molnar
2015-05-15 15:45                                   ` Josh Triplett
2015-05-17  5:34                                     ` Ingo Molnar
2015-05-17 19:18                                       ` Josh Triplett
2015-05-18  6:48                                         ` Ingo Molnar
2015-05-15  9:39                               ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 17:49                           ` Linus Torvalds
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 18:09                               ` Linus Torvalds
2015-04-10 14:20                     ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
2015-04-10 17:44                       ` Ingo Molnar
2015-04-10 18:05                         ` Paul E. McKenney
2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09 19:58                   ` Paul E. McKenney
2015-04-09 20:58                     ` Jason Low
2015-04-09 21:07                       ` Paul E. McKenney
2015-04-09 19:59                   ` Davidlohr Bueso
2015-04-09 20:36                 ` Jason Low
2015-04-10  2:43                   ` Andev
2015-04-10  9:04                   ` Ingo Molnar
2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
2015-04-08 20:10   ` Jason Low

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150521140314.GB544@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=aswin@hp.com \
    --cc=bp@alien8.de \
    --cc=brgerst@gmail.com \
    --cc=dave@stgolabs.net \
    --cc=dvlasenk@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jason.low2@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).