Re: [tip:x86/asm] x86: Pack function addresses tightly as well

From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Davidlohr Bueso <dave@stgolabs.net>, Peter Anvin <hpa@zytor.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <peterz@infradead.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Brian Gerst <brgerst@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jason Low <jason.low2@hp.com>,
	"linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>
Subject: Re: [tip:x86/asm] x86: Pack function addresses tightly as well
Date: Sun, 17 May 2015 09:09:29 +0200	[thread overview]
Message-ID: <20150517070929.GA23344@gmail.com> (raw)
In-Reply-To: <20150517055551.GB17002@gmail.com>

* Ingo Molnar <mingo@kernel.org> wrote:

> > Put another way: I suspect this is more likely to hurt, and less 
> > likely to help than the others.
> 
> Yeah, indeed.
> 
> So my thinking was that it would help, because:
> 
>   - There's often locality of reference between functions: we often
>     have a handful of hot functions that are sitting next to each 
>     other and they could thus be packed closer to each other this way, 
>     creating a smaller net I$ footprint.
> 
>   - We have a handful of 'clusters' or small and often hot functions,
>     especially in the locking code:
> 
> 	ffffffff81893080 T _raw_spin_unlock_irqrestore
> 	ffffffff81893090 T _raw_read_unlock_irqrestore
> 	ffffffff818930a0 T _raw_write_unlock_irqrestore
> 	ffffffff818930b0 T _raw_spin_trylock_bh
> 	ffffffff81893110 T _raw_spin_unlock_bh
> 	ffffffff81893130 T _raw_read_unlock_bh
> 	ffffffff81893150 T _raw_write_unlock_bh
> 	ffffffff81893170 T _raw_read_trylock
> 	ffffffff818931a0 T _raw_write_trylock
> 	ffffffff818931d0 T _raw_read_lock_irqsave
> 	ffffffff81893200 T _raw_write_lock_irqsave
> 	ffffffff81893230 T _raw_spin_lock_bh
> 	ffffffff81893270 T _raw_spin_lock_irqsave
> 	ffffffff818932c0 T _raw_write_lock
> 	ffffffff818932e0 T _raw_write_lock_irq
> 	ffffffff81893310 T _raw_write_lock_bh
> 	ffffffff81893340 T _raw_spin_trylock
> 	ffffffff81893380 T _raw_read_lock
> 	ffffffff818933a0 T _raw_read_lock_irq
> 	ffffffff818933c0 T _raw_read_lock_bh
> 	ffffffff818933f0 T _raw_spin_lock
> 	ffffffff81893430 T _raw_spin_lock_irq
> 	ffffffff81893450
> 
>      That's 976 bytes total if 16 bytes aligned.
> 
>      With function packing, they compress into:
> 
> 	ffffffff817f2458 T _raw_spin_unlock_irqrestore
> 	ffffffff817f2463 T _raw_read_unlock_irqrestore
> 	ffffffff817f2472 T _raw_write_unlock_irqrestore
> 	ffffffff817f247d T _raw_read_unlock_bh
> 	ffffffff817f2498 T _raw_write_unlock_bh
> 	ffffffff817f24af T _raw_spin_unlock_bh
> 	ffffffff817f24c6 T _raw_read_trylock
> 	ffffffff817f24ef T _raw_write_trylock
> 	ffffffff817f250e T _raw_spin_lock_bh
> 	ffffffff817f2536 T _raw_read_lock_irqsave
> 	ffffffff817f255e T _raw_write_lock_irqsave
> 	ffffffff817f2588 T _raw_spin_lock_irqsave
> 	ffffffff817f25be T _raw_spin_trylock_bh
> 	ffffffff817f25f6 T _raw_spin_trylock
> 	ffffffff817f2615 T _raw_spin_lock
> 	ffffffff817f2632 T _raw_spin_lock_irq
> 	ffffffff817f2650 T _raw_write_lock
> 	ffffffff817f266b T _raw_write_lock_irq
> 	ffffffff817f2687 T _raw_write_lock_bh
> 	ffffffff817f26ad T _raw_read_lock
> 	ffffffff817f26c6 T _raw_read_lock_bh
> 	ffffffff817f26ea T _raw_read_lock_irq
> 	ffffffff817f2704
> 
>       That's 684 bytes - a very stark difference that will show up in 
>       better I$ footprint even if usage is sparse.
> 
>       OTOH, on the flip side, their ordering is far from ideal, so for 
>       example the rarely used 'trylock' variants are mixed into the 
>       middle, and the way we mix rwlock with spinlock ops isn't 
>       very pretty either.
> 
>       So we could reduce alignment for just the locking APIs, via per 
>       .o cflags in the Makefile, if packing otherwise hurts the common 
>       case.

Another datapoint: despite the widespread stereotype of bloated, 
kernel stack guzzling kernel functions, most kernel functions are 
pretty small and fit into a single (or at most two) cachelines.

Here's a histogram of x86-64 defconfig function sizes (in bytes, cut 
off at around 1200 bytes):

  600 ++----------------+------------------+-----------------+-----------------+------------------+----------------++
      +                 +                  +                 +               kernel function sizes histogram   A    +
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  500 +AA                                                                                                          ++
      |                                                                                                             |
      A                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  400 ++                                                                                                           ++
      |                                                                                                             |
      |A                                                                                                            |
      |                                                                                                             |
      |                                                                                                             |
  300 +AAA                                                                                                         ++
      | A                                                                                                           |
      |A                                                                                                            |
      |A                                                                                                            |
      |                                                                                                             |
      | AAA                                                                                                         |
  200 +AAAA                                                                                                        ++
      | AAAAA                                                                                                       |
      | AAAAAA                                                                                                      |
      | A AAAA A                                                                                                    |
      |A A AAAAAAA                                                                                                  |
      |       AAAAA                                                                                                 |
  100 +A        AAAA                                                                                               ++
      |         AAAAAAA                                                                                             |
      |            AAAAAAAAAA                                                                                       |
      |               AAAAAAAAAAAA                                                                                  |
      |A                AAAAAAAAAAAA AAAA                                                                           |
      +                 +       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AA A A A       +               A  +                 +
    0 AA----------------+------------------AA-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------++
      0                200                400               600               800                1000              1200

Zoomed to just the first 200 bytes:

  600 ++--------------------------+--------------------------+---------------------------+-------------------------++
      +                           +                          +               kernel function sizes histogram   A    +
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  500 ++    A   A                                                                                                  ++
      |                                                                                                             |
      A                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  400 ++                                                                                                           ++
      |                                                                                                             |
      |      A                                                                                                      |
      |                                                                                                             |
      |                                                                                                             |
  300 ++       A  A     A                                                                                          ++
      |         A                                                                                                   |
      |  A                                                                                                          |
      |   A                                                                                                         |
      |                                                                                                             |
      |             A    A  A A                                                                                     |
  200 ++      A    A   A AAA A                                                                                     ++
      |          A   A    AA   A  A A     A                                                                         |
      |        A  A A AA      AAAA A A   A        A                                                                 |
      |            A         A  A  AA AAAA AAAAAA         A AA                                                      |
      |       A         A         A  A  A A  A A A AAAA A A   AAA AAA A                                             |
      |                                             AA A A AA AAAA   AAAAA AA                                       |
  100 ++     A                                                   A A A    AA  AAAAA AA A                           ++
      |                                                             A    AA AA    AA AAA AAAAA AA A  A              |
      |                                                                               A AA AAAA AAAAAAAAA AAAAAA A A|
      |                                                                                               A AAAAA  AA AAA
      |    AA                                                                                                    A  |
      +                           +                          +                           +                          +
    0 +AAAA-----------------------+--------------------------+---------------------------+-------------------------++
      0                           50                        100                         150                        200

The median function size is around 1 cacheline (64-byte one), ~80% 
fitting into two cachelines, with a big peak for very small functions 
that make up something like 20% of all functions - so packing makes 
sense unless the actual functions being executed are more fragmented 
than I'd estimate around 50%.

Which is still a significant hurdle.

Plus there are a truckload of other details, like how exactly the flow 
of execution goes within the function (which part is hot). For example 
locking APIs tend to return early after a relatively straight path of 
execution, making the effective hot function length even smaller than 
the median.

All in one: there's no clear winner, so there's no way around actually 
measuring it ... ;-)

Thanks,

	Ingo