From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752184AbbEQHJl (ORCPT <rfc822;w@1wt.eu>);
	Sun, 17 May 2015 03:09:41 -0400
Received: from mail-wi0-f169.google.com ([209.85.212.169]:35098 "EHLO
	mail-wi0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751051AbbEQHJh (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sun, 17 May 2015 03:09:37 -0400
Date: Sun, 17 May 2015 09:09:29 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>, Davidlohr Bueso <dave@stgolabs.net>,
        Peter Anvin <hpa@zytor.com>, Denys Vlasenko <dvlasenk@redhat.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Tim Chen <tim.c.chen@linux.intel.com>, Borislav Petkov <bp@alien8.de>,
        Peter Zijlstra <peterz@infradead.org>,
        "Chandramouleeswaran, Aswin" <aswin@hp.com>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>,
        Brian Gerst <brgerst@gmail.com>,
        Paul McKenney <paulmck@linux.vnet.ibm.com>,
        Thomas Gleixner <tglx@linutronix.de>, Jason Low <jason.low2@hp.com>,
        "linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>
Subject: Re: [tip:x86/asm] x86: Pack function addresses tightly as well
Message-ID: <20150517070929.GA23344@gmail.com>
References: <20150410121808.GA19918@gmail.com>
 <tip-4874fe1eeb40b403a8c9d0ddeb4d166cab3f37ba@git.kernel.org>
 <CA+55aFywCXk083w78cQGbRKh-ERLtE8v9PZhqxaHcyHJxSNsFQ@mail.gmail.com>
 <20150517055551.GB17002@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150517055551.GB17002@gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


* Ingo Molnar <mingo@kernel.org> wrote:

> > Put another way: I suspect this is more likely to hurt, and less 
> > likely to help than the others.
> 
> Yeah, indeed.
> 
> So my thinking was that it would help, because:
> 
>   - There's often locality of reference between functions: we often
>     have a handful of hot functions that are sitting next to each 
>     other and they could thus be packed closer to each other this way, 
>     creating a smaller net I$ footprint.
> 
>   - We have a handful of 'clusters' or small and often hot functions,
>     especially in the locking code:
> 
> 	ffffffff81893080 T _raw_spin_unlock_irqrestore
> 	ffffffff81893090 T _raw_read_unlock_irqrestore
> 	ffffffff818930a0 T _raw_write_unlock_irqrestore
> 	ffffffff818930b0 T _raw_spin_trylock_bh
> 	ffffffff81893110 T _raw_spin_unlock_bh
> 	ffffffff81893130 T _raw_read_unlock_bh
> 	ffffffff81893150 T _raw_write_unlock_bh
> 	ffffffff81893170 T _raw_read_trylock
> 	ffffffff818931a0 T _raw_write_trylock
> 	ffffffff818931d0 T _raw_read_lock_irqsave
> 	ffffffff81893200 T _raw_write_lock_irqsave
> 	ffffffff81893230 T _raw_spin_lock_bh
> 	ffffffff81893270 T _raw_spin_lock_irqsave
> 	ffffffff818932c0 T _raw_write_lock
> 	ffffffff818932e0 T _raw_write_lock_irq
> 	ffffffff81893310 T _raw_write_lock_bh
> 	ffffffff81893340 T _raw_spin_trylock
> 	ffffffff81893380 T _raw_read_lock
> 	ffffffff818933a0 T _raw_read_lock_irq
> 	ffffffff818933c0 T _raw_read_lock_bh
> 	ffffffff818933f0 T _raw_spin_lock
> 	ffffffff81893430 T _raw_spin_lock_irq
> 	ffffffff81893450
> 
>      That's 976 bytes total if 16 bytes aligned.
> 
>      With function packing, they compress into:
> 
> 	ffffffff817f2458 T _raw_spin_unlock_irqrestore
> 	ffffffff817f2463 T _raw_read_unlock_irqrestore
> 	ffffffff817f2472 T _raw_write_unlock_irqrestore
> 	ffffffff817f247d T _raw_read_unlock_bh
> 	ffffffff817f2498 T _raw_write_unlock_bh
> 	ffffffff817f24af T _raw_spin_unlock_bh
> 	ffffffff817f24c6 T _raw_read_trylock
> 	ffffffff817f24ef T _raw_write_trylock
> 	ffffffff817f250e T _raw_spin_lock_bh
> 	ffffffff817f2536 T _raw_read_lock_irqsave
> 	ffffffff817f255e T _raw_write_lock_irqsave
> 	ffffffff817f2588 T _raw_spin_lock_irqsave
> 	ffffffff817f25be T _raw_spin_trylock_bh
> 	ffffffff817f25f6 T _raw_spin_trylock
> 	ffffffff817f2615 T _raw_spin_lock
> 	ffffffff817f2632 T _raw_spin_lock_irq
> 	ffffffff817f2650 T _raw_write_lock
> 	ffffffff817f266b T _raw_write_lock_irq
> 	ffffffff817f2687 T _raw_write_lock_bh
> 	ffffffff817f26ad T _raw_read_lock
> 	ffffffff817f26c6 T _raw_read_lock_bh
> 	ffffffff817f26ea T _raw_read_lock_irq
> 	ffffffff817f2704
> 
>       That's 684 bytes - a very stark difference that will show up in 
>       better I$ footprint even if usage is sparse.
> 
>       OTOH, on the flip side, their ordering is far from ideal, so for 
>       example the rarely used 'trylock' variants are mixed into the 
>       middle, and the way we mix rwlock with spinlock ops isn't 
>       very pretty either.
> 
>       So we could reduce alignment for just the locking APIs, via per 
>       .o cflags in the Makefile, if packing otherwise hurts the common 
>       case.

Another datapoint: despite the widespread stereotype of bloated, 
kernel stack guzzling kernel functions, most kernel functions are 
pretty small and fit into a single (or at most two) cachelines.

Here's a histogram of x86-64 defconfig function sizes (in bytes, cut 
off at around 1200 bytes):

  600 ++----------------+------------------+-----------------+-----------------+------------------+----------------++
      +                 +                  +                 +               kernel function sizes histogram   A    +
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  500 +AA                                                                                                          ++
      |                                                                                                             |
      A                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  400 ++                                                                                                           ++
      |                                                                                                             |
      |A                                                                                                            |
      |                                                                                                             |
      |                                                                                                             |
  300 +AAA                                                                                                         ++
      | A                                                                                                           |
      |A                                                                                                            |
      |A                                                                                                            |
      |                                                                                                             |
      | AAA                                                                                                         |
  200 +AAAA                                                                                                        ++
      | AAAAA                                                                                                       |
      | AAAAAA                                                                                                      |
      | A AAAA A                                                                                                    |
      |A A AAAAAAA                                                                                                  |
      |       AAAAA                                                                                                 |
  100 +A        AAAA                                                                                               ++
      |         AAAAAAA                                                                                             |
      |            AAAAAAAAAA                                                                                       |
      |               AAAAAAAAAAAA                                                                                  |
      |A                AAAAAAAAAAAA AAAA                                                                           |
      +                 +       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AA A A A       +               A  +                 +
    0 AA----------------+------------------AA-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------++
      0                200                400               600               800                1000              1200


Zoomed to just the first 200 bytes:

  600 ++--------------------------+--------------------------+---------------------------+-------------------------++
      +                           +                          +               kernel function sizes histogram   A    +
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  500 ++    A   A                                                                                                  ++
      |                                                                                                             |
      A                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  400 ++                                                                                                           ++
      |                                                                                                             |
      |      A                                                                                                      |
      |                                                                                                             |
      |                                                                                                             |
  300 ++       A  A     A                                                                                          ++
      |         A                                                                                                   |
      |  A                                                                                                          |
      |   A                                                                                                         |
      |                                                                                                             |
      |             A    A  A A                                                                                     |
  200 ++      A    A   A AAA A                                                                                     ++
      |          A   A    AA   A  A A     A                                                                         |
      |        A  A A AA      AAAA A A   A        A                                                                 |
      |            A         A  A  AA AAAA AAAAAA         A AA                                                      |
      |       A         A         A  A  A A  A A A AAAA A A   AAA AAA A                                             |
      |                                             AA A A AA AAAA   AAAAA AA                                       |
  100 ++     A                                                   A A A    AA  AAAAA AA A                           ++
      |                                                             A    AA AA    AA AAA AAAAA AA A  A              |
      |                                                                               A AA AAAA AAAAAAAAA AAAAAA A A|
      |                                                                                               A AAAAA  AA AAA
      |    AA                                                                                                    A  |
      +                           +                          +                           +                          +
    0 +AAAA-----------------------+--------------------------+---------------------------+-------------------------++
      0                           50                        100                         150                        200


The median function size is around 1 cacheline (64-byte one), ~80% 
fitting into two cachelines, with a big peak for very small functions 
that make up something like 20% of all functions - so packing makes 
sense unless the actual functions being executed are more fragmented 
than I'd estimate around 50%.

Which is still a significant hurdle.

Plus there are a truckload of other details, like how exactly the flow 
of execution goes within the function (which part is hot). For example 
locking APIs tend to return early after a relatively straight path of 
execution, making the effective hot function length even smaller than 
the median.

All in one: there's no clear winner, so there's no way around actually 
measuring it ... ;-)

Thanks,

	Ingo