All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Davidlohr Bueso <dave@stgolabs.net>, Peter Anvin <hpa@zytor.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <peterz@infradead.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Brian Gerst <brgerst@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jason Low <jason.low2@hp.com>,
	"linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>
Subject: Re: [tip:x86/asm] x86: Pack function addresses tightly as well
Date: Sun, 17 May 2015 09:09:29 +0200	[thread overview]
Message-ID: <20150517070929.GA23344@gmail.com> (raw)
In-Reply-To: <20150517055551.GB17002@gmail.com>


* Ingo Molnar <mingo@kernel.org> wrote:

> > Put another way: I suspect this is more likely to hurt, and less 
> > likely to help than the others.
> 
> Yeah, indeed.
> 
> So my thinking was that it would help, because:
> 
>   - There's often locality of reference between functions: we often
>     have a handful of hot functions that are sitting next to each 
>     other and they could thus be packed closer to each other this way, 
>     creating a smaller net I$ footprint.
> 
>   - We have a handful of 'clusters' or small and often hot functions,
>     especially in the locking code:
> 
> 	ffffffff81893080 T _raw_spin_unlock_irqrestore
> 	ffffffff81893090 T _raw_read_unlock_irqrestore
> 	ffffffff818930a0 T _raw_write_unlock_irqrestore
> 	ffffffff818930b0 T _raw_spin_trylock_bh
> 	ffffffff81893110 T _raw_spin_unlock_bh
> 	ffffffff81893130 T _raw_read_unlock_bh
> 	ffffffff81893150 T _raw_write_unlock_bh
> 	ffffffff81893170 T _raw_read_trylock
> 	ffffffff818931a0 T _raw_write_trylock
> 	ffffffff818931d0 T _raw_read_lock_irqsave
> 	ffffffff81893200 T _raw_write_lock_irqsave
> 	ffffffff81893230 T _raw_spin_lock_bh
> 	ffffffff81893270 T _raw_spin_lock_irqsave
> 	ffffffff818932c0 T _raw_write_lock
> 	ffffffff818932e0 T _raw_write_lock_irq
> 	ffffffff81893310 T _raw_write_lock_bh
> 	ffffffff81893340 T _raw_spin_trylock
> 	ffffffff81893380 T _raw_read_lock
> 	ffffffff818933a0 T _raw_read_lock_irq
> 	ffffffff818933c0 T _raw_read_lock_bh
> 	ffffffff818933f0 T _raw_spin_lock
> 	ffffffff81893430 T _raw_spin_lock_irq
> 	ffffffff81893450
> 
>      That's 976 bytes total if 16 bytes aligned.
> 
>      With function packing, they compress into:
> 
> 	ffffffff817f2458 T _raw_spin_unlock_irqrestore
> 	ffffffff817f2463 T _raw_read_unlock_irqrestore
> 	ffffffff817f2472 T _raw_write_unlock_irqrestore
> 	ffffffff817f247d T _raw_read_unlock_bh
> 	ffffffff817f2498 T _raw_write_unlock_bh
> 	ffffffff817f24af T _raw_spin_unlock_bh
> 	ffffffff817f24c6 T _raw_read_trylock
> 	ffffffff817f24ef T _raw_write_trylock
> 	ffffffff817f250e T _raw_spin_lock_bh
> 	ffffffff817f2536 T _raw_read_lock_irqsave
> 	ffffffff817f255e T _raw_write_lock_irqsave
> 	ffffffff817f2588 T _raw_spin_lock_irqsave
> 	ffffffff817f25be T _raw_spin_trylock_bh
> 	ffffffff817f25f6 T _raw_spin_trylock
> 	ffffffff817f2615 T _raw_spin_lock
> 	ffffffff817f2632 T _raw_spin_lock_irq
> 	ffffffff817f2650 T _raw_write_lock
> 	ffffffff817f266b T _raw_write_lock_irq
> 	ffffffff817f2687 T _raw_write_lock_bh
> 	ffffffff817f26ad T _raw_read_lock
> 	ffffffff817f26c6 T _raw_read_lock_bh
> 	ffffffff817f26ea T _raw_read_lock_irq
> 	ffffffff817f2704
> 
>       That's 684 bytes - a very stark difference that will show up in 
>       better I$ footprint even if usage is sparse.
> 
>       OTOH, on the flip side, their ordering is far from ideal, so for 
>       example the rarely used 'trylock' variants are mixed into the 
>       middle, and the way we mix rwlock with spinlock ops isn't 
>       very pretty either.
> 
>       So we could reduce alignment for just the locking APIs, via per 
>       .o cflags in the Makefile, if packing otherwise hurts the common 
>       case.

Another datapoint: despite the widespread stereotype of bloated, 
kernel stack guzzling kernel functions, most kernel functions are 
pretty small and fit into a single (or at most two) cachelines.

Here's a histogram of x86-64 defconfig function sizes (in bytes, cut 
off at around 1200 bytes):

  600 ++----------------+------------------+-----------------+-----------------+------------------+----------------++
      +                 +                  +                 +               kernel function sizes histogram   A    +
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  500 +AA                                                                                                          ++
      |                                                                                                             |
      A                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  400 ++                                                                                                           ++
      |                                                                                                             |
      |A                                                                                                            |
      |                                                                                                             |
      |                                                                                                             |
  300 +AAA                                                                                                         ++
      | A                                                                                                           |
      |A                                                                                                            |
      |A                                                                                                            |
      |                                                                                                             |
      | AAA                                                                                                         |
  200 +AAAA                                                                                                        ++
      | AAAAA                                                                                                       |
      | AAAAAA                                                                                                      |
      | A AAAA A                                                                                                    |
      |A A AAAAAAA                                                                                                  |
      |       AAAAA                                                                                                 |
  100 +A        AAAA                                                                                               ++
      |         AAAAAAA                                                                                             |
      |            AAAAAAAAAA                                                                                       |
      |               AAAAAAAAAAAA                                                                                  |
      |A                AAAAAAAAAAAA AAAA                                                                           |
      +                 +       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AA A A A       +               A  +                 +
    0 AA----------------+------------------AA-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------++
      0                200                400               600               800                1000              1200


Zoomed to just the first 200 bytes:

  600 ++--------------------------+--------------------------+---------------------------+-------------------------++
      +                           +                          +               kernel function sizes histogram   A    +
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  500 ++    A   A                                                                                                  ++
      |                                                                                                             |
      A                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  400 ++                                                                                                           ++
      |                                                                                                             |
      |      A                                                                                                      |
      |                                                                                                             |
      |                                                                                                             |
  300 ++       A  A     A                                                                                          ++
      |         A                                                                                                   |
      |  A                                                                                                          |
      |   A                                                                                                         |
      |                                                                                                             |
      |             A    A  A A                                                                                     |
  200 ++      A    A   A AAA A                                                                                     ++
      |          A   A    AA   A  A A     A                                                                         |
      |        A  A A AA      AAAA A A   A        A                                                                 |
      |            A         A  A  AA AAAA AAAAAA         A AA                                                      |
      |       A         A         A  A  A A  A A A AAAA A A   AAA AAA A                                             |
      |                                             AA A A AA AAAA   AAAAA AA                                       |
  100 ++     A                                                   A A A    AA  AAAAA AA A                           ++
      |                                                             A    AA AA    AA AAA AAAAA AA A  A              |
      |                                                                               A AA AAAA AAAAAAAAA AAAAAA A A|
      |                                                                                               A AAAAA  AA AAA
      |    AA                                                                                                    A  |
      +                           +                          +                           +                          +
    0 +AAAA-----------------------+--------------------------+---------------------------+-------------------------++
      0                           50                        100                         150                        200


The median function size is around 1 cacheline (64-byte one), ~80% 
fitting into two cachelines, with a big peak for very small functions 
that make up something like 20% of all functions - so packing makes 
sense unless the actual functions being executed are more fragmented 
than I'd estimate around 50%.

Which is still a significant hurdle.

Plus there are a truckload of other details, like how exactly the flow 
of execution goes within the function (which part is hot). For example 
locking APIs tend to return early after a relatively straight path of 
execution, making the effective hot function length even smaller than 
the median.

All in one: there's no clear winner, so there's no way around actually 
measuring it ... ;-)

Thanks,

	Ingo

  reply	other threads:[~2015-05-17  7:09 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
2015-04-09  9:00   ` [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner() tip-bot for Jason Low
2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09  5:37   ` Ingo Molnar
2015-04-09  6:40     ` Jason Low
2015-04-09  7:53       ` Ingo Molnar
2015-04-09 16:47         ` Linus Torvalds
2015-04-09 17:56           ` Paul E. McKenney
2015-04-09 18:08             ` Linus Torvalds
2015-04-09 18:16               ` Linus Torvalds
2015-04-09 18:39                 ` Paul E. McKenney
2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
2015-04-10  9:12                     ` Ingo Molnar
2015-04-10  9:21                       ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
2015-04-10 13:46                                   ` Borislav Petkov
2015-05-15  9:40                                   ` [tip:x86/asm] " tip-bot for Ingo Molnar
2015-05-17  6:03                                   ` [tip:x86/apic] " tip-bot for Ingo Molnar
2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
2015-05-15 18:36                                   ` Linus Torvalds
2015-05-15 20:52                                     ` Denys Vlasenko
2015-05-17  5:58                                     ` Ingo Molnar
2015-05-17  7:09                                       ` Ingo Molnar [this message]
2015-05-17  7:30                                         ` Ingo Molnar
2015-05-18  9:28                                       ` Denys Vlasenko
2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
2015-05-20  0:47                                         ` Linus Torvalds
2015-05-20 12:21                                           ` Denys Vlasenko
2015-05-21 11:36                                             ` Ingo Molnar
2015-05-21 11:38                                             ` Denys Vlasenko
2016-04-16 21:08                                               ` Denys Vlasenko
2015-05-20 13:09                                           ` Ingo Molnar
2015-05-20 11:29                                         ` Denys Vlasenko
2015-05-21 13:28                                           ` Ingo Molnar
2015-05-21 14:03                                           ` Ingo Molnar
2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-04-10 13:18                                 ` H. Peter Anvin
2015-04-10 17:54                                   ` Ingo Molnar
2015-04-10 18:32                                     ` H. Peter Anvin
2015-04-11 14:41                                   ` Markus Trippelsdorf
2015-04-12 10:14                                     ` Ingo Molnar
2015-04-13 16:23                                       ` Markus Trippelsdorf
2015-04-13 17:26                                         ` Markus Trippelsdorf
2015-04-13 18:31                                           ` Linus Torvalds
2015-04-13 19:09                                             ` Markus Trippelsdorf
2015-04-14  5:38                                               ` Ingo Molnar
2015-04-14  8:23                                                 ` Markus Trippelsdorf
2015-04-14  9:16                                                   ` Ingo Molnar
2015-04-14 11:17                                                     ` Markus Trippelsdorf
2015-04-14 12:09                                                       ` Ingo Molnar
2015-04-10 18:48                                 ` Linus Torvalds
2015-04-12 23:44                                   ` Maciej W. Rozycki
2015-04-10 19:23                                 ` Daniel Borkmann
2015-04-11 13:48                                 ` Markus Trippelsdorf
2015-04-10 13:19                               ` Borislav Petkov
2015-04-10 13:54                                 ` Denys Vlasenko
2015-04-10 14:01                                   ` Borislav Petkov
2015-04-10 14:53                                     ` Denys Vlasenko
2015-04-10 15:25                                       ` Borislav Petkov
2015-04-10 15:48                                         ` Denys Vlasenko
2015-04-10 15:54                                           ` Borislav Petkov
2015-04-10 21:44                                             ` Borislav Petkov
2015-04-10 18:54                                       ` Linus Torvalds
2015-04-10 14:10                               ` Paul E. McKenney
2015-04-11 14:28                                 ` Josh Triplett
2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
2015-04-11 17:41                                 ` Linus Torvalds
2015-04-11 18:57                                   ` Thomas Gleixner
2015-04-11 19:35                                     ` Linus Torvalds
2015-04-12  5:47                                       ` Ingo Molnar
2015-04-12  6:20                                         ` Markus Trippelsdorf
2015-04-12 10:15                                           ` Ingo Molnar
2015-04-12  7:56                                         ` Mike Galbraith
2015-04-12  7:41                                       ` Ingo Molnar
2015-04-12  8:07                                     ` Ingo Molnar
2015-04-12 21:11                                     ` Jan Hubicka
2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-05-14 18:17                                 ` Ingo Molnar
2015-05-14 19:04                                   ` Denys Vlasenko
2015-05-14 19:44                                     ` Ingo Molnar
2015-05-15 15:45                                   ` Josh Triplett
2015-05-17  5:34                                     ` Ingo Molnar
2015-05-17 19:18                                       ` Josh Triplett
2015-05-18  6:48                                         ` Ingo Molnar
2015-05-15  9:39                               ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 17:49                           ` Linus Torvalds
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 18:09                               ` Linus Torvalds
2015-04-10 14:20                     ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
2015-04-10 17:44                       ` Ingo Molnar
2015-04-10 18:05                         ` Paul E. McKenney
2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09 19:58                   ` Paul E. McKenney
2015-04-09 20:58                     ` Jason Low
2015-04-09 21:07                       ` Paul E. McKenney
2015-04-09 19:59                   ` Davidlohr Bueso
2015-04-09 20:36                 ` Jason Low
2015-04-10  2:43                   ` Andev
2015-04-10  9:04                   ` Ingo Molnar
2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
2015-04-08 20:10   ` Jason Low

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150517070929.GA23344@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aswin@hp.com \
    --cc=bp@alien8.de \
    --cc=brgerst@gmail.com \
    --cc=dave@stgolabs.net \
    --cc=dvlasenk@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jason.low2@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.