All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Davidlohr Bueso <dave@stgolabs.net>, Peter Anvin <hpa@zytor.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <peterz@infradead.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Brian Gerst <brgerst@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jason Low <jason.low2@hp.com>,
	"linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>
Subject: Re: [tip:x86/asm] x86: Pack function addresses tightly as well
Date: Sun, 17 May 2015 07:58:36 +0200	[thread overview]
Message-ID: <20150517055551.GB17002@gmail.com> (raw)
In-Reply-To: <CA+55aFywCXk083w78cQGbRKh-ERLtE8v9PZhqxaHcyHJxSNsFQ@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, May 15, 2015 at 2:39 AM, tip-bot for Ingo Molnar
> <tipbot@zytor.com> wrote:
> >
> > We can pack function addresses tightly as well:
> 
> So I really want to see performance numbers on a few 
> microarchitectures for this one in particular.
> 
> The kernel generally doesn't have loops (well, not the kinds of 
> high-rep loops that tend to be worth aligning), and I think the 
> general branch/loop alignment is likely fine. But the function 
> alignment doesn't tend to have the same kind of I$ advantages, it's 
> more lilely purely a size issue and not as interesting. Function 
> targets are also more likely to be not in the cache, I suspect, 
> since you don't have a loop priming it or a short forward jump that 
> just got the cacheline anyway. And then *not* aligning the function 
> would actually tend to make it *less* dense in the I$.
> 
> Put another way: I suspect this is more likely to hurt, and less 
> likely to help than the others.

Yeah, indeed.

So my thinking was that it would help, because:

  - There's often locality of reference between functions: we often
    have a handful of hot functions that are sitting next to each 
    other and they could thus be packed closer to each other this way, 
    creating a smaller net I$ footprint.

  - We have a handful of 'clusters' or small and often hot functions,
    especially in the locking code:

	ffffffff81893080 T _raw_spin_unlock_irqrestore
	ffffffff81893090 T _raw_read_unlock_irqrestore
	ffffffff818930a0 T _raw_write_unlock_irqrestore
	ffffffff818930b0 T _raw_spin_trylock_bh
	ffffffff81893110 T _raw_spin_unlock_bh
	ffffffff81893130 T _raw_read_unlock_bh
	ffffffff81893150 T _raw_write_unlock_bh
	ffffffff81893170 T _raw_read_trylock
	ffffffff818931a0 T _raw_write_trylock
	ffffffff818931d0 T _raw_read_lock_irqsave
	ffffffff81893200 T _raw_write_lock_irqsave
	ffffffff81893230 T _raw_spin_lock_bh
	ffffffff81893270 T _raw_spin_lock_irqsave
	ffffffff818932c0 T _raw_write_lock
	ffffffff818932e0 T _raw_write_lock_irq
	ffffffff81893310 T _raw_write_lock_bh
	ffffffff81893340 T _raw_spin_trylock
	ffffffff81893380 T _raw_read_lock
	ffffffff818933a0 T _raw_read_lock_irq
	ffffffff818933c0 T _raw_read_lock_bh
	ffffffff818933f0 T _raw_spin_lock
	ffffffff81893430 T _raw_spin_lock_irq
	ffffffff81893450

     That's 976 bytes total if 16 bytes aligned.

     With function packing, they compress into:

	ffffffff817f2458 T _raw_spin_unlock_irqrestore
	ffffffff817f2463 T _raw_read_unlock_irqrestore
	ffffffff817f2472 T _raw_write_unlock_irqrestore
	ffffffff817f247d T _raw_read_unlock_bh
	ffffffff817f2498 T _raw_write_unlock_bh
	ffffffff817f24af T _raw_spin_unlock_bh
	ffffffff817f24c6 T _raw_read_trylock
	ffffffff817f24ef T _raw_write_trylock
	ffffffff817f250e T _raw_spin_lock_bh
	ffffffff817f2536 T _raw_read_lock_irqsave
	ffffffff817f255e T _raw_write_lock_irqsave
	ffffffff817f2588 T _raw_spin_lock_irqsave
	ffffffff817f25be T _raw_spin_trylock_bh
	ffffffff817f25f6 T _raw_spin_trylock
	ffffffff817f2615 T _raw_spin_lock
	ffffffff817f2632 T _raw_spin_lock_irq
	ffffffff817f2650 T _raw_write_lock
	ffffffff817f266b T _raw_write_lock_irq
	ffffffff817f2687 T _raw_write_lock_bh
	ffffffff817f26ad T _raw_read_lock
	ffffffff817f26c6 T _raw_read_lock_bh
	ffffffff817f26ea T _raw_read_lock_irq
	ffffffff817f2704

      That's 684 bytes - a very stark difference that will show up in 
      better I$ footprint even if usage is sparse.

      OTOH, on the flip side, their ordering is far from ideal, so for 
      example the rarely used 'trylock' variants are mixed into the 
      middle, and the way we mix rwlock with spinlock ops isn't 
      very pretty either.

      So we could reduce alignment for just the locking APIs, via per 
      .o cflags in the Makefile, if packing otherwise hurts the common 
      case.

This function packing argument fails:

  - for large functions that are physically fragmented

  - if less than half of all functions in a hot workload are
    packed together. This might be the common case in fact.

  - even if functions are technically 'packed' next to each other, 
    this only works for small functions: larger functions typically 
    are hotter near their heads, with unlikely codepaths being in 
    their tails.

> Size matters, but size matters mainly from an I$ standpoint, not 
> from some absolute 'big is bad" issue.

Absolutely.

> [...] Also, even when size matters, performance matters too. I do 
> want performance numbers. Is this measurable?

Will try to measure this. I'm somewhat sceptical that I'll be able to 
measure any signal: alignment effects are very hard to measure on x86, 
especially on any realistic workload.

In any case, consider this function alignment patch shelved until it's 
properly measured.

Thanks,

	ingo

  parent reply	other threads:[~2015-05-17  5:58 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
2015-04-09  9:00   ` [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner() tip-bot for Jason Low
2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09  5:37   ` Ingo Molnar
2015-04-09  6:40     ` Jason Low
2015-04-09  7:53       ` Ingo Molnar
2015-04-09 16:47         ` Linus Torvalds
2015-04-09 17:56           ` Paul E. McKenney
2015-04-09 18:08             ` Linus Torvalds
2015-04-09 18:16               ` Linus Torvalds
2015-04-09 18:39                 ` Paul E. McKenney
2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
2015-04-10  9:12                     ` Ingo Molnar
2015-04-10  9:21                       ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
2015-04-10 13:46                                   ` Borislav Petkov
2015-05-15  9:40                                   ` [tip:x86/asm] " tip-bot for Ingo Molnar
2015-05-17  6:03                                   ` [tip:x86/apic] " tip-bot for Ingo Molnar
2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
2015-05-15 18:36                                   ` Linus Torvalds
2015-05-15 20:52                                     ` Denys Vlasenko
2015-05-17  5:58                                     ` Ingo Molnar [this message]
2015-05-17  7:09                                       ` Ingo Molnar
2015-05-17  7:30                                         ` Ingo Molnar
2015-05-18  9:28                                       ` Denys Vlasenko
2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
2015-05-20  0:47                                         ` Linus Torvalds
2015-05-20 12:21                                           ` Denys Vlasenko
2015-05-21 11:36                                             ` Ingo Molnar
2015-05-21 11:38                                             ` Denys Vlasenko
2016-04-16 21:08                                               ` Denys Vlasenko
2015-05-20 13:09                                           ` Ingo Molnar
2015-05-20 11:29                                         ` Denys Vlasenko
2015-05-21 13:28                                           ` Ingo Molnar
2015-05-21 14:03                                           ` Ingo Molnar
2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-04-10 13:18                                 ` H. Peter Anvin
2015-04-10 17:54                                   ` Ingo Molnar
2015-04-10 18:32                                     ` H. Peter Anvin
2015-04-11 14:41                                   ` Markus Trippelsdorf
2015-04-12 10:14                                     ` Ingo Molnar
2015-04-13 16:23                                       ` Markus Trippelsdorf
2015-04-13 17:26                                         ` Markus Trippelsdorf
2015-04-13 18:31                                           ` Linus Torvalds
2015-04-13 19:09                                             ` Markus Trippelsdorf
2015-04-14  5:38                                               ` Ingo Molnar
2015-04-14  8:23                                                 ` Markus Trippelsdorf
2015-04-14  9:16                                                   ` Ingo Molnar
2015-04-14 11:17                                                     ` Markus Trippelsdorf
2015-04-14 12:09                                                       ` Ingo Molnar
2015-04-10 18:48                                 ` Linus Torvalds
2015-04-12 23:44                                   ` Maciej W. Rozycki
2015-04-10 19:23                                 ` Daniel Borkmann
2015-04-11 13:48                                 ` Markus Trippelsdorf
2015-04-10 13:19                               ` Borislav Petkov
2015-04-10 13:54                                 ` Denys Vlasenko
2015-04-10 14:01                                   ` Borislav Petkov
2015-04-10 14:53                                     ` Denys Vlasenko
2015-04-10 15:25                                       ` Borislav Petkov
2015-04-10 15:48                                         ` Denys Vlasenko
2015-04-10 15:54                                           ` Borislav Petkov
2015-04-10 21:44                                             ` Borislav Petkov
2015-04-10 18:54                                       ` Linus Torvalds
2015-04-10 14:10                               ` Paul E. McKenney
2015-04-11 14:28                                 ` Josh Triplett
2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
2015-04-11 17:41                                 ` Linus Torvalds
2015-04-11 18:57                                   ` Thomas Gleixner
2015-04-11 19:35                                     ` Linus Torvalds
2015-04-12  5:47                                       ` Ingo Molnar
2015-04-12  6:20                                         ` Markus Trippelsdorf
2015-04-12 10:15                                           ` Ingo Molnar
2015-04-12  7:56                                         ` Mike Galbraith
2015-04-12  7:41                                       ` Ingo Molnar
2015-04-12  8:07                                     ` Ingo Molnar
2015-04-12 21:11                                     ` Jan Hubicka
2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-05-14 18:17                                 ` Ingo Molnar
2015-05-14 19:04                                   ` Denys Vlasenko
2015-05-14 19:44                                     ` Ingo Molnar
2015-05-15 15:45                                   ` Josh Triplett
2015-05-17  5:34                                     ` Ingo Molnar
2015-05-17 19:18                                       ` Josh Triplett
2015-05-18  6:48                                         ` Ingo Molnar
2015-05-15  9:39                               ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 17:49                           ` Linus Torvalds
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 18:09                               ` Linus Torvalds
2015-04-10 14:20                     ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
2015-04-10 17:44                       ` Ingo Molnar
2015-04-10 18:05                         ` Paul E. McKenney
2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09 19:58                   ` Paul E. McKenney
2015-04-09 20:58                     ` Jason Low
2015-04-09 21:07                       ` Paul E. McKenney
2015-04-09 19:59                   ` Davidlohr Bueso
2015-04-09 20:36                 ` Jason Low
2015-04-10  2:43                   ` Andev
2015-04-10  9:04                   ` Ingo Molnar
2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
2015-04-08 20:10   ` Jason Low

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150517055551.GB17002@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aswin@hp.com \
    --cc=bp@alien8.de \
    --cc=brgerst@gmail.com \
    --cc=dave@stgolabs.net \
    --cc=dvlasenk@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jason.low2@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.