From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752184AbbEQHJl (ORCPT ); Sun, 17 May 2015 03:09:41 -0400 Received: from mail-wi0-f169.google.com ([209.85.212.169]:35098 "EHLO mail-wi0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751051AbbEQHJh (ORCPT ); Sun, 17 May 2015 03:09:37 -0400 Date: Sun, 17 May 2015 09:09:29 +0200 From: Ingo Molnar To: Linus Torvalds Cc: Andy Lutomirski , Davidlohr Bueso , Peter Anvin , Denys Vlasenko , Linux Kernel Mailing List , Tim Chen , Borislav Petkov , Peter Zijlstra , "Chandramouleeswaran, Aswin" , Peter Zijlstra , Brian Gerst , Paul McKenney , Thomas Gleixner , Jason Low , "linux-tip-commits@vger.kernel.org" Subject: Re: [tip:x86/asm] x86: Pack function addresses tightly as well Message-ID: <20150517070929.GA23344@gmail.com> References: <20150410121808.GA19918@gmail.com> <20150517055551.GB17002@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150517055551.GB17002@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Ingo Molnar wrote: > > Put another way: I suspect this is more likely to hurt, and less > > likely to help than the others. > > Yeah, indeed. > > So my thinking was that it would help, because: > > - There's often locality of reference between functions: we often > have a handful of hot functions that are sitting next to each > other and they could thus be packed closer to each other this way, > creating a smaller net I$ footprint. > > - We have a handful of 'clusters' or small and often hot functions, > especially in the locking code: > > ffffffff81893080 T _raw_spin_unlock_irqrestore > ffffffff81893090 T _raw_read_unlock_irqrestore > ffffffff818930a0 T _raw_write_unlock_irqrestore > ffffffff818930b0 T _raw_spin_trylock_bh > ffffffff81893110 T _raw_spin_unlock_bh > ffffffff81893130 T _raw_read_unlock_bh > ffffffff81893150 T _raw_write_unlock_bh > ffffffff81893170 T _raw_read_trylock > ffffffff818931a0 T _raw_write_trylock > ffffffff818931d0 T _raw_read_lock_irqsave > ffffffff81893200 T _raw_write_lock_irqsave > ffffffff81893230 T _raw_spin_lock_bh > ffffffff81893270 T _raw_spin_lock_irqsave > ffffffff818932c0 T _raw_write_lock > ffffffff818932e0 T _raw_write_lock_irq > ffffffff81893310 T _raw_write_lock_bh > ffffffff81893340 T _raw_spin_trylock > ffffffff81893380 T _raw_read_lock > ffffffff818933a0 T _raw_read_lock_irq > ffffffff818933c0 T _raw_read_lock_bh > ffffffff818933f0 T _raw_spin_lock > ffffffff81893430 T _raw_spin_lock_irq > ffffffff81893450 > > That's 976 bytes total if 16 bytes aligned. > > With function packing, they compress into: > > ffffffff817f2458 T _raw_spin_unlock_irqrestore > ffffffff817f2463 T _raw_read_unlock_irqrestore > ffffffff817f2472 T _raw_write_unlock_irqrestore > ffffffff817f247d T _raw_read_unlock_bh > ffffffff817f2498 T _raw_write_unlock_bh > ffffffff817f24af T _raw_spin_unlock_bh > ffffffff817f24c6 T _raw_read_trylock > ffffffff817f24ef T _raw_write_trylock > ffffffff817f250e T _raw_spin_lock_bh > ffffffff817f2536 T _raw_read_lock_irqsave > ffffffff817f255e T _raw_write_lock_irqsave > ffffffff817f2588 T _raw_spin_lock_irqsave > ffffffff817f25be T _raw_spin_trylock_bh > ffffffff817f25f6 T _raw_spin_trylock > ffffffff817f2615 T _raw_spin_lock > ffffffff817f2632 T _raw_spin_lock_irq > ffffffff817f2650 T _raw_write_lock > ffffffff817f266b T _raw_write_lock_irq > ffffffff817f2687 T _raw_write_lock_bh > ffffffff817f26ad T _raw_read_lock > ffffffff817f26c6 T _raw_read_lock_bh > ffffffff817f26ea T _raw_read_lock_irq > ffffffff817f2704 > > That's 684 bytes - a very stark difference that will show up in > better I$ footprint even if usage is sparse. > > OTOH, on the flip side, their ordering is far from ideal, so for > example the rarely used 'trylock' variants are mixed into the > middle, and the way we mix rwlock with spinlock ops isn't > very pretty either. > > So we could reduce alignment for just the locking APIs, via per > .o cflags in the Makefile, if packing otherwise hurts the common > case. Another datapoint: despite the widespread stereotype of bloated, kernel stack guzzling kernel functions, most kernel functions are pretty small and fit into a single (or at most two) cachelines. Here's a histogram of x86-64 defconfig function sizes (in bytes, cut off at around 1200 bytes): 600 ++----------------+------------------+-----------------+-----------------+------------------+----------------++ + + + + kernel function sizes histogram A + | | | | | | | | 500 +AA ++ | | A | | | | | | | 400 ++ ++ | | |A | | | | | 300 +AAA ++ | A | |A | |A | | | | AAA | 200 +AAAA ++ | AAAAA | | AAAAAA | | A AAAA A | |A A AAAAAAA | | AAAAA | 100 +A AAAA ++ | AAAAAAA | | AAAAAAAAAA | | AAAAAAAAAAAA | |A AAAAAAAAAAAA AAAA | + + AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AA A A A + A + + 0 AA----------------+------------------AA-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------++ 0 200 400 600 800 1000 1200 Zoomed to just the first 200 bytes: 600 ++--------------------------+--------------------------+---------------------------+-------------------------++ + + + kernel function sizes histogram A + | | | | | | | | 500 ++ A A ++ | | A | | | | | | | 400 ++ ++ | | | A | | | | | 300 ++ A A A ++ | A | | A | | A | | | | A A A A | 200 ++ A A A AAA A ++ | A A AA A A A A | | A A A AA AAAA A A A A | | A A A AA AAAA AAAAAA A AA | | A A A A A A A A A AAAA A A AAA AAA A | | AA A A AA AAAA AAAAA AA | 100 ++ A A A A AA AAAAA AA A ++ | A AA AA AA AAA AAAAA AA A A | | A AA AAAA AAAAAAAAA AAAAAA A A| | A AAAAA AA AAA | AA A | + + + + + 0 +AAAA-----------------------+--------------------------+---------------------------+-------------------------++ 0 50 100 150 200 The median function size is around 1 cacheline (64-byte one), ~80% fitting into two cachelines, with a big peak for very small functions that make up something like 20% of all functions - so packing makes sense unless the actual functions being executed are more fragmented than I'd estimate around 50%. Which is still a significant hurdle. Plus there are a truckload of other details, like how exactly the flow of execution goes within the function (which part is hot). For example locking APIs tend to return early after a relatively straight path of execution, making the effective hot function length even smaller than the median. All in one: there's no clear winner, so there's no way around actually measuring it ... ;-) Thanks, Ingo