Re: [RFC v2 PATCH 6/7] x86/entry: get rid of ALLOC_PT_GPREGS_ON_STACK and SAVE_AND_CLEAR_REGS

From: Ingo Molnar <mingo@kernel.org>
To: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	the arch/x86 maintainers <x86@kernel.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andi Kleen <ak@linux.intel.com>,
	Andrew Lutomirski <luto@kernel.org>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
	Brian Gerst <brgerst@gmail.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [RFC v2 PATCH 6/7] x86/entry: get rid of ALLOC_PT_GPREGS_ON_STACK and SAVE_AND_CLEAR_REGS
Date: Thu, 8 Feb 2018 10:47:10 +0100	[thread overview]
Message-ID: <20180208094710.qnjixhm6hybebdv7@gmail.com> (raw)
In-Reply-To: <20180207212915.GA4633@isilmar-4.linta.de>

* Dominik Brodowski <linux@dominikbrodowski.net> wrote:

> On Wed, Feb 07, 2018 at 12:44:41PM -0800, Linus Torvalds wrote:
> > On Wed, Feb 7, 2018 at 12:15 PM, Dominik Brodowski
> > <linux@dominikbrodowski.net> wrote:
> > >
> > > Note: The testb $3, CS(%rsp) instruction in idtentry() does not need
> > > modification. Previously %rsp was manually decreased by 15*8; with
> > > this patch, %rsp is decreased by 15 pushq instructions. Moreover,
> > > error_entry did and does the exact same test (with offset=8) after
> > > the registers have been moved/pushed and cleared.
> > 
> > So this has the problem that now those save/clear instructions will
> > all be done in that idtentry macro.
> > 
> > So now that code will be duplicated for all the users of that macro.
> > 
> > The old code did the saving in the common error_entry and
> > paranoid_entry routines, in order to be able to share all the code,
> > and making the duplicated stub functions generated by the idtentry
> > macro smaller.
> > 
> > Now, admittedly the new push sequence is much smaller than the old
> > movq sequence, so the duplication doesn't hurt as much, but it's still
> > likely quite noticeable.
> > 
> > So this removes lines of asm code, but it adds a lot of instructions
> > to the end result thanks to the macro, I think.
> 
> Indeed, that is the case (see below). However, if we want to switch to
> PUSH instructions and do this in a routine which is call'ed and which
> ret'urns, %rsp needs to be moved around even more often than the old
> ALLOC_PT_GPREGS_ON_STACK macro did (which you wanted to get rid of,
> IIUYC). Or do I miss something?
> 
>    text	   data	    bss	    dec	    hex	filename
>   19500	      0	      0	  19500	   4c2c	arch/x86/entry/entry_64.o-orig
>   19510	      0	      0	  19510	   4c36	arch/x86/entry/entry_64.o-3_of_7
>   21105	      0	      0	  21105	   5271	arch/x86/entry/entry_64.o-5_of_7
>   24307	      0	      0	  24307	   5ef3	arch/x86/entry/entry_64.o-7_of_7

So while this shows a +~5K increase in text size, these numbers also _very_ 
significantly over-represent the extra footprint. The assumptions that resulted in 
us compressing the IRQ entry code have changed very significantly with the new x86 
IRQ allocation code we introduced in the last year:

- IRQ vectors are usually populated in tightly clustered groups.

  With our new vector allocator code the typical per CPU allocation percentage on
  x86 systems is ~3 device vectors and ~10 fixed vectors out of ~220 vectors -
  i.e. a very low ~6% utilization (!). This means that the _real_ text footprint
  increase is probably closer to 0.1K of text...

  This is a very typical picture on an average desktop system:

  /sys/kernel/debug/irq/domains/VECTORS:

   Online bitmaps:       40
   Global available:   8007
   Global reserved:      24
   Total allocated:      73
   System: 42: 0-19,32,50,128,237-255

 | CPU | avl | man | act | vectors
     0   199     0     3  33-34,48
     1   199     0     3  33-35
     2   199     0     3  33-35
     3   199     0     3  33-35
     4   199     0     3  33-35
     5   199     0     3  33-35
     6   199     0     3  33-35
     7   199     0     3  33-35
     8   199     0     3  33-35
     9   199     0     3  33-35
    10   201     0     1  33
    11   201     0     1  33
    12   201     0     1  33
    13   201     0     1  33
    14   201     0     1  33
    15   201     0     1  33
    16   201     0     1  33
    17   201     0     1  33
    18   201     0     1  33
    19   201     0     1  33
    20   199     0     3  33-35
    21   199     0     3  33-35
    22   199     0     3  33-35
    23   199     0     3  33-35
    24   199     0     3  33-35
    25   199     0     3  33-35
    26   199     0     3  33-35
    27   199     0     3  33-35
    28   199     0     3  33-35
    29   199     0     3  33-35
    30   201     0     1  33
    31   201     0     1  33
    32   201     0     1  33
    33   202     0     0  
    34   202     0     0  
    35   202     0     0  
    36   202     0     0  
    37   202     0     0  
    38   202     0     0  
    39   202     0     0  

  I.e. the average number of (device) IRQ vectors per CPU is around 2, with the 
  max being 3.

  The days where we allocated a lot of vectors on every CPU and the compression 
  of the IRQ entry code text mattered are over.

- Another issue is that only a small minority of vectors is frequent enough to
  actually matter to cache utilization in practice: 3-4 key IPIs and 1-2 device 
  IRQs at most - and those vectors tend to be tightly clustered as well into about 
  two groups, and are probably already on 2-3 cache lines in practice.

  For the common case of 'cache cold' IRQs it's the depth of the call chain and 
  the fragmentation of the resulting I$ that should be the main performance limit
  - not the overall size of it.

- The CPU side cost of IRQ delivery is still very expensive even in the best, most 
  cached case, as in 'over a thousand cycles'. So much stuff is done that maybe 
  contemporary x86 IRQ entry microcode already prefetches the IDT entry and its 
  expected call target address.

So for those reasons I'm really tempted by the all around simplification offered 
by this series:

   2 files changed, 62 insertions(+), 160 deletions(-)

Basically in this specific case I'd like to turn the argument around,
use this simpler design and put the burden of proof on the patches that
want to _complicate it_ beyond this simple, straightforward model: show us
the numbers that it truly makes a difference: it's not that hard to 
spray a CPU with IRQs by using IPIs, and the cache miss rate and the
overall perf counter stats should tell a clear story about whether it
makes sense to cache-compress (and complicate) the IRQ entry sequences.

Thanks,

	Ingo