Re: [PATCH] KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps

From: Sean Christopherson <seanjc@google.com>
To: Vipin Sharma <vipinsh@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	David Matlack <dmatlack@google.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	kvm list <kvm@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps
Date: Mon, 28 Mar 2022 20:29:05 +0000	[thread overview]
Message-ID: <YkIakXAxkyJiO7iF@google.com> (raw)
In-Reply-To: <CAHVum0cynwp5Phx=v2LV33Hsa8viq0jpVLh0Q_ZtpUZVy6Lm9w@mail.gmail.com>

On Mon, Mar 28, 2022, Vipin Sharma wrote:
> Thank you David and Paolo, for checking this patch carefully. With
> hindsight, I should have explicitly mentioned adding "noinline" in my
> patch email.
> 
> On Sun, Mar 27, 2022 at 3:41 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 3/26/22 01:31, Vipin Sharma wrote:
> > >>> -static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
> > >>> +static noinline void
> > >>
> > >> What is the reason to add noinline?
> > >
> > > My understanding is that since this method is called from
> > > __always_inline methods, noinline will avoid gcc inlining the
> > > slot_rmap_walk_next in those functions and generate smaller code.
> > >
> >
> > Iterators are written in such a way that it's way more beneficial to
> > inline them.  After inlining, compilers replace the aggregates (in this
> > case, struct slot_rmap_walk_iterator) with one variable per field and
> > that in turn enables a lot of optimizations, so the iterators should
> > actually be always_inline if anything.
> >
> > For the same reason I'd guess the effect on the generated code should be
> > small (next time please include the output of "size mmu.o"), but should
> > still be there.  I'll do a quick check of the generated code and apply
> > the patch.
> 
> Yeah, I should have added the "size mmu.o" output. Here is what I have found:
> 
> size arch/x86/kvm/mmu/mmu.o
> 
> Without noinline:
>               text      data     bss       dec        hex filename
>           89938   15793      72  105803   19d4b arch/x86/kvm/mmu/mmu.o
> 
> With noinline:
>               text      data     bss        dec       hex filename
>           90058   15793      72  105923   19dc3 arch/x86/kvm/mmu/mmu.o
> 
> With noinline, increase in size = 120
> 
> Curiously, I also checked file size with "ls -l" command
> File size:
>         Without noinline: 1394272 bytes
>         With noinline: 1381216 bytes
> 
> With noinline, decrease in size = 13056 bytes
> 
> I also disassembled mmu.o via "objdump -d" and found following
> Total lines in the generated assembly:
>         Without noinline: 23438
>         With noinline: 23393
> 
> With noinline, decrease in assembly code = 45
> 
> I can see in assembly code that there are multiple "call" operations
> in the "with noinline" object file, which is expected and has less
> lines of code compared to "without noinline". I am not sure why the
> size command is showing an increase in text segment for "with
> noinline" and what to infer with all of this data.

The most common takeaway from these types of exercises is that trying to be smarter
than the compiler is usually a fools errand.  Smaller code footprint doesn't
necessarily equate to better runtime performance.  And conversely, inlining may
not always be a win, which is why tagging static helpers (not in headers) with
"inline" is generally discouraged.

IMO, unless there's an explicit side effect we want (or want to avoid), we should
never use "noinline".  E.g. the VMX <insn>_error() handlers use noinline so that
KVM only WARNs once per failure of instruction type, and fxregs_fixup() uses it
to keep the stack size manageable.