Re: [PATCH v1 3/3] KVM: arm64: Add histogram stats for handling time of arch specific exit reasons

From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>,
	Jing Zhang <jingzhangos@google.com>, KVM <kvm@vger.kernel.org>,
	KVMARM <kvmarm@lists.cs.columbia.edu>,
	Will Deacon <will@kernel.org>,
	David Matlack <dmatlack@google.com>,
	Peter Shier <pshier@google.com>, Oliver Upton <oupton@google.com>,
	Jim Mattson <jmattson@google.com>,
	Ben Gardon <bgardon@google.com>,
	Aaron Lewis <aaronlewis@google.com>,
	Venkatesh Srinivas <venkateshs@google.com>
Subject: Re: [PATCH v1 3/3] KVM: arm64: Add histogram stats for handling time of arch specific exit reasons
Date: Wed, 22 Sep 2021 18:13:40 +0000	[thread overview]
Message-ID: <YUtyVEpMBityBBNl@google.com> (raw)
In-Reply-To: <d16ecbd2-2bc9-2691-a21d-aef4e6f007b9@redhat.com>

+Google folks

On Wed, Sep 22, 2021, Paolo Bonzini wrote:
> On 22/09/21 13:22, Marc Zyngier wrote:
> > Frankly, this is a job for BPF and the tracing subsystem, not for some
> > hardcoded syndrome accounting. It would allow to extract meaningful
> > information, prevent bloat, and crucially make it optional. Even empty
> > trace points like the ones used in the scheduler would be infinitely
> > better than this (load your own module that hooks into these trace
> > points, expose the data you want, any way you want).
> 
> I agree.  I had left out for later the similar series you had for x86, but I
> felt the same as Marc; even just counting the number of occurrences of each
> exit reason is a nontrivial amount of memory to spend on each vCPU.

That depends on the use case, environment, etc...  E.g. if the VM is assigned a
_minimum_ of 4gb per vCPU, then burning even tens of kilobytes of memory per vCPU
is trivial, or at least completely acceptable.

I do 100% agree this should be optional, be it through an ioctl(), module/kernel
param, Kconfig, whatever.  The changelogs are also sorely lacking the motivation
for having dedicated stats; we'll do our best to remedy that for future work.

Stepping back a bit, this is one piece of the larger issue of how to modernize
KVM for hyperscale usage.  BPF and tracing are great when the debugger has root
access to the machine and can rerun the failing workload at will.  They're useless
for identifying trends across large numbers of machines, triaging failures after
the fact, debugging performance issues with workloads that the debugger doesn't
have direct access to, etc...

Logging is a similar story, e.g. using _ratelimited() printk to aid debug works
well when there are a very limited number of VMs and there is a human that can
react to arbitrary kernel messages, but it's basically useless when there are 10s
or 100s of VMs and taking action on a kernel message requires a prior knowledge
of the message.

I'm certainly not expecting other people to solve our challenges, and I fully
appreciate that there are many KVM users that don't care at all about scalability,
but I'm hoping we can get the community at large, and especially maintainers and
reviewers, to also consider at-scale use cases when designing, implementing,
reviewing, etc...