Re: [PATCH 11/15] KVM: x86/MMU: Refactor vmx_get_mt_mask

From: Sean Christopherson <seanjc@google.com>
To: Ben Gardon <bgardon@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Peter Xu <peterx@redhat.com>, Peter Shier <pshier@google.com>,
	David Matlack <dmatlack@google.com>,
	Mingwei Zhang <mizhang@google.com>,
	Yulei Zhang <yulei.kernel@gmail.com>,
	Wanpeng Li <kernellwp@gmail.com>,
	Xiao Guangrong <xiaoguangrong.eric@gmail.com>,
	Kai Huang <kai.huang@intel.com>,
	Keqian Zhu <zhukeqian1@huawei.com>,
	David Hildenbrand <david@redhat.com>
Subject: Re: [PATCH 11/15] KVM: x86/MMU: Refactor vmx_get_mt_mask
Date: Mon, 22 Nov 2021 18:46:58 +0000	[thread overview]
Message-ID: <YZvloswO5g/o02V6@google.com> (raw)
In-Reply-To: <CANgfPd-_7tR9tSJg85-0wAG72454qeedovhBvbX6OS1YNRxvMw@mail.gmail.com>

On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 1:03 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 11/18/21 16:30, Sean Christopherson wrote:
> > > If we really want to make this state per-vCPU, KVM would need to incorporate the
> > > CR0.CD and MTRR settings in kvm_mmu_page_role.  For MTRRs in particular, the worst
> > > case scenario is that every vCPU has different MTRR settings, which means that
> > > kvm_mmu_page_role would need to be expanded by 10 bits in order to track every
> > > possible vcpu_idx (currently capped at 1024).
> >
> > Yes, that's insanity.  I was also a bit skeptical about Ben's try_get_mt_mask callback,
> > but this would be much much worse.
> 
> Yeah, the implementation of that felt a bit kludgy to me too, but
> refactoring the handling of all those CR bits was way more complex
> than I wanted to handle in this patch set.
> I'd love to see some of those CR0 / MTRR settings be set on a VM basis
> and enforced as uniform across vCPUs.

Architecturally, we can't do that.  Even a perfectly well-behaved guest will have
(small) periods where the BSP has different settings than APs.  And it's technically
legal to have non-uniform MTRR and CR0.CD/NW configurations, even though no modern
BIOS/kernel does that.  Except for non-coherent DMA, it's a moot point because KVM
can simply ignore guest cacheability settings.

> Looking up vCPU 0 and basing things on that feels extra hacky though,
> especially if we're still not asserting uniformity of settings across
> vCPUs.

IMO, it's marginally less hacky than what KVM has today as it allows KVM's behavior
to be clearly and sanely stated, e.g. KVM uses vCPU0's cacheability settings when
mapping non-coherent DMA.  Compare that with today's behavior where the cacheability
settings depend on which vCPU first faulted in the address for a given MMU role and
instance of the associated root, and whether other vCPUs share an MMU role/root.

> If we need to track that state to accurately virtualize the hardware
> though, that would be unfortunate.