Re: [PATCH v4 11/12] KVM: x86/svm/pmu: Add AMD PerfMonV2 support

From: Like Xu <like.xu.linux@gmail.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Sandipan Das <sandipan.das@amd.com>
Subject: Re: [PATCH v4 11/12] KVM: x86/svm/pmu: Add AMD PerfMonV2 support
Date: Mon, 10 Apr 2023 19:34:12 +0800	[thread overview]
Message-ID: <dd7fa8c8-a141-d036-d3e2-826f90eb97a9@gmail.com> (raw)
In-Reply-To: <ZDAsaXvx85x+n71S@google.com>

On 7/4/2023 10:44 pm, Sean Christopherson wrote:
> On Fri, Apr 07, 2023, Like Xu wrote:
>> On 7/4/2023 9:35 am, Sean Christopherson wrote:
>>> On Tue, Feb 14, 2023, Like Xu wrote:
>>>> +	case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS:
>>>> +		if (!msr_info->host_initiated)
>>>> +			return 0; /* Writes are ignored */
>>>
>>> Where is the "writes ignored" behavior documented?  I can't find anything in the
>>> APM that defines write behavior.
>>
>> KVM would follow the real hardware behavior once specifications stay silent
>> on details or secret.
> 
> So is that a "this isn't actually documented anywhere" answer?  It's not your
> responsibility to get AMD to document their CPUs, but I want to clearly document
> when KVM's behavior is based solely off of observed hardware behavior, versus an
> actual specification.

Indeed, you draw a clearer line than APM or PPR.

Spec-defined:

	RO: Read-only. Readable; writes are ignored (Per PPR "AccessType Definitions")
	WO: Writable. Reads are undefined. (Per PPR "AccessType Definitions")

And vPMU will refer to real HW observations for the (hidden) undefined behaviour.
More comments in the new version may help. Please check.

> 
>> How about this:
>>
>> 	/*
>> 	 * Note, AMD ignores writes to reserved bits and read-only PMU MSRs,
>> 	 * whereas Intel generates #GP on attempts to write reserved/RO MSRs.
>> 	 */
> 
> Looks good.
> 
>>>> +		pmu->nr_arch_gp_counters = min_t(unsigned int,
>>>> +						 ebx.split.num_core_pmc,
>>>> +						 kvm_pmu_cap.num_counters_gp);
>>>> +	} else if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
>>>>    		pmu->nr_arch_gp_counters = AMD64_NUM_COUNTERS_CORE;
>>>
>>> This needs to be sanitized, no?  E.g. if KVM only has access to 4 counters, but
>>> userspace sets X86_FEATURE_PERFCTR_CORE anyways.  Hrm, unless I'm missing something,
>>> that's a pre-existing bug.
>>
>> Now your point is that if a user space more capbility than KVM can support,
>> KVM should constrain it.
>> Your previous preference was that the user space can set capbilities that
>> evene if KVM doesn't support as long as it doesn't break KVM and host and the
>> guest will eat its own.
> 
> Letting userspace define a "bad" configuration is perfectly ok, but KVM needs to
> be careful not to endanger itself by consuming the bad state.  A good example is
> the handling of nested SVM features in svm_vcpu_after_set_cpuid().  KVM lets
> userspace define anything and everything, but KVM only actually tries to utilize
> a feature if the feature is actually supported in hardware.
> 
> In this case, it's not clear to me that putting a bogus value into "nr_arch_gp_counters"
> is safe (for KVM).  And AIUI, the guest can't actually use more than
> kvm_pmu_cap.num_counters_gp counters, i.e. KVM isn't arbitrarily restricting the
> setup.

AFAI,  when a guest has more counters (N) than the host (M), and they are all 
enabled,
thus KVM will create an equal number (N) of perf_events, and these events will 
occupy
real hardware counters (M) in the host perf scheduler subsystem in a round robin 
way.

 From the point of view of a vCPU, its virtual counters can only occupy the hardware
part of the time slice to count for guest payload, which affects the accuracy. 
However,
from the host security point of view, too many counters will only result in too many
perf_events created by KVM, which is a normal usage for the perf subsystem, called
perf counter multiplexing. It seems to be safe (using perf API for KVM).

But considering that scheduling too many perf_events is also a performance overhead,
it can also be seen as a performance attack on the scheduling of vCPU processes 
on host.

Back to the diff itself, code for intel_pmu does a similar sanity check, thus 
here we just
let AMD_PMU follow the same decision pattern. Please refer to the latest version.