Re: DESIGN v2: CPUID part 3

From: Joao Martins <joao.m.martins@oracle.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>,
	Xen-devel <xen-devel@lists.xen.org>
Subject: Re: DESIGN v2: CPUID part 3
Date: Wed, 2 Aug 2017 11:34:05 +0100	[thread overview]
Message-ID: <d537291c-8de2-1597-bb5f-78338dfbf703@oracle.com> (raw)
In-Reply-To: <89da4ab7-a0e1-52c6-3003-7b68e7b6eedb@citrix.com>

On 08/01/2017 07:34 PM, Andrew Cooper wrote:
> On 31/07/2017 20:49, Konrad Rzeszutek Wilk wrote:
>> On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote:
>>> On 07/05/2017 12:16 PM, Andrew Cooper wrote:
>>>> On 05/07/17 10:46, Joao Martins wrote:
>>>>> Hey Andrew,
>>>>>
>>>>> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>>>>>
>>>>>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
>>>>>> The toolstack shall also have a mechanism to explicitly select topology
>>>>>> configuration for the guest, which primarily affects the virtual APIC ID
>>>>>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
>>>>>> Xen's auditing shall ensure that guests observe values consistent with the
>>>>>> guarantees made by the vendor manuals.
>>>>>>
>>>>> Why choose max_vcpus domctl?
>>>> Despite its name, the max_vcpus hypercall is the one which allocates all
>>>> the vcpus in the hypervisor.  I don't want there to be any opportunity
>>>> for vcpus to exist but no topology information to have been provided.
>>>>
>>> /nods
>>>
>>> So then doing this at vcpus allocation we would need to pass an additional CPU
>>> topology argument on the max_vcpus hypercall? Otherwise it's sort of guess work
>>> wrt sockets, cores, threads ... no?
>> Andrew, thoughts on this and the one below?
> 
> Urgh sorry.  I've been distracted with some high priority interrupts (of
> the non-maskable variety).
> 
> So, bad news is that the CPUID and MSR policy handling has become
> substantially more complicated and entwined than I had first planned.  A
> change in either of the data alters the auditing of the other, so I am
> leaning towards implementing everything with a single set hypercall (as
> this is the only way to get a plausibly-consistent set of data).
> 
> The good news is that I don't think we actually need any changes to the
> XEN_DOMCTL_max_vcpus.  I now think there is sufficient expressibility in
> the static cpuid policy to work.
> 
Awesome!

>>> There could be other uses too on passing this info to Xen, say e.g. the
>>> scheduler knowing the guest CPU topology it would allow better selection of
>>> core+sibling pair such that it could match cache/cpu topology passed on the
>>> guest (for unpinned SMT guests).
> 
> I remain to be convinced (i.e. with some real performance numbers) that
> the added complexity in the scheduler for that logic is a benefit in the
> general case.
> 
The suggestion above was a simple extension to struct domain (e.g. cores/threads
or struct cpu_topology field) - nothing too disruptive I think.

But I cannot really argue on this as this was just an idea that I found
interesting (no numbers to support it entirely). We just happened to see it
under-perform when a simple range of cpus was used for affinity, and that some
vcpus end up being scheduled belonging the same core+sibling pair IIRC; hence I
(perhaps naively) imagined that there could be value in further scheduler
enlightenment e.g. "gang-scheduling" where we schedule core+sibling always
together. I was speaking to Dario (CC'ed) on the summit whether CPU topology
could have value - and there might be but it remains to be explored once we're
able to pass a cpu topology to the guest. (In the past it seemed enthusiastic of
the idea of the topology[0] and hence I assumed to be in the context of schedulers)

[0] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg03850.html

> In practice, customers are either running very specific and dedicated
> workloads (at which point pinning is used and there is no
> oversubscription, and exposing the actual SMT topology is a good thing),
>
/nods

> or customers are running general workloads with no pinning (or perhaps
> cpupool-numa-split) with a moderate amount of oversubscription (at which
> point exposing SMT is a bad move).
> 
Given the scale you folks invest on over-subscription (1000 VMs), I wonder what
moderate here means :P

> Counterintuitively, exposing NUMA in general oversubscribed scenarios is
> terrible for net system performance.  What happens in practice is that
> VMs which see NUMA spend their idle cycles trying to balance their own
> userspace processes, rather than yielding to the hypervisor so another
> guest can get a go.
> 
Interesting to know - vNUMA perhaps is only better placed for performance cases
where both (or either) I/O topology and memory locality matter - or when going
for bigger guests. Provided that the correspondent CPU topology is provided.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel