On Wed, 2017-08-02 at 11:34 +0100, Joao Martins wrote:
> On 08/01/2017 07:34 PM, Andrew Cooper wrote:
> > > On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote:
> > > > 
> > > > There could be other uses too on passing this info to Xen, say
> > > > e.g. the
> > > > scheduler knowing the guest CPU topology it would allow better
> > > > selection of
> > > > core+sibling pair such that it could match cache/cpu topology
> > > > passed on the
> > > > guest (for unpinned SMT guests).
> > 
> > I remain to be convinced (i.e. with some real performance numbers)
> > that
> > the added complexity in the scheduler for that logic is a benefit
> > in the
> > general case.
> > 
> 
> The suggestion above was a simple extension to struct domain (e.g.
> cores/threads
> or struct cpu_topology field) - nothing too disruptive I think.
> 
> But I cannot really argue on this as this was just an idea that I
> found
> interesting (no numbers to support it entirely). We just happened to
> see it
> under-perform when a simple range of cpus was used for affinity, and
> that some
> vcpus end up being scheduled belonging the same core+sibling pair
> IIRC; hence I
> (perhaps naively) imagined that there could be value in further
> scheduler
> enlightenment e.g. "gang-scheduling" where we schedule core+sibling
> always
> together. I was speaking to Dario (CC'ed) on the summit whether CPU
> topology
> could have value - and there might be but it remains to be explored
> once we're
> able to pass a cpu topology to the guest. (In the past it seemed
> enthusiastic of
> the idea of the topology[0] and hence I assumed to be in the context
> of schedulers)
> 
> [0] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg0
> 3850.html
> 
> > In practice, customers are either running very specific and
> > dedicated
> > workloads (at which point pinning is used and there is no
> > oversubscription, and exposing the actual SMT topology is a good
> > thing),
> > 
> /nods
> 
I am enthusiast of there going to be a way for specifying explicitly
the CPU topology of a guest.

The way we can take advantage of this, at least as a first step, is,
when the guest is pinned, and two of its vCPUs are pinned to two host
hyperthreads, make the two vCPUs hyperthreads as well, from the guest
point of view.

Then, it will be the guest's (e.g., Linux's) scheduler that will do
something clever with this information, so no need for adding
complexity anywhere (well, in theory, in the guest scheduler, but in
practise, code it's there already!).

Or, on the other hand, if pinning is *not* used, then I'd use this
mechanism to tell the guest that there's no relationship between its
vCPUs whatsoever.

In fact, currently --sticking to SMT as example-- by not specifying the
topology explicitly, there may be cases where the guest scheduler comes
to thinking that two vCPUs are SMT siblings, while they either are not,
or (if no pinning is in place), they may or may not be, depending on
onto which pCPUs the two vCPUs are executing at any given time. This
means the guest's scheduler's SMT optimization logic will trigger,
while it probably better wouldn't have.

These are the first two use cases that, as the "scheduler guy", I'm
interested to use this feature for.

Then, there indeed is the chance of using the guest topology to affect
the decision of the Xen's scheduler, e.g., to implement some form of
gang scheduling, or to force two vCPU to be executed on pCPUs that
respect such topology... But this is all still in the "wild ideas"
camp, for now. :-D

> > or customers are running general workloads with no pinning (or
> > perhaps
> > cpupool-numa-split) with a moderate amount of oversubscription (at
> > which
> > point exposing SMT is a bad move).
> > 
> 
> Given the scale you folks invest on over-subscription (1000 VMs), I
> wonder what
> moderate here means :P
> 
> > Counterintuitively, exposing NUMA in general oversubscribed
> > scenarios is
> > terrible for net system performance.  What happens in practice is
> > that
> > VMs which see NUMA spend their idle cycles trying to balance their
> > own
> > userspace processes, rather than yielding to the hypervisor so
> > another
> > guest can get a go.
> > 
> 
For NUMA aware workloads, running in guests, the guests themselves
doing something sane with both the placement and the balancing of tasks
and memory is a good thing, that will improve the performance of the
workload itself. Provided the (topology) information used for doing
this placement and this balancing are accurate... a vNUMA is what makes
them accurate.

So, IMO, for big guests, running NUMA-aware workload, vNUMA will most
of the times improve things.

I totally don't get the part where a vCPU becoming idle is what Xen
needs for running other guests' vCPUs... Xen does not at all rely on a
vCPU to periodically block or yield, in order to let another vCPU run,
neither in undesubscribed, nor in oversubscribed scenarios.

It's competitive multitasking, not cooperative multitasking that we do,
i.e., running vCPUs are preempted when it's the turn of some other vCPU
to execute.

The only thing that we gain from the fact that vCPUs go idle from time
to time, is that we may be able to make the actual pCPUs sleep a bit,
and hence we save power and produce less heat, but that's mostly the
case in undersubscribed scenarios, not in oversubscribed ones.

> Interesting to know - vNUMA perhaps is only better placed for
> performance cases
> where both (or either) I/O topology and memory locality matter - or
> when going
> for bigger guests. Provided that the correspondent CPU topology is
> provided.
> 
Exactly, it matters if the guest is big enough and/or is NUMA enough
(e.g., as you say, both its memory and I/O access are sensitive, and
suffers the fact of having to go the long route), and if the workload
is also NUMA-aware.

And yes, vNUMA needs topology information to be accurate and
consistent.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)