Re: [PATCH v6 00/10] vnuma introduction

From: Elena Ufimtseva <ufimtseva@gmail.com>
To: Wei Liu <wei.liu2@citrix.com>
Cc: Keir Fraser <keir@xen.org>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Matt Wilson <msw@linux.com>,
	Dario Faggioli <dario.faggioli@citrix.com>,
	Li Yechen <lccycc123@gmail.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH v6 00/10] vnuma introduction
Date: Sun, 20 Jul 2014 10:57:44 -0400	[thread overview]
Message-ID: <CAEr7rXgtv=3xP3CRgKr_7z-HjKMj4Bb0PQ+kd9JAnAMXa1XDsw@mail.gmail.com> (raw)
In-Reply-To: <20140718114834.GI7142@zion.uk.xensource.com>

[-- Attachment #1.1: Type: text/plain, Size: 5603 bytes --]

On Fri, Jul 18, 2014 at 7:48 AM, Wei Liu <wei.liu2@citrix.com> wrote:

> On Fri, Jul 18, 2014 at 12:13:36PM +0200, Dario Faggioli wrote:
> > On ven, 2014-07-18 at 10:53 +0100, Wei Liu wrote:
> > > Hi! Another new series!
> > >
> > :-)
> >
> > > On Fri, Jul 18, 2014 at 01:49:59AM -0400, Elena Ufimtseva wrote:
> >
> > > > The workaround is to specify cpuid in config file and not use SMT.
> But soon I will come up
> > > > with some other acceptable solution.
> > > >
> > >
> > For Elena, workaround like what?
>

In the workaround I used I configured vcpus (as we have ht/smt turned on)
caches as this:

>  >
> > > I've also encountered this. I suspect that even if you disble SMT with
> > > cpuid in config file, the cpu topology in guest might still be wrong.
> > >
> > Can I ask why?
> >
>
> Because for a PV guest (currently) the guest kernel sees the real "ID"s
> for a cpu. See those "ID"s I change in my hacky patch.
>

Yep, thats what I see as well.

>
> > > What do hwloc-ls and lscpu show? Do you see any weird topology like one
> > > core belongs to one node while three belong to another?
> > >
> > Yep, that would be interesting to see.
> >
> > >  (I suspect not
> > > because your vcpus are already pinned to a specific node)
> > >
> > Sorry, I'm not sure I follow here... Are you saying that things probably
> > works ok, but that is (only) because of pinning?
>
> Yes, given that you derive numa memory allocation from cpu pinning or
> use combination of cpu pinning, vcpu to vnode map and vnode to pnode
> map, in those cases those IDs might reflect the right topology.
>
> >
> > I may be missing something here, but would it be possible to at least
> > try to make sure that the virtual topology and the topology related
> > content of CPUID actually agree? And I mean doing it automatically (if
>
> This is what I'm doing in my hack. :-)
>
> > only one of the two is specified) and to either error or warn if that is
> > not possible (if both are specified and they disagree)?
> >
> > I admit I'm not a CPUID expert, but I always thought this could be a
> > good solution...
> >
> > > What I did was to manipulate various "id"s in Linux kernel, so that I
> > > create a topology like 1 core : 1 cpu : 1 socket mapping.
> > >
> > And how this topology maps/interact with the virtual topology we want
> > the guest to have?
> >
>
> Say you have a two nodes guest, with 4 vcpus, you now have two sockets
> per node, each socket has one cpu, each cpu has one core.
>
> Node 0:
>   Socket 0:
>     CPU0:
>       Core 0
>   Socket 1:
>     CPU 1:
>       Core 1
> Node 1:
>   Socket 2:
>     CPU 2:
>       Core 2
>   Socket 3:
>     CPU 3:
>       Core 3
>
> > > In that case
> > > guest scheduler won't be able to make any assumption on individual CPU
> > > sharing caches with each other.
> > >
> > And, apart from SMT, what topology does the guest see then?
> >
>
> See above.
>
> > In any case, if this only alter SMT-ness (where "alter"="disable"), I
> > think that is fine too. What I'm failing at seeing is whether and why
> > this approach is more powerful than manipulating CPUID from config file.
> >
> > I'm insisting because, if they'd be equivalent, in terms of results, I
> > think it's easier, cleaner and more correct to deal with CPUID in xl and
> > libxl (automatically or semi-automatically).
> >
>
> SMT is just one aspect of the story that easily surfaces.
>
> In my opinion, if we don't manually create some kind of topology for the
> guest, the guest might end up with something weird. For example, if you
> have a 2 nodes, 4 sockets, 8 cpus, 8 cores system, you might have
>
> Node 0:
>   Socket 0
>     CPU0
>   Socket 1
>     CPU1
> Node 1:
>   Socket 2
>     CPU 3
>     CPU 4
>
> which all stems from guest having knowledge of real CPU "ID"s.
>
> And this topology is just wrong, it might just be the one during guest
> creation. Xen is free to schedule vcpus to different pcpus, so guest
> scheduler will make wrong decision based on errnous information.
>
> That's why I chose to have 1 core : 1 cpu : 1 socket mapping, so that
> guest makes no assumption on cache sharing etc. It's suboptimal but
> should provide predictable average performance. What do you think?
>

Running lstopo with vNUMA enabled in guest with 4 vnodes, 8 vcpus:
root@heatpipe:~# lstopo

Machine (7806MB) + L3 L#0 (7806MB 10MB) + L2 L#0 (7806MB 256KB) + L1d L#0
(7806MB 32KB) + L1i L#0 (7806MB 32KB)
  NUMANode L#0 (P#0 1933MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#4)
  NUMANode L#1 (P#1 1967MB) + Socket L#1
    Core L#2 + PU L#2 (P#1)
    Core L#3 + PU L#3 (P#5)
  NUMANode L#2 (P#2 1969MB) + Socket L#2
    Core L#4 + PU L#4 (P#2)
    Core L#5 + PU L#5 (P#6)
  NUMANode L#3 (P#3 1936MB) + Socket L#3
    Core L#6 + PU L#6 (P#3)
    Core L#7 + PU L#7 (P#7)

Basically, L2 and L1 are shared between nodes :)

I have manipulated cache sharing options before in cpuid but I agree with
Wei its just a part of the problem.
Along with number of logical processor numbers (if HT is enabled), I guess
we need to construct apic ids (if its not done yet, I could not find it) and
cache sharing cpuids maybe needed, taking into account pinning if set.

Like its described here:
https://software.intel.com/en-us/articles/methods-to-utilize-intels-hyper-threading-technology-with-linux

"The Initial APIC ID is composed of the physical processor's ID and the
logical processor's ID within the physical processor. The least significant
bits of the APIC ID are used to identify the logical processors within a
single physical processor."

>
> Wei.
>

-- 
Elena

[-- Attachment #1.2: Type: text/html, Size: 8169 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel