On 27/07/2015 18:42, Dario Faggioli wrote:
> On Mon, 2015-07-27 at 17:33 +0100, Andrew Cooper wrote: >> On 27/07/15 17:31, David Vrabel wrote: >>> >>>> Yeah, indeed. That's the downside of Juergen's "Linux scheduler >>>> approach". But the issue is there, even without taking vNUMA into >>>> account, and I think something like that would really help (only for >>>> Dom0, and Linux guests, of course). >>> I disagree. Whether we're using vNUMA or not, Xen should still ensure >>> that the guest kernel and userspace see a consistent and correct >>> topology using the native mechanisms. >> >> +1 >> > +1 from me as well. In fact, a mechanism for making exactly such thing > happen, was what I was after when starting the thread. > > Then it came up that CPUID needs to be used for at least two different > and potentially conflicting purposes, that we want to support both and > that, whether and for whatever reason it's used, Linux configures its > scheduler after it, potentially resulting in rather pathological setups.

I don't see what the problem is here. Fundamentally, "NUMA optimise" vs "comply with licence" is a user/admin decision at boot time, and we need not cater to both halves at the same time.

Supporting either, as chosen by the admin, is worthwhile.

> > > It's at that point that some decoupling started to appear > interesting... :-P > > Also, are we really being consistent? If my methodology is correct > (which might not be, please, double check, and sorry for that), I'm > seeing quite some inconsistency around: > > HOST: > root@Zhaman:~# xl info -n > ... > cpu_topology : > cpu: core socket node > 0: 0 1 0 > 1: 0 1 0 > 2: 1 1 0 > 3: 1 1 0 > 4: 9 1 0 > 5: 9 1 0 > 6: 10 1 0 > 7: 10 1 0 > 8: 0 0 1 > 9: 0 0 1 > 10: 1 0 1 > 11: 1 0 1 > 12: 9 0 1 > 13: 9 0 1 > 14: 10 0 1 > 15: 10 0 1

o_O

What kind of system results in this layout? Can you dump the ACPI tables and make them available?

> > ... > root@Zhaman:~# xl vcpu-list test > Name ID VCPU CPU State Time(s) Affinity (Hard / Soft) > test 2 0 0 r-- 1.5 0 / all > test 2 1 1 r-- 0.2 1 / all > test 2 2 8 -b- 2.2 8 / all > test 2 3 9 -b- 2.0 9 / all > > GUEST (HVM, 4 vcpus): > root@test:~# cpuid|grep CORE_ID > (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0 > (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=1 > (APIC synth): PKG_ID=0 CORE_ID=0 SMT_ID=0 > (APIC synth): PKG_ID=0 CORE_ID=0 SMT_ID=1 > > HOST: > root@Zhaman:~# xl vcpu-pin 2 all 0 > root@Zhaman:~# xl vcpu-list 2 > Name ID VCPU CPU State Time(s) Affinity (Hard / Soft) > test 2 0 0 -b- 43.7 0 / all > test 2 1 0 -b- 38.4 0 / all > test 2 2 0 -b- 36.9 0 / all > test 2 3 0 -b- 38.8 0 / all > > GUEST: > root@test:~# cpuid|grep CORE_ID > (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0 > (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0 > (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0 > (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0 > > HOST: > root@Zhaman:~# xl vcpu-pin 2 0 7 > root@Zhaman:~# xl vcpu-pin 2 1 7 > root@Zhaman:~# xl vcpu-pin 2 2 15 > root@Zhaman:~# xl vcpu-pin 2 3 15 > root@Zhaman:~# xl vcpu-list 2 > Name ID VCPU CPU State Time(s) Affinity (Hard / Soft) > test 2 0 7 -b- 44.3 7 / all > test 2 1 7 -b- 38.9 7 / all > test 2 2 15 -b- 37.3 15 / all > test 2 3 15 -b- 39.2 15 / all > > GUEST: > root@test:~# cpuid|grep CORE_ID > (APIC synth): PKG_ID=0 CORE_ID=26 SMT_ID=1 > (APIC synth): PKG_ID=0 CORE_ID=26 SMT_ID=1 > (APIC synth): PKG_ID=0 CORE_ID=10 SMT_ID=1 > (APIC synth): PKG_ID=0 CORE_ID=10 SMT_ID=1 > > So, it looks to me that: > 1) any application using CPUID for either licensing or > placement/performance optimization will get (potentially) random > results; > 2) whatever set of values the kernel used, during guest boot, to build > up its internal scheduling data structures, has no guarantee of > being related to any value returned by CPUID, at a later point. > > Hence, I think I'm seeing inconsistency between kernel and userspace > (and between userspace and itself, over time) already... Am I > overlooking something?

All current CPUID values presented to guests are about as reliable as being picked from /dev/urandom. (This isn't strictly true - the feature flags will be in the right ballpark if the VM has not migrated yet).

Fixing this (as described in my feature levelling design document) is sufficiently non-trivial that it has been deferred to post feature-levelling work.

~Andrew