On 27/07/2015 18:42, Dario Faggioli wrote:
> On Mon, 2015-07-27 at 17:33 +0100, Andrew Cooper wrote:
>> On 27/07/15 17:31, David Vrabel wrote:
>>>
>>>> Yeah, indeed. That's the downside of Juergen's "Linux scheduler
>>>> approach". But the issue is there, even without taking vNUMA into
>>>> account, and I think something like that would really help (only for
>>>> Dom0, and Linux guests, of course).
>>> I disagree. Whether we're using vNUMA or not, Xen should still ensure
>>> that the guest kernel and userspace see a consistent and correct
>>> topology using the native mechanisms.
>>
>> +1
>>
> +1 from me as well. In fact, a mechanism for making exactly such thing
> happen, was what I was after when starting the thread.
>
> Then it came up that CPUID needs to be used for at least two different
> and potentially conflicting purposes, that we want to support both and
> that, whether and for whatever reason it's used, Linux configures its
> scheduler after it, potentially resulting in rather pathological setups.
I don't see what the problem is here. Fundamentally, "NUMA
optimise" vs "comply with licence" is a user/admin decision at boot
time, and we need not cater to both halves at the same time.
Supporting either, as chosen by the admin, is worthwhile.
>
>
> It's at that point that some decoupling started to appear
> interesting... :-P
>
> Also, are we really being consistent? If my methodology is correct
> (which might not be, please, double check, and sorry for that), I'm
> seeing quite some inconsistency around:
>
> HOST:
> root@Zhaman:~# xl info -n
> ...
> cpu_topology :
> cpu: core socket node
> 0: 0 1 0
> 1: 0 1 0
> 2: 1 1 0
> 3: 1 1 0
> 4: 9 1 0
> 5: 9 1 0
> 6: 10 1 0
> 7: 10 1 0
> 8: 0 0 1
> 9: 0 0 1
> 10: 1 0 1
> 11: 1 0 1
> 12: 9 0 1
> 13: 9 0 1
> 14: 10 0 1
> 15: 10 0 1
o_O
What kind of system results in this layout? Can you dump the ACPI
tables and make them available?
>
> ...
> root@Zhaman:~# xl vcpu-list test
> Name ID VCPU CPU State Time(s) Affinity (Hard / Soft)
> test 2 0 0 r-- 1.5 0 / all
> test 2 1 1 r-- 0.2 1 / all
> test 2 2 8 -b- 2.2 8 / all
> test 2 3 9 -b- 2.0 9 / all
>
> GUEST (HVM, 4 vcpus):
> root@test:~# cpuid|grep CORE_ID
> (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0
> (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=1
> (APIC synth): PKG_ID=0 CORE_ID=0 SMT_ID=0
> (APIC synth): PKG_ID=0 CORE_ID=0 SMT_ID=1
>
> HOST:
> root@Zhaman:~# xl vcpu-pin 2 all 0
> root@Zhaman:~# xl vcpu-list 2
> Name ID VCPU CPU State Time(s) Affinity (Hard / Soft)
> test 2 0 0 -b- 43.7 0 / all
> test 2 1 0 -b- 38.4 0 / all
> test 2 2 0 -b- 36.9 0 / all
> test 2 3 0 -b- 38.8 0 / all
>
> GUEST:
> root@test:~# cpuid|grep CORE_ID
> (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0
> (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0
> (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0
> (APIC synth): PKG_ID=0 CORE_ID=16 SMT_ID=0
>
> HOST:
> root@Zhaman:~# xl vcpu-pin 2 0 7
> root@Zhaman:~# xl vcpu-pin 2 1 7
> root@Zhaman:~# xl vcpu-pin 2 2 15
> root@Zhaman:~# xl vcpu-pin 2 3 15
> root@Zhaman:~# xl vcpu-list 2
> Name ID VCPU CPU State Time(s) Affinity (Hard / Soft)
> test 2 0 7 -b- 44.3 7 / all
> test 2 1 7 -b- 38.9 7 / all
> test 2 2 15 -b- 37.3 15 / all
> test 2 3 15 -b- 39.2 15 / all
>
> GUEST:
> root@test:~# cpuid|grep CORE_ID
> (APIC synth): PKG_ID=0 CORE_ID=26 SMT_ID=1
> (APIC synth): PKG_ID=0 CORE_ID=26 SMT_ID=1
> (APIC synth): PKG_ID=0 CORE_ID=10 SMT_ID=1
> (APIC synth): PKG_ID=0 CORE_ID=10 SMT_ID=1
>
> So, it looks to me that:
> 1) any application using CPUID for either licensing or
> placement/performance optimization will get (potentially) random
> results;
> 2) whatever set of values the kernel used, during guest boot, to build
> up its internal scheduling data structures, has no guarantee of
> being related to any value returned by CPUID, at a later point.
>
> Hence, I think I'm seeing inconsistency between kernel and userspace
> (and between userspace and itself, over time) already... Am I
> overlooking something?
All current CPUID values presented to guests are about as reliable
as being picked from /dev/urandom. (This isn't strictly true - the
feature flags will be in the right ballpark if the VM has not
migrated yet).
Fixing this (as described in my feature levelling design document)
is sufficiently non-trivial that it has been deferred to post
feature-levelling work.
~Andrew