xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Juergen Gross <jgross@suse.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	Wei Liu <wei.liu2@citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Jan Beulich <JBeulich@suse.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>
Subject: Re: PV-vNUMA issue: topology is misinterpreted by the guest
Date: Fri, 24 Jul 2015 12:28:52 +0200	[thread overview]
Message-ID: <55B21364.5040906@suse.com> (raw)
In-Reply-To: <1437660433.5036.96.camel@citrix.com>

On 07/23/2015 04:07 PM, Dario Faggioli wrote:
> On Thu, 2015-07-23 at 06:43 +0200, Juergen Gross wrote:
>> On 07/22/2015 04:44 PM, Boris Ostrovsky wrote:
>>> On 07/22/2015 10:09 AM, Juergen Gross wrote:
>
>>>>>> I think we have 2 possible solutions:
>>>>>>
>>>>>> 1. Try to handle this all in the hypervisor via CPUID mangling.
>>>>>>
>>>>>> 2. Add PV-topology support to the guest and indicate this capability
>>>>>> via
>>>>>>     elfnote; only enable PV-numa if this note is present.
>>>>>>
>>>>>> I'd prefer the second solution. If you are okay with this, I'd try
>>>>>> to do
>>>>>> some patches for the pvops kernel.
>>>
>>> Why do you think that kernel patches are preferable to CPUID management?
>>> This would be all in tools, I'd think. (Well, one problem that I can
>>> think of is that AMD sometimes pokes at MSRs and/or Northbridge's PCI
>>> registers to figure out nodeID --- that we may need to have to address
>>> in the hypervisor)
>>
>> Doing it via CPUID is more HW specific. Trying to fake a topology for
>> the guest from outside might lead to weird decisions in the guest e.g.
>> regarding licenses based on socket counts.
>>
> I do see the value of this, I think...
>
>> If you are doing it in the guest itself you are able to address the
>> different problems (scheduling, licensing) in different ways.
>>
> ... but, at least in the case of vNUMA for instance, there still need to
> be a correlation between the vNUMA topology, and the "CPUID topology",
> and vNUMA topology is decided by toolstack.
>
> Then, if you mean, within all the possible solutions that matches (i.e.,
> that does not cause problems to!) the vNUMA setup we've been given,
> let's pick up one that also is best for this xxx other purpose, then I
> agree.
>
> What I'm not sure I see, although, is how you would be specifying the
> other purpose, e.g., in this case, are you thinking to another parameter
> saying that we want to minimize the socket count?
>
>> Depending on the licensing model
>> playing with CPUID is either good or bad. I can even imagine the CPUID
>> configuration capabilities in xl are in use today for exactly this
>> purpose. Using them for pv-NUMA as well will make this feature unusable
>> for those users.
>>
> Yeah, well... So, you want a VM with only one socket, because of
> whatever reasons (say licensing), and you're using libxl's CPUID
> fiddling capability to do that. Now, if you specify, for such a VM, a
> vNUMA layout with 2 vnodes, well, I'd call this asking for troubles. I
> know, strictly speaking, socket != (v)NUMA node. Still, I think this
> will be a corner case, way less common than just a user specifying a
> vNUMA topology, and getting only a fraction of the vcpus being
> used/usable! :-/
>
> In summary, I probably know too few of CPUID handling to have a clear
> view on whether something like 'making it match the topology' --which
> also means, if no vNUMA, CPUID should say flat, for some definition of
> flat-- should leave in tools or in kernel... I just know that we need to
> do something *consistent*.
>
> FWIW, I was thinking that the kernel were a better place, as Juergen is
> saying, while now I'm more convinced that tools would be more
> appropriate, as Boris is saying.

I've collected some information from the linux kernel sources as a base
for the discussion:

The complete numa information (cpu->node and memory->node relations) is
taken from the acpi tables (srat, slit for "distances").

The topology information is obtained via:
- intel:
   + cpuid leaf b with subleafs, leaf 4
   + cpuid leaf 2 and/or leaf 1 if leaf b and/or 4 isn't available
- amd:
   + cpuid leaf 8000001e, leaf 8000001d, leaf 4
   + msr c001100c
   + cpuid leaf 2 and/or leaf 1 if leaf b and/or 4 isn't available

The scheduler is aware of:
- smt siblings (from topology)
- last-level-cache siblings (from topology)
- node siblings (from numa information)
It will especially move tasks from one cpu to another first between smt
siblings, second between llc siblings, third between node siblings and
last all cpus.

Memory management does numa node aware memory allocation.

Topology and numa information are made available through /sys and /proc
filesystems.

cpuid instruction is available for user mode as well.


Anything I have missed?


Juergen

  parent reply	other threads:[~2015-07-24 10:28 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-16 10:32 PV-vNUMA issue: topology is misinterpreted by the guest Dario Faggioli
2015-07-16 10:47 ` Jan Beulich
2015-07-16 10:56   ` Andrew Cooper
2015-07-16 15:25     ` Wei Liu
2015-07-16 15:45       ` Andrew Cooper
2015-07-16 15:50         ` Boris Ostrovsky
2015-07-16 16:29           ` Jan Beulich
2015-07-16 16:39             ` Andrew Cooper
2015-07-16 16:59               ` Boris Ostrovsky
2015-07-17  6:09                 ` Jan Beulich
2015-07-17  7:27                   ` Dario Faggioli
2015-07-17  7:42                     ` Jan Beulich
2015-07-17  8:44                     ` Wei Liu
2015-07-17 18:17                     ` Boris Ostrovsky
2015-07-20 14:09                       ` Dario Faggioli
2015-07-20 14:43                         ` Boris Ostrovsky
2015-07-21 20:00                           ` Boris Ostrovsky
2015-07-22 13:36                             ` Dario Faggioli
2015-07-22 13:50                               ` Juergen Gross
2015-07-22 13:58                                 ` Boris Ostrovsky
2015-07-22 14:09                                   ` Juergen Gross
2015-07-22 14:44                                     ` Boris Ostrovsky
2015-07-23  4:43                                       ` Juergen Gross
2015-07-23  7:28                                         ` Jan Beulich
2015-07-23  9:42                                         ` Andrew Cooper
2015-07-23 14:07                                         ` Dario Faggioli
2015-07-23 14:13                                           ` Juergen Gross
2015-07-24 10:28                                           ` Juergen Gross [this message]
2015-07-24 14:44                                             ` Dario Faggioli
2015-07-24 15:14                                               ` Juergen Gross
2015-07-24 15:24                                                 ` Juergen Gross
2015-07-24 15:58                                                   ` Dario Faggioli
2015-07-24 16:09                                                     ` Konrad Rzeszutek Wilk
2015-07-24 16:14                                                       ` Dario Faggioli
2015-07-24 16:18                                                       ` Juergen Gross
2015-07-24 16:29                                                         ` Konrad Rzeszutek Wilk
2015-07-24 16:39                                                           ` Juergen Gross
2015-07-24 16:44                                                             ` Boris Ostrovsky
2015-07-27  4:35                                                               ` Juergen Gross
2015-07-27 10:43                                                                 ` George Dunlap
2015-07-27 10:54                                                                   ` Andrew Cooper
2015-07-27 11:13                                                                     ` Juergen Gross
2015-07-27 10:54                                                                   ` Juergen Gross
2015-07-27 11:11                                                                     ` George Dunlap
2015-07-27 12:01                                                                       ` Juergen Gross
2015-07-27 12:16                                                                         ` Tim Deegan
2015-07-27 13:23                                                                         ` Dario Faggioli
2015-07-27 14:02                                                                           ` Juergen Gross
2015-07-27 14:02                                                                       ` Dario Faggioli
2015-07-27 10:41                                                       ` George Dunlap
2015-07-27 10:49                                                         ` Andrew Cooper
2015-07-27 13:11                                                           ` Dario Faggioli
2015-07-24 16:10                                                     ` Juergen Gross
2015-07-24 16:40                                                       ` Boris Ostrovsky
2015-07-24 16:48                                                         ` Juergen Gross
2015-07-24 17:11                                                           ` Boris Ostrovsky
2015-07-27 13:40                                                             ` Dario Faggioli
2015-07-27  4:24                                                         ` Juergen Gross
2015-07-27 14:09                                                       ` Dario Faggioli
2015-07-27 14:34                                                         ` Boris Ostrovsky
2015-07-27 14:43                                                           ` Juergen Gross
2015-07-27 14:51                                                             ` Boris Ostrovsky
2015-07-27 15:03                                                               ` Juergen Gross
2015-07-27 14:47                                                           ` Juergen Gross
2015-07-27 14:58                                                           ` Dario Faggioli
2015-07-28  4:29                                                         ` Juergen Gross
2015-07-28 15:11                                                           ` Juergen Gross
2015-07-28 16:17                                                             ` Dario Faggioli
2015-07-28 17:13                                                               ` Dario Faggioli
2015-07-29  6:04                                                               ` Juergen Gross
2015-07-29  7:09                                                                 ` Dario Faggioli
2015-07-29  7:44                                                             ` Dario Faggioli
2015-07-24 16:05                                                 ` Dario Faggioli
2015-07-28 10:05                                                   ` Wei Liu
2015-07-28 15:17                                                     ` Dario Faggioli
2015-07-24 20:27                                               ` Elena Ufimtseva
2015-07-22 14:50                                     ` Dario Faggioli
2015-07-22 15:32                                       ` Boris Ostrovsky
2015-07-22 15:49                                         ` Dario Faggioli
2015-07-22 18:10                                           ` Boris Ostrovsky
2015-07-23  7:25                                             ` Jan Beulich
2015-07-24 16:03                                               ` Boris Ostrovsky
2015-07-23 13:46                                             ` Dario Faggioli
2015-07-17 10:17                 ` Andrew Cooper
2015-07-16 15:26 ` Wei Liu
2015-07-27 15:13 ` David Vrabel
2015-07-27 16:02   ` Dario Faggioli
2015-07-27 16:31     ` David Vrabel
2015-07-27 16:33       ` Andrew Cooper
2015-07-27 17:42         ` Dario Faggioli
2015-07-27 17:50           ` Konrad Rzeszutek Wilk
2015-07-27 23:19           ` Andrew Cooper
2015-07-28  3:52             ` Juergen Gross
2015-07-28  9:40               ` Andrew Cooper
2015-07-28  9:28             ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55B21364.5040906@suse.com \
    --to=jgross@suse.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dario.faggioli@citrix.com \
    --cc=david.vrabel@citrix.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).