All of lore.kernel.org
 help / color / mirror / Atom feed
From: Babu Moger <babu.moger@amd.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>,
	"rth@twiddle.net" <rth@twiddle.net>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"ehabkost@redhat.com" <ehabkost@redhat.com>
Subject: RE: [PATCH v2 1/3] hw/i386: Initialize topo_ids from CpuInstanceProperties
Date: Mon, 27 Jul 2020 18:59:42 -0500	[thread overview]
Message-ID: <5df170bd-ea91-8347-a2cf-7ac234248197@amd.com> (raw)
In-Reply-To: <20200727191416.2bf6e34a@redhat.com>



> -----Original Message-----
> From: Igor Mammedov <imammedo@redhat.com>
> Sent: Monday, July 27, 2020 12:14 PM
> To: Moger, Babu <Babu.Moger@amd.com>
> Cc: qemu-devel@nongnu.org; pbonzini@redhat.com; ehabkost@redhat.com;
> rth@twiddle.net
> Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> CpuInstanceProperties
> 
> On Mon, 27 Jul 2020 10:49:08 -0500
> Babu Moger <babu.moger@amd.com> wrote:
> 
> > > -----Original Message-----
> > > From: Igor Mammedov <imammedo@redhat.com>
> > > Sent: Friday, July 24, 2020 12:05 PM
> > > To: Moger, Babu <Babu.Moger@amd.com>
> > > Cc: qemu-devel@nongnu.org; pbonzini@redhat.com;
> ehabkost@redhat.com;
> > > rth@twiddle.net
> > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> > > CpuInstanceProperties
> > >
> > > On Mon, 13 Jul 2020 14:30:29 -0500
> > > Babu Moger <babu.moger@amd.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Igor Mammedov <imammedo@redhat.com>
> > > > > Sent: Monday, July 13, 2020 12:32 PM
> > > > > To: Moger, Babu <Babu.Moger@amd.com>
> > > > > Cc: pbonzini@redhat.com; rth@twiddle.net; ehabkost@redhat.com;
> > > > > qemu- devel@nongnu.org
> > > > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> > > > > CpuInstanceProperties
> > > > >
> > > > > On Mon, 13 Jul 2020 11:43:33 -0500 Babu Moger
> > > > > <babu.moger@amd.com> wrote:
> > > > >
> > > > > > On 7/13/20 11:17 AM, Igor Mammedov wrote:
> > > > > > > On Mon, 13 Jul 2020 10:02:22 -0500 Babu Moger
> > > > > > > <babu.moger@amd.com> wrote:
> > > > > > >
> > > > > > >>> -----Original Message-----
> > > > > > >>> From: Igor Mammedov <imammedo@redhat.com>
> > > > > > >>> Sent: Monday, July 13, 2020 4:08 AM
> > > > > > >>> To: Moger, Babu <Babu.Moger@amd.com>
> > > > > > >>> Cc: pbonzini@redhat.com; rth@twiddle.net;
> > > > > > >>> ehabkost@redhat.com;
> > > > > > >>> qemu- devel@nongnu.org
> > > > > > >>> Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids
> > > > > > >>> from CpuInstanceProperties
> > > > > > > [...]
> > > > > > >>>> +
> > > > > > >>>> +/*
> > > > > > >>>> + * Initialize topo_ids from CpuInstanceProperties
> > > > > > >>>> + * node_id in CpuInstanceProperties(or in CPU device) is
> > > > > > >>>> +a sequential
> > > > > > >>>> + * number, but while building the topology
> > > > > > >>>
> > > > > > >>>> we need to separate it for
> > > > > > >>>> + * each socket(mod nodes_per_pkg).
> > > > > > >>> could you clarify a bit more on why this is necessary?
> > > > > > >>
> > > > > > >> If you have two sockets and 4 numa nodes, node_id in
> > > > > > >> CpuInstanceProperties will be number sequentially as 0, 1, 2, 3.
> > > > > > >> But in EPYC topology, it will be  0, 1, 0, 1( Basically mod
> > > > > > >> % number of nodes
> > > > > per socket).
> > > > > > >
> > > > > > > I'm confused, let's suppose we have 2 EPYC sockets with 2
> > > > > > > nodes per socket so APIC id woulbe be composed like:
> > > > > > >
> > > > > > >  1st socket
> > > > > > >    pkg_id(0) | node_id(0)
> > > > > > >    pkg_id(0) | node_id(1)
> > > > > > >
> > > > > > >  2nd socket
> > > > > > >    pkg_id(1) | node_id(0)
> > > > > > >    pkg_id(1) | node_id(1)
> > > > > > >
> > > > > > > if that's the case, then EPYC's node_id here doesn't look
> > > > > > > like a NUMA node in the sense it's usually used (above
> > > > > > > config would have 4 different memory controllers => 4 conventional
> NUMA nodes).
> > > > > >
> > > > > > EPIC model uses combination of socket id and node id to
> > > > > > identify the numa nodes. So, it internally uses all the information.
> > > > >
> > > > > well with above values, EPYC's node_id doesn't look like it's
> > > > > specifying a machine numa node, but rather a node index within
> > > > > single socket. In which case, it doesn't make much sense calling
> > > > > it NUMA node_id, it's rather some index within a socket. (it
> > > > > starts looking like terminology is all mixed up)
> > > > >
> > > > > If you have access to a milti-socket EPYC machine, can you dump
> > > > > and post here its apic ids, pls?
> > > >
> > > > Here is the output from my EPYC machine with 2 sockets and totally
> > > > 8 nodes(SMT disabled). The cpus 0-31 are in socket 0 and  cpus
> > > > 32-63 in socket 1.
> > > >
> > > > # lscpu
> > > > Architecture:        x86_64
> > > > CPU op-mode(s):      32-bit, 64-bit
> > > > Byte Order:          Little Endian
> > > > CPU(s):              64
> > > > On-line CPU(s) list: 0-63
> > > > Thread(s) per core:  1
> > > > Core(s) per socket:  32
> > > > Socket(s):           2
> > > > NUMA node(s):        8
> > > > Vendor ID:           AuthenticAMD
> > > > CPU family:          23
> > > > Model:               1
> > > > Model name:          AMD Eng Sample: 1S1901A4VIHF5_30/19_N
> > > > Stepping:            2
> > > > CPU MHz:             2379.233
> > > > CPU max MHz:         1900.0000
> > > > CPU min MHz:         1200.0000
> > > > BogoMIPS:            3792.81
> > > > Virtualization:      AMD-V
> > > > L1d cache:           32K
> > > > L1i cache:           64K
> > > > L2 cache:            512K
> > > > L3 cache:            8192K
> > > > NUMA node0 CPU(s):   0-7
> > > > NUMA node1 CPU(s):   8-15
> > > > NUMA node2 CPU(s):   16-23
> > > > NUMA node3 CPU(s):   24-31
> > > > NUMA node4 CPU(s):   32-39
> > > > NUMA node5 CPU(s):   40-47
> > > > NUMA node6 CPU(s):   48-55
> > > > NUMA node7 CPU(s):   56-63
> > > >
> > > > Here is the output of #cpuid  -l 0x8000001e  -r
> > >
> > >
> > > (1)
> > > > You may want to refer
> > > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> > > >
> > >
> amd.com%2Fsystem%2Ffiles%2FTechDocs%2F54945_3.03_ppr_ZP_B2_pub.zip&
> > > amp
> > > >
> > >
> ;data=02%7C01%7Cbabu.moger%40amd.com%7Ceacf7e8facbc4ae2eee808d82
> > > ff3ca9
> > > >
> > >
> 0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6373120714103223
> > > 35&amp;
> > > >
> > >
> sdata=%2Fdr93YVlwSq82%2FwRh2NU21Zkw4HJ%2B%2FVVYxAkhCCKJ4w%3D&a
> > > mp;reser
> > > > ved=0 (section 2.1.12.2.1.3 ApicId Enumeration Requirements).
> > > > Note that this is a general guideline. We tried to generalize in
> > > > qemu as much as possible. It is bit complex.
> > >
> > >
> > >
> > > > CPU 0:
> > > >    0x8000001e 0x00: eax=0x00000000 ebx=0x00000100 ecx=0x00000300
> > > > edx=0x00000000
> > > [...]
> > > > CPU 63:
> > > >    0x8000001e 0x00: eax=0x0000007e ebx=0x0000011f ecx=0x00000307
> > > > edx=0x00000000
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > I wonder if linux guest actually uses node_id encoded in
> > > > > > > apic id for configuring/checking numa structures, or it just
> > > > > > > uses whatever ACPI SRAT table provided.
> > > > > > >
> > > > > > >>>> + */
> > > > > > >>>> +static inline void x86_init_topo_ids(X86CPUTopoInfo
> *topo_info,
> > > > > > >>>> +                                     CpuInstanceProperties props,
> > > > > > >>>> +                                     X86CPUTopoIDs *topo_ids) {
> > > > > > >>>> +    topo_ids->smt_id = props.has_thread_id ? props.thread_id : 0;
> > > > > > >>>> +    topo_ids->core_id = props.has_core_id ? props.core_id : 0;
> > > > > > >>>> +    topo_ids->die_id = props.has_die_id ? props.die_id : 0;
> > > > > > >>>> +    topo_ids->node_id = props.has_node_id ?
> > > > > > >>>> +                        props.node_id %
> > > > > > >>>> +MAX(topo_info->nodes_per_pkg, 1) : 0;
> > >
> > > It looks like I was wrong pushing system wide NUMA node-id into APIC
> > > ID (choosen naming is confusing a bit), per [1] mentioned above, EPYC's
> node-id is:
> > >
> > > • ApicId[6] = Socket ID.
> > > * ApicId[5:4]= Node ID.
> > > • ApicId[3] = Logical CCX L3 complex ID • ApicId[2:0]= (SMT) ?
> > > {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}
> > >
> > > which is can hold only 0-3 values, and defined as:
> > >
> > > "A node, is an integrated circuit device that includes one to 8
> > > cores (one or two Core Complexes)."
> > >
> > > spec also mentions it indirectly as die-id if one looks at
> > > CPUID_Fn8000001E_EBX [Core Identifiers] (Core::X86::Cpuid::CoreId) ...
> > >   CoreId = ({2'b0, DieId[1:0], LogicalComplexId[0],
> > > LogicalThreadId[2:0]} >> SMT
> > >
> > > and in
> > > (2)
> > > CPUID_Fn8000001E_ECX [Node Identifiers] (Core::X86::Cpuid::NodeId) ...
> > >   {5'b00000,1'b[SOCKET_ID],2'b[DIE_ID]}
> > >
> > > Question is why we did not reuse topo_ids->die_id instead of adding
> > > confusing topo_ids->node_id in the first place?
> >
> > Initially, I thought about it. But Intel uses die_id differently than AMD.
> > So, did not want complicate things.
> > If we take that route then we need to re-arrange the numa code as we
> > need to numa information to calculate the die id. So, did not want to
> > mix up things.
> >
> > >
> > > Also looking APIC ID and SRAT table provided here,
> > > CPUID_Fn8000001E_ECX corresponds to NUMA node id (i.e. what -numa in
> > > QEMU used for) and Node ID embeded into ApicId[5:4] is basically die-id.
> > >
> > > Difference between die-id implemented in QEMU and EPYC's die id
> > > (topo_ids-
> > > >node_id) is that the former doesn't require numa config (maybe it
> > > >should, but
> > > ship'salready sailed) and gets number of dies from '-smp dies=X' CLI
> > > option, while for EPYC we calculate topo_ids->node_id implicitly
> > > from number of numa nodes and sockets, which implicitly requires
> > > that machine 'must' be configured with -numa options.
> > >
> > > Maybe we should drop this implicit calculation along with
> > > topo_ids->node_id and reuse '-smp dies=X' plus extra checks for EPYC
> > > to ask for -numa if there is more than 1 die and if we need to be
> > > really strict, add checks for limit of dies/cores within socket/die
> > > according to spec[2] so encoded APIC ID and CPUID_8000001E match the
> spec.
> >
> > There will be complications when user configures with both die_id and
> > numa_id. It will complicate things further. I will have to look
> > closely at the code if it is feasible.
> 
> it's worth a try.
> conseptionally die_id in intel/amd is the same. Most likely intel has a dedicated
> memory controller on each die so it still should form a NUMA node. But that
> aspect probably was ignored while implementing it in QEMU so ping of
> configuring QEMU right is on user's shoulders (there is no checks whatsoever if
> cpu belonging to specific die is on right NUMA node).

So you are suggesting to use die_id to build the topology for EPYC. Also
initialize die_id based on the numa information. Re-arrange the numa code
to make sure we have all the information before we build the topology. And
then remove the node_id inside X86CPUTopoIDs. Is that the plan?
> 
> What AMD has implemented on top of that in CPU hw, is to expose NUMA node
> id in CPUID_8000001E. I don't know why it was done as usually it's ACPI tables
> that describe relations between nodes so for OS this info almost useless (I'd
> guess it's faster to use CPUID instead of fetching pre-cpu variable but that's
> pretty much it from OS point of view)
> 
> >
> > >
> > >
> > >
> > > > > > >>>> +    topo_ids->pkg_id = props.has_socket_id ? props.socket_id :
> > > > > > >>>> +0; }
> > > > > > >>>>  /*
> > > > > > >>>>   * Make APIC ID for the CPU 'cpu_index'
> > > > > > >>>>   *
> > > > > > >>>>
> > > > > > >>
> > > > > > >
> > > > > >
> > > >
> > > >
> >



  reply	other threads:[~2020-07-28  0:00 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-01 17:30 [PATCH v2 0/3] Fix couple of issues with AMD topology Babu Moger
2020-07-01 17:31 ` [PATCH v2 1/3] hw/i386: Initialize topo_ids from CpuInstanceProperties Babu Moger
2020-07-13  9:08   ` Igor Mammedov
2020-07-13 15:02     ` Babu Moger
2020-07-13 16:17       ` Igor Mammedov
2020-07-13 16:43         ` Babu Moger
2020-07-13 17:32           ` Igor Mammedov
2020-07-13 19:30             ` Babu Moger
2020-07-14 16:41               ` Igor Mammedov
2020-07-14 17:26                 ` Babu Moger
2020-07-14 18:26                   ` Igor Mammedov
2020-07-24 17:05               ` Igor Mammedov
2020-07-27 15:49                 ` Babu Moger
2020-07-27 17:14                   ` Igor Mammedov
2020-07-27 23:59                     ` Babu Moger [this message]
2020-07-29 14:12                       ` Igor Mammedov
2020-07-29 21:22                         ` Babu Moger
2020-07-30 11:27                           ` Igor Mammedov
2020-07-01 17:31 ` [PATCH v2 2/3] hw/i386: Build apic_id " Babu Moger
2020-07-13  9:15   ` Igor Mammedov
2020-07-13 15:00     ` Babu Moger
2020-07-01 17:31 ` [PATCH v2 3/3] hw/386: Fix uninitialized memory with -device and CPU hotplug Babu Moger
2020-07-27 16:36   ` Igor Mammedov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5df170bd-ea91-8347-a2cf-7ac234248197@amd.com \
    --to=babu.moger@amd.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.