Re: [Qemu-devel] [RFC 2 PATCH 07/16] hw/386: Add new epyc mode topology decoding functions

From: Babu Moger <babu.moger@amd.com>
To: Eduardo Habkost <ehabkost@redhat.com>
Cc: "mst@redhat.com" <mst@redhat.com>,
	"armbru@redhat.com" <armbru@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"imammedo@redhat.com" <imammedo@redhat.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"rth@twiddle.net" <rth@twiddle.net>
Subject: Re: [Qemu-devel] [RFC 2 PATCH 07/16] hw/386: Add new epyc mode topology decoding functions
Date: Mon, 2 Dec 2019 14:38:03 -0600	[thread overview]
Message-ID: <d0b3badc-69d9-7538-0f47-e206c2123e6f@amd.com> (raw)
In-Reply-To: <20191011032953.GD29387@habkost.net>

On 10/10/19 10:29 PM, Eduardo Habkost wrote:
> On Fri, Sep 06, 2019 at 07:12:25PM +0000, Moger, Babu wrote:
>> These functions add support for building new epyc mode topology
>> given smp details like numa nodes, cores, threads and sockets.
>> Subsequent patches will use these functions to build the topology.
>>
>> The topology details are available in Processor Programming Reference (PPR)
>> for AMD Family 17h Model 01h, Revision B1 Processors.
>> It is available at https://www.amd.com/en/support/tech-docs
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>>  include/hw/i386/topology.h |  174 ++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 174 insertions(+)
>>
>> diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
>> index 5a61d53f05..6fd4184f07 100644
>> --- a/include/hw/i386/topology.h
>> +++ b/include/hw/i386/topology.h
>> @@ -62,6 +62,22 @@ typedef struct X86CPUTopoInfo {
>>      unsigned nr_threads;
>>  } X86CPUTopoInfo;
>>  
>> +/*
>> + * Definitions used for building CPUID Leaf 0x8000001D and 0x8000001E
>> + * Please refer to the AMD64 Architecture Programmer’s Manual Volume 3.
>> + * Define the constants to build the cpu topology. Right now, TOPOEXT
>> + * feature is enabled only on EPYC. So, these constants are based on
>> + * EPYC supported configurations. We may need to handle the cases if
>> + * these values change in future.
>> + */
>> +
>> +/* Maximum core complexes in a node */
>> +#define MAX_CCX                  2
> 
> I suggest a more explicit name like MAX_CCX_IN_NODE.
> 
>> +/* Maximum cores in a core complex */
>> +#define MAX_CORES_IN_CCX         4
>> +/* Maximum cores in a node */
>> +#define MAX_CORES_IN_NODE        8
>> +
> 
> Having limits isn't necessarily bad, but if we look at the code
> that use those constants below:
> 
> [...]
>> +/*
>> + * Figure out the number of nodes required to build this config.
>> + * Max cores in a nodes is 8
>> + */
>> +static inline int nodes_in_pkg(X86CPUTopoInfo *topo_info)
>> +{
>> +    /*
>> +     * Create a config with user given (nr_nodes > 1) numa node config,
>> +     * else go with a standard configuration
>> +     */
>> +    if (topo_info->numa_nodes > 1) {
>> +        return DIV_ROUND_UP(topo_info->numa_nodes, topo_info->nr_sockets);
>> +    } else {
>> +        return DIV_ROUND_UP(topo_info->nr_cores, MAX_CORES_IN_NODE);
>> +    }
>> +}
> 
> Here we are trying to choose a default value for the number of
> nodes, but it's better to let the CPU or machine code choose it.
> Otherwise, this will be very painful to maintain if new CPUs with
> different limits appear.  I would just let the number of nodes
> per package to be configurable in the command-line.
> 
>> +
>> +/*
>> + * Decide the number of cores in a core complex with the given nr_cores using
>> + * following set constants MAX_CCX, MAX_CORES_IN_CCX, MAX_CORES_IN_DIE and
>> + * MAX_NODES_PER_SOCKET. Maintain symmetry as much as possible
>> + * L3 cache is shared across all cores in a core complex. So, this will also
>> + * tell us how many cores are sharing the L3 cache.
>> + */
>> +static inline int cores_in_ccx(X86CPUTopoInfo *topo_info)
>> +{
>> +    int nodes;
>> +
>> +    /* Get the number of nodes required to build this config */
>> +    nodes = nodes_in_pkg(topo_info);
>> +
>> +    /*
>> +     * Divide the cores accros all the core complexes
>> +     * Return rounded up value
>> +     */
>> +    return DIV_ROUND_UP(topo_info->nr_cores, nodes * MAX_CCX);
>> +}
> 
> Same here.  This will become painful to maintain if CPUs with a
> different CCX-per-node limit appear.  I suggest just letting the
> number of cores per CCX to be configurable in the command-line.
> The "cores" option is already defined as "cores per die" when
> nr_dies > 1.  We can easily define it to mean "cores per CCX"
> when nr_ccxs > 1.
> 
> If you still want to impose limits to some of those parameters, I
> suggest placing those limits in the CPU model table, not in
> global constants.
> 
> 
>> +
>> +/*
>> + * Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
>> + *
>> + * The caller must make sure core_id < nr_cores and smt_id < nr_threads.
>> + */
>> +static inline apic_id_t x86_apicid_from_topo_ids_epyc(X86CPUTopoInfo *topo_info,
>> +                                                  const X86CPUTopoIDs *topo_ids)
>> +{
>> +    unsigned nr_ccxs = MAX_CCX;
> 
> I suggest adding nr_ccxs to X86CPUTopoInfo, instead of making it
> fixed.
> 
>> +    unsigned nr_nodes = nodes_in_pkg(topo_info);
> 
> Same here: isn't it better to have a nr_nodes_per_pkg field in
> X86CPUTopoInfo, and make it configurable in the command-line?
> 
>> +    unsigned nr_cores = MAX_CORES_IN_CCX;
> 
> nr_cores is already in X86CPUTopoInfo, why aren't you using it?
> 
> Similar questions at x86_topo_ids_from_apicid_epyc() [1].
> 
>> +    unsigned nr_threads = topo_info->nr_threads;
>> +
>> +    return (topo_ids->pkg_id  << apicid_pkg_offset_epyc(nr_nodes, nr_ccxs,
>> +                                                        nr_cores, nr_threads)) |
>> +           (topo_ids->node_id  << apicid_node_offset(nr_ccxs, nr_cores,
>> +                                                     nr_threads)) |
>> +           (topo_ids->ccx_id  << apicid_ccx_offset(nr_cores, nr_threads)) |
>> +           (topo_ids->core_id << apicid_core_offset(nr_threads)) |
>> +           topo_ids->smt_id;
>> +}
>> +
>> +static inline void x86_topo_ids_from_idx_epyc(X86CPUTopoInfo *topo_info,
>> +                                              unsigned cpu_index,
>> +                                              X86CPUTopoIDs *topo_ids)
>> +{
>> +    unsigned nr_cores = topo_info->nr_cores;
> 
> Is nr_cores cores-per-ccx?  cores-per-node?  cores-per-pkg?
> 
> 
>> +    unsigned nr_threads = topo_info->nr_threads;
>> +    unsigned core_index = cpu_index / nr_threads % nr_cores;
> 
> It's hard to understand what this variable means.  Is it a unique
> identifier for a core inside the whole system?  Is it unique
> inside a package?  Unique inside a node?  Unique inside a ccx?
> 
> 
>> +    unsigned ccx_cores = cores_in_ccx(topo_info);
>> +
>> +    topo_ids->smt_id = cpu_index % nr_threads;
>> +    topo_ids->core_id = core_index % ccx_cores; /* core id inside the ccx */
>> +    topo_ids->ccx_id = (core_index % (ccx_cores * MAX_CCX)) / ccx_cores;
>> +    topo_ids->node_id = core_index / (ccx_cores * MAX_CCX);
> 
> It looks like we have one extra constraint we must ensure
> elsewhere in the code: we need to make sure the node topology in
> topo->node_id matches the node topology defined using -numa,
> don't we?
> 
> 
>> +    topo_ids->pkg_id = cpu_index / (nr_cores * nr_threads);
> 
> Now, these calculations are over my head.  It's hard to
> understand what "nr_cores" mean, or what "ccx_cores * MAX_CCX" is
> calculating.
> 
> I will try to compare with the existing x86_topo_ids_from_idx()
> function, to see if we can make the hierarchy clearer:
> 
> Existing code is:
> 
>     topo->pkg_id = cpu_index / (nr_dies * nr_cores * nr_threads);
>     topo->die_id = cpu_index / (nr_cores * nr_threads) % nr_dies;
>     topo->core_id = cpu_index / nr_threads % nr_cores;
>     topo->smt_id = cpu_index % nr_threads;
> 
> The code is hierarchical, but not 100% clear.  Let's try to make
> it clearer:
> 
>     unsigned threads_per_core = topo->nr_threads;
>     unsigned   cores_per_die  = topo->nr_cores;
>     unsigned    dies_per_pkg  = topo->nr_dies;
> 
>     unsigned threads_per_die  = threads_per_core * cores_per_die;
>     unsigned threads_per_pkg  = threads_per_die  *  dies_per_pkg;
> 
>     unsigned thread_index = cpu_index;
>     unsigned   core_index = cpu_index / threads_per_core;
>     unsigned    die_index = cpu_index / threads_per_die;
>     unsigned    pkg_index = cpu_index / threads_per_pkg;
> 
>     topo-> smt_id = thread_index % threads_per_core;
>     topo->core_id =   core_index %   cores_per_die;
>     topo-> die_id =    die_index %    dies_per_pkg;
>     topo-> pkg_id =    pkg_index;
> 
> The logic above should be equivalent to the existing
> x86_topo_ids_from_idx() function.
> 
> Can we make it better for the smt/core/ccx/node/pkg case too?
> Let's try:
> 
>     unsigned threads_per_core = topo->nr_threads;
>     unsigned   cores_per_ccx  = topo->nr_cores;
>     unsigned    ccxs_per_node = topo->nr_ccxs;
>     unsigned   nodes_per_pkg  = topo->nodes_per_pkg;
> 
>     unsigned threads_per_ccx  = threads_per_core * cores_per_ccx;
>     unsigned threads_per_node = threads_per_ccx  *  ccxs_per_node;
>     unsigned threads_per_pkg  = threads_per_node * nodes_per_pkg;
> 
>     unsigned thread_index = cpu_index;
>     unsigned   core_index = cpu_index / threads_per_core;
>     unsigned    ccx_index = cpu_index / threads_per_ccx;
>     unsigned   node_index = cpu_index / threads_per_node;
>     unsigned    pkg_index = cpu_index / threads_per_pkg;
> 
>     topo-> smt_id = thread_index % threads_per_core;
>     topo->core_id =   core_index %   cores_per_ccx;
>     topo-> ccx_id =    ccx_index %    ccxs_per_node;
>     topo->node_id =   node_index %   nodes_per_pkg;
>     topo-> pkg_id =    pkg_index;
> 
> The logic above probably isn't equivalent to the code you wrote.
> The point here is that each variable is very clearly defined with
> a "per_*" suffix to avoid confusion, and all parameters are
> coming from X86CPUTopoInfo.
> 
>> +}
>> +
>> +/*
>> + * Calculate thread/core/package IDs for a specific topology,
>> + * based on APIC ID
>> + */
>> +static inline void x86_topo_ids_from_apicid_epyc(apic_id_t apicid,
>> +                                            X86CPUTopoInfo *topo_info,
>> +                                            X86CPUTopoIDs *topo_ids)
>> +{
>> +    unsigned nr_nodes = nodes_in_pkg(topo_info);
>> +    unsigned nr_cores = MAX_CORES_IN_CCX;
>> +    unsigned nr_threads = topo_info->nr_threads;
>> +    unsigned nr_ccxs = MAX_CCX;
> 
> Same questions as in x86_apicid_from_topo_ids_epyc() [1]:
> shouldn't nr_cores, nr_ccxs and nr_nodes come from
> X86CPUTopoInfo?
> 
>> +
>> +    topo_ids->smt_id = apicid &
>> +                   ~(0xFFFFFFFFUL << apicid_smt_width(nr_threads));
>> +
>> +    topo_ids->core_id = (apicid >> apicid_core_offset(nr_threads)) &
>> +                   ~(0xFFFFFFFFUL << apicid_core_width(nr_cores));
>> +
>> +    topo_ids->ccx_id = (apicid >> apicid_ccx_offset(nr_cores, nr_threads)) &
>> +                   ~(0xFFFFFFFFUL << apicid_ccx_width(nr_ccxs));
>> +
>> +    topo_ids->node_id = (apicid >> apicid_node_offset(nr_ccxs, nr_cores,
>> +                                                      nr_threads)) &
>> +                   ~(0xFFFFFFFFUL << apicid_node_width(nr_nodes));
>> +
>> +    topo_ids->pkg_id = apicid >> apicid_pkg_offset_epyc(nr_nodes, nr_ccxs,
>> +                                                        nr_cores, nr_threads);
>> +}
>> +
>> +/*
>> + * Make APIC ID for the CPU 'cpu_index'
>> + *
>> + * 'cpu_index' is a sequential, contiguous ID for the CPU.
>> + */
>> +static inline apic_id_t x86_apicid_from_cpu_idx_epyc(X86CPUTopoInfo *topo_info,
>> +                                                     unsigned cpu_index)
>> +{
>> +    X86CPUTopoIDs topo_ids;
>> +    x86_topo_ids_from_idx_epyc(topo_info, cpu_index, &topo_ids);
>> +    return x86_apicid_from_topo_ids_epyc(topo_info, &topo_ids);
>> +}
> 
> If we make nodes_per_pkg==1 by default, treat nr_ccxs and nr_dies
> as synonyms, and make all parameters come from X86CPUTopoInfo,
> can't we reuse exactly the same topology code for both EPYC and
> non-EPYC?  It would be better than having two separate sets of
> functions.
> 
> I have one additional request: please add unit tests for the
> functions above to test-x86-cpuid.c.  There may be opportunities
> for refactoring this code later, and unit tests will be important
> to make sure we don't break anything.
> 
>> +
>>  /* Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
>>   *
>>   * The caller must make sure core_id < nr_cores and smt_id < nr_threads.
>>
> 

I have removed all the hardcoded values. All we need to know is the node
id if numa configured. All other fields just works fine as it is now.
Please look at v3 for more information.