Re: [PATCH v3 00/18] APIC ID fixes for AMD EPYC CPU models

From: Babu Moger <babu.moger@amd.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: ehabkost@redhat.com, mst@redhat.com, armbru@redhat.com,
	qemu-devel@nongnu.org, pbonzini@redhat.com, rth@twiddle.net
Subject: Re: [PATCH v3 00/18] APIC ID fixes for AMD EPYC CPU models
Date: Thu, 6 Feb 2020 09:32:12 -0600	[thread overview]
Message-ID: <63d0f63c-bd84-5e42-0a30-531a54fb8af3@amd.com> (raw)
In-Reply-To: <20200206140839.378ea544@redhat.com>

On 2/6/20 7:08 AM, Igor Mammedov wrote:
> On Wed, 5 Feb 2020 13:07:31 -0600
> Babu Moger <babu.moger@amd.com> wrote:
> 
>> On 2/5/20 10:56 AM, Igor Mammedov wrote:
>>> On Wed, 5 Feb 2020 10:10:06 -0600
>>> Babu Moger <babu.moger@amd.com> wrote:
>>>   
>>>> On 2/5/20 3:38 AM, Igor Mammedov wrote:  
>>>>> On Tue, 4 Feb 2020 13:08:58 -0600
>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>     
>>>>>> On 2/4/20 2:02 AM, Igor Mammedov wrote:    
>>>>>>> On Mon, 3 Feb 2020 13:31:29 -0600
>>>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>>>       
>>>>>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:      
>>>>>>>>> On Tue, 03 Dec 2019 18:36:54 -0600
>>>>>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>>>>>         
>>>>>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C76bf8434899b41de094f08d7ab05bdf3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165913481441118&amp;sdata=34fZQpUjScKbbc35c7ot433HA1Rz03YG6aP1ucyGUsQ%3D&amp;reserved=0
>>>>>>>>>>
>>>>>>>>>> Currently, the APIC ID is decoded based on the sequence
>>>>>>>>>> sockets->dies->cores->threads. This works for most standard AMD and other
>>>>>>>>>> vendors' configurations, but this decoding sequence does not follow that of
>>>>>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
>>>>>>>>>> inconsistency.  When booting a guest VM, the kernel tries to validate the
>>>>>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
>>>>>>>>>>
>>>>>>>>>> To fix the problem we need to build the topology as per the Processor
>>>>>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
>>>>>>>>>> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C76bf8434899b41de094f08d7ab05bdf3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165913481451075&amp;sdata=4YXG%2BrCP5UUXcCQX4Ly8B%2FXdlvZoFrPCgonjy0IwG0U%3D&amp;reserved=0
>>>>>>>>>>
>>>>>>>>>> Here is the text from the PPR.
>>>>>>>>>> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
>>>>>>>>>> number of least significant bits in the Initial APIC ID that indicate core ID
>>>>>>>>>> within a processor, in constructing per-core CPUID masks.
>>>>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
>>>>>>>>>> (MNC) that the processor could theoretically support, not the actual number of
>>>>>>>>>> cores that are actually implemented or enabled on the processor, as indicated
>>>>>>>>>> by Core::X86::Cpuid::SizeId[NC].
>>>>>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
>>>>>>>>>> • ApicId[6] = Socket ID.
>>>>>>>>>> • ApicId[5:4] = Node ID.
>>>>>>>>>> • ApicId[3] = Logical CCX L3 complex ID
>>>>>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}        
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After checking out all patches and some pondering, used here approach
>>>>>>>>> looks to me too intrusive for the task at hand especially where it
>>>>>>>>> comes to generic code.
>>>>>>>>>
>>>>>>>>> (Ignore till ==== to see suggestion how to simplify without reading
>>>>>>>>> reasoning behind it first)
>>>>>>>>>
>>>>>>>>> Lets look for a way to simplify it a little bit.
>>>>>>>>>
>>>>>>>>> So problem we are trying to solve,
>>>>>>>>>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
>>>>>>>>>  2: it depends on knowing total number of numa nodes.
>>>>>>>>>
>>>>>>>>> Externally workflow looks like following:
>>>>>>>>>   1. user provides -smp x,sockets,cores,...,maxcpus
>>>>>>>>>       that's used by possible_cpu_arch_ids() singleton to build list of
>>>>>>>>>       possible CPUs (which is available to user via command 'hotpluggable-cpus')
>>>>>>>>>
>>>>>>>>>       Hook could be called very early and possible_cpus data might be
>>>>>>>>>       not complete. It builds a list of possible CPUs which user could
>>>>>>>>>       modify later.
>>>>>>>>>
>>>>>>>>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
>>>>>>>>>       options to assign cpus to nodes, which is one way or another calling
>>>>>>>>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
>>>>>>>>>       with node information. It happens early when total number of nodes
>>>>>>>>>       is not available.
>>>>>>>>>
>>>>>>>>>   2.2 user does not provide explicit node mappings for CPUs.
>>>>>>>>>       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
>>>>>>>>>       (using the same machine_set_cpu_numa_node()) right before calling boards
>>>>>>>>>       specific machine init(). At that time total number of nodes is known.
>>>>>>>>>
>>>>>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
>>>>>>>>> boards init() is run.      
>>>>>>
>>>>>> In case of 2.1, we need to have the arch_id already generated. This is
>>>>>> done inside possible_cpu_arch_ids. The arch_id is used by
>>>>>> machine_set_cpu_numa_node to assign the cpus to correct numa node.    
>>>>>
>>>>> I might have missed something but I don't see arch_id itself being used in
>>>>> machine_set_cpu_numa_node(). It only uses props part of possible_cpus    
>>>>
>>>> Before calling machine_set_cpu_numa_node, we call
>>>> cpu_index_to_instance_props -> x86_cpu_index_to_props->
>>>> possible_cpu_arch_ids->x86_possible_cpu_arch_ids.
>>>>
>>>> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
>>>> the available cpus. Based on the arch_id, it also sets up the props.  
>>>
>>>
>>> x86_possible_cpu_arch_ids()
>>>    arch_id = x86_cpu_apic_id_from_index(x86ms, i)
>>>    x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores,  ms->smp.threads, &topo);
>>>    // assign socket/die/core/thread from topo
>>>
>>> so currently it uses indirect way to convert index in possible_cpus->cpus[]
>>> to socket/die/core/thread ids.
>>> But essentially it take '-smp' options and [0..max_cpus) number as original data
>>> converts it into intermediate apic_id and then reverse engineer it back to
>>> topo info.
>>>
>>> Why not use x86_topo_ids_from_idx() directly to get rid of 'props' dependency on apic_id?  
>>
>> It might work. But this feels like a work-around and delaying the problem
>> for later. Just re-arranging the numa code little bit we can address this.
> 
> The idea behind possible_cpus is to allow users query topo information
> board generates (based on -smp) at configuration time (or late) so users
> could know what -numa cpu,topo_options [and -device foo-cpu,topo_options]
> to use, initializing apic_id on the first access is secondary and I did
> it only because I could do it without additional data.
> 
> But main purpose of possible_cpus is to keep topology information.
> That includes numa node mapping, which should be stored in possible_cpus
> along with the rest of cpu topology.
> 
> Looking [12/18] numa patch, it makes -numa node,cpus legacy option
> to reintroduce data duplication, by storing mapping elsewhere and
> then putting that mapping into possible_cpus at numa complete time
> (that's what I dislike and don't see a valid reason to do so).
> 
> That also won't work if user queries hotpluggable-cpus before that time
> and it also doesn't work if user uses preferable -numa cpu,topo_options
> as both would initialize possible_cpus on the first access.
> 
> So if you need do some board specific post-processing done on topo
> information when it's complete and recalculate apic_id do it at board
> init time like was suggested before (x86_cpu_new() looks like a good
> place to do it).

Ok. Sure. Will start working on it. Thanks