From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98B05C35247 for ; Wed, 5 Feb 2020 16:57:29 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 4AC8120674 for ; Wed, 5 Feb 2020 16:57:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="cnX+Pcow" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4AC8120674 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:52762 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1izNzU-000228-ED for qemu-devel@archiver.kernel.org; Wed, 05 Feb 2020 11:57:28 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:54602) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1izNyh-0001UB-Op for qemu-devel@nongnu.org; Wed, 05 Feb 2020 11:56:42 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1izNyc-0006Ul-W9 for qemu-devel@nongnu.org; Wed, 05 Feb 2020 11:56:37 -0500 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:21485 helo=us-smtp-1.mimecast.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1izNyc-0006P7-Fu for qemu-devel@nongnu.org; Wed, 05 Feb 2020 11:56:34 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1580921793; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J6GFBznRN/dDHkJH0T6y3jrYZYRl+/Qq4YLWmZXaaWo=; b=cnX+Pcow9BsXwQ2c9EZQi5LuqxfkYx7El54Oraz35YqkY2Wd/cVHl3ZIQxx9jINQ1SvJ6i bhreZXf2cwbq3YQurrDoZKXQfiPRYatC7iOfxNTA1z9nqJGookVOEwmjdSbHDHJR8D+J9B O+l7V41KXUe5GuCIrFcceyWLKEcrekI= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-333-HQo25MXGPzeyuK8Mp-EUbQ-1; Wed, 05 Feb 2020 11:56:30 -0500 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 742FA800D54; Wed, 5 Feb 2020 16:56:28 +0000 (UTC) Received: from localhost (unknown [10.43.2.114]) by smtp.corp.redhat.com (Postfix) with ESMTP id A575860BF4; Wed, 5 Feb 2020 16:56:21 +0000 (UTC) Date: Wed, 5 Feb 2020 17:56:19 +0100 From: Igor Mammedov To: Babu Moger Subject: Re: [PATCH v3 00/18] APIC ID fixes for AMD EPYC CPU models Message-ID: <20200205175619.51be0cc8@redhat.com> In-Reply-To: <3b7e02cb-4da0-d5f0-6cd0-c9b8045bbe17@amd.com> References: <157541968844.46157.17994918142533791313.stgit@naples-babu.amd.com> <20200203155921.658d7765@redhat.com> <20200204090230.28f31a87@redhat.com> <20200205103838.424b40a9@redhat.com> <3b7e02cb-4da0-d5f0-6cd0-c9b8045bbe17@amd.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-MC-Unique: HQo25MXGPzeyuK8Mp-EUbQ-1 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 207.211.31.120 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: ehabkost@redhat.com, mst@redhat.com, armbru@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, rth@twiddle.net Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Wed, 5 Feb 2020 10:10:06 -0600 Babu Moger wrote: > On 2/5/20 3:38 AM, Igor Mammedov wrote: > > On Tue, 4 Feb 2020 13:08:58 -0600 > > Babu Moger wrote: > > =20 > >> On 2/4/20 2:02 AM, Igor Mammedov wrote: =20 > >>> On Mon, 3 Feb 2020 13:31:29 -0600 > >>> Babu Moger wrote: > >>> =20 > >>>> On 2/3/20 8:59 AM, Igor Mammedov wrote: =20 > >>>>> On Tue, 03 Dec 2019 18:36:54 -0600 > >>>>> Babu Moger wrote: > >>>>> =20 > >>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs. > >>>>>> https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%= 2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&data=3D02%7C01%7Cba= bu.moger%40amd.com%7Cdbfd059a060a4851aad908d7aa1f3532%7C3dd8961fe4884e608e1= 1a82d994e183d%7C0%7C0%7C637164923333568238&sdata=3DP0I547X5r0s9emWu3ptI= cm1U%2FhCMZmnMQOQ0IgLPzzQ%3D&reserved=3D0 > >>>>>> > >>>>>> Currently, the APIC ID is decoded based on the sequence > >>>>>> sockets->dies->cores->threads. This works for most standard AMD an= d other > >>>>>> vendors' configurations, but this decoding sequence does not follo= w that of > >>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause C= PU topology > >>>>>> inconsistency. When booting a guest VM, the kernel tries to valid= ate the > >>>>>> topology, and finds it inconsistent with the enumeration of EPYC c= pu models. > >>>>>> > >>>>>> To fix the problem we need to build the topology as per the Proces= sor > >>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision= B1 > >>>>>> Processors. It is available at https://nam11.safelinks.protection.= outlook.com/?url=3Dhttps%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F= 55570-B1_PUB.zip&data=3D02%7C01%7Cbabu.moger%40amd.com%7Cdbfd059a060a48= 51aad908d7aa1f3532%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63716492333= 3568238&sdata=3DAO6m%2FEI17iLoAa3gNnRSJKJAdvBRKh0Dmbr7bCVA0us%3D&re= served=3D0 > >>>>>> > >>>>>> Here is the text from the PPR. > >>>>>> Operating systems are expected to use Core::X86::Cpuid::SizeId[Api= cIdSize], the > >>>>>> number of least significant bits in the Initial APIC ID that indic= ate core ID > >>>>>> within a processor, in constructing per-core CPUID masks. > >>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number= of cores > >>>>>> (MNC) that the processor could theoretically support, not the actu= al number of > >>>>>> cores that are actually implemented or enabled on the processor, a= s indicated > >>>>>> by Core::X86::Cpuid::SizeId[NC]. > >>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows= : > >>>>>> =E2=80=A2 ApicId[6] =3D Socket ID. > >>>>>> =E2=80=A2 ApicId[5:4] =3D Node ID. > >>>>>> =E2=80=A2 ApicId[3] =3D Logical CCX L3 complex ID > >>>>>> =E2=80=A2 ApicId[2:0]=3D (SMT) ? {LogicalCoreID[1:0],ThreadId} : {= 1'b0,LogicalCoreID[1:0]} =20 > >>>>> > >>>>> > >>>>> After checking out all patches and some pondering, used here approa= ch > >>>>> looks to me too intrusive for the task at hand especially where it > >>>>> comes to generic code. > >>>>> > >>>>> (Ignore till =3D=3D=3D=3D to see suggestion how to simplify without= reading > >>>>> reasoning behind it first) > >>>>> > >>>>> Lets look for a way to simplify it a little bit. > >>>>> > >>>>> So problem we are trying to solve, > >>>>> 1: calculate APIC IDs based on cpu type (to e more specific: for E= PYC based CPUs) > >>>>> 2: it depends on knowing total number of numa nodes. > >>>>> > >>>>> Externally workflow looks like following: > >>>>> 1. user provides -smp x,sockets,cores,...,maxcpus > >>>>> that's used by possible_cpu_arch_ids() singleton to build lis= t of > >>>>> possible CPUs (which is available to user via command 'hotplu= ggable-cpus') > >>>>> > >>>>> Hook could be called very early and possible_cpus data might = be > >>>>> not complete. It builds a list of possible CPUs which user co= uld > >>>>> modify later. > >>>>> > >>>>> 2.1 user uses "-numa cpu,node-id=3Dx,..." or legacy "-numa node,n= ode_id=3Dx,cpus=3D" > >>>>> options to assign cpus to nodes, which is one way or another = calling > >>>>> machine_set_cpu_numa_node(). The later updates 'possible_cpus= ' list > >>>>> with node information. It happens early when total number of = nodes > >>>>> is not available. > >>>>> > >>>>> 2.2 user does not provide explicit node mappings for CPUs. > >>>>> QEMU steps in and assigns possible cpus to nodes in machine_n= uma_finish_cpu_init() > >>>>> (using the same machine_set_cpu_numa_node()) right before cal= ling boards > >>>>> specific machine init(). At that time total number of nodes i= s known. > >>>>> > >>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have t= o be defined before > >>>>> boards init() is run. =20 > >> > >> In case of 2.1, we need to have the arch_id already generated. This is > >> done inside possible_cpu_arch_ids. The arch_id is used by > >> machine_set_cpu_numa_node to assign the cpus to correct numa node. =20 > >=20 > > I might have missed something but I don't see arch_id itself being used= in > > machine_set_cpu_numa_node(). It only uses props part of possible_cpus = =20 >=20 > Before calling machine_set_cpu_numa_node, we call > cpu_index_to_instance_props -> x86_cpu_index_to_props-> > possible_cpu_arch_ids->x86_possible_cpu_arch_ids. >=20 > This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all > the available cpus. Based on the arch_id, it also sets up the props. x86_possible_cpu_arch_ids() arch_id =3D x86_cpu_apic_id_from_index(x86ms, i) x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores, ms->s= mp.threads, &topo); // assign socket/die/core/thread from topo so currently it uses indirect way to convert index in possible_cpus->cpus[] to socket/die/core/thread ids. But essentially it take '-smp' options and [0..max_cpus) number as original= data converts it into intermediate apic_id and then reverse engineer it back to topo info. Why not use x86_topo_ids_from_idx() directly to get rid of 'props' dependen= cy on apic_id? > And these props values are used to assign the nodes in > machine_set_cpu_numa_node. >=20 > At this point we are still parsing the numa nodes and so we don't know th= e > total number of numa nodes. Without that information, the arch_id > generated here will not be correct for EPYC models. >=20 > This is the reason for changing the generic numa code(patch #12-Split the > numa initialization). >=20 > >=20 > > =20 > >> If we want to move the arch_id generation into board init(), then we n= eed > >> to save the cpu indexes belonging to each node somewhere. =20 > >=20 > > when cpus are assigned explicitly, decision what cpus go to what nodes = is > > up to user and user configured mapping is stored in MachineState::possi= ble_cpus > > which is accessed by via possible_cpu_arch_ids() callback. > > Hence I don see any reason to touch cpu indexes. =20 >=20 > Please see my reasoning above. >=20 > > =20 > >> =20 > >>>>> > >>>>> In 2.2 case it calls get_default_cpu_node_id() -> x86_get_default_c= pu_node_id() > >>>>> which uses arch_id calculate numa node. > >>>>> But then question is: does it have to use APIC id or could it infer= 'pkg_id', > >>>>> it's after, from ms->possible_cpus->cpus[i].props data? =20 > >>>> > >>>> Not sure if I got the question right. In this case because the numa > >>>> information is not provided all the cpus are assigned to only one no= de. > >>>> The apic id is used here to get the correct pkg_id. =20 > >>> > >>> apicid was composed from socket/core/thread[/die] tuple which cpus[i]= .props is. > >>> > >>> Question is if we can compose only pkg_id based on the same data with= out > >>> converting it to apicid and then "reverse engineering" it back > >>> original data? =20 > >> > >> Yes. It is possible. > >> =20 > >>> > >>> Or more direct question: is socket-id the same as pkg_id? =20 > >> > >> Yes. Socket_id and pkg_id is same. > >> =20 > >>> > >>> =20 > >>>> =20 > >>>>> =20 > >>>>> With that out of the way APIC ID will be used only during board's i= nit(), > >>>>> so board could update possible_cpus with valid APIC IDs at the star= t of > >>>>> x86_cpus_init(). > >>>>> > >>>>> =3D=3D=3D=3D > >>>>> in nutshell it would be much easier to do following: > >>>>> > >>>>> 1. make x86_get_default_cpu_node_id() APIC ID in-depended or > >>>>> if impossible as alternative recompute APIC IDs there if cpu > >>>>> type is EPYC based (since number of nodes is already known) > >>>>> 2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based > >>>>> > >>>>> this way one doesn't need to touch generic numa code, introduce > >>>>> x86 specific init_apicid_fn() hook into generic code and keep > >>>>> x86/EPYC nuances contained within x86 code only. =20 > >>>> > >>>> I was kind of already working in the similar direction in v4. > >>>> 1. We already have split the numa initialization in patch #12(Split = the > >>>> numa initialization). This way we know exactly how many numa nodes a= re > >>>> there before hand. =20 > >>> > >>> I suggest to drop that patch, It's the one that touches generic numa > >>> code and adding more legacy based extensions like cpu_indexes. > >>> Which I'd like to get rid of to begin with, so only -numa cpu is left= . > >>> > >>> I think it's not necessary to touch numa code at all for apicid gener= ation > >>> purpose, as I tried to explain above. We should be able to keep > >>> this x86 only business. =20 > >> > >> This is going to be difficult without touching the generic numa code.p= atch #12(Split the =20 > >>>> numa initialization) =20 > >=20 > > Looking at current code I don't see why one would touch numa code. > > Care to explain in more details why you'd have to touch it? =20 >=20 > Please see the reasoning above. > > =20 > >>>> 2. Planning to remove init_apicid_fn > >>>> 3. Insert the handlers inside X86CPUDefinition. =20 > >>> what handlers do you mean? =20 > >> > >> Apicid generation logic can be separated into 3 types of handlers. > >> x86_apicid_from_cpu_idx: Generate apicid from cpu index. > >> x86_topo_ids_from_apicid: Generate topo ids from apic id. > >> x86_apicid_from_topo_ids: Generate apicid from topo ids. > >> > >> We should be able to generate one id from other(you can see topology.h= ). > >> > >> X86CPUDefinition will have the handlers specific to each model like th= e > >> way we have features now. The above 3 handlers will be used as default > >> handler. =20 > >=20 > > it probably shouldn't be a part of X86CPUDefinition, > > as it's machines responsibility to generate and set APIC ID. > >=20 > > What you are doing with this topo functions in this version > > looks more that enough to me. =20 >=20 > It is all the exact same topo functions. Only making these functions as > the handlers inside the X86CPUDefinition. >=20 > > =20 > >> The EPYC model will have its corresponding handlers. > >> > >> x86_apicid_from_cpu_idx_epyc > >> x86_topo_ids_from_apicid_epyc > >> x86_apicid_from_topo_ids_epyc. =20 > >=20 > > CPU might use call backs, but does it have to? > > I see cpu_x86_cpuid() uses these functions to decode apic_id back to to= po > > info and then compose various leaves based on it. > > Within CPU code I'd just use > > if (i_am_epyc) > > x86_topo_ids_from_apicid_epyc() > > else > > x86_topo_ids_from_apicid() > > it's easier to read and one doesn't have to go figure > > indirection chain to figure out what code is called. =20 >=20 > Eduardo already commented on this idea. Anything specific to cpu models > should be part of the X86CPUDefinition. We should not compare the specifi= c > model here. Comparing the specific model does not scale. We are achieving > this by loading the model definition(similar to what we do in > x86_cpu_load_model). ok >=20 > > =20 > >>>> 4. EPYC model will have its own apid id handlers. Everything else wi= ll be > >>>> initialized with a default handlers(current default handler). > >>>> 5. The function pc_possible_cpu_arch_ids will load the model definit= ion > >>>> and initialize the PCMachineState data structure with the model spec= ific > >>>> handlers. =20 > >>> I'm not sure what do you mean here. =20 > >> > >> PCMachineState will have the function pointers to the above handlers. > >> I was going to load the correct handler based on the mode type. =20 > >=20 > > Could be done like this, but considering that within machine we need > > to calculate apic_id only once, the same 'if' trick would be simpler > >=20 > > x86_cpus_init() { > >=20 > > if (cpu =3D=3D epic) { > > make_epyc_apic_ids(mc->possible_cpu_arch_ids(ms)) > > } =20 >=20 > Once again, this does not scale. Please see my response above. >=20 > >=20 > > // go on with creating cpus ... > > } > > =20 > >>>> Does that sound similar to what you are thinking. Thoughts? =20 > >>> If you have something to share and can push it on github, > >>> I can look at, whether it has design issues to spare you a round trip= on a list. > >>> (it won't be proper review but at least I can help to pinpoint most p= roblematic parts) > >>> =20 > >> My code for the current approach is kind of ready(yet to be tested). I= can > >> send it as v3.1 if you want to look. Or we can wait for our discussion= to > >> settle. I will post it after our discussion. =20 > > ok, lets wait till we finish this discussion =20 >=20 > I can post my draft patch to give you more idea about what i am talking > about now. Let me know. >=20 > > =20 > >> There is one more problem we need to address. I was going to address l= ater > >> in v4 or v5. > >> > >> This works > >> -numa node,nodeid=3D0,cpus=3D0-3 -numa node,nodeid=3D1,cpus=3D4-7 > >> > >> This does not work > >> -numa node,nodeid=3D0,cpus=3D0-5 -numa node,nodeid=3D1,cpus=3D6-7 =20 > > Is it supposed to work (i.e. can real hardware do such topology)? =20 >=20 > Hardware does not support this configuration. That is why I did not think > it is serious enough to fix this problem right now. >=20 > > =20 > >> This requires the generic code to pass the node information to the x86 > >> code which requires some handler changes. I was thinking my code will > >> simplify the changes to address this issue. =20 > >=20 > > without more information, it's hard to comment on issue and whether > > extra complexity of callbacks is justificated.=20 > >=20 > > There could be 2 ways here, add fixes to this series so we could see th= e reason > > or make this series simple to solve apic_id problem only and then on to= p of > > it send the second series that solves another issue. > >=20 > > Considering that this series is already big/complicated enough, > > personally I'd go for 2nd option. As it's easier to describe what patch= es are > > doing and easier to review =3D> should result in faster reaching consen= sus and merging. > > [...] > > =20 >=20