From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7344EC433ED for ; Wed, 31 Mar 2021 17:34:34 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id DD0FE60BD3 for ; Wed, 31 Mar 2021 17:34:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DD0FE60BD3 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:34734 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lRejg-0006xd-PI for qemu-devel@archiver.kernel.org; Wed, 31 Mar 2021 13:34:32 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37092) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lReee-00032N-6j; Wed, 31 Mar 2021 13:29:20 -0400 Received: from mail-qk1-x735.google.com ([2607:f8b0:4864:20::735]:42650) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lReec-0001U2-4g; Wed, 31 Mar 2021 13:29:19 -0400 Received: by mail-qk1-x735.google.com with SMTP id y5so20120745qkl.9; Wed, 31 Mar 2021 10:29:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=cQipijX7beA8elQDNjfTpZJEg/+Cl1JgiZMq2D4C6wI=; b=RYXKUL+XWAyqbx0xCNSjgpdLMtDWR7lY8G973ned6Wgoyp+vqV/FEGev1sAQ4hlbRc fT/yjFv0x2oUzsEzzSNKN4SIRpCibBPh6CfDe3vHB/ProRcKneRe7RjvkRw46/L1tiYx 195Vwx9rInEEWGEm+/OdV+5/tdj0TwekA199Wg1ESqPpreggi/yh1J22Gxm4c8z+aDp3 Zd+ztJ3XYWPnqnm5Ma9yEwDycN77PeCbNKXqIrvn7xqnRRv/QkI9gXuG3F0Y7lbewqiC JtfCgNIKOWNiDLFaYX4QVl95CxtJvkZOQRDIUiGBGn2ZVe95pjur20D9J0Q//pFPGB9c qbiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=cQipijX7beA8elQDNjfTpZJEg/+Cl1JgiZMq2D4C6wI=; b=AjqqxvgcI9iLTXv3lTpSizNOpYjSitDEOW8eLwINfKbSLjdvhfywdYarSScaUMyZyV PMIpOCUak2tXpDrqFGcIYeYXM9XdYSBJ1Mat6erubHwz0nRF+mj7JBpcEtqNX7jEjiBW JqngTlOuIcnl3DVOGP6IPKlkeH7EooLJ1LnANgxeUB2UI4d/3dy4fkqM5LIPs6rpRaah EX96HL5Jjd+wdcyA9/JXQkxwNlOrqtLl2+37d7aNIWe7GZs9VQJ52NinIdY0CShYbFAJ QKkqrFF1eHbtRETh3FM0s0t4BITHhRi2VGpAgnrTVYbtrruKDXy/cdv0UJIvLixnA9LU DwrQ== X-Gm-Message-State: AOAM531y23Y14DADGS4CP4ULpvUshS9Uldmdrv1bEQCDDp0Kx5/xmLYW pCwkzz0eaXTWH/SS5IP+KDo= X-Google-Smtp-Source: ABdhPJyUED1EerqwqsL+19e92NI7g9PqdRtR/WMs2XGhe45nxzZgrPWls8eW3Yeb29VGesi4NVZ2JA== X-Received: by 2002:a37:ae44:: with SMTP id x65mr4181757qke.9.1617211756753; Wed, 31 Mar 2021 10:29:16 -0700 (PDT) Received: from ?IPv6:2804:431:c7c6:e000:6f43:93dd:11a0:93a1? ([2804:431:c7c6:e000:6f43:93dd:11a0:93a1]) by smtp.gmail.com with ESMTPSA id m16sm1863990qkm.100.2021.03.31.10.29.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 31 Mar 2021 10:29:16 -0700 (PDT) Subject: Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes To: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= , David Gibson References: <20210319183453.4466-1-danielhb413@gmail.com> <20210319183453.4466-2-danielhb413@gmail.com> <2025f26f-5883-4e86-02af-5b83a8d52465@gmail.com> <9870aaba-9921-5c5d-113c-5be6cd098cf2@kaod.org> <91e406bf-c9c6-0734-1f69-081d3633332b@gmail.com> <1e16fe5e-f20a-f882-d18a-113cf48c934c@kaod.org> From: Daniel Henrique Barboza Message-ID: <61876812-c915-6489-3058-b463967b0679@gmail.com> Date: Wed, 31 Mar 2021 14:29:12 -0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <1e16fe5e-f20a-f882-d18a-113cf48c934c@kaod.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::735; envelope-from=danielhb413@gmail.com; helo=mail-qk1-x735.google.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Laurent Vivier , Thomas Huth , Srikar Dronamraju , Michael Ellerman , qemu-devel@nongnu.org, groug@kaod.org, qemu-ppc@nongnu.org, Igor Mammedov Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On 3/31/21 12:18 PM, Cédric Le Goater wrote: > On 3/31/21 2:57 AM, David Gibson wrote: >> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: >>> >>> >>> On 3/29/21 12:32 PM, Cédric Le Goater wrote: >>>> On 3/29/21 6:20 AM, David Gibson wrote: >>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: >>>>>> On 3/25/21 3:10 AM, David Gibson wrote: >>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote: >>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>>>>>>> equivalent. >>>>>>>>>> >>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the >>>>>>>>>> powerpc kernel always considered that the number of sockets were equal >>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were >>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node >>>>>>>>>> as in [1], happened to be correctly represented in the kernel by >>>>>>>>>> accident during all these years. >>>>>>>>>> >>>>>>>>>> There's a case to be made for virtual topologies being detached from >>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same >>>>>>>>>> time, this freedom can't result in unrealistic hardware representations >>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't >>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>>>>>>>> >>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>>>>>>> pseries machine. qtest changes were made to adapt to this new >>>>>>>>>> condition. >>>>>>>>> >>>>>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>>>>>>> for users to match these up, for no real gain. I'm also concerned >>>>>>>>> that this will require follow on changes in libvirt to not make this a >>>>>>>>> really cryptic and irritating point of failure. >>>>>>>> >>>>>>>> Haven't though about required Libvirt changes, although I can say that there >>>>>>>> will be some amount to be mande and it will probably annoy existing users >>>>>>>> (everyone that has a multiple socket per NUMA node topology). >>>>>>>> >>>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed >>>>>>>> here. The other alternative is to keep interacting with the kernel folks to >>>>>>>> see if there is a way to keep our use case untouched. >>>>>>> >>>>>>> Right. Well.. not necessarily untouched, but I'm hoping for more >>>>>>> replies from Cédric to my objections and mpe's. Even with sockets >>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to >>>>>>> NUMA nodes makes sense. >>>>>> >>>>>> I did a couple of replies in different email threads but maybe not >>>>>> to all. I felt it was going nowhere :/ Couple of thoughts, >>>>> >>>>> I think I saw some of those, but maybe not all. >>>>> >>>>>> Shouldn't we get rid of the socket concept, die also, under pseries >>>>>> since they don't exist under PAPR ? We only have numa nodes, cores, >>>>>> threads AFAICT. >>>>> >>>>> Theoretically, yes. I'm not sure it's really practical, though, since >>>>> AFAICT, both qemu and the kernel have the notion of sockets (though >>>>> not dies) built into generic code. >>>> >>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" >>>> and PPC Linux only has a NUMA node id, on pseries and powernv. >>>> >>>>> It does mean that one possible approach here - maybe the best one - is >>>>> to simply declare that sockets are meaningless under, so we simply >>>>> don't expect what the guest kernel reports to match what's given to >>>>> qemu. >>>>> >>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic, >>>>> but it is likely to surprise and confuse people. >>>>> >>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? >>>>>> There are a couple of places where Linux checks for the underlying >>>>>> hypervisor already. >>>>>> >>>>>>>> This also means that >>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where >>>>>>>> we inform cores per socket information to the kernel. >>>>>>> >>>>>>> Well.. unless we can find some other sensible way to convey that >>>>>>> information. I haven't given up hope for that yet. >>>>>> >>>>>> Well, we could start by fixing the value in QEMU. It is broken >>>>>> today. >>>>> >>>>> Fixing what value, exactly? >>>> >>>> The value of the "ibm,chip-id" since we are keeping the property under >>>> QEMU. >>> >>> David, I believe this has to do with the discussing we had last Friday. >>> >>> I mentioned that the ibm,chip-id property is being calculated in a way that >>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, >>> e.g.: >>> >>> -smp 4,cores=4,maxcpus=8,threads=1 \ >>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ >>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>> >>> >>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id >>> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; >>> ibm,pft-size = <0x00 0x19>; >>> ibm,chip-id = <0x00>; >>> -- >>> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; >>> ibm,pft-size = <0x00 0x19>; >>> ibm,chip-id = <0x00>; >>> -- >>> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; >>> ibm,pft-size = <0x00 0x19>; >>> ibm,chip-id = <0x00>; >>> -- >>> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; >>> ibm,pft-size = <0x00 0x19>; >>> ibm,chip-id = <0x00>; >> >>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a >>> different NUMA node than 0-1. This would mean that the same socket >>> would belong to different NUMA nodes at the same time. >> >> Right... and I'm still not seeing why that's a problem. AFAICT that's >> a possible, if unexpected, situation under real hardware - though >> maybe not for POWER9 specifically. > The ibm,chip-id property does not exist under PAPR. PAPR only has > NUMA nodes, no sockets nor chips. > > And the property value is simply broken under QEMU. Try this : > > -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > # dmesg | grep numa > [ 0.013106] numa: Node 0 CPUs: 0-1 > [ 0.013136] numa: Node 1 CPUs: 2-3 > > # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > ibm,chip-id = <0x01>; > ibm,chip-id = <0x02>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x03>; These values are not wrong. When you do: -smp 4,cores=1,maxcpus=8 (....) You didn't fill threads and sockets. QEMU default is to prioritize sockets to fill the missing information, up to the maxcpus value. This means that what you did is equivalent to: -smp 4,threads=1,cores=1,sockets=8,maxcpus=8 (....) It's a 1 thread/core, 1 core/socket with 8 sockets config. Each possible CPU will sit in its own core, having its own ibm,chip-id. So: "-numa node,nodeid=0,cpus=0-1" is in fact allocating sockets 0 and 1 to NUMA node 0. Thanks, DHB > >>> I believe this is what Cedric wants to be addressed. Given that the >>> property is called after the OPAL property ibm,chip-id, the kernel >>> expects that the property will have the same semantics as in OPAL.> >> Even on powernv, I'm not clear why chip-id is tied into the NUMA >> configuration, rather than getting all the NUMA info from >> associativity properties. > > It is the case. > > The associativity properties are built from chip-id in OPAL though. > > The chip-id property is only used in low level PowerNV drivers, VAS, > XSCOM, LPC, etc. > > It's also badly used in the common part of the XIVE driver, what I am > trying to fix to introduce an IPI per node on all platforms. > > C. > > >