[Bug 1856335] Re: Cache Layout wrong on many Zen Arch CPUs

From: Heiko Sieger <1856335@bugs.launchpad.net>
To: qemu-devel@nongnu.org
Subject: [Bug 1856335] Re: Cache Layout wrong on many Zen Arch CPUs
Date: Sun, 10 May 2020 20:01:51 -0000	[thread overview]
Message-ID: <158914091142.4693.6888270013870332292.malone@soybean.canonical.com> (raw)
In-Reply-To: 157625616239.22064.10423897892496347105.malonedeb@gac.canonical.com

I upgraded to QEMU emulator version 5.0.50
Using q35-5.1 (the latest) and the following libvirt configuration:

  <memory unit="KiB">50331648</memory>
  <currentMemory unit="KiB">50331648</currentMemory>
  <memoryBacking>
    <hugepages/>
  </memoryBacking>
  <vcpu placement="static">24</vcpu>
  <cputune>
    <vcpupin vcpu="0" cpuset="0"/>
    <vcpupin vcpu="1" cpuset="12"/>
    <vcpupin vcpu="2" cpuset="1"/>
    <vcpupin vcpu="3" cpuset="13"/>
    <vcpupin vcpu="4" cpuset="2"/>
    <vcpupin vcpu="5" cpuset="14"/>
    <vcpupin vcpu="6" cpuset="3"/>
    <vcpupin vcpu="7" cpuset="15"/>
    <vcpupin vcpu="8" cpuset="4"/>
    <vcpupin vcpu="9" cpuset="16"/>
    <vcpupin vcpu="10" cpuset="5"/>
    <vcpupin vcpu="11" cpuset="17"/>
    <vcpupin vcpu="12" cpuset="6"/>
    <vcpupin vcpu="13" cpuset="18"/>
    <vcpupin vcpu="14" cpuset="7"/>
    <vcpupin vcpu="15" cpuset="19"/>
    <vcpupin vcpu="16" cpuset="8"/>
    <vcpupin vcpu="17" cpuset="20"/>
    <vcpupin vcpu="18" cpuset="9"/>
    <vcpupin vcpu="19" cpuset="21"/>
    <vcpupin vcpu="20" cpuset="10"/>
    <vcpupin vcpu="21" cpuset="22"/>
    <vcpupin vcpu="22" cpuset="11"/>
    <vcpupin vcpu="23" cpuset="23"/>
  </cputune>
  <os>
    <type arch="x86_64" machine="pc-q35-5.1">hvm</type>
    <loader readonly="yes" type="pflash">/usr/share/OVMF/x64/OVMF_CODE.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/win10_VARS.fd</nvram>
    <boot dev="hd"/>
    <bootmenu enable="no"/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
      <vpindex state="on"/>
      <synic state="on"/>
      <stimer state="on"/>
      <vendor_id state="on" value="AuthenticAMD"/>
      <frequencies state="on"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <vmport state="off"/>
    <ioapic driver="kvm"/>
  </features>
  <cpu mode="host-passthrough" check="none">
    <topology sockets="1" cores="12" threads="2"/>
    <cache mode="passthrough"/>
    <feature policy="require" name="invtsc"/>
    <feature policy="require" name="hypervisor"/>
    <feature policy="require" name="topoext"/>
    <numa>
      <cell id="0" cpus="0-2,12-14" memory="12582912" unit="KiB"/>
      <cell id="1" cpus="3-5,15-17" memory="12582912" unit="KiB"/>
      <cell id="2" cpus="6-8,18-20" memory="12582912" unit="KiB"/>
      <cell id="3" cpus="9-11,21-23" memory="12582912" unit="KiB"/>
    </numa>
  </cpu>

...

/var/log/libvirt/qemu/win10.log:

-machine pc-q35-5.1,accel=kvm,usb=off,vmport=off,dump-guest-core=off,kernel_irqchip=on,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format \
-cpu host,invtsc=on,hypervisor=on,topoext=on,hv-time,hv-relaxed,hv-vapic,hv-spinlocks=0x1fff,hv-vpindex,hv-synic,hv-stimer,hv-vendor-id=AuthenticAMD,hv-frequencies,hv-crash,kvm=off,host-cache-info=on,l3-cache=off \
-m 49152 \
-overcommit mem-lock=off \
-smp 24,sockets=1,cores=12,threads=2 \
-mem-prealloc \
-mem-path /dev/hugepages/libvirt/qemu/3-win10 \
-numa node,nodeid=0,cpus=0-2,cpus=12-14,mem=12288 \
-numa node,nodeid=1,cpus=3-5,cpus=15-17,mem=12288 \
-numa node,nodeid=2,cpus=6-8,cpus=18-20,mem=12288 \
-numa node,nodeid=3,cpus=9-11,cpus=21-23,mem=12288 \
...

For some reason I always get l3-cache=off.

CoreInfo.exe in Windows 10 then produces the following report
(shortened):

Logical to Physical Processor Map:
**----------------------  Physical Processor 0 (Hyperthreaded)
--*---------------------  Physical Processor 1
---*--------------------  Physical Processor 2
----**------------------  Physical Processor 3 (Hyperthreaded)
------**----------------  Physical Processor 4 (Hyperthreaded)
--------*---------------  Physical Processor 5
---------*--------------  Physical Processor 6
----------**------------  Physical Processor 7 (Hyperthreaded)
------------**----------  Physical Processor 8 (Hyperthreaded)
--------------*---------  Physical Processor 9
---------------*--------  Physical Processor 10
----------------**------  Physical Processor 11 (Hyperthreaded)
------------------**----  Physical Processor 12 (Hyperthreaded)
--------------------*---  Physical Processor 13
---------------------*--  Physical Processor 14
----------------------**  Physical Processor 15 (Hyperthreaded)

Logical Processor to Socket Map:
************************  Socket 0

Logical Processor to NUMA Node Map:
***---------***---------  NUMA Node 0
---***---------***------  NUMA Node 1
------***---------***---  NUMA Node 2
---------***---------***  NUMA Node 3

Approximate Cross-NUMA Node Access Cost (relative to fastest):
     00  01  02  03
00: 1.4 1.2 1.1 1.2
01: 1.1 1.1 1.3 1.1
02: 1.0 1.1 1.0 1.2
03: 1.1 1.2 1.2 1.2

Logical Processor to Cache Map:
**----------------------  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
**----------------------  Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
**----------------------  Unified Cache       0, Level 2,  512 KB, Assoc   8, LineSize  64
***---------------------  Unified Cache       1, Level 3,   16 MB, Assoc  16, LineSize  64
--*---------------------  Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
--*---------------------  Instruction Cache   1, Level 1,   32 KB, Assoc   8, LineSize  64
--*---------------------  Unified Cache       2, Level 2,  512 KB, Assoc   8, LineSize  64
---*--------------------  Data Cache          2, Level 1,   32 KB, Assoc   8, LineSize  64
---*--------------------  Instruction Cache   2, Level 1,   32 KB, Assoc   8, LineSize  64
---*--------------------  Unified Cache       3, Level 2,  512 KB, Assoc   8, LineSize  64
---***------------------  Unified Cache       4, Level 3,   16 MB, Assoc  16, LineSize  64
----**------------------  Data Cache          3, Level 1,   32 KB, Assoc   8, LineSize  64
----**------------------  Instruction Cache   3, Level 1,   32 KB, Assoc   8, LineSize  64
----**------------------  Unified Cache       5, Level 2,  512 KB, Assoc   8, LineSize  64
------**----------------  Data Cache          4, Level 1,   32 KB, Assoc   8, LineSize  64
------**----------------  Instruction Cache   4, Level 1,   32 KB, Assoc   8, LineSize  64
------**----------------  Unified Cache       6, Level 2,  512 KB, Assoc   8, LineSize  64
------**----------------  Unified Cache       7, Level 3,   16 MB, Assoc  16, LineSize  64
--------*---------------  Data Cache          5, Level 1,   32 KB, Assoc   8, LineSize  64
--------*---------------  Instruction Cache   5, Level 1,   32 KB, Assoc   8, LineSize  64
--------*---------------  Unified Cache       8, Level 2,  512 KB, Assoc   8, LineSize  64
--------*---------------  Unified Cache       9, Level 3,   16 MB, Assoc  16, LineSize  64
---------*--------------  Data Cache          6, Level 1,   32 KB, Assoc   8, LineSize  64
---------*--------------  Instruction Cache   6, Level 1,   32 KB, Assoc   8, LineSize  64
---------*--------------  Unified Cache      10, Level 2,  512 KB, Assoc   8, LineSize  64
---------***------------  Unified Cache      11, Level 3,   16 MB, Assoc  16, LineSize  64
----------**------------  Data Cache          7, Level 1,   32 KB, Assoc   8, LineSize  64
----------**------------  Instruction Cache   7, Level 1,   32 KB, Assoc   8, LineSize  64
----------**------------  Unified Cache      12, Level 2,  512 KB, Assoc   8, LineSize  64
------------**----------  Data Cache          8, Level 1,   32 KB, Assoc   8, LineSize  64
------------**----------  Instruction Cache   8, Level 1,   32 KB, Assoc   8, LineSize  64
------------**----------  Unified Cache      13, Level 2,  512 KB, Assoc   8, LineSize  64
------------***---------  Unified Cache      14, Level 3,   16 MB, Assoc  16, LineSize  64
--------------*---------  Data Cache          9, Level 1,   32 KB, Assoc   8, LineSize  64
--------------*---------  Instruction Cache   9, Level 1,   32 KB, Assoc   8, LineSize  64
--------------*---------  Unified Cache      15, Level 2,  512 KB, Assoc   8, LineSize  64
---------------*--------  Data Cache         10, Level 1,   32 KB, Assoc   8, LineSize  64
---------------*--------  Instruction Cache  10, Level 1,   32 KB, Assoc   8, LineSize  64
---------------*--------  Unified Cache      16, Level 2,  512 KB, Assoc   8, LineSize  64
---------------*--------  Unified Cache      17, Level 3,   16 MB, Assoc  16, LineSize  64
----------------**------  Data Cache         11, Level 1,   32 KB, Assoc   8, LineSize  64
----------------**------  Instruction Cache  11, Level 1,   32 KB, Assoc   8, LineSize  64
----------------**------  Unified Cache      18, Level 2,  512 KB, Assoc   8, LineSize  64
----------------**------  Unified Cache      19, Level 3,   16 MB, Assoc  16, LineSize  64
------------------**----  Data Cache         12, Level 1,   32 KB, Assoc   8, LineSize  64
------------------**----  Instruction Cache  12, Level 1,   32 KB, Assoc   8, LineSize  64
------------------**----  Unified Cache      20, Level 2,  512 KB, Assoc   8, LineSize  64
------------------***---  Unified Cache      21, Level 3,   16 MB, Assoc  16, LineSize  64
--------------------*---  Data Cache         13, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------*---  Instruction Cache  13, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------*---  Unified Cache      22, Level 2,  512 KB, Assoc   8, LineSize  64
---------------------*--  Data Cache         14, Level 1,   32 KB, Assoc   8, LineSize  64
---------------------*--  Instruction Cache  14, Level 1,   32 KB, Assoc   8, LineSize  64
---------------------*--  Unified Cache      23, Level 2,  512 KB, Assoc   8, LineSize  64
---------------------***  Unified Cache      24, Level 3,   16 MB, Assoc  16, LineSize  64
----------------------**  Data Cache         15, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------**  Instruction Cache  15, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------**  Unified Cache      25, Level 2,  512 KB, Assoc   8, LineSize  64

Logical Processor to Group Map:
************************  Group 0

The above result is even further away from the actual L3 cache configuration.

So numatune doesn't produce the expected outcome.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1856335

Title:
  Cache Layout wrong on many Zen Arch CPUs

Status in QEMU:
  New

Bug description:
  AMD CPUs have L3 cache per 2, 3 or 4 cores. Currently, TOPOEXT seems
  to always map Cache ass if it was an 4-Core per CCX CPU, which is
  incorrect, and costs upwards 30% performance (more realistically 10%)
  in L3 Cache Layout aware applications.

  Example on a 4-CCX CPU (1950X /w 8 Cores and no SMT):

    <cpu mode='custom' match='exact' check='full'>
      <model fallback='forbid'>EPYC-IBPB</model>
      <vendor>AMD</vendor>
      <topology sockets='1' cores='8' threads='1'/>

  In windows, coreinfo reports correctly:

  ****----  Unified Cache 1, Level 3,    8 MB, Assoc  16, LineSize  64
  ----****  Unified Cache 6, Level 3,    8 MB, Assoc  16, LineSize  64

  On a 3-CCX CPU (3960X /w 6 cores and no SMT):

   <cpu mode='custom' match='exact' check='full'>
      <model fallback='forbid'>EPYC-IBPB</model>
      <vendor>AMD</vendor>
      <topology sockets='1' cores='6' threads='1'/>

  in windows, coreinfo reports incorrectly:

  ****--  Unified Cache  1, Level 3,    8 MB, Assoc  16, LineSize  64
  ----**  Unified Cache  6, Level 3,    8 MB, Assoc  16, LineSize  64

  Validated against 3.0, 3.1, 4.1 and 4.2 versions of qemu-kvm.

  With newer Qemu there is a fix (that does behave correctly) in using the dies parameter:
   <qemu:arg value='cores=3,threads=1,dies=2,sockets=1'/>

  The problem is that the dies are exposed differently than how AMD does
  it natively, they are exposed to Windows as sockets, which means, that
  if you are nto a business user, you can't ever have a machine with
  more than two CCX (6 cores) as consumer versions of Windows only
  supports two sockets. (Should this be reported as a separate bug?)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1856335/+subscriptions