[Bug 1856335] Re: Cache Layout wrong on many Zen Arch CPUs

From: Jan Klos <1856335@bugs.launchpad.net>
To: qemu-devel@nongnu.org
Subject: [Bug 1856335] Re: Cache Layout wrong on many Zen Arch CPUs
Date: Sun, 17 May 2020 11:15:50 -0000	[thread overview]
Message-ID: <158971415072.22897.16303582021252899317.malone@gac.canonical.com> (raw)
In-Reply-To: 157625616239.22064.10423897892496347105.malonedeb@gac.canonical.com

No, creating artificial NUMA nodes is, simply put, never a good solution
for CPUs that operate as a single NUMA node - which is the case for all
Zen2 CPUs (except maybe EPYCs? not sure about those).

You may workaround the L3 issue that way, but hit many new bugs/problems
by introducing multiple NUMA nodes, _especially_ on Windows VMs, because
that OS has crappy NUMA handling and multitude of bugs related to it -
which was one of the major reasons why even Zen2 Threadrippers are now
single NUMA node (e.g. https://www.servethehome.com/wp-
content/uploads/2019/11/AMD-Ryzen-Threadripper-3960X-Topology.png ).

The host CPU architecture should be replicated as closely as possible on
the VM and for Zen2 CPUs with 4 cores per CCX, _this already works
perfectly_ - there are no problems on
3300X/3700(X)/3800X/3950X/3970X/3990X.

There is, unfortunately, no way to customize/specify the "disabled" CPU
cores in QEMU, and therefore no way to emulate 1 NUMA node + L3 cache
per 2/3 cores - only to passthrough the cache config from host, which is
unfortunately not done correctly for CPUs with disabled cores (but
again, works perfectly for CPUs with all 4 cores enabled per CCX).

lscpu:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 9 3900X 12-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2972.127
CPU max MHz:                     3800.0000
CPU min MHz:                     2200.0000
BogoMIPS:                        7602.55
Virtualization:                  AMD-V
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        6 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-23
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP conditional, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonsto
                                 p_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a mi
                                 salignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase b
                                 mi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru
                                  wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

But the important thing has already been posted here in previous comments - notice the skipped core ids belonging to the disabled cores:

virsh capabilities | grep "cpu id":
<cpu id='0' socket_id='0' core_id='0' siblings='0,12'/>
<cpu id='1' socket_id='0' core_id='1' siblings='1,13'/>
<cpu id='2' socket_id='0' core_id='2' siblings='2,14'/>
<cpu id='3' socket_id='0' core_id='4' siblings='3,15'/>
<cpu id='4' socket_id='0' core_id='5' siblings='4,16'/>
<cpu id='5' socket_id='0' core_id='6' siblings='5,17'/>
<cpu id='6' socket_id='0' core_id='8' siblings='6,18'/>
<cpu id='7' socket_id='0' core_id='9' siblings='7,19'/>
<cpu id='8' socket_id='0' core_id='10' siblings='8,20'/>
<cpu id='9' socket_id='0' core_id='12' siblings='9,21'/>
<cpu id='10' socket_id='0' core_id='13' siblings='10,22'/>
<cpu id='11' socket_id='0' core_id='14' siblings='11,23'/>
<cpu id='12' socket_id='0' core_id='0' siblings='0,12'/>
<cpu id='13' socket_id='0' core_id='1' siblings='1,13'/>
<cpu id='14' socket_id='0' core_id='2' siblings='2,14'/>
<cpu id='15' socket_id='0' core_id='4' siblings='3,15'/>
<cpu id='16' socket_id='0' core_id='5' siblings='4,16'/>
<cpu id='17' socket_id='0' core_id='6' siblings='5,17'/>
<cpu id='18' socket_id='0' core_id='8' siblings='6,18'/>
<cpu id='19' socket_id='0' core_id='9' siblings='7,19'/>
<cpu id='20' socket_id='0' core_id='10' siblings='8,20'/>
<cpu id='21' socket_id='0' core_id='12' siblings='9,21'/>
<cpu id='22' socket_id='0' core_id='13' siblings='10,22'/>
<cpu id='23' socket_id='0' core_id='14' siblings='11,23'/>

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1856335

Title:
  Cache Layout wrong on many Zen Arch CPUs

Status in QEMU:
  New

Bug description:
  AMD CPUs have L3 cache per 2, 3 or 4 cores. Currently, TOPOEXT seems
  to always map Cache ass if it was an 4-Core per CCX CPU, which is
  incorrect, and costs upwards 30% performance (more realistically 10%)
  in L3 Cache Layout aware applications.

  Example on a 4-CCX CPU (1950X /w 8 Cores and no SMT):

    <cpu mode='custom' match='exact' check='full'>
      <model fallback='forbid'>EPYC-IBPB</model>
      <vendor>AMD</vendor>
      <topology sockets='1' cores='8' threads='1'/>

  In windows, coreinfo reports correctly:

  ****----  Unified Cache 1, Level 3,    8 MB, Assoc  16, LineSize  64
  ----****  Unified Cache 6, Level 3,    8 MB, Assoc  16, LineSize  64

  On a 3-CCX CPU (3960X /w 6 cores and no SMT):

   <cpu mode='custom' match='exact' check='full'>
      <model fallback='forbid'>EPYC-IBPB</model>
      <vendor>AMD</vendor>
      <topology sockets='1' cores='6' threads='1'/>

  in windows, coreinfo reports incorrectly:

  ****--  Unified Cache  1, Level 3,    8 MB, Assoc  16, LineSize  64
  ----**  Unified Cache  6, Level 3,    8 MB, Assoc  16, LineSize  64

  Validated against 3.0, 3.1, 4.1 and 4.2 versions of qemu-kvm.

  With newer Qemu there is a fix (that does behave correctly) in using the dies parameter:
   <qemu:arg value='cores=3,threads=1,dies=2,sockets=1'/>

  The problem is that the dies are exposed differently than how AMD does
  it natively, they are exposed to Windows as sockets, which means, that
  if you are nto a business user, you can't ever have a machine with
  more than two CCX (6 cores) as consumer versions of Windows only
  supports two sockets. (Should this be reported as a separate bug?)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1856335/+subscriptions