Performance Regression in Linux Kernel 5.19

* Performance Regression in Linux Kernel 5.19
@ 2022-09-09 11:46 Manikandan Jagatheesan
  2022-09-09 13:18 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Manikandan Jagatheesan @ 2022-09-09 11:46 UTC (permalink / raw)
  To: peterz, bp, jpoimboe
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, linux-kernel, srivatsa,
	Peter Jonasson, Yiu Cho Lau, Rajender M, Abdul Anshad Azeez,
	Kodeswaran Kumarasamy, Rahul Gopakumar

As part of VMware's performance regression testing for Linux
Kernel upstream releases, we have evaluated the performance
of Linux kernel 5.19 against the 5.18 release and we have 
noticed performance regressions in Linux VMs on ESXi as shown 
below.
- Compute(up to -70%)
- Networking(up to -30%)
- Storage(up to -13%) 

After performing the bisect between kernel 5.18 and 5.19, we 
identified the root cause to be the enablement of IBRS mitigation 
for spectre_v2 vulnerability by commit 6ad0ad2bf8a6 ("x86/bugs: 
Report Intel retbleed vulnerability").

To confirm this, we have disabled the above security mitigation
through kernel boot parameter(spectre_v2=off) in 5.19 and re-ran
our tests & confirmed that the performance was on-par with 
5.18 release. 

Performance data and workload details:
=========================
Used Linux VM on ESXi host: Ubuntu20.04.3

ESXi Compute workloads:
----------------------------
Server configs: 112 threads, 4 sockets Skylake with 2TB memory
1. Boot-halt test:
- Configs: Single VM with different CPU and Memory configurations
                 (1vCPU_32gb, 28vCPU_256gb, 56vCPU_512gb, 84vCPU_1024gb
                 & 112vCPU_1433gb)
- Test-desc: Measures the time taken by the Guest to boot up and 
                   shut down itself. We have "shutdown -h now" in 
                   rc.local for Linux. Boothalt time is calculated by 
                   using timestamps of following patterns from vmware.log.
                   * Begin Pattern - " PowerOn"
                   * End Pattern - "VMX exit"
- Boothalt time = Timestamp(End Pattern) - Timestamp(Begin Pattern)
- Highly affected case: Lower vCPU config is affected (1vCPU_32gb
                                    up to -12%)
- Metric: Secs
- Performance data:
      * Immediate before commit: 14.844 secs
      * Intel retbleed/IBRS commit: 16.29 secs (absolute diff ~2 secs)

2. Kernel Compile test:
- Configs: Single VM with different CPU and Memory configurations
                 (1vCPU_4gb, 28vCPU_64gb, 56vCPU_64gb, 84vCPU_64gb,
                 112vCPU_64gb & 126vCPU_64gb)
- Test-desc: A CPU intensive benchmark. Measures time taken to compile 
                   Linux kernel source (4.9.24).
- Highly affected case: Higher vCPU configs - 112vCPU_64gb (up to -10%)
- Command: make -j 2x$VCPU. This uses all the available CPU threads to 
                     achieve 100% CPU utilization.
                     Timestamp is recorded in the vmware.log before and after 
                     compiling the source.
                     * Begin Pattern - "VMQARESULT BEGIN"
                     * End Pattern - "VMQARESULT END"
- Metric: Secs
- Performance data:
      * Immediate before commit: 21.316 secs
      * Intel retbleed/IBRS commit: 23.824secs (absolute diff ~2 secs)

3. OSbench test:
- Configs: Single VM with 1vCPU_4gb config
- Test-desc: This is a collection of benchmarks that aim to measure 
                   the performance of operating system primitives, such as 
                   process and thread creation and it is publicly available.
                   (https://www.bitsnbites.eu/benchmarking-os-primitives)
                   git- https://github.com/mbitsnbites/osbench#readme
                   To build the benchmarks, we need a C compiler, meson 
                   and ninja.
- Highly affected case: 1vCPU_4gb (up to -70%)
- Command: To run - ./create_threads 
- Metric: Milliseconds
- Performance data:
   i) create_threads 
      * Immediate before commit: 16.46 msecs
      * Intel retbleed/IBRS commit: 27.97 msecs (absolute diff ~11 msecs)
   ii) create_processes
      * Immediate before commit: 69.03 msecs
      * Intel retbleed/IBRS commit: 83.20 msecs (absolute diff ~14 msecs)

ESXi Networking workloads:
------------------------------
- Server config: 56 threads 2 sockets Skylake with 192G memory
- Benchmark: Netperf 2.7.0
- Topology: A Linux VM on an ESXi host is connected to a Bare Metal 
                   Linux client using back to back direct connection without 
                   involving a physical switch.
- Test-Desc: We measure bulk data transfer and request/response
                    performance using TCP and UDP protocols.
- Highly affected case: Single VM on 8vCPU with TCP_STREAM RECV
                                     Large packets(256K Socket & 16K Message size) 
                                     up to -30%
- Netperf command: (TCP_STREAM_RECV large packets)
netperf -l 60 -H DestinationIP -p port -t TCP_STREAM -- -s 256K 
-S 256K -m 16K -M 16K
Linux VM on the ESXi host act as RECEIVER and Bare Metal 
Linux host act as SENDER. 
We initiate netperf from Bare Metal Client Linux host and start 
netserver from Linux VM on the ESXi host with 16 parallel netperf 
streams. 
- Metrics: TCP_STREAM(Cpu/Gbits, Gbps), UDP_STREAM(Kilo packets per
                second), TCP_RR(ResponseTime in microseconds)
TCP_STREAM_Throughput - Capture Throughput from netperf output file.
TCP_STREAM_CPU - Capture CPU/Gbits from Total CPU spent in all
                                 of the threads in given duration divided by 
                                 respective throughput Gbps.
UDP_STREAM Msgs - Capture from netstats & netperf out files.
TCP_RR RespTimeMean - Capture output from netperf out file.
- NIC Model used: Intel(R) Ethernet Controller XL710 for 40GbE QSFP+
- Performance data:
      * Immediate before commit: 11.932 Gbps
      * Intel retbleed/IBRS commit: 8.56 Gbps (~3.5 Gbps of throughput drop)

ESXi Storage workloads:
--------------------------
- Server config: 56 threads 2 sockets Skylake with 192G memory
- Benchmark: FIO v3.20
- Test-Desc: We measure how much read/write I/O operations can be
                    performed at a given period of time, average time it
                    takes to complete the I/O and the total CPU cycles
                    been spent.
- I/O  Block size: 4KiB, 64KiB & 256KiB
- Read write Ratio: 100% read, 100% write & 70/30 mixed readwrite
- Access Patterns: Random & Sequential
- # of VMs: Single VM (1VM_8vCPU) & Multi VMs(16VM_4vCPU)
- Devices under test: Local device and SAN
- Local device: Local NVMe (Intel Corporation DC P3700 SSD)
- SAN connected: QLogic QLE2692 FC-16G (connected to DELL EMC
                             PowerStore 5000T array)
- Highly affected case: 1VM-cpucost_64K_seq_7030readwrite (up to -13%)
- Throughput and latency tests are not affected.
- Command: fio --name=fio-test --ioengine=libaio --iodepth=16 --rw=rw 
         --rwmixread=70 --rwmixwrite=30 --bs=65536 --thread --direct=1 
         --numjobs=8 --group_reporting=1 --time_based --runtime=180 
         --filename=/dev/sdb:/dev/sdc:/dev/sdd:/dev/sde:/dev/sdf:
         /dev/sdg:/dev/sdh:/dev/sdi --significant_figures=10
- Metrics: Throughput (IOPS), Latency (milliseconds) and Cpucost
                (CPIO - cycles per I/O) t
                The new CPIO (internal tool) is implemented simply as a 
                python script, that uses a processor’s performance counters 
                to arrive at the CPU cycles used in a given duration.
- Command: python3 /usr/lib/vmware/cpio/cpio.pyc -i 25 -n 5 -D all 
                    -v -d -o outputDir
                     here, 25 is the interval of collection
                     5 is the number of intervals
                     all is the device for which we intend to collect data.
- Topology: A standalone server(ESXi image) with local NVMe disks and
                   FC-16G HBA is connected to an “DELL EMC PowerStore 5000T”
                   array for Storage I/O performance measurements.
- Performance data:
     * Immediate before commit: 269928 cycles/io
      * Intel retbleed/IBRS commit: 303937 cycles/io (absolute 
                                                      diff 34009 cycles/io)

We believe these findings would be useful to the Linux community and
wanted to document the same.

Manikandan Jagatheesan
Performance Engineering
VMware, Inc.

^ permalink raw reply	[flat|nested] 8+ messages in thread