linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Performance Regression in Linux Kernel 5.19
@ 2022-09-09 11:46 Manikandan Jagatheesan
  2022-09-09 13:18 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Manikandan Jagatheesan @ 2022-09-09 11:46 UTC (permalink / raw)
  To: peterz, bp, jpoimboe
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, linux-kernel, srivatsa,
	Peter Jonasson, Yiu Cho Lau, Rajender M, Abdul Anshad Azeez,
	Kodeswaran Kumarasamy, Rahul Gopakumar

As part of VMware's performance regression testing for Linux
Kernel upstream releases, we have evaluated the performance
of Linux kernel 5.19 against the 5.18 release and we have 
noticed performance regressions in Linux VMs on ESXi as shown 
below.
- Compute(up to -70%)
- Networking(up to -30%)
- Storage(up to -13%) 
 
After performing the bisect between kernel 5.18 and 5.19, we 
identified the root cause to be the enablement of IBRS mitigation 
for spectre_v2 vulnerability by commit 6ad0ad2bf8a6 ("x86/bugs: 
Report Intel retbleed vulnerability").
 
To confirm this, we have disabled the above security mitigation
through kernel boot parameter(spectre_v2=off) in 5.19 and re-ran
our tests & confirmed that the performance was on-par with 
5.18 release. 
 
Performance data and workload details:
=========================
Used Linux VM on ESXi host: Ubuntu20.04.3
 
ESXi Compute workloads:
----------------------------
Server configs: 112 threads, 4 sockets Skylake with 2TB memory
1. Boot-halt test:
- Configs: Single VM with different CPU and Memory configurations
                 (1vCPU_32gb, 28vCPU_256gb, 56vCPU_512gb, 84vCPU_1024gb
                 & 112vCPU_1433gb)
- Test-desc: Measures the time taken by the Guest to boot up and 
                   shut down itself. We have "shutdown -h now" in 
                   rc.local for Linux. Boothalt time is calculated by 
                   using timestamps of following patterns from vmware.log.
                   * Begin Pattern - " PowerOn"
                   * End Pattern - "VMX exit"
- Boothalt time = Timestamp(End Pattern) - Timestamp(Begin Pattern)
- Highly affected case: Lower vCPU config is affected (1vCPU_32gb
                                    up to -12%)
- Metric: Secs
- Performance data:
      * Immediate before commit: 14.844 secs
      * Intel retbleed/IBRS commit: 16.29 secs (absolute diff ~2 secs)
 
2. Kernel Compile test:
- Configs: Single VM with different CPU and Memory configurations
                 (1vCPU_4gb, 28vCPU_64gb, 56vCPU_64gb, 84vCPU_64gb,
                 112vCPU_64gb & 126vCPU_64gb)
- Test-desc: A CPU intensive benchmark. Measures time taken to compile 
                   Linux kernel source (4.9.24).
- Highly affected case: Higher vCPU configs - 112vCPU_64gb (up to -10%)
- Command: make -j 2x$VCPU. This uses all the available CPU threads to 
                     achieve 100% CPU utilization.
                     Timestamp is recorded in the vmware.log before and after 
                     compiling the source.
                     * Begin Pattern - "VMQARESULT BEGIN"
                     * End Pattern - "VMQARESULT END"
- Metric: Secs
- Performance data:
      * Immediate before commit: 21.316 secs
      * Intel retbleed/IBRS commit: 23.824secs (absolute diff ~2 secs)
 
3. OSbench test:
- Configs: Single VM with 1vCPU_4gb config
- Test-desc: This is a collection of benchmarks that aim to measure 
                   the performance of operating system primitives, such as 
                   process and thread creation and it is publicly available.
                   (https://www.bitsnbites.eu/benchmarking-os-primitives)
                   git- https://github.com/mbitsnbites/osbench#readme
                   To build the benchmarks, we need a C compiler, meson 
                   and ninja.
- Highly affected case: 1vCPU_4gb (up to -70%)
- Command: To run - ./create_threads 
- Metric: Milliseconds
- Performance data:
   i) create_threads 
      * Immediate before commit: 16.46 msecs
      * Intel retbleed/IBRS commit: 27.97 msecs (absolute diff ~11 msecs)
   ii) create_processes
      * Immediate before commit: 69.03 msecs
      * Intel retbleed/IBRS commit: 83.20 msecs (absolute diff ~14 msecs)
 
ESXi Networking workloads:
------------------------------
- Server config: 56 threads 2 sockets Skylake with 192G memory
- Benchmark: Netperf 2.7.0
- Topology: A Linux VM on an ESXi host is connected to a Bare Metal 
                   Linux client using back to back direct connection without 
                   involving a physical switch.
- Test-Desc: We measure bulk data transfer and request/response
                    performance using TCP and UDP protocols.
- Highly affected case: Single VM on 8vCPU with TCP_STREAM RECV
                                     Large packets(256K Socket & 16K Message size) 
                                     up to -30%
- Netperf command: (TCP_STREAM_RECV large packets)
netperf -l 60 -H DestinationIP -p port -t TCP_STREAM -- -s 256K 
-S 256K -m 16K -M 16K
Linux VM on the ESXi host act as RECEIVER and Bare Metal 
Linux host act as SENDER. 
We initiate netperf from Bare Metal Client Linux host and start 
netserver from Linux VM on the ESXi host with 16 parallel netperf 
streams. 
- Metrics: TCP_STREAM(Cpu/Gbits, Gbps), UDP_STREAM(Kilo packets per
                second), TCP_RR(ResponseTime in microseconds)
TCP_STREAM_Throughput - Capture Throughput from netperf output file.
TCP_STREAM_CPU - Capture CPU/Gbits from Total CPU spent in all
                                 of the threads in given duration divided by 
                                 respective throughput Gbps.
UDP_STREAM Msgs - Capture from netstats & netperf out files.
TCP_RR RespTimeMean - Capture output from netperf out file.
- NIC Model used: Intel(R) Ethernet Controller XL710 for 40GbE QSFP+
- Performance data:
      * Immediate before commit: 11.932 Gbps
      * Intel retbleed/IBRS commit: 8.56 Gbps (~3.5 Gbps of throughput drop)
 
ESXi Storage workloads:
--------------------------
- Server config: 56 threads 2 sockets Skylake with 192G memory
- Benchmark: FIO v3.20
- Test-Desc: We measure how much read/write I/O operations can be
                    performed at a given period of time, average time it
                    takes to complete the I/O and the total CPU cycles
                    been spent.
- I/O  Block size: 4KiB, 64KiB & 256KiB
- Read write Ratio: 100% read, 100% write & 70/30 mixed readwrite
- Access Patterns: Random & Sequential
- # of VMs: Single VM (1VM_8vCPU) & Multi VMs(16VM_4vCPU)
- Devices under test: Local device and SAN
- Local device: Local NVMe (Intel Corporation DC P3700 SSD)
- SAN connected: QLogic QLE2692 FC-16G (connected to DELL EMC
                             PowerStore 5000T array)
- Highly affected case: 1VM-cpucost_64K_seq_7030readwrite (up to -13%)
- Throughput and latency tests are not affected.
- Command: fio --name=fio-test --ioengine=libaio --iodepth=16 --rw=rw 
         --rwmixread=70 --rwmixwrite=30 --bs=65536 --thread --direct=1 
         --numjobs=8 --group_reporting=1 --time_based --runtime=180 
         --filename=/dev/sdb:/dev/sdc:/dev/sdd:/dev/sde:/dev/sdf:
         /dev/sdg:/dev/sdh:/dev/sdi --significant_figures=10
- Metrics: Throughput (IOPS), Latency (milliseconds) and Cpucost
                (CPIO - cycles per I/O) t
                The new CPIO (internal tool) is implemented simply as a 
                python script, that uses a processor’s performance counters 
                to arrive at the CPU cycles used in a given duration.
- Command: python3 /usr/lib/vmware/cpio/cpio.pyc -i 25 -n 5 -D all 
                    -v -d -o outputDir
                     here, 25 is the interval of collection
                     5 is the number of intervals
                     all is the device for which we intend to collect data.
- Topology: A standalone server(ESXi image) with local NVMe disks and
                   FC-16G HBA is connected to an “DELL EMC PowerStore 5000T”
                   array for Storage I/O performance measurements.
- Performance data:
     * Immediate before commit: 269928 cycles/io
      * Intel retbleed/IBRS commit: 303937 cycles/io (absolute 
                                                      diff 34009 cycles/io)
 
We believe these findings would be useful to the Linux community and
wanted to document the same.


Manikandan Jagatheesan
Performance Engineering
VMware, Inc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance Regression in Linux Kernel 5.19
  2022-09-09 11:46 Performance Regression in Linux Kernel 5.19 Manikandan Jagatheesan
@ 2022-09-09 13:18 ` Peter Zijlstra
  2022-09-09 21:22 ` David Laight
  2022-09-10  7:52 ` Borislav Petkov
  2 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2022-09-09 13:18 UTC (permalink / raw)
  To: Manikandan Jagatheesan
  Cc: bp, jpoimboe, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, srivatsa, Peter Jonasson, Yiu Cho Lau, Rajender M,
	Abdul Anshad Azeez, Kodeswaran Kumarasamy, Rahul Gopakumar

On Fri, Sep 09, 2022 at 11:46:08AM +0000, Manikandan Jagatheesan wrote:
> As part of VMware's performance regression testing for Linux
> Kernel upstream releases, we have evaluated the performance
> of Linux kernel 5.19 against the 5.18 release and we have 
> noticed performance regressions in Linux VMs on ESXi as shown 
> below.
> - Compute(up to -70%)
> - Networking(up to -30%)
> - Storage(up to -13%) 
>  
> After performing the bisect between kernel 5.18 and 5.19, we 
> identified the root cause to be the enablement of IBRS mitigation 
> for spectre_v2 vulnerability by commit 6ad0ad2bf8a6 ("x86/bugs: 
> Report Intel retbleed vulnerability").
>  
> To confirm this, we have disabled the above security mitigation
> through kernel boot parameter(spectre_v2=off) in 5.19 and re-ran
> our tests & confirmed that the performance was on-par with 
> 5.18 release. 

Well, duh.. :-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Performance Regression in Linux Kernel 5.19
  2022-09-09 11:46 Performance Regression in Linux Kernel 5.19 Manikandan Jagatheesan
  2022-09-09 13:18 ` Peter Zijlstra
@ 2022-09-09 21:22 ` David Laight
  2022-09-10  7:52 ` Borislav Petkov
  2 siblings, 0 replies; 8+ messages in thread
From: David Laight @ 2022-09-09 21:22 UTC (permalink / raw)
  To: 'Manikandan Jagatheesan', peterz, bp, jpoimboe
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, linux-kernel, srivatsa,
	Peter Jonasson, Yiu Cho Lau, Rajender M, Abdul Anshad Azeez,
	Kodeswaran Kumarasamy, Rahul Gopakumar

From: Manikandan Jagatheesan
> Sent: 09 September 2022 12:46
> 
> As part of VMware's performance regression testing for Linux
> Kernel upstream releases, we have evaluated the performance
> of Linux kernel 5.19 against the 5.18 release and we have
> noticed performance regressions in Linux VMs on ESXi as shown
> below.
> - Compute(up to -70%)
> - Networking(up to -30%)
> - Storage(up to -13%)
> 
> After performing the bisect between kernel 5.18 and 5.19, we
> identified the root cause to be the enablement of IBRS mitigation
> for spectre_v2 vulnerability by commit 6ad0ad2bf8a6 ("x86/bugs:
> Report Intel retbleed vulnerability").

As a matter of interest how much faster does it go if you
boot with all mitigations disabled and compile without
retpolines and without page table separation?

There are plenty of semi-embedded systems (even running on x86)
where there are a limited set of binaries, it is difficult to
add new binaries, and everything basically runs as root.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance Regression in Linux Kernel 5.19
  2022-09-09 11:46 Performance Regression in Linux Kernel 5.19 Manikandan Jagatheesan
  2022-09-09 13:18 ` Peter Zijlstra
  2022-09-09 21:22 ` David Laight
@ 2022-09-10  7:52 ` Borislav Petkov
  2022-09-12 10:58   ` Borislav Petkov
  2 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2022-09-10  7:52 UTC (permalink / raw)
  To: Manikandan Jagatheesan
  Cc: peterz, jpoimboe, tglx, mingo, dave.hansen, x86, hpa,
	linux-kernel, srivatsa, Peter Jonasson, Yiu Cho Lau, Rajender M,
	Abdul Anshad Azeez, Kodeswaran Kumarasamy, Rahul Gopakumar

On Fri, Sep 09, 2022 at 11:46:08AM +0000, Manikandan Jagatheesan wrote:
> After performing the bisect between kernel 5.18 and 5.19, we 
> identified the root cause to be the enablement of IBRS mitigation 
> for spectre_v2 vulnerability by commit 6ad0ad2bf8a6 ("x86/bugs: 
> Report Intel retbleed vulnerability").

What I'm wondering about is why does the guest enable IBRS when booting
on your HV?

I'm guessing you're exposing SPEC_CTRL and all the feature flags so that
the detection in spectre_v2_select_mitigation(), the SPECTRE_V2_CMD_AUTO
case, hits.

But then, why are you emulating a CPU which is vulnerable to retbleed?

Because as far as the guest is concerned, filling the RSB on VMEXIT
should be good enough and the guest doesn't have to do anything else.

IOW, X86_BUG_RETBLEED should not be set on the guest booting on your HV.

Hmmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance Regression in Linux Kernel 5.19
  2022-09-10  7:52 ` Borislav Petkov
@ 2022-09-12 10:58   ` Borislav Petkov
  2022-09-13  8:40     ` Manikandan Jagatheesan
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2022-09-12 10:58 UTC (permalink / raw)
  To: Manikandan Jagatheesan
  Cc: peterz, jpoimboe, tglx, mingo, dave.hansen, x86, hpa,
	linux-kernel, srivatsa, Peter Jonasson, Yiu Cho Lau, Rajender M,
	Abdul Anshad Azeez, Kodeswaran Kumarasamy, Rahul Gopakumar

A couple more notes after talking to tglx:

So this works as expected. The threat model where the guest needs
to protect itself from malicious userspace is there so if the guest
emulates a CPU which is affected by retbleed and the hypervisor exposes
SPEC_CTRL, then the guest *should* enable IBRS to flush the RSB.

It is a lot nastier if the guest emulates a CPU which is *not* affected
by retbleed but the host uarch is - then the guest will be vulnerable
and it would not even warn about it! So people should be careful what
they do there.

In addition, if the guest trusts its userspace, it might disable IBRS
in order not to suffer the penalty but that's left to the guest owner.
The default setting has to be secure.

HTH.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance Regression in Linux Kernel 5.19
  2022-09-12 10:58   ` Borislav Petkov
@ 2022-09-13  8:40     ` Manikandan Jagatheesan
  2022-09-13 10:27       ` Boris Petkov
  2022-09-13 11:20       ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Manikandan Jagatheesan @ 2022-09-13  8:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: peterz, jpoimboe, tglx, mingo, dave.hansen, x86, hpa,
	linux-kernel, srivatsa, Peter Jonasson, Yiu Cho Lau, Rajender M,
	Abdul Anshad Azeez, Kodeswaran Kumarasamy, Rahul Gopakumar

Thank you for the responses,

The underlying host CPU architecture is Skylake, and we are using
default settings at both Hypervisor and Guest (kernel) level.

Are there are any ongoing activities to optimize the performance?

We would be happy to validate the patch (if any) from a performance 
point of view and share the results.

Regards,
Manikandan



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance Regression in Linux Kernel 5.19
  2022-09-13  8:40     ` Manikandan Jagatheesan
@ 2022-09-13 10:27       ` Boris Petkov
  2022-09-13 11:20       ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Boris Petkov @ 2022-09-13 10:27 UTC (permalink / raw)
  To: Manikandan Jagatheesan
  Cc: peterz, jpoimboe, tglx, mingo, Kodeswaran Kumarasamy,
	dave.hansen, linux-kernel, Abdul Anshad Azeez, srivatsa,
	Peter Jonasson, hpa, x86, Yiu Cho Lau, Rajender M,
	Rahul Gopakumar

On September 13, 2022 8:40:49 AM UTC, Manikandan Jagatheesan <mjagatheesan@vmware.com> wrote:
>Are there are any ongoing activities to optimize the performance?

What do you think is there to optimize here? Might wanna read my reply again...

-- 
Sent from a small device: formatting sux and brevity is inevitable.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance Regression in Linux Kernel 5.19
  2022-09-13  8:40     ` Manikandan Jagatheesan
  2022-09-13 10:27       ` Boris Petkov
@ 2022-09-13 11:20       ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2022-09-13 11:20 UTC (permalink / raw)
  To: Manikandan Jagatheesan
  Cc: Borislav Petkov, jpoimboe, tglx, mingo, dave.hansen, x86, hpa,
	linux-kernel, srivatsa, Peter Jonasson, Yiu Cho Lau, Rajender M,
	Abdul Anshad Azeez, Kodeswaran Kumarasamy, Rahul Gopakumar

On Tue, Sep 13, 2022 at 08:40:49AM +0000, Manikandan Jagatheesan wrote:

> Are there are any ongoing activities to optimize the performance?

https://lkml.kernel.org/r/20220902130625.217071627@infradead.org

I should post a new version soonish:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git call-depth-tracking

boot with:

  spectre_v2=retpoline retbleed=stuff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-09-13 11:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-09 11:46 Performance Regression in Linux Kernel 5.19 Manikandan Jagatheesan
2022-09-09 13:18 ` Peter Zijlstra
2022-09-09 21:22 ` David Laight
2022-09-10  7:52 ` Borislav Petkov
2022-09-12 10:58   ` Borislav Petkov
2022-09-13  8:40     ` Manikandan Jagatheesan
2022-09-13 10:27       ` Boris Petkov
2022-09-13 11:20       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).