QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
From: "Emilio G. Cota" <cota@braap.org>
To: Robert Foley <robert.foley@linaro.org>
Cc: "Richard Henderson" <richard.henderson@linaro.org>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"QEMU Developers" <qemu-devel@nongnu.org>,
	"Peter Puhov" <peter.puhov@linaro.org>
Subject: Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL
Date: Wed, 20 May 2020 00:46:13 -0400
Message-ID: <20200520044613.GA359481@sff> (raw)
In-Reply-To: <CAEyhzFuiDWYvu3FZNYy5M0FQ91Cs=-4=kV80xQZHEWX+ejhyTw@mail.gmail.com>

On Mon, May 18, 2020 at 09:46:36 -0400, Robert Foley wrote:
> We re-ran the numbers with the latest re-based series.
> 
> We used an aarch64 ubuntu VM image with a host CPU:
> Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 2 CPUs, 10 cores/CPU,
> 20 Threads/CPU.  40 cores total.
> 
> For the bare hardware and kvm tests (first chart) the host CPU was:
> HiSilicon 1620 CPU 2600 Mhz,  2 CPUs, 64 Cores per CPU, 128 CPUs total.
> 
> First, we ran a test of building the kernel in the VM.
> We did not see any major improvements nor major regressions.
> We show the results of the Speedup of building the kernel
> on bare hardware compared with kvm and QEMU (both the baseline and cpu locks).
> 
> 
>                    Speedup vs a single thread for kernel build
> 
>   40 +----------------------------------------------------------------------+
>      |         +         +         +          +         +         +  **     |
>      |                                                bare hardwar********* |
>      |                                                          kvm ####### |
>   35 |-+                                                   baseline $$$$$$$-|
>      |                                                    *cpu lock %%%%%%% |
>      |                                                 ***                  |
>      |                                               **                     |
>   30 |-+                                          ***                     +-|
>      |                                         ***                          |
>      |                                      ***                             |
>      |                                    **                                |
>   25 |-+                               ***                                +-|
>      |                              ***                                     |
>      |                            **                                        |
>      |                          **                                          |
>   20 |-+                      **                                          +-|
>      |                      **                                #########     |
>      |                    **                  ################              |
>      |                  **          ##########                              |
>      |                **         ###                                        |
>   15 |-+             *       ####                                         +-|
>      |             **     ###                                               |
>      |            *    ###                                                  |
>      |           *  ###                                                     |
>   10 |-+       **###                                                      +-|
>      |        *##                                                           |
>      |       ##  $$$$$$$$$$$$$$$$                                           |
>      |     #$$$$$%%%%%%%%%%%%%%%%%%%%                                       |
>    5 |-+  $%%%%%%                    %%%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%    +-|
>      |   %%                                                           %     |
>      | %%                                                                   |
>      |%        +         +         +          +         +         +         |
>    0 +----------------------------------------------------------------------+
>      0         10        20        30         40        50        60        70
>                                    Guest vCPUs
> 
> 
> After seeing these results and the scaling limits inherent in the build itself,
> we decided to run a test which might show the scaling improvements clearer.

Thanks for doing these tests. I know from experience that benchmarking
is hard and incredibly time consuming, so please do not be discouraged by
my comments below.

A couple of points:

1. I am not familiar with aarch64 KVM but I'd expect it to scale almost
like the native run. Are you assigning enough RAM to the guest? Also,
it can help to run the kernel build in a ramfs in the guest.

2. The build itself does not seem to impose a scaling limit, since
it scales very well when run natively (per-thread I presume aarch64 TCG is
still slower than native, even if TCG is run on a faster x86 machine).
The limit here is probably aarch64 TCG. In particular, last time I
checked aarch64 TCG has room for improvement scalability-wise handling
interrupts and some TLB operations; this is likely to explain why we
see no benefit with per-CPU locks, i.e. the bottleneck is elsewhere.
This can be confirmed with the sync profiler.

IIRC I originally used ppc64 for this test because ppc64 TCG does not
have any other big bottlenecks scalability-wise. I just checked but
unfortunately I can't find the ppc64 image I used :( What I can offer
is the script I used to run these benchmarks; see the appended.

Thanks,
		Emilio

---
#!/bin/bash

set -eu

# path to host files
MYHOME=/local/home/cota/src

# guest image
QEMU_INST_PATH=$MYHOME/qemu-inst
IMG=$MYHOME/qemu/img/ppc64/ubuntu.qcow2

ARCH=ppc64
COMMON_ARGS="-M pseries -nodefaults \
		-hda $IMG -nographic -serial stdio \
		-net nic -net user,hostfwd=tcp::2222-:22 \
		-m 48G"

# path to this script's directory, where .txt output will be copied
# from the guest.
QELT=$MYHOME/qelt
HOST_PATH=$QELT/fig/kcomp

# The guest must be able to SSH to the HOST without entering a password.
# The way I set this up is to have a passwordless SSH key in the guest's
# root user, and then copy that key's public key to the host.
# I used the root user because the guest runs on bootup (as root) a
# script that scp's run-guest.sh (see below) from the host, then executes it.
# This is done via a tiny script in the guest invoked from systemd once
# boot-up has completed.
HOST=foo@bar.edu

# This is a script in the host to use an appropriate cpumask to
# use cores in the same socket if possible.
# See https://github.com/cota/cputopology-perl
CPUTOPO=$MYHOME/cputopology-perl

# For each run we create this file that then the guest will SCP
# and execute. It is a quick and dirty way of passing arguments to the guest.
create_file () {
    TAG=$1
    CORES=$2
    NAME=$ARCH.$TAG-$CORES.txt

    echo '#!/bin/bash' > run-guest.sh
    echo 'cp -r /home/cota/linux-4.18-rc7 /tmp2/linux' >> run-guest.sh
    echo "cd /tmp2/linux" >> run-guest.sh
    echo "{ time make -j $CORES vmlinux >/dev/null; } 2>>/home/cota/$NAME" >> run-guest.sh
    # Output with execution time is then scp'ed to the host.
    echo "ssh $HOST 'cat >> $HOST_PATH/$NAME' < /home/cota/$NAME" >> run-guest.sh
    echo "poweroff" >> run-guest.sh
}

# Change here THREADS and also the TAGS that point to different QEMU installations.
for THREADS in 64 32 16; do
    for TAG in cpu-exclusive-work cputlb-no-bql per-cpu-lock cpu-has-work baseline; do
	QEMU=$QEMU_INST_PATH/$TAG/bin/qemu-system-$ARCH
	CPUMASK=$($CPUTOPO/list.pl --policy=compact-smt $THREADS)

	create_file $TAG $THREADS
	time taskset -c $CPUMASK $QEMU $COMMON_ARGS -smp $THREADS
    done
done


  reply index

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-26 19:30 [PATCH v8 00/74] per-CPU locks Robert Foley
2020-03-26 19:30 ` [PATCH v8 01/74] cpu: convert queued work to a QSIMPLEQ Robert Foley
2020-03-26 19:30 ` [PATCH v8 02/74] cpu: rename cpu->work_mutex to cpu->lock Robert Foley
2020-05-11 14:48   ` Alex Bennée
2020-05-11 16:33     ` Robert Foley
2020-03-26 19:30 ` [PATCH v8 03/74] cpu: introduce cpu_mutex_lock/unlock Robert Foley
2020-05-11 10:24   ` Alex Bennée
2020-05-11 16:09     ` Robert Foley
2020-03-26 19:30 ` [PATCH v8 04/74] cpu: make qemu_work_cond per-cpu Robert Foley
2020-03-26 19:30 ` [PATCH v8 05/74] cpu: move run_on_cpu to cpus-common Robert Foley
2020-03-26 19:30 ` [PATCH v8 06/74] cpu: introduce process_queued_cpu_work_locked Robert Foley
2020-03-26 19:30 ` [PATCH v8 07/74] cpu: make per-CPU locks an alias of the BQL in TCG rr mode Robert Foley
2020-03-26 19:30 ` [PATCH v8 08/74] tcg-runtime: define helper_cpu_halted_set Robert Foley
2020-03-26 19:30 ` [PATCH v8 09/74] ppc: convert to helper_cpu_halted_set Robert Foley
2020-03-26 19:30 ` [PATCH v8 10/74] cris: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 11/74] hppa: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 12/74] m68k: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 13/74] alpha: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 14/74] microblaze: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 15/74] cpu: define cpu_halted helpers Robert Foley
2020-03-26 19:30 ` [PATCH v8 16/74] tcg-runtime: convert to cpu_halted_set Robert Foley
2020-03-26 19:30 ` [PATCH v8 17/74] hw/semihosting: " Robert Foley
2020-05-11 10:25   ` Alex Bennée
2020-03-26 19:31 ` [PATCH v8 18/74] arm: convert to cpu_halted Robert Foley
2020-03-26 19:31 ` [PATCH v8 19/74] ppc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 20/74] sh4: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 21/74] i386: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 22/74] lm32: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 23/74] m68k: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 24/74] mips: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 25/74] riscv: " Robert Foley
2020-05-11 10:40   ` Alex Bennée
2020-05-11 16:13     ` Robert Foley
2020-03-26 19:31 ` [PATCH v8 26/74] s390x: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 27/74] sparc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 28/74] xtensa: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 29/74] gdbstub: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 30/74] openrisc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 31/74] cpu-exec: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 32/74] cpu: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 33/74] cpu: define cpu_interrupt_request helpers Robert Foley
2020-03-26 19:31 ` [PATCH v8 34/74] ppc: use cpu_reset_interrupt Robert Foley
2020-03-26 19:31 ` [PATCH v8 35/74] exec: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 36/74] i386: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 37/74] s390x: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 38/74] openrisc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 39/74] arm: convert to cpu_interrupt_request Robert Foley
2020-03-26 19:31 ` [PATCH v8 40/74] i386: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 41/74] i386/kvm: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 42/74] i386/hax-all: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 43/74] i386/whpx-all: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 44/74] i386/hvf: convert to cpu_request_interrupt Robert Foley
2020-03-26 19:31 ` [PATCH v8 45/74] ppc: convert to cpu_interrupt_request Robert Foley
2020-03-26 19:31 ` [PATCH v8 46/74] sh4: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 47/74] cris: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 48/74] hppa: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 49/74] lm32: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 50/74] m68k: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 51/74] mips: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 52/74] nios: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 53/74] s390x: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 54/74] alpha: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 55/74] moxie: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 56/74] sparc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 57/74] openrisc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 58/74] unicore32: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 59/74] microblaze: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 60/74] accel/tcg: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 61/74] cpu: convert to interrupt_request Robert Foley
2020-03-26 19:31 ` [PATCH v8 62/74] cpu: call .cpu_has_work with the CPU lock held Robert Foley
2020-03-26 19:31 ` [PATCH v8 63/74] cpu: introduce cpu_has_work_with_iothread_lock Robert Foley
2020-03-26 19:31 ` [PATCH v8 64/74] ppc: convert to cpu_has_work_with_iothread_lock Robert Foley
2020-03-26 19:31 ` [PATCH v8 65/74] mips: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 66/74] s390x: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 67/74] riscv: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 68/74] sparc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 69/74] xtensa: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 70/74] cpu: rename all_cpu_threads_idle to qemu_tcg_rr_all_cpu_threads_idle Robert Foley
2020-03-26 19:31 ` [PATCH v8 71/74] cpu: protect CPU state with cpu->lock instead of the BQL Robert Foley
2020-03-26 19:31 ` [PATCH v8 72/74] cpus-common: release BQL earlier in run_on_cpu Robert Foley
2020-03-26 19:31 ` [PATCH v8 73/74] cpu: add async_run_on_cpu_no_bql Robert Foley
2020-03-26 19:31 ` [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL Robert Foley
2020-05-12 16:27   ` Alex Bennée
2020-05-12 19:26     ` Robert Foley
2020-05-18 13:46       ` Robert Foley
2020-05-20  4:46         ` Emilio G. Cota [this message]
2020-05-20 15:01           ` Robert Foley
2020-05-21 14:17             ` Robert Foley
2020-05-12 18:38   ` Alex Bennée
2020-03-26 22:58 ` [PATCH v8 00/74] per-CPU locks Aleksandar Markovic
2020-03-27  9:39   ` Alex Bennée
2020-03-27  9:50     ` Aleksandar Markovic
2020-03-27 10:24       ` Aleksandar Markovic
2020-03-27 17:21         ` Robert Foley
2020-03-27  5:14 ` Emilio G. Cota
2020-03-27 10:59   ` Philippe Mathieu-Daudé
2020-03-30  8:57     ` Stefan Hajnoczi
2020-03-27 18:23   ` Alex Bennée
2020-03-27 18:30   ` Robert Foley
2020-05-12 16:29 ` Alex Bennée

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200520044613.GA359481@sff \
    --to=cota@braap.org \
    --cc=alex.bennee@linaro.org \
    --cc=peter.puhov@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=robert.foley@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git
	git clone --mirror https://lore.kernel.org/qemu-devel/2 qemu-devel/git/2.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git