From: "Emilio G. Cota" <cota@braap.org>
To: qemu-devel@nongnu.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
Peter Crosthwaite <crosthwaite.peter@gmail.com>,
Richard Henderson <rth@twiddle.net>
Subject: [Qemu-devel] [RFC v3 56/56] cputlb: queue async flush jobs without the BQL
Date: Thu, 18 Oct 2018 21:06:25 -0400 [thread overview]
Message-ID: <20181019010625.25294-57-cota@braap.org> (raw)
In-Reply-To: <20181019010625.25294-1-cota@braap.org>
This yields sizable scalability improvements, as the below results show.
Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)
Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
"make -j N", where N is the number of cores in the guest.
Speedup vs a single thread (higher is better):
14 +---------------------------------------------------------------+
| + + + + + + $$$$$$ + |
| $$$$$ |
| $$$$$$ |
12 |-+ $A$$ +-|
| $$ |
| $$$ |
10 |-+ $$ ##D#####################D +-|
| $$$ #####**B**************** |
| $$####***** ***** |
| A$#***** B |
8 |-+ $$B** +-|
| $$** |
| $** |
6 |-+ $$* +-|
| A** |
| $B |
| $ |
4 |-+ $* +-|
| $ |
| $ |
2 |-+ $ +-|
| $ +cputlb-no-bql $$A$$ |
| A +per-cpu-lock ##D## |
| + + + + + + baseline **B** |
0 +---------------------------------------------------------------+
1 4 8 12 16 20 24 28
Guest vCPUs
png: https://imgur.com/zZRvS7q
Some notes:
- baseline corresponds to the commit before this series
- per-cpu-lock is the commit that converts the CPU loop to per-cpu locks.
- cputlb-no-bql is this commit.
- I'm using taskset to assign cores to threads, favouring locality whenever
possible but not using SMT. When N=1, I'm using a single host core, which
leads to superlinear speedups (since with more cores the I/O thread can execute
while vCPU threads sleep). In the future I might use N+1 host cores for N
guest cores to avoid this, or perhaps pin guest threads to cores one-by-one.
- Scalability is not good at 64 cores, where the BQL for handling
interrupts dominates. I got this from another machine (a 64-core one),
that unfortunately is much slower than this 28-core one, so I don't have
the numbers for 1-16 cores. The plot is normalized at 16-core baseline
performance, and therefore very ugly :-) https://imgur.com/XyKGkAw
See below for an example of the *huge* amount of waiting on the BQL:
(qemu) info sync-profile
Type Object Call site Wait Time (s) Count Average (us)
----------------------------------------------------------------------------------------------------------
BQL mutex 0x55ba286c9800 accel/tcg/cpu-exec.c:545 2868.85676 14872596 192.90
BQL mutex 0x55ba286c9800 hw/ppc/ppc.c:70 539.58924 3666820 147.15
BQL mutex 0x55ba286c9800 target/ppc/helper_regs.h:105 323.49283 2544959 127.11
mutex [ 2] util/qemu-timer.c:426 181.38420 3666839 49.47
condvar [ 61] cpus.c:1327 136.50872 15379 8876.31
BQL mutex 0x55ba286c9800 accel/tcg/cpu-exec.c:516 86.14785 946301 91.04
condvar 0x55ba286eb6a0 cpus-common.c:196 78.41010 126 622302.35
BQL mutex 0x55ba286c9800 util/main-loop.c:236 28.14795 272940 103.13
mutex [ 64] include/qom/cpu.h:514 17.87662 75139413 0.24
BQL mutex 0x55ba286c9800 target/ppc/translate_init.inc.c:8665 7.04738 36528 192.93
----------------------------------------------------------------------------------------------------------
Single-threaded performance is affected very lightly. Results
below for debian aarch64 bootup+test for the entire series
on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:
- Before:
Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
7269.033478 task-clock (msec) # 0.998 CPUs utilized ( +- 0.06% )
30,659,870,302 cycles # 4.218 GHz ( +- 0.06% )
54,790,540,051 instructions # 1.79 insns per cycle ( +- 0.05% )
9,796,441,380 branches # 1347.695 M/sec ( +- 0.05% )
165,132,201 branch-misses # 1.69% of all branches ( +- 0.12% )
7.287011656 seconds time elapsed ( +- 0.10% )
- After:
7375.924053 task-clock (msec) # 0.998 CPUs utilized ( +- 0.13% )
31,107,548,846 cycles # 4.217 GHz ( +- 0.12% )
55,355,668,947 instructions # 1.78 insns per cycle ( +- 0.05% )
9,929,917,664 branches # 1346.261 M/sec ( +- 0.04% )
166,547,442 branch-misses # 1.68% of all branches ( +- 0.09% )
7.389068145 seconds time elapsed ( +- 0.13% )
That is, a 1.37% slowdown.
Cc: Peter Crosthwaite <crosthwaite.peter@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
accel/tcg/cputlb.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 353d76d6a5..e3582f2f1d 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -212,7 +212,7 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
CPU_FOREACH(cpu) {
if (cpu != src) {
- async_run_on_cpu(cpu, fn, d);
+ async_run_on_cpu_no_bql(cpu, fn, d);
}
}
}
@@ -280,8 +280,8 @@ void tlb_flush(CPUState *cpu)
if (cpu->created && !qemu_cpu_is_self(cpu)) {
if (atomic_mb_read(&cpu->pending_tlb_flush) != ALL_MMUIDX_BITS) {
atomic_mb_set(&cpu->pending_tlb_flush, ALL_MMUIDX_BITS);
- async_run_on_cpu(cpu, tlb_flush_global_async_work,
- RUN_ON_CPU_NULL);
+ async_run_on_cpu_no_bql(cpu, tlb_flush_global_async_work,
+ RUN_ON_CPU_NULL);
}
} else {
tlb_flush_nocheck(cpu);
@@ -341,8 +341,8 @@ void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap)
tlb_debug("reduced mmu_idx: 0x%" PRIx16 "\n", pending_flushes);
atomic_or(&cpu->pending_tlb_flush, pending_flushes);
- async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work,
- RUN_ON_CPU_HOST_INT(pending_flushes));
+ async_run_on_cpu_no_bql(cpu, tlb_flush_by_mmuidx_async_work,
+ RUN_ON_CPU_HOST_INT(pending_flushes));
}
} else {
tlb_flush_by_mmuidx_async_work(cpu,
@@ -442,8 +442,8 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
tlb_debug("page :" TARGET_FMT_lx "\n", addr);
if (!qemu_cpu_is_self(cpu)) {
- async_run_on_cpu(cpu, tlb_flush_page_async_work,
- RUN_ON_CPU_TARGET_PTR(addr));
+ async_run_on_cpu_no_bql(cpu, tlb_flush_page_async_work,
+ RUN_ON_CPU_TARGET_PTR(addr));
} else {
tlb_flush_page_async_work(cpu, RUN_ON_CPU_TARGET_PTR(addr));
}
@@ -514,8 +514,9 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap)
addr_and_mmu_idx |= idxmap;
if (!qemu_cpu_is_self(cpu)) {
- async_run_on_cpu(cpu, tlb_check_page_and_flush_by_mmuidx_async_work,
- RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
+ async_run_on_cpu_no_bql(cpu,
+ tlb_check_page_and_flush_by_mmuidx_async_work,
+ RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
} else {
tlb_check_page_and_flush_by_mmuidx_async_work(
cpu, RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
--
2.17.1
next prev parent reply other threads:[~2018-10-19 1:07 UTC|newest]
Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-19 1:05 [Qemu-devel] [RFC v3 0/56] per-CPU locks Emilio G. Cota
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 01/56] cpu: convert queued work to a QSIMPLEQ Emilio G. Cota
2018-10-19 6:26 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 02/56] cpu: rename cpu->work_mutex to cpu->lock Emilio G. Cota
2018-10-19 6:26 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 03/56] cpu: introduce cpu_mutex_lock/unlock Emilio G. Cota
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 04/56] cpu: make qemu_work_cond per-cpu Emilio G. Cota
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 05/56] cpu: move run_on_cpu to cpus-common Emilio G. Cota
2018-10-19 6:39 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 06/56] cpu: introduce process_queued_cpu_work_locked Emilio G. Cota
2018-10-19 6:41 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 07/56] target/m68k: rename cpu_halted to cpu_halt Emilio G. Cota
2018-10-21 12:53 ` Richard Henderson
2018-10-21 13:38 ` Richard Henderson
2018-10-22 22:58 ` Emilio G. Cota
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 08/56] cpu: define cpu_halted helpers Emilio G. Cota
2018-10-21 12:54 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 09/56] arm: convert to cpu_halted Emilio G. Cota
2018-10-21 12:55 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 10/56] ppc: " Emilio G. Cota
2018-10-21 12:56 ` Richard Henderson
2018-10-22 21:12 ` Emilio G. Cota
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 11/56] sh4: " Emilio G. Cota
2018-10-21 12:57 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 12/56] i386: " Emilio G. Cota
2018-10-21 12:59 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 13/56] lm32: " Emilio G. Cota
2018-10-21 13:00 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 14/56] m68k: " Emilio G. Cota
2018-10-21 13:01 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 15/56] mips: " Emilio G. Cota
2018-10-21 13:02 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 16/56] riscv: " Emilio G. Cota
2018-10-19 17:24 ` Palmer Dabbelt
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 17/56] s390x: " Emilio G. Cota
2018-10-21 13:04 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 18/56] sparc: " Emilio G. Cota
2018-10-21 13:04 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 19/56] xtensa: " Emilio G. Cota
2018-10-21 13:10 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 20/56] gdbstub: " Emilio G. Cota
2018-10-21 13:10 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 21/56] openrisc: " Emilio G. Cota
2018-10-21 13:11 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 22/56] cpu-exec: " Emilio G. Cota
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 23/56] cpu: define cpu_interrupt_request helpers Emilio G. Cota
2018-10-21 13:15 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 24/56] ppc: use cpu_reset_interrupt Emilio G. Cota
2018-10-21 13:15 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 25/56] exec: " Emilio G. Cota
2018-10-21 13:17 ` Richard Henderson
2018-10-22 23:28 ` Emilio G. Cota
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 26/56] i386: " Emilio G. Cota
2018-10-21 13:18 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 27/56] s390x: " Emilio G. Cota
2018-10-21 13:18 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 28/56] openrisc: " Emilio G. Cota
2018-10-21 13:18 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 29/56] arm: convert to cpu_interrupt_request Emilio G. Cota
2018-10-21 13:21 ` Richard Henderson
2018-10-19 1:05 ` [Qemu-devel] [RFC v3 30/56] i386: " Emilio G. Cota
2018-10-21 13:27 ` Richard Henderson
2018-10-23 20:28 ` Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 31/56] ppc: " Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 32/56] sh4: " Emilio G. Cota
2018-10-21 13:28 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 33/56] cris: " Emilio G. Cota
2018-10-21 13:29 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 34/56] hppa: " Emilio G. Cota
2018-10-21 13:29 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 35/56] lm32: " Emilio G. Cota
2018-10-21 13:29 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 36/56] m68k: " Emilio G. Cota
2018-10-21 13:29 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 37/56] mips: " Emilio G. Cota
2018-10-21 13:30 ` Richard Henderson
2018-10-22 23:38 ` Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 38/56] nios: " Emilio G. Cota
2018-10-21 13:30 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 39/56] s390x: " Emilio G. Cota
2018-10-21 13:30 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 40/56] alpha: " Emilio G. Cota
2018-10-21 13:31 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 41/56] moxie: " Emilio G. Cota
2018-10-21 13:31 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 42/56] sparc: " Emilio G. Cota
2018-10-21 13:32 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 43/56] openrisc: " Emilio G. Cota
2018-10-21 13:32 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 44/56] unicore32: " Emilio G. Cota
2018-10-21 13:33 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 45/56] microblaze: " Emilio G. Cota
2018-10-21 13:33 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 46/56] accel/tcg: " Emilio G. Cota
2018-10-21 13:34 ` Richard Henderson
2018-10-22 23:50 ` Emilio G. Cota
2018-10-23 2:17 ` Richard Henderson
2018-10-23 20:21 ` Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 47/56] cpu: call .cpu_has_work with the CPU lock held Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 48/56] ppc: acquire the BQL in cpu_has_work Emilio G. Cota
2018-10-19 6:58 ` Paolo Bonzini
2018-10-20 16:31 ` Emilio G. Cota
2018-10-21 13:42 ` Richard Henderson
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 49/56] mips: " Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 50/56] s390: " Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 51/56] riscv: " Emilio G. Cota
2018-10-19 17:24 ` Palmer Dabbelt
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 52/56] sparc: " Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 53/56] xtensa: " Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 54/56] cpu: protect most CPU state with cpu->lock Emilio G. Cota
2018-10-19 1:06 ` [Qemu-devel] [RFC v3 55/56] cpu: add async_run_on_cpu_no_bql Emilio G. Cota
2018-10-19 1:06 ` Emilio G. Cota [this message]
2018-10-19 6:59 ` [Qemu-devel] [RFC v3 0/56] per-CPU locks Paolo Bonzini
2018-10-19 14:50 ` Emilio G. Cota
2018-10-19 16:01 ` Paolo Bonzini
2018-10-19 19:29 ` Emilio G. Cota
2018-10-19 23:46 ` Emilio G. Cota
2018-10-22 15:30 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181019010625.25294-57-cota@braap.org \
--to=cota@braap.org \
--cc=crosthwaite.peter@gmail.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=rth@twiddle.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.