All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Emilio G. Cota" <cota@braap.org>
To: qemu-devel@nongnu.org
Cc: "Richard Henderson" <richard.henderson@linaro.org>,
	"Alex Bennée" <alex.bennee@linaro.org>
Subject: [Qemu-devel] [PATCH v7 73/73] cputlb: queue async flush jobs without the BQL
Date: Mon,  4 Mar 2019 13:18:13 -0500	[thread overview]
Message-ID: <20190304181813.8075-74-cota@braap.org> (raw)
In-Reply-To: <20190304181813.8075-1-cota@braap.org>

This yields sizable scalability improvements, as the below results show.

Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)

Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
"make -j N", where N is the number of cores in the guest.

                      Speedup vs a single thread (higher is better):

         14 +---------------------------------------------------------------+
            |       +    +       +      +       +      +      $$$$$$  +     |
            |                                            $$$$$              |
            |                                      $$$$$$                   |
         12 |-+                                $A$$                       +-|
            |                                $$                             |
            |                             $$$                               |
         10 |-+                         $$    ##D#####################D   +-|
            |                        $$$ #####**B****************           |
            |                      $$####*****                   *****      |
            |                    A$#*****                             B     |
          8 |-+                $$B**                                      +-|
            |                $$**                                           |
            |               $**                                             |
          6 |-+           $$*                                             +-|
            |            A**                                                |
            |           $B                                                  |
            |           $                                                   |
          4 |-+        $*                                                 +-|
            |          $                                                    |
            |         $                                                     |
          2 |-+      $                                                    +-|
            |        $                                 +cputlb-no-bql $$A$$ |
            |       A                                   +per-cpu-lock ##D## |
            |       +    +       +      +       +      +     baseline **B** |
          0 +---------------------------------------------------------------+
                    1    4       8      12      16     20      24     28
                                       Guest vCPUs
  png: https://imgur.com/zZRvS7q

Some notes:
- baseline corresponds to the commit before this series

- per-cpu-lock is the commit that converts the CPU loop to per-cpu locks.

- cputlb-no-bql is this commit.

- I'm using taskset to assign cores to threads, favouring locality whenever
  possible but not using SMT. When N=1, I'm using a single host core, which
  leads to superlinear speedups (since with more cores the I/O thread can execute
  while vCPU threads sleep). In the future I might use N+1 host cores for N
  guest cores to avoid this, or perhaps pin guest threads to cores one-by-one.

Single-threaded performance is affected very lightly. Results
below for debian aarch64 bootup+test for the entire series
on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:

- Before:

 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7269.033478      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.06% )
    30,659,870,302      cycles                    #    4.218 GHz                      ( +-  0.06% )
    54,790,540,051      instructions              #    1.79  insns per cycle          ( +-  0.05% )
     9,796,441,380      branches                  # 1347.695 M/sec                    ( +-  0.05% )
       165,132,201      branch-misses             #    1.69% of all branches          ( +-  0.12% )

       7.287011656 seconds time elapsed                                          ( +-  0.10% )

- After:

       7375.924053      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.13% )
    31,107,548,846      cycles                    #    4.217 GHz                      ( +-  0.12% )
    55,355,668,947      instructions              #    1.78  insns per cycle          ( +-  0.05% )
     9,929,917,664      branches                  # 1346.261 M/sec                    ( +-  0.04% )
       166,547,442      branch-misses             #    1.68% of all branches          ( +-  0.09% )

       7.389068145 seconds time elapsed                                          ( +-  0.13% )

That is, a 1.37% slowdown.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cputlb.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 88cc8389e9..d9e0814b5c 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -260,7 +260,7 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
 
     CPU_FOREACH(cpu) {
         if (cpu != src) {
-            async_run_on_cpu(cpu, fn, d);
+            async_run_on_cpu_no_bql(cpu, fn, d);
         }
     }
 }
@@ -336,8 +336,8 @@ void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap)
     tlb_debug("mmu_idx: 0x%" PRIx16 "\n", idxmap);
 
     if (cpu->created && !qemu_cpu_is_self(cpu)) {
-        async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work,
-                         RUN_ON_CPU_HOST_INT(idxmap));
+        async_run_on_cpu_no_bql(cpu, tlb_flush_by_mmuidx_async_work,
+                                RUN_ON_CPU_HOST_INT(idxmap));
     } else {
         tlb_flush_by_mmuidx_async_work(cpu, RUN_ON_CPU_HOST_INT(idxmap));
     }
@@ -481,8 +481,8 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap)
     addr_and_mmu_idx |= idxmap;
 
     if (!qemu_cpu_is_self(cpu)) {
-        async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_work,
-                         RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
+        async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_work,
+                                RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
     } else {
         tlb_flush_page_by_mmuidx_async_work(
             cpu, RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
-- 
2.17.1

  parent reply	other threads:[~2019-03-04 18:19 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-04 18:17 [Qemu-devel] [PATCH v7 00/73] per-CPU locks Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 01/73] cpu: convert queued work to a QSIMPLEQ Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 02/73] cpu: rename cpu->work_mutex to cpu->lock Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 03/73] cpu: introduce cpu_mutex_lock/unlock Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 04/73] cpu: make qemu_work_cond per-cpu Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 05/73] cpu: move run_on_cpu to cpus-common Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 06/73] cpu: introduce process_queued_cpu_work_locked Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 07/73] cpu: make per-CPU locks an alias of the BQL in TCG rr mode Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 08/73] tcg-runtime: define helper_cpu_halted_set Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 09/73] ppc: convert to helper_cpu_halted_set Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 10/73] cris: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 11/73] hppa: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 12/73] m68k: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 13/73] alpha: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 14/73] microblaze: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 15/73] cpu: define cpu_halted helpers Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 16/73] tcg-runtime: convert to cpu_halted_set Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 17/73] arm: convert to cpu_halted Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 18/73] ppc: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 19/73] sh4: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 20/73] i386: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 21/73] lm32: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 22/73] m68k: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 23/73] mips: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 24/73] riscv: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 25/73] s390x: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 26/73] sparc: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 27/73] xtensa: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 28/73] gdbstub: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 29/73] openrisc: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 30/73] cpu-exec: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 31/73] cpu: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 32/73] cpu: define cpu_interrupt_request helpers Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 33/73] ppc: use cpu_reset_interrupt Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 34/73] exec: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 35/73] i386: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 36/73] s390x: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 37/73] openrisc: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 38/73] arm: convert to cpu_interrupt_request Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 39/73] i386: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 40/73] i386/kvm: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 41/73] i386/hax-all: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 42/73] i386/whpx-all: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 43/73] i386/hvf: convert to cpu_request_interrupt Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 44/73] ppc: convert to cpu_interrupt_request Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 45/73] sh4: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 46/73] cris: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 47/73] hppa: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 48/73] lm32: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 49/73] m68k: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 50/73] mips: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 51/73] nios: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 52/73] s390x: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 53/73] alpha: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 54/73] moxie: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 55/73] sparc: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 56/73] openrisc: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 57/73] unicore32: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 58/73] microblaze: " Emilio G. Cota
2019-03-04 18:17 ` [Qemu-devel] [PATCH v7 59/73] accel/tcg: " Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 60/73] cpu: convert to interrupt_request Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 61/73] cpu: call .cpu_has_work with the CPU lock held Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 62/73] cpu: introduce cpu_has_work_with_iothread_lock Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 63/73] ppc: convert to cpu_has_work_with_iothread_lock Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 64/73] mips: " Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 65/73] s390x: " Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 66/73] riscv: " Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 67/73] sparc: " Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 68/73] xtensa: " Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 69/73] cpu: rename all_cpu_threads_idle to qemu_tcg_rr_all_cpu_threads_idle Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 70/73] cpu: protect CPU state with cpu->lock instead of the BQL Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 71/73] cpus-common: release BQL earlier in run_on_cpu Emilio G. Cota
2019-03-04 18:18 ` [Qemu-devel] [PATCH v7 72/73] cpu: add async_run_on_cpu_no_bql Emilio G. Cota
2019-03-04 18:18 ` Emilio G. Cota [this message]
2019-03-05  9:16 ` [Qemu-devel] [PATCH v7 00/73] per-CPU locks Alex Bennée
2019-03-06 19:40 ` Richard Henderson
2019-03-06 19:57   ` Peter Maydell
2019-03-06 23:49   ` Emilio G. Cota

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190304181813.8075-74-cota@braap.org \
    --to=cota@braap.org \
    --cc=alex.bennee@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.