All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Emilio G. Cota" <cota@braap.org>
To: qemu-devel@nongnu.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Peter Crosthwaite <crosthwaite.peter@gmail.com>,
	Richard Henderson <rth@twiddle.net>
Subject: [Qemu-devel] [RFC v3 56/56] cputlb: queue async flush jobs without the BQL
Date: Thu, 18 Oct 2018 21:06:25 -0400	[thread overview]
Message-ID: <20181019010625.25294-57-cota@braap.org> (raw)
In-Reply-To: <20181019010625.25294-1-cota@braap.org>

This yields sizable scalability improvements, as the below results show.

Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)

Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
"make -j N", where N is the number of cores in the guest.

                      Speedup vs a single thread (higher is better):

         14 +---------------------------------------------------------------+
            |       +    +       +      +       +      +      $$$$$$  +     |
            |                                            $$$$$              |
            |                                      $$$$$$                   |
         12 |-+                                $A$$                       +-|
            |                                $$                             |
            |                             $$$                               |
         10 |-+                         $$    ##D#####################D   +-|
            |                        $$$ #####**B****************           |
            |                      $$####*****                   *****      |
            |                    A$#*****                             B     |
          8 |-+                $$B**                                      +-|
            |                $$**                                           |
            |               $**                                             |
          6 |-+           $$*                                             +-|
            |            A**                                                |
            |           $B                                                  |
            |           $                                                   |
          4 |-+        $*                                                 +-|
            |          $                                                    |
            |         $                                                     |
          2 |-+      $                                                    +-|
            |        $                                 +cputlb-no-bql $$A$$ |
            |       A                                   +per-cpu-lock ##D## |
            |       +    +       +      +       +      +     baseline **B** |
          0 +---------------------------------------------------------------+
                    1    4       8      12      16     20      24     28
                                       Guest vCPUs
  png: https://imgur.com/zZRvS7q

Some notes:
- baseline corresponds to the commit before this series

- per-cpu-lock is the commit that converts the CPU loop to per-cpu locks.

- cputlb-no-bql is this commit.

- I'm using taskset to assign cores to threads, favouring locality whenever
  possible but not using SMT. When N=1, I'm using a single host core, which
  leads to superlinear speedups (since with more cores the I/O thread can execute
  while vCPU threads sleep). In the future I might use N+1 host cores for N
  guest cores to avoid this, or perhaps pin guest threads to cores one-by-one.

- Scalability is not good at 64 cores, where the BQL for handling
  interrupts dominates. I got this from another machine (a 64-core one),
  that unfortunately is much slower than this 28-core one, so I don't have
  the numbers for 1-16 cores. The plot is normalized at 16-core baseline
  performance, and therefore very ugly :-) https://imgur.com/XyKGkAw
  See below for an example of the *huge* amount of waiting on the BQL:

(qemu) info sync-profile
Type               Object  Call site                             Wait Time (s)         Count  Average (us)
----------------------------------------------------------------------------------------------------------
BQL mutex  0x55ba286c9800  accel/tcg/cpu-exec.c:545                 2868.85676      14872596        192.90
BQL mutex  0x55ba286c9800  hw/ppc/ppc.c:70                           539.58924       3666820        147.15
BQL mutex  0x55ba286c9800  target/ppc/helper_regs.h:105              323.49283       2544959        127.11
mutex      [           2]  util/qemu-timer.c:426                     181.38420       3666839         49.47
condvar    [          61]  cpus.c:1327                               136.50872         15379       8876.31
BQL mutex  0x55ba286c9800  accel/tcg/cpu-exec.c:516                   86.14785        946301         91.04
condvar    0x55ba286eb6a0  cpus-common.c:196                          78.41010           126     622302.35
BQL mutex  0x55ba286c9800  util/main-loop.c:236                       28.14795        272940        103.13
mutex      [          64]  include/qom/cpu.h:514                      17.87662      75139413          0.24
BQL mutex  0x55ba286c9800  target/ppc/translate_init.inc.c:8665        7.04738         36528        192.93
----------------------------------------------------------------------------------------------------------

Single-threaded performance is affected very lightly. Results
below for debian aarch64 bootup+test for the entire series
on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:

- Before:

 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7269.033478      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.06% )
    30,659,870,302      cycles                    #    4.218 GHz                      ( +-  0.06% )
    54,790,540,051      instructions              #    1.79  insns per cycle          ( +-  0.05% )
     9,796,441,380      branches                  # 1347.695 M/sec                    ( +-  0.05% )
       165,132,201      branch-misses             #    1.69% of all branches          ( +-  0.12% )

       7.287011656 seconds time elapsed                                          ( +-  0.10% )

- After:

       7375.924053      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.13% )
    31,107,548,846      cycles                    #    4.217 GHz                      ( +-  0.12% )
    55,355,668,947      instructions              #    1.78  insns per cycle          ( +-  0.05% )
     9,929,917,664      branches                  # 1346.261 M/sec                    ( +-  0.04% )
       166,547,442      branch-misses             #    1.68% of all branches          ( +-  0.09% )

       7.389068145 seconds time elapsed                                          ( +-  0.13% )

That is, a 1.37% slowdown.

Cc: Peter Crosthwaite <crosthwaite.peter@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cputlb.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 353d76d6a5..e3582f2f1d 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -212,7 +212,7 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
 
     CPU_FOREACH(cpu) {
         if (cpu != src) {
-            async_run_on_cpu(cpu, fn, d);
+            async_run_on_cpu_no_bql(cpu, fn, d);
         }
     }
 }
@@ -280,8 +280,8 @@ void tlb_flush(CPUState *cpu)
     if (cpu->created && !qemu_cpu_is_self(cpu)) {
         if (atomic_mb_read(&cpu->pending_tlb_flush) != ALL_MMUIDX_BITS) {
             atomic_mb_set(&cpu->pending_tlb_flush, ALL_MMUIDX_BITS);
-            async_run_on_cpu(cpu, tlb_flush_global_async_work,
-                             RUN_ON_CPU_NULL);
+            async_run_on_cpu_no_bql(cpu, tlb_flush_global_async_work,
+                                    RUN_ON_CPU_NULL);
         }
     } else {
         tlb_flush_nocheck(cpu);
@@ -341,8 +341,8 @@ void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap)
             tlb_debug("reduced mmu_idx: 0x%" PRIx16 "\n", pending_flushes);
 
             atomic_or(&cpu->pending_tlb_flush, pending_flushes);
-            async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work,
-                             RUN_ON_CPU_HOST_INT(pending_flushes));
+            async_run_on_cpu_no_bql(cpu, tlb_flush_by_mmuidx_async_work,
+                                    RUN_ON_CPU_HOST_INT(pending_flushes));
         }
     } else {
         tlb_flush_by_mmuidx_async_work(cpu,
@@ -442,8 +442,8 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
     tlb_debug("page :" TARGET_FMT_lx "\n", addr);
 
     if (!qemu_cpu_is_self(cpu)) {
-        async_run_on_cpu(cpu, tlb_flush_page_async_work,
-                         RUN_ON_CPU_TARGET_PTR(addr));
+        async_run_on_cpu_no_bql(cpu, tlb_flush_page_async_work,
+                                RUN_ON_CPU_TARGET_PTR(addr));
     } else {
         tlb_flush_page_async_work(cpu, RUN_ON_CPU_TARGET_PTR(addr));
     }
@@ -514,8 +514,9 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap)
     addr_and_mmu_idx |= idxmap;
 
     if (!qemu_cpu_is_self(cpu)) {
-        async_run_on_cpu(cpu, tlb_check_page_and_flush_by_mmuidx_async_work,
-                         RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
+        async_run_on_cpu_no_bql(cpu,
+                                tlb_check_page_and_flush_by_mmuidx_async_work,
+                                RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
     } else {
         tlb_check_page_and_flush_by_mmuidx_async_work(
             cpu, RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
-- 
2.17.1

  parent reply	other threads:[~2018-10-19  1:07 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-19  1:05 [Qemu-devel] [RFC v3 0/56] per-CPU locks Emilio G. Cota
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 01/56] cpu: convert queued work to a QSIMPLEQ Emilio G. Cota
2018-10-19  6:26   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 02/56] cpu: rename cpu->work_mutex to cpu->lock Emilio G. Cota
2018-10-19  6:26   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 03/56] cpu: introduce cpu_mutex_lock/unlock Emilio G. Cota
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 04/56] cpu: make qemu_work_cond per-cpu Emilio G. Cota
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 05/56] cpu: move run_on_cpu to cpus-common Emilio G. Cota
2018-10-19  6:39   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 06/56] cpu: introduce process_queued_cpu_work_locked Emilio G. Cota
2018-10-19  6:41   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 07/56] target/m68k: rename cpu_halted to cpu_halt Emilio G. Cota
2018-10-21 12:53   ` Richard Henderson
2018-10-21 13:38     ` Richard Henderson
2018-10-22 22:58       ` Emilio G. Cota
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 08/56] cpu: define cpu_halted helpers Emilio G. Cota
2018-10-21 12:54   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 09/56] arm: convert to cpu_halted Emilio G. Cota
2018-10-21 12:55   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 10/56] ppc: " Emilio G. Cota
2018-10-21 12:56   ` Richard Henderson
2018-10-22 21:12     ` Emilio G. Cota
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 11/56] sh4: " Emilio G. Cota
2018-10-21 12:57   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 12/56] i386: " Emilio G. Cota
2018-10-21 12:59   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 13/56] lm32: " Emilio G. Cota
2018-10-21 13:00   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 14/56] m68k: " Emilio G. Cota
2018-10-21 13:01   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 15/56] mips: " Emilio G. Cota
2018-10-21 13:02   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 16/56] riscv: " Emilio G. Cota
2018-10-19 17:24   ` Palmer Dabbelt
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 17/56] s390x: " Emilio G. Cota
2018-10-21 13:04   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 18/56] sparc: " Emilio G. Cota
2018-10-21 13:04   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 19/56] xtensa: " Emilio G. Cota
2018-10-21 13:10   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 20/56] gdbstub: " Emilio G. Cota
2018-10-21 13:10   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 21/56] openrisc: " Emilio G. Cota
2018-10-21 13:11   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 22/56] cpu-exec: " Emilio G. Cota
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 23/56] cpu: define cpu_interrupt_request helpers Emilio G. Cota
2018-10-21 13:15   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 24/56] ppc: use cpu_reset_interrupt Emilio G. Cota
2018-10-21 13:15   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 25/56] exec: " Emilio G. Cota
2018-10-21 13:17   ` Richard Henderson
2018-10-22 23:28     ` Emilio G. Cota
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 26/56] i386: " Emilio G. Cota
2018-10-21 13:18   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 27/56] s390x: " Emilio G. Cota
2018-10-21 13:18   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 28/56] openrisc: " Emilio G. Cota
2018-10-21 13:18   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 29/56] arm: convert to cpu_interrupt_request Emilio G. Cota
2018-10-21 13:21   ` Richard Henderson
2018-10-19  1:05 ` [Qemu-devel] [RFC v3 30/56] i386: " Emilio G. Cota
2018-10-21 13:27   ` Richard Henderson
2018-10-23 20:28     ` Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 31/56] ppc: " Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 32/56] sh4: " Emilio G. Cota
2018-10-21 13:28   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 33/56] cris: " Emilio G. Cota
2018-10-21 13:29   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 34/56] hppa: " Emilio G. Cota
2018-10-21 13:29   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 35/56] lm32: " Emilio G. Cota
2018-10-21 13:29   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 36/56] m68k: " Emilio G. Cota
2018-10-21 13:29   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 37/56] mips: " Emilio G. Cota
2018-10-21 13:30   ` Richard Henderson
2018-10-22 23:38     ` Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 38/56] nios: " Emilio G. Cota
2018-10-21 13:30   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 39/56] s390x: " Emilio G. Cota
2018-10-21 13:30   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 40/56] alpha: " Emilio G. Cota
2018-10-21 13:31   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 41/56] moxie: " Emilio G. Cota
2018-10-21 13:31   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 42/56] sparc: " Emilio G. Cota
2018-10-21 13:32   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 43/56] openrisc: " Emilio G. Cota
2018-10-21 13:32   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 44/56] unicore32: " Emilio G. Cota
2018-10-21 13:33   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 45/56] microblaze: " Emilio G. Cota
2018-10-21 13:33   ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 46/56] accel/tcg: " Emilio G. Cota
2018-10-21 13:34   ` Richard Henderson
2018-10-22 23:50     ` Emilio G. Cota
2018-10-23  2:17       ` Richard Henderson
2018-10-23 20:21         ` Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 47/56] cpu: call .cpu_has_work with the CPU lock held Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 48/56] ppc: acquire the BQL in cpu_has_work Emilio G. Cota
2018-10-19  6:58   ` Paolo Bonzini
2018-10-20 16:31     ` Emilio G. Cota
2018-10-21 13:42       ` Richard Henderson
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 49/56] mips: " Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 50/56] s390: " Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 51/56] riscv: " Emilio G. Cota
2018-10-19 17:24   ` Palmer Dabbelt
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 52/56] sparc: " Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 53/56] xtensa: " Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 54/56] cpu: protect most CPU state with cpu->lock Emilio G. Cota
2018-10-19  1:06 ` [Qemu-devel] [RFC v3 55/56] cpu: add async_run_on_cpu_no_bql Emilio G. Cota
2018-10-19  1:06 ` Emilio G. Cota [this message]
2018-10-19  6:59 ` [Qemu-devel] [RFC v3 0/56] per-CPU locks Paolo Bonzini
2018-10-19 14:50   ` Emilio G. Cota
2018-10-19 16:01     ` Paolo Bonzini
2018-10-19 19:29       ` Emilio G. Cota
2018-10-19 23:46         ` Emilio G. Cota
2018-10-22 15:30           ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181019010625.25294-57-cota@braap.org \
    --to=cota@braap.org \
    --cc=crosthwaite.peter@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.