QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
From: Robert Foley <robert.foley@linaro.org>
To: qemu-devel@nongnu.org
Cc: richard.henderson@linaro.org, "Emilio G. Cota" <cota@braap.org>,
	alex.bennee@linaro.org, robert.foley@linaro.org,
	peter.puhov@linaro.org
Subject: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL
Date: Thu, 26 Mar 2020 15:31:56 -0400
Message-ID: <20200326193156.4322-75-robert.foley@linaro.org> (raw)
In-Reply-To: <20200326193156.4322-1-robert.foley@linaro.org>

From: "Emilio G. Cota" <cota@braap.org>

This yields sizable scalability improvements, as the below results show.

Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)

Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
"make -j N", where N is the number of cores in the guest.

                      Speedup vs a single thread (higher is better):

         14 +---------------------------------------------------------------+
            |       +    +       +      +       +      +      $$$$$$  +     |
            |                                            $$$$$              |
            |                                      $$$$$$                   |
         12 |-+                                $A$$                       +-|
            |                                $$                             |
            |                             $$$                               |
         10 |-+                         $$    ##D#####################D   +-|
            |                        $$$ #####**B****************           |
            |                      $$####*****                   *****      |
            |                    A$#*****                             B     |
          8 |-+                $$B**                                      +-|
            |                $$**                                           |
            |               $**                                             |
          6 |-+           $$*                                             +-|
            |            A**                                                |
            |           $B                                                  |
            |           $                                                   |
          4 |-+        $*                                                 +-|
            |          $                                                    |
            |         $                                                     |
          2 |-+      $                                                    +-|
            |        $                                 +cputlb-no-bql $$A$$ |
            |       A                                   +per-cpu-lock ##D## |
            |       +    +       +      +       +      +     baseline **B** |
          0 +---------------------------------------------------------------+
                    1    4       8      12      16     20      24     28
                                       Guest vCPUs
  png: https://imgur.com/zZRvS7q

Some notes:
- baseline corresponds to the commit before this series

- per-cpu-lock is the commit that converts the CPU loop to per-cpu locks.

- cputlb-no-bql is this commit.

- I'm using taskset to assign cores to threads, favouring locality whenever
  possible but not using SMT. When N=1, I'm using a single host core, which
  leads to superlinear speedups (since with more cores the I/O thread can execute
  while vCPU threads sleep). In the future I might use N+1 host cores for N
  guest cores to avoid this, or perhaps pin guest threads to cores one-by-one.

Single-threaded performance is affected very lightly. Results
below for debian aarch64 bootup+test for the entire series
on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:

- Before:

 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7269.033478      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.06% )
    30,659,870,302      cycles                    #    4.218 GHz                      ( +-  0.06% )
    54,790,540,051      instructions              #    1.79  insns per cycle          ( +-  0.05% )
     9,796,441,380      branches                  # 1347.695 M/sec                    ( +-  0.05% )
       165,132,201      branch-misses             #    1.69% of all branches          ( +-  0.12% )

       7.287011656 seconds time elapsed                                          ( +-  0.10% )

- After:

       7375.924053      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.13% )
    31,107,548,846      cycles                    #    4.217 GHz                      ( +-  0.12% )
    55,355,668,947      instructions              #    1.78  insns per cycle          ( +-  0.05% )
     9,929,917,664      branches                  # 1346.261 M/sec                    ( +-  0.04% )
       166,547,442      branch-misses             #    1.68% of all branches          ( +-  0.09% )

       7.389068145 seconds time elapsed                                          ( +-  0.13% )

That is, a 1.37% slowdown.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Robert Foley <robert.foley@linaro.org>
---
 accel/tcg/cputlb.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index e3b5750c3b..d13feaf3a3 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -284,7 +284,7 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
 
     CPU_FOREACH(cpu) {
         if (cpu != src) {
-            async_run_on_cpu(cpu, fn, d);
+            async_run_on_cpu_no_bql(cpu, fn, d);
         }
     }
 }
@@ -352,8 +352,8 @@ void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap)
     tlb_debug("mmu_idx: 0x%" PRIx16 "\n", idxmap);
 
     if (cpu->created && !qemu_cpu_is_self(cpu)) {
-        async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work,
-                         RUN_ON_CPU_HOST_INT(idxmap));
+        async_run_on_cpu_no_bql(cpu, tlb_flush_by_mmuidx_async_work,
+                                RUN_ON_CPU_HOST_INT(idxmap));
     } else {
         tlb_flush_by_mmuidx_async_work(cpu, RUN_ON_CPU_HOST_INT(idxmap));
     }
@@ -547,7 +547,7 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap)
          * we can stuff idxmap into the low TARGET_PAGE_BITS, avoid
          * allocating memory for this operation.
          */
-        async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_1,
+        async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_1,
                          RUN_ON_CPU_TARGET_PTR(addr | idxmap));
     } else {
         TLBFlushPageByMMUIdxData *d = g_new(TLBFlushPageByMMUIdxData, 1);
@@ -555,7 +555,7 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap)
         /* Otherwise allocate a structure, freed by the worker.  */
         d->addr = addr;
         d->idxmap = idxmap;
-        async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_2,
+        async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_2,
                          RUN_ON_CPU_HOST_PTR(d));
     }
 }
-- 
2.17.1



  parent reply index

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-26 19:30 [PATCH v8 00/74] per-CPU locks Robert Foley
2020-03-26 19:30 ` [PATCH v8 01/74] cpu: convert queued work to a QSIMPLEQ Robert Foley
2020-03-26 19:30 ` [PATCH v8 02/74] cpu: rename cpu->work_mutex to cpu->lock Robert Foley
2020-05-11 14:48   ` Alex Bennée
2020-05-11 16:33     ` Robert Foley
2020-03-26 19:30 ` [PATCH v8 03/74] cpu: introduce cpu_mutex_lock/unlock Robert Foley
2020-05-11 10:24   ` Alex Bennée
2020-05-11 16:09     ` Robert Foley
2020-03-26 19:30 ` [PATCH v8 04/74] cpu: make qemu_work_cond per-cpu Robert Foley
2020-03-26 19:30 ` [PATCH v8 05/74] cpu: move run_on_cpu to cpus-common Robert Foley
2020-03-26 19:30 ` [PATCH v8 06/74] cpu: introduce process_queued_cpu_work_locked Robert Foley
2020-03-26 19:30 ` [PATCH v8 07/74] cpu: make per-CPU locks an alias of the BQL in TCG rr mode Robert Foley
2020-03-26 19:30 ` [PATCH v8 08/74] tcg-runtime: define helper_cpu_halted_set Robert Foley
2020-03-26 19:30 ` [PATCH v8 09/74] ppc: convert to helper_cpu_halted_set Robert Foley
2020-03-26 19:30 ` [PATCH v8 10/74] cris: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 11/74] hppa: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 12/74] m68k: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 13/74] alpha: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 14/74] microblaze: " Robert Foley
2020-03-26 19:30 ` [PATCH v8 15/74] cpu: define cpu_halted helpers Robert Foley
2020-03-26 19:30 ` [PATCH v8 16/74] tcg-runtime: convert to cpu_halted_set Robert Foley
2020-03-26 19:30 ` [PATCH v8 17/74] hw/semihosting: " Robert Foley
2020-05-11 10:25   ` Alex Bennée
2020-03-26 19:31 ` [PATCH v8 18/74] arm: convert to cpu_halted Robert Foley
2020-03-26 19:31 ` [PATCH v8 19/74] ppc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 20/74] sh4: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 21/74] i386: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 22/74] lm32: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 23/74] m68k: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 24/74] mips: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 25/74] riscv: " Robert Foley
2020-05-11 10:40   ` Alex Bennée
2020-05-11 16:13     ` Robert Foley
2020-03-26 19:31 ` [PATCH v8 26/74] s390x: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 27/74] sparc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 28/74] xtensa: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 29/74] gdbstub: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 30/74] openrisc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 31/74] cpu-exec: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 32/74] cpu: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 33/74] cpu: define cpu_interrupt_request helpers Robert Foley
2020-03-26 19:31 ` [PATCH v8 34/74] ppc: use cpu_reset_interrupt Robert Foley
2020-03-26 19:31 ` [PATCH v8 35/74] exec: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 36/74] i386: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 37/74] s390x: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 38/74] openrisc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 39/74] arm: convert to cpu_interrupt_request Robert Foley
2020-03-26 19:31 ` [PATCH v8 40/74] i386: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 41/74] i386/kvm: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 42/74] i386/hax-all: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 43/74] i386/whpx-all: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 44/74] i386/hvf: convert to cpu_request_interrupt Robert Foley
2020-03-26 19:31 ` [PATCH v8 45/74] ppc: convert to cpu_interrupt_request Robert Foley
2020-03-26 19:31 ` [PATCH v8 46/74] sh4: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 47/74] cris: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 48/74] hppa: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 49/74] lm32: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 50/74] m68k: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 51/74] mips: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 52/74] nios: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 53/74] s390x: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 54/74] alpha: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 55/74] moxie: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 56/74] sparc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 57/74] openrisc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 58/74] unicore32: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 59/74] microblaze: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 60/74] accel/tcg: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 61/74] cpu: convert to interrupt_request Robert Foley
2020-03-26 19:31 ` [PATCH v8 62/74] cpu: call .cpu_has_work with the CPU lock held Robert Foley
2020-03-26 19:31 ` [PATCH v8 63/74] cpu: introduce cpu_has_work_with_iothread_lock Robert Foley
2020-03-26 19:31 ` [PATCH v8 64/74] ppc: convert to cpu_has_work_with_iothread_lock Robert Foley
2020-03-26 19:31 ` [PATCH v8 65/74] mips: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 66/74] s390x: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 67/74] riscv: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 68/74] sparc: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 69/74] xtensa: " Robert Foley
2020-03-26 19:31 ` [PATCH v8 70/74] cpu: rename all_cpu_threads_idle to qemu_tcg_rr_all_cpu_threads_idle Robert Foley
2020-03-26 19:31 ` [PATCH v8 71/74] cpu: protect CPU state with cpu->lock instead of the BQL Robert Foley
2020-03-26 19:31 ` [PATCH v8 72/74] cpus-common: release BQL earlier in run_on_cpu Robert Foley
2020-03-26 19:31 ` [PATCH v8 73/74] cpu: add async_run_on_cpu_no_bql Robert Foley
2020-03-26 19:31 ` Robert Foley [this message]
2020-05-12 16:27   ` [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL Alex Bennée
2020-05-12 19:26     ` Robert Foley
2020-05-18 13:46       ` Robert Foley
2020-05-20  4:46         ` Emilio G. Cota
2020-05-20 15:01           ` Robert Foley
2020-05-21 14:17             ` Robert Foley
2020-05-12 18:38   ` Alex Bennée
2020-03-26 22:58 ` [PATCH v8 00/74] per-CPU locks Aleksandar Markovic
2020-03-27  9:39   ` Alex Bennée
2020-03-27  9:50     ` Aleksandar Markovic
2020-03-27 10:24       ` Aleksandar Markovic
2020-03-27 17:21         ` Robert Foley
2020-03-27  5:14 ` Emilio G. Cota
2020-03-27 10:59   ` Philippe Mathieu-Daudé
2020-03-30  8:57     ` Stefan Hajnoczi
2020-03-27 18:23   ` Alex Bennée
2020-03-27 18:30   ` Robert Foley
2020-05-12 16:29 ` Alex Bennée

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200326193156.4322-75-robert.foley@linaro.org \
    --to=robert.foley@linaro.org \
    --cc=alex.bennee@linaro.org \
    --cc=cota@braap.org \
    --cc=peter.puhov@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git
	git clone --mirror https://lore.kernel.org/qemu-devel/2 qemu-devel/git/2.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git