All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Philippe Mathieu-Daudé" <f4bug@amsat.org>
To: "Emilio G. Cota" <cota@braap.org>, qemu-devel@nongnu.org
Cc: "Alex Bennée" <alex.bennee@linaro.org>,
	"Pranith Kumar" <bobby.prani@gmail.com>,
	"Richard Henderson" <richard.henderson@linaro.org>
Subject: Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
Date: Sun, 7 Oct 2018 19:37:50 +0200	[thread overview]
Message-ID: <197b7649-0bc7-4a9a-021b-a308a1755cf1@amsat.org> (raw)
In-Reply-To: <20181006214508.5331-7-cota@braap.org>

Hi Emilio,

On 10/6/18 11:45 PM, Emilio G. Cota wrote:
> Perform the resizing only on flushes, otherwise we'd
> have to take a perf hit by either rehashing the array
> or unnecessarily flushing it.
> 
> We grow the array aggressively, and reduce the size more
> slowly. This accommodates mixed workloads, where some
> processes might be memory-heavy while others are not.
> 
> As the following experiments show, this a net perf gain,
> particularly for memory-heavy workloads. Experiments
> are run on an Intel i7-6700K CPU @ 4.00GHz.
> 
> 1. System boot + shudown, debian aarch64:
> 
> - Before (tb-lock-v3):
>  Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
> 
>        7469.363393      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.07% )
>     31,507,707,190      cycles                    #    4.218 GHz                      ( +-  0.07% )
>     57,101,577,452      instructions              #    1.81  insns per cycle          ( +-  0.08% )
>     10,265,531,804      branches                  # 1374.352 M/sec                    ( +-  0.07% )
>        173,020,681      branch-misses             #    1.69% of all branches          ( +-  0.10% )
> 
>        7.483359063 seconds time elapsed                                          ( +-  0.08% )
> 
> - After:
>  Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
> 
>        7185.036730      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.11% )
>     30,303,501,143      cycles                    #    4.218 GHz                      ( +-  0.11% )
>     54,198,386,487      instructions              #    1.79  insns per cycle          ( +-  0.08% )
>      9,726,518,945      branches                  # 1353.719 M/sec                    ( +-  0.08% )
>        167,082,307      branch-misses             #    1.72% of all branches          ( +-  0.08% )
> 
>        7.195597842 seconds time elapsed                                          ( +-  0.11% )
> 
> That is, a 3.8% improvement.
> 
> 2. System boot + shutdown, ubuntu 18.04 x86_64:

You can also run the VM tests to build QEMU:

$ make vm-test
vm-test: Test QEMU in preconfigured virtual machines

  vm-build-ubuntu.i386            - Build QEMU in ubuntu i386 VM
  vm-build-freebsd                - Build QEMU in FreeBSD VM
  vm-build-netbsd                 - Build QEMU in NetBSD VM
  vm-build-openbsd                - Build QEMU in OpenBSD VM
  vm-build-centos                 - Build QEMU in CentOS VM, with Docker

> 
> - Before (tb-lock-v3):
> Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
> 
>       49971.036482      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.62% )
>    210,766,077,140      cycles                    #    4.218 GHz                      ( +-  1.63% )
>    428,829,830,790      instructions              #    2.03  insns per cycle          ( +-  0.75% )
>     77,313,384,038      branches                  # 1547.164 M/sec                    ( +-  0.54% )
>        835,610,706      branch-misses             #    1.08% of all branches          ( +-  2.97% )
> 
>       50.003855102 seconds time elapsed                                          ( +-  1.61% )
> 
> - After:
>  Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
> 
>       50118.124477      task-clock (msec)         #    0.999 CPUs utilized            ( +-  4.30% )
>            132,396      context-switches          #    0.003 M/sec                    ( +-  1.20% )
>                  0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
>            167,754      page-faults               #    0.003 M/sec                    ( +-  0.06% )
>    211,414,701,601      cycles                    #    4.218 GHz                      ( +-  4.30% )
>    <not supported>      stalled-cycles-frontend
>    <not supported>      stalled-cycles-backend
>    431,618,818,597      instructions              #    2.04  insns per cycle          ( +-  6.40% )
>     80,197,256,524      branches                  # 1600.165 M/sec                    ( +-  8.59% )
>        794,830,352      branch-misses             #    0.99% of all branches          ( +-  2.05% )
> 
>       50.177077175 seconds time elapsed                                          ( +-  4.23% )
> 
> No improvement (within noise range).
> 
> 3. x86_64 SPEC06int:
>                               SPEC06int (test set)
>                          [ Y axis: speedup over master ]
>   8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+
>     |                                                                       |
>     |                                                   tlb-lock-v3         |
>   7 +-+..................$$$...........................+indirection       +-+
>     |                    $ $                              +resizing         |
>     |                    $ $                                                |
>   6 +-+..................$.$..............................................+-+
>     |                    $ $                                                |
>     |                    $ $                                                |
>   5 +-+..................$.$..............................................+-+
>     |                    $ $                                                |
>     |                    $ $                                                |
>   4 +-+..................$.$..............................................+-+
>     |                    $ $                                                |
>     |          +++       $ $                                                |
>   3 +-+........$$+.......$.$..............................................+-+
>     |          $$        $ $                                                |
>     |          $$        $ $                                 $$$            |
>   2 +-+........$$........$.$.................................$.$..........+-+
>     |          $$        $ $                                 $ $       +$$  |
>     |          $$   $$+  $ $  $$$       +$$                  $ $  $$$   $$  |
>   1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+
>     |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
>     |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
>   0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+
>      401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean

This description line is hard to read ;)

> png: https://imgur.com/a/b1wn3wc
> 
> That is, a 1.53x average speedup over master, with a max speedup of 7.13x.
> 
> Note that "indirection" (i.e. the first patch in this series) incurs
> no overhead, on average.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/exec/cpu-defs.h |  1 +
>  accel/tcg/cputlb.c      | 36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 27b9433976..4d1d6b2b8b 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -145,6 +145,7 @@ typedef struct CPUTLBDesc {
>      size_t size;
>      size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
>      size_t used;
> +    size_t n_flushes_low_rate;
>  } CPUTLBDesc;
>  
>  #define CPU_COMMON_TLB  \
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 1ca71ecfc4..afb61e9c2b 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -85,6 +85,7 @@ void tlb_init(CPUState *cpu)
>          desc->size = MIN_CPU_TLB_SIZE;
>          desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
>          desc->used = 0;
> +        desc->n_flushes_low_rate = 0;
>          env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
>          env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
>      }
> @@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
>      return count;
>  }
>  
> +/* Call with tlb_lock held */
> +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
> +{
> +    CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
> +    size_t rate = desc->used * 100 / desc->size;
> +    size_t new_size = desc->size;
> +
> +    if (rate == 100) {
> +        new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate > 70) {
> +        new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate < 30) {

I wonder if those thresholds might be per TCG_TARGET.

Btw the paper used 40% here, did you tried it too?

Regards,

Phil.

> +        desc->n_flushes_low_rate++;
> +        if (desc->n_flushes_low_rate == 100) {
> +            new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
> +            desc->n_flushes_low_rate = 0;
> +        }
> +    }
> +
> +    if (new_size == desc->size) {
> +        return;
> +    }
> +
> +    g_free(env->tlb_table[mmu_idx]);
> +    g_free(env->iotlb[mmu_idx]);
> +
> +    desc->size = new_size;
> +    desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
> +    desc->n_flushes_low_rate = 0;
> +    env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, desc->size);
> +    env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, desc->size);
> +}
> +
>  /* This is OK because CPU architectures generally permit an
>   * implementation to drop entries from the TLB at any time, so
>   * flushing more entries than required is only an efficiency issue,
> @@ -151,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
>       */
>      qemu_spin_lock(&env->tlb_lock);
>      for (i = 0; i < NB_MMU_MODES; i++) {
> +        tlb_mmu_resize_locked(env, i);
>          memset(env->tlb_table[i], -1,
>                 env->tlb_desc[i].size * sizeof(CPUTLBEntry));
>          env->tlb_desc[i].used = 0;
> @@ -215,6 +250,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
>          if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
>              tlb_debug("%d\n", mmu_idx);
>  
> +            tlb_mmu_resize_locked(env, mmu_idx);
>              memset(env->tlb_table[mmu_idx], -1,
>                     env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
>              memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
> 

  reply	other threads:[~2018-10-07 17:38 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-06 21:45 [Qemu-devel] [RFC 0/6] Dynamic TLB sizing Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 1/6] (XXX) cputlb: separate MMU allocation + run-time sizing Emilio G. Cota
2018-10-08  1:47   ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 2/6] cputlb: do not evict invalid entries to the vtlb Emilio G. Cota
2018-10-08  2:09   ` Richard Henderson
2018-10-08 14:42     ` Emilio G. Cota
2018-10-08 19:46       ` Richard Henderson
2018-10-08 20:23         ` Emilio G. Cota
2018-10-06 21:45 ` [Qemu-devel] [RFC 3/6] cputlb: track TLB use rates Emilio G. Cota
2018-10-08  2:54   ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 4/6] tcg: define TCG_TARGET_TLB_MAX_INDEX_BITS Emilio G. Cota
2018-10-08  2:56   ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 5/6] cpu-defs: define MIN_CPU_TLB_SIZE Emilio G. Cota
2018-10-08  3:01   ` Richard Henderson
2018-10-06 21:45 ` [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate Emilio G. Cota
2018-10-07 17:37   ` Philippe Mathieu-Daudé [this message]
2018-10-08  1:48     ` Emilio G. Cota
2018-10-08 13:46       ` Emilio G. Cota
2018-10-08  3:21   ` Richard Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=197b7649-0bc7-4a9a-021b-a308a1755cf1@amsat.org \
    --to=f4bug@amsat.org \
    --cc=alex.bennee@linaro.org \
    --cc=bobby.prani@gmail.com \
    --cc=cota@braap.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.