[Qemu-devel] [PATCH v3 0/3] Live migration optimization for Thunderx platform

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/3] Live migration optimization for Thunderx platform
@ 2016-10-24  5:55 vijay.kilari
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 1/3] cutils: Set __builtin_prefetch optional parameters vijay.kilari
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: vijay.kilari @ 2016-10-24  5:55 UTC (permalink / raw)
  To: qemu-arm, peter.maydell, pbonzini, rth
  Cc: qemu-devel, vijay.kilari, Vijaya Kumar K

From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>

The CPU MIDR_EL1 register is exposed to userspace for arm64
with the below patch.
https://lkml.org/lkml/2016/7/8/467

Thunderx platform requires explicit prefetch instruction to
provide prefetch hint. Using MIDR_EL1 information, provided
by above kernel patch, prefetch is executed if the platform
is Thunderx.

The results of live migration time improvement is provided
in commit message of patch 2.

Note: Check for size of while prefetching beyond page is
not added. Making this check is counter productive on
performance of live migration.

v2 => v3:
   - Rebased on top of richard's patches.
   - Consider cache line size and line number to prefetch
   - Passed optional parameters to __builtin_prefetch
v1 => v2:
   - Rename util/cpuinfo.c as util/aarch64-cpuid.c
   - Introduced header file include/qemu/aarch64-cpuid.h
   - Place all arch specific code under define __aarch64__ and
     CONFIG_LINUX.
   - Used builtin_prefetch() to add prefetch instruction.
   - Moved arch specific changes out of generic code
   - Dropped prefetching 5th cache line.

Vijaya Kumar K (3):
  cutils: Set __builtin_prefetch optional parameters
  utils: Add helper to read arm MIDR_EL1 register
  utils: Add prefetch for Thunderx platform

 include/qemu/aarch64-cpuid.h |  9 +++++
 util/Makefile.objs           |  1 +
 util/aarch64-cpuid.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
 util/bufferiszero.c          | 45 ++++++++++++++++++++---
 4 files changed, 137 insertions(+), 5 deletions(-)
 create mode 100644 include/qemu/aarch64-cpuid.h
 create mode 100644 util/aarch64-cpuid.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v3 1/3] cutils: Set __builtin_prefetch optional parameters
  2016-10-24  5:55 [Qemu-devel] [PATCH v3 0/3] Live migration optimization for Thunderx platform vijay.kilari
@ 2016-10-24  5:55 ` vijay.kilari
  2016-10-24 15:43   ` Richard Henderson
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: vijay.kilari @ 2016-10-24  5:55 UTC (permalink / raw)
  To: qemu-arm, peter.maydell, pbonzini, rth
  Cc: qemu-devel, vijay.kilari, Vijaya Kumar K

From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>

Optional parameters of __builtin_prefetch() which specifies
rw and locality to 0's. For checking buffer is zero, set rw as read
and temporal locality to 0.

On arm64, __builtin_prefetch(addr) generates 'prfm    pldl1keep'
where __builtin_prefetch(addr, 0, 0) generates 'prfm pldl1strm'
instruction which is optimal for this use case

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
---
 util/bufferiszero.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index eb974b7..421d945 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -49,7 +49,7 @@ buffer_zero_int(const void *buf, size_t len)
         const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
 
         for (; p + 8 <= e; p += 8) {
-            __builtin_prefetch(p + 8);
+            __builtin_prefetch(p + 8, 0, 0);
             if (t) {
                 return false;
             }
@@ -86,7 +86,7 @@ buffer_zero_sse2(const void *buf, size_t len)
 
     /* Loop over 16-byte aligned blocks of 64.  */
     while (likely(p <= e)) {
-        __builtin_prefetch(p);
+        __builtin_prefetch(p, 0, 0);
         t = _mm_cmpeq_epi8(t, zero);
         if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
             return false;
@@ -127,7 +127,7 @@ buffer_zero_sse4(const void *buf, size_t len)
 
     /* Loop over 16-byte aligned blocks of 64.  */
     while (likely(p <= e)) {
-        __builtin_prefetch(p);
+        __builtin_prefetch(p, 0, 0);
         if (unlikely(!_mm_testz_si128(t, t))) {
             return false;
         }
@@ -162,7 +162,7 @@ buffer_zero_avx2(const void *buf, size_t len)
     if (likely(p <= e)) {
         /* Loop over 32-byte aligned blocks of 128.  */
         do {
-            __builtin_prefetch(p);
+            __builtin_prefetch(p, 0, 0);
             if (unlikely(!_mm256_testz_si256(t, t))) {
                 return false;
             }
@@ -303,7 +303,7 @@ bool buffer_is_zero(const void *buf, size_t len)
     }
 
     /* Fetch the beginning of the buffer while we select the accelerator.  */
-    __builtin_prefetch(buf);
+    __builtin_prefetch(buf, 0, 0);
 
     /* Use an optimized zero check if possible.  Note that this also
        includes a check for an unrolled loop over 64-bit integers.  */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register
  2016-10-24  5:55 [Qemu-devel] [PATCH v3 0/3] Live migration optimization for Thunderx platform vijay.kilari
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 1/3] cutils: Set __builtin_prefetch optional parameters vijay.kilari
@ 2016-10-24  5:55 ` vijay.kilari
  2016-10-24  9:39   ` Dr. David Alan Gilbert
  2016-10-24 15:47   ` Richard Henderson
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 3/3] utils: Add prefetch for Thunderx platform vijay.kilari
  2016-10-24 15:47 ` [Qemu-devel] [PATCH v3 0/3] Live migration optimization " Richard Henderson
  3 siblings, 2 replies; 12+ messages in thread
From: vijay.kilari @ 2016-10-24  5:55 UTC (permalink / raw)
  To: qemu-arm, peter.maydell, pbonzini, rth
  Cc: qemu-devel, vijay.kilari, Vijaya Kumar K

From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>

Add helper API to read MIDR_EL1 registers to fetch
cpu identification information. This helps in
adding errata's and architecture specific features.

This is implemented only for arm architecture.

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
---
 include/qemu/aarch64-cpuid.h |  9 +++++
 util/Makefile.objs           |  1 +
 util/aarch64-cpuid.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 97 insertions(+)

diff --git a/include/qemu/aarch64-cpuid.h b/include/qemu/aarch64-cpuid.h
new file mode 100644
index 0000000..dbcb5ff
--- /dev/null
+++ b/include/qemu/aarch64-cpuid.h
@@ -0,0 +1,9 @@
+#ifndef QEMU_AARCH64_CPUID_H
+#define QEMU_AARCH64_CPUID_H
+
+#if defined(__aarch64__)
+uint64_t get_aarch64_cpu_id(void);
+bool is_thunderx_pass2_cpu(void);
+#endif
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 36c7dcc..d14a455 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -37,3 +37,4 @@ util-obj-y += log.o
 util-obj-y += qdist.o
 util-obj-y += qht.o
 util-obj-y += range.o
+util-obj-y += aarch64-cpuid.o
diff --git a/util/aarch64-cpuid.c b/util/aarch64-cpuid.c
new file mode 100644
index 0000000..a6352ad
--- /dev/null
+++ b/util/aarch64-cpuid.c
@@ -0,0 +1,87 @@
+/*
+ * Dealing with arm cpu identification information.
+ *
+ * Copyright (C) 2016 Cavium, Inc.
+ *
+ * Authors:
+ *  Vijaya Kumar K <Vijaya.Kumar@cavium.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1
+ * or later.  See the COPYING.LIB file in the top-level directory.
+ */
+
+#include <math.h>
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/cutils.h"
+#include "qemu/aarch64-cpuid.h"
+
+#if defined(__aarch64__)
+#define MIDR_IMPLEMENTER_SHIFT  24
+#define MIDR_IMPLEMENTER_MASK   (0xffULL << MIDR_IMPLEMENTER_SHIFT)
+#define MIDR_ARCHITECTURE_SHIFT 16
+#define MIDR_ARCHITECTURE_MASK  (0xf << MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_PARTNUM_SHIFT      4
+#define MIDR_PARTNUM_MASK       (0xfff << MIDR_PARTNUM_SHIFT)
+
+#define MIDR_CPU_PART(imp, partnum) \
+        (((imp)                 << MIDR_IMPLEMENTER_SHIFT)  | \
+        (0xf                    << MIDR_ARCHITECTURE_SHIFT) | \
+        ((partnum)              << MIDR_PARTNUM_SHIFT))
+
+#define ARM_CPU_IMP_CAVIUM        0x43
+#define CAVIUM_CPU_PART_THUNDERX  0x0A1
+
+#define MIDR_THUNDERX_PASS2  \
+               MIDR_CPU_PART(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
+#define CPU_MODEL_MASK  (MIDR_IMPLEMENTER_MASK | MIDR_ARCHITECTURE_MASK | \
+                         MIDR_PARTNUM_MASK)
+
+static uint64_t qemu_read_aarch64_midr_el1(void)
+{
+#ifdef CONFIG_LINUX
+    const char *file = "/sys/devices/system/cpu/cpu0/regs/identification/midr_el1";
+    char *buf;
+    uint64_t midr = 0;
+
+#define BUF_SIZE 32
+    buf = g_malloc0(BUF_SIZE);
+    if (!buf) {
+        return 0;
+    }
+
+    if (!g_file_get_contents(file, &buf, 0, NULL)) {
+        goto out;
+    }
+
+    if (qemu_strtoull(buf, NULL, 0, &midr) < 0) {
+        goto out;
+    }
+
+out:
+    g_free(buf);
+
+    return midr;
+#else
+    return 0;
+#endif
+}
+
+static uint64_t aarch64_midr_val;
+uint64_t get_aarch64_cpu_id(void)
+{
+#ifdef CONFIG_LINUX
+    aarch64_midr_val = qemu_read_aarch64_midr_el1();
+    aarch64_midr_val &= CPU_MODEL_MASK;
+
+    return aarch64_midr_val;
+#else
+    return 0;
+#endif
+}
+
+bool is_thunderx_pass2_cpu(void)
+{
+    return aarch64_midr_val == MIDR_THUNDERX_PASS2;
+}
+#endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v3 3/3] utils: Add prefetch for Thunderx platform
  2016-10-24  5:55 [Qemu-devel] [PATCH v3 0/3] Live migration optimization for Thunderx platform vijay.kilari
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 1/3] cutils: Set __builtin_prefetch optional parameters vijay.kilari
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
@ 2016-10-24  5:55 ` vijay.kilari
  2016-10-24 11:25   ` Paolo Bonzini
  2016-10-24 15:47 ` [Qemu-devel] [PATCH v3 0/3] Live migration optimization " Richard Henderson
  3 siblings, 1 reply; 12+ messages in thread
From: vijay.kilari @ 2016-10-24  5:55 UTC (permalink / raw)
  To: qemu-arm, peter.maydell, pbonzini, rth
  Cc: qemu-devel, vijay.kilari, Vijaya Kumar K

From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>

Thunderx pass2 chip requires explicit prefetch
instruction to give prefetch hint.

To speed up live migration on Thunderx platform,
prefetch instruction is added in zero buffer check
function.The below results show live migration time improvement
with prefetch instruction. VM with 4 VCPUs, 8GB RAM is migrated.

Without prefetch total migration time is ~13 seconds
adding prefetch total migration time is 9.5 seconds

Code for decoding cache size is taken from Richard's
patch

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
---
 util/bufferiszero.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 421d945..f50b8df 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -25,6 +25,10 @@
 #include "qemu-common.h"
 #include "qemu/cutils.h"
 #include "qemu/bswap.h"
+#include <math.h>
+
+static uint32_t cache_line_factor = 1;
+static uint32_t prefetch_line_dist = 1;
 
 static bool
 buffer_zero_int(const void *buf, size_t len)
@@ -49,7 +53,8 @@ buffer_zero_int(const void *buf, size_t len)
         const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
 
         for (; p + 8 <= e; p += 8) {
-            __builtin_prefetch(p + 8, 0, 0);
+            __builtin_prefetch(p +
+               (8 * cache_line_factor * prefetch_line_dist), 0, 0);
             if (t) {
                 return false;
             }
@@ -293,6 +298,30 @@ bool test_buffer_is_zero_next_accel(void)
 }
 #endif
 
+#if defined(__aarch64__)
+#include "qemu/aarch64-cpuid.h"
+
+static void __attribute__((constructor)) aarch64_init_cache_size(void)
+{
+    uint64_t t;
+
+    /* Use the DZP block size as a proxy for the cacheline size,
+       since the later is not available to userspace.  This seems
+       to work in practice for existing implementations.  */
+    asm("mrs %0, dczid_el0" : "=r"(t));
+    if (pow(2, (t & 0xf)) * 4 >= 128) {
+        cache_line_factor = 2;
+    } else {
+        cache_line_factor = 1;
+    }
+
+    get_aarch64_cpu_id();
+    if (is_thunderx_pass2_cpu()) {
+        prefetch_line_dist = 3;
+    }
+}
+#endif
+
 /*
  * Checks if a buffer is all zeroes
  */
@@ -305,6 +334,12 @@ bool buffer_is_zero(const void *buf, size_t len)
     /* Fetch the beginning of the buffer while we select the accelerator.  */
     __builtin_prefetch(buf, 0, 0);
 
+#if defined(__aarch64__)
+    if (is_thunderx_pass2_cpu()) {
+        __builtin_prefetch(buf + 16, 0, 0);
+        __builtin_prefetch(buf + 32, 0, 0);
+    }
+#endif
     /* Use an optimized zero check if possible.  Note that this also
        includes a check for an unrolled loop over 64-bit integers.  */
     return select_accel_fn(buf, len);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
@ 2016-10-24  9:39   ` Dr. David Alan Gilbert
  2016-10-24 10:20     ` Vijay Kilari
  2016-10-24 11:26     ` Paolo Bonzini
  2016-10-24 15:47   ` Richard Henderson
  1 sibling, 2 replies; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2016-10-24  9:39 UTC (permalink / raw)
  To: vijay.kilari
  Cc: qemu-arm, peter.maydell, pbonzini, rth, qemu-devel, Vijaya Kumar K

* vijay.kilari@gmail.com (vijay.kilari@gmail.com) wrote:
> From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> 
> Add helper API to read MIDR_EL1 registers to fetch
> cpu identification information. This helps in
> adding errata's and architecture specific features.
> 
> This is implemented only for arm architecture.
> 
> Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> ---
>  include/qemu/aarch64-cpuid.h |  9 +++++
>  util/Makefile.objs           |  1 +
>  util/aarch64-cpuid.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++

It feels like there should be somewhere else to put this very ARM specific thing
that in util/ - not sure where though.

>  3 files changed, 97 insertions(+)
> 
> diff --git a/include/qemu/aarch64-cpuid.h b/include/qemu/aarch64-cpuid.h

> new file mode 100644
> index 0000000..dbcb5ff
> --- /dev/null
> +++ b/include/qemu/aarch64-cpuid.h
> @@ -0,0 +1,9 @@
> +#ifndef QEMU_AARCH64_CPUID_H
> +#define QEMU_AARCH64_CPUID_H
> +
> +#if defined(__aarch64__)
> +uint64_t get_aarch64_cpu_id(void);
> +bool is_thunderx_pass2_cpu(void);
> +#endif
> +
> +#endif
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 36c7dcc..d14a455 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -37,3 +37,4 @@ util-obj-y += log.o
>  util-obj-y += qdist.o
>  util-obj-y += qht.o
>  util-obj-y += range.o
> +util-obj-y += aarch64-cpuid.o
> diff --git a/util/aarch64-cpuid.c b/util/aarch64-cpuid.c
> new file mode 100644
> index 0000000..a6352ad
> --- /dev/null
> +++ b/util/aarch64-cpuid.c
> @@ -0,0 +1,87 @@
> +/*
> + * Dealing with arm cpu identification information.
> + *
> + * Copyright (C) 2016 Cavium, Inc.
> + *
> + * Authors:
> + *  Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1
> + * or later.  See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include <math.h>
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qemu/cutils.h"
> +#include "qemu/aarch64-cpuid.h"
> +
> +#if defined(__aarch64__)
> +#define MIDR_IMPLEMENTER_SHIFT  24
> +#define MIDR_IMPLEMENTER_MASK   (0xffULL << MIDR_IMPLEMENTER_SHIFT)
> +#define MIDR_ARCHITECTURE_SHIFT 16
> +#define MIDR_ARCHITECTURE_MASK  (0xf << MIDR_ARCHITECTURE_SHIFT)
> +#define MIDR_PARTNUM_SHIFT      4
> +#define MIDR_PARTNUM_MASK       (0xfff << MIDR_PARTNUM_SHIFT)
> +
> +#define MIDR_CPU_PART(imp, partnum) \
> +        (((imp)                 << MIDR_IMPLEMENTER_SHIFT)  | \
> +        (0xf                    << MIDR_ARCHITECTURE_SHIFT) | \
> +        ((partnum)              << MIDR_PARTNUM_SHIFT))
> +
> +#define ARM_CPU_IMP_CAVIUM        0x43
> +#define CAVIUM_CPU_PART_THUNDERX  0x0A1
> +
> +#define MIDR_THUNDERX_PASS2  \
> +               MIDR_CPU_PART(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
> +#define CPU_MODEL_MASK  (MIDR_IMPLEMENTER_MASK | MIDR_ARCHITECTURE_MASK | \
> +                         MIDR_PARTNUM_MASK)
> +
> +static uint64_t qemu_read_aarch64_midr_el1(void)
> +{
> +#ifdef CONFIG_LINUX
> +    const char *file = "/sys/devices/system/cpu/cpu0/regs/identification/midr_el1";
> +    char *buf;
> +    uint64_t midr = 0;
> +
> +#define BUF_SIZE 32
> +    buf = g_malloc0(BUF_SIZE);
> +    if (!buf) {
> +        return 0;
> +    }

Do you need to do that? Isn't g_file_get_contents doing the allocation?

Dave

> +    if (!g_file_get_contents(file, &buf, 0, NULL)) {
> +        goto out;
> +    }
> +
> +    if (qemu_strtoull(buf, NULL, 0, &midr) < 0) {
> +        goto out;
> +    }
> +
> +out:
> +    g_free(buf);
> +
> +    return midr;
> +#else
> +    return 0;
> +#endif
> +}
> +
> +static uint64_t aarch64_midr_val;
> +uint64_t get_aarch64_cpu_id(void)
> +{
> +#ifdef CONFIG_LINUX
> +    aarch64_midr_val = qemu_read_aarch64_midr_el1();
> +    aarch64_midr_val &= CPU_MODEL_MASK;
> +
> +    return aarch64_midr_val;
> +#else
> +    return 0;
> +#endif
> +}
> +
> +bool is_thunderx_pass2_cpu(void)
> +{
> +    return aarch64_midr_val == MIDR_THUNDERX_PASS2;
> +}
> +#endif
> -- 
> 1.9.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register
  2016-10-24  9:39   ` Dr. David Alan Gilbert
@ 2016-10-24 10:20     ` Vijay Kilari
  2016-10-24 11:26     ` Paolo Bonzini
  1 sibling, 0 replies; 12+ messages in thread
From: Vijay Kilari @ 2016-10-24 10:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-arm, Peter Maydell, Paolo Bonzini, Richard Henderson,
	QEMU Developers, Vijaya Kumar K

On Mon, Oct 24, 2016 at 3:09 PM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * vijay.kilari@gmail.com (vijay.kilari@gmail.com) wrote:
>> From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
>>
>> Add helper API to read MIDR_EL1 registers to fetch
>> cpu identification information. This helps in
>> adding errata's and architecture specific features.
>>
>> This is implemented only for arm architecture.
>>
>> Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
>> ---
>>  include/qemu/aarch64-cpuid.h |  9 +++++
>>  util/Makefile.objs           |  1 +
>>  util/aarch64-cpuid.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
>
> It feels like there should be somewhere else to put this very ARM specific thing
> that in util/ - not sure where though.

  IRC, I tried it. But libutil is built before arch code compilation.
So cannot put
outside of util folder

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] utils: Add prefetch for Thunderx platform
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 3/3] utils: Add prefetch for Thunderx platform vijay.kilari
@ 2016-10-24 11:25   ` Paolo Bonzini
  2016-10-24 15:51     ` Richard Henderson
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2016-10-24 11:25 UTC (permalink / raw)
  To: vijay.kilari, qemu-arm, peter.maydell, rth; +Cc: qemu-devel, Vijaya Kumar K



On 24/10/2016 07:55, vijay.kilari@gmail.com wrote:
> From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> 
> Thunderx pass2 chip requires explicit prefetch
> instruction to give prefetch hint.
> 
> To speed up live migration on Thunderx platform,
> prefetch instruction is added in zero buffer check
> function.The below results show live migration time improvement
> with prefetch instruction. VM with 4 VCPUs, 8GB RAM is migrated.
> 
> Without prefetch total migration time is ~13 seconds
> adding prefetch total migration time is 9.5 seconds
> 
> Code for decoding cache size is taken from Richard's
> patch
> 
> Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> ---
>  util/bufferiszero.c | 37 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
> index 421d945..f50b8df 100644
> --- a/util/bufferiszero.c
> +++ b/util/bufferiszero.c
> @@ -25,6 +25,10 @@
>  #include "qemu-common.h"
>  #include "qemu/cutils.h"
>  #include "qemu/bswap.h"
> +#include <math.h>
> +
> +static uint32_t cache_line_factor = 1;

Let's express this in bytes, with a default value of 64 (so rename
cache_line_factor->cache_line_size).

> +static uint32_t prefetch_line_dist = 1;
>  
>  static bool
>  buffer_zero_int(const void *buf, size_t len)
> @@ -49,7 +53,8 @@ buffer_zero_int(const void *buf, size_t len)
>          const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
>  
>          for (; p + 8 <= e; p += 8) {
> -            __builtin_prefetch(p + 8, 0, 0);
> +            __builtin_prefetch(p +
> +               (8 * cache_line_factor * prefetch_line_dist), 0, 0);

You should precompute cache_line_bytes * prefetch_line_dist /
sizeof(uint64_t) in a single variable, prefetch_distance.  This saves
the effort of loading global variables repeatedly.  Then you can do

    __builtin_prefetch(p + prefetch_distance, 0, 0);

>              if (t) {
>                  return false;
>              }
> @@ -293,6 +298,30 @@ bool test_buffer_is_zero_next_accel(void)
>  }
>  #endif
>  
> +#if defined(__aarch64__)
> +#include "qemu/aarch64-cpuid.h"
> +
> +static void __attribute__((constructor)) aarch64_init_cache_size(void)
> +{
> +    uint64_t t;
> +
> +    /* Use the DZP block size as a proxy for the cacheline size,
> +       since the later is not available to userspace.  This seems
> +       to work in practice for existing implementations.  */
> +    asm("mrs %0, dczid_el0" : "=r"(t));
> +    if (pow(2, (t & 0xf)) * 4 >= 128) {
> +        cache_line_factor = 2;
> +    } else {
> +        cache_line_factor = 1;
> +    }
> +
> +    get_aarch64_cpu_id();
> +    if (is_thunderx_pass2_cpu()) {
> +        prefetch_line_dist = 3;
> +    }
> +}
> +#endif
> +
>  /*
>   * Checks if a buffer is all zeroes
>   */
> @@ -305,6 +334,12 @@ bool buffer_is_zero(const void *buf, size_t len)
>      /* Fetch the beginning of the buffer while we select the accelerator.  */
>      __builtin_prefetch(buf, 0, 0);
>  
> +#if defined(__aarch64__)
> +    if (is_thunderx_pass2_cpu()) {
> +        __builtin_prefetch(buf + 16, 0, 0);
> +        __builtin_prefetch(buf + 32, 0, 0);

This should not be ThunderX or aarch64 specific; it should be a loop like

    prefetch_distance_bytes = prefetch_line_dist * cache_line_size;
    for (i = 0; i < prefetch_distance_bytes; i += cache_line_size)
         __builtin_prefetch(buf + i, 0, 0);

In the default case, cache_line_bytes == prefetch_distance_bytes (both
are 64) and you will get the same behavior as the existing

    __builtin_prefetch(buf, 0, 0);

Thanks,

Paolo

> +    }
> +#endif
>      /* Use an optimized zero check if possible.  Note that this also
>         includes a check for an unrolled loop over 64-bit integers.  */
>      return select_accel_fn(buf, len);
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register
  2016-10-24  9:39   ` Dr. David Alan Gilbert
  2016-10-24 10:20     ` Vijay Kilari
@ 2016-10-24 11:26     ` Paolo Bonzini
  1 sibling, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2016-10-24 11:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, vijay.kilari
  Cc: qemu-arm, peter.maydell, rth, qemu-devel, Vijaya Kumar K



On 24/10/2016 11:39, Dr. David Alan Gilbert wrote:
> * vijay.kilari@gmail.com (vijay.kilari@gmail.com) wrote:
>> From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
>>
>> Add helper API to read MIDR_EL1 registers to fetch
>> cpu identification information. This helps in
>> adding errata's and architecture specific features.
>>
>> This is implemented only for arm architecture.
>>
>> Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
>> ---
>>  include/qemu/aarch64-cpuid.h |  9 +++++
>>  util/Makefile.objs           |  1 +
>>  util/aarch64-cpuid.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
> 
> It feels like there should be somewhere else to put this very ARM specific thing
> that in util/ - not sure where though.

It's okay I guess, the name is pretty clear.  What's important is a
clear split of arch-specific and generic code in bufferiszero.c.

Paolo

>>  3 files changed, 97 insertions(+)
>>
>> diff --git a/include/qemu/aarch64-cpuid.h b/include/qemu/aarch64-cpuid.h
> 
>> new file mode 100644
>> index 0000000..dbcb5ff
>> --- /dev/null
>> +++ b/include/qemu/aarch64-cpuid.h
>> @@ -0,0 +1,9 @@
>> +#ifndef QEMU_AARCH64_CPUID_H
>> +#define QEMU_AARCH64_CPUID_H
>> +
>> +#if defined(__aarch64__)
>> +uint64_t get_aarch64_cpu_id(void);
>> +bool is_thunderx_pass2_cpu(void);
>> +#endif
>> +
>> +#endif
>> diff --git a/util/Makefile.objs b/util/Makefile.objs
>> index 36c7dcc..d14a455 100644
>> --- a/util/Makefile.objs
>> +++ b/util/Makefile.objs
>> @@ -37,3 +37,4 @@ util-obj-y += log.o
>>  util-obj-y += qdist.o
>>  util-obj-y += qht.o
>>  util-obj-y += range.o
>> +util-obj-y += aarch64-cpuid.o
>> diff --git a/util/aarch64-cpuid.c b/util/aarch64-cpuid.c
>> new file mode 100644
>> index 0000000..a6352ad
>> --- /dev/null
>> +++ b/util/aarch64-cpuid.c
>> @@ -0,0 +1,87 @@
>> +/*
>> + * Dealing with arm cpu identification information.
>> + *
>> + * Copyright (C) 2016 Cavium, Inc.
>> + *
>> + * Authors:
>> + *  Vijaya Kumar K <Vijaya.Kumar@cavium.com>
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2.1
>> + * or later.  See the COPYING.LIB file in the top-level directory.
>> + */
>> +
>> +#include <math.h>
>> +#include "qemu/osdep.h"
>> +#include "qemu-common.h"
>> +#include "qemu/cutils.h"
>> +#include "qemu/aarch64-cpuid.h"
>> +
>> +#if defined(__aarch64__)
>> +#define MIDR_IMPLEMENTER_SHIFT  24
>> +#define MIDR_IMPLEMENTER_MASK   (0xffULL << MIDR_IMPLEMENTER_SHIFT)
>> +#define MIDR_ARCHITECTURE_SHIFT 16
>> +#define MIDR_ARCHITECTURE_MASK  (0xf << MIDR_ARCHITECTURE_SHIFT)
>> +#define MIDR_PARTNUM_SHIFT      4
>> +#define MIDR_PARTNUM_MASK       (0xfff << MIDR_PARTNUM_SHIFT)
>> +
>> +#define MIDR_CPU_PART(imp, partnum) \
>> +        (((imp)                 << MIDR_IMPLEMENTER_SHIFT)  | \
>> +        (0xf                    << MIDR_ARCHITECTURE_SHIFT) | \
>> +        ((partnum)              << MIDR_PARTNUM_SHIFT))
>> +
>> +#define ARM_CPU_IMP_CAVIUM        0x43
>> +#define CAVIUM_CPU_PART_THUNDERX  0x0A1
>> +
>> +#define MIDR_THUNDERX_PASS2  \
>> +               MIDR_CPU_PART(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
>> +#define CPU_MODEL_MASK  (MIDR_IMPLEMENTER_MASK | MIDR_ARCHITECTURE_MASK | \
>> +                         MIDR_PARTNUM_MASK)
>> +
>> +static uint64_t qemu_read_aarch64_midr_el1(void)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    const char *file = "/sys/devices/system/cpu/cpu0/regs/identification/midr_el1";
>> +    char *buf;
>> +    uint64_t midr = 0;
>> +
>> +#define BUF_SIZE 32
>> +    buf = g_malloc0(BUF_SIZE);
>> +    if (!buf) {
>> +        return 0;
>> +    }
> 
> Do you need to do that? Isn't g_file_get_contents doing the allocation?
> 
> Dave
> 
>> +    if (!g_file_get_contents(file, &buf, 0, NULL)) {
>> +        goto out;
>> +    }
>> +
>> +    if (qemu_strtoull(buf, NULL, 0, &midr) < 0) {
>> +        goto out;
>> +    }
>> +
>> +out:
>> +    g_free(buf);
>> +
>> +    return midr;
>> +#else
>> +    return 0;
>> +#endif
>> +}
>> +
>> +static uint64_t aarch64_midr_val;
>> +uint64_t get_aarch64_cpu_id(void)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    aarch64_midr_val = qemu_read_aarch64_midr_el1();
>> +    aarch64_midr_val &= CPU_MODEL_MASK;
>> +
>> +    return aarch64_midr_val;
>> +#else
>> +    return 0;
>> +#endif
>> +}
>> +
>> +bool is_thunderx_pass2_cpu(void)
>> +{
>> +    return aarch64_midr_val == MIDR_THUNDERX_PASS2;
>> +}
>> +#endif
>> -- 
>> 1.9.1
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] cutils: Set __builtin_prefetch optional parameters
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 1/3] cutils: Set __builtin_prefetch optional parameters vijay.kilari
@ 2016-10-24 15:43   ` Richard Henderson
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Henderson @ 2016-10-24 15:43 UTC (permalink / raw)
  To: vijay.kilari, qemu-arm, peter.maydell, pbonzini
  Cc: qemu-devel, Vijaya Kumar K

On 10/23/2016 10:55 PM, vijay.kilari@gmail.com wrote:
> From: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> 
> Optional parameters of __builtin_prefetch() which specifies
> rw and locality to 0's. For checking buffer is zero, set rw as read
> and temporal locality to 0.
> 
> On arm64, __builtin_prefetch(addr) generates 'prfm    pldl1keep'
> where __builtin_prefetch(addr, 0, 0) generates 'prfm pldl1strm'
> instruction which is optimal for this use case
> 
> Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
> ---

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
  2016-10-24  9:39   ` Dr. David Alan Gilbert
@ 2016-10-24 15:47   ` Richard Henderson
  1 sibling, 0 replies; 12+ messages in thread
From: Richard Henderson @ 2016-10-24 15:47 UTC (permalink / raw)
  To: vijay.kilari, qemu-arm, peter.maydell, pbonzini
  Cc: qemu-devel, Vijaya Kumar K

On 10/23/2016 10:55 PM, vijay.kilari@gmail.com wrote:
> +static uint64_t aarch64_midr_val;
> +uint64_t get_aarch64_cpu_id(void)
> +{
> +#ifdef CONFIG_LINUX
> +    aarch64_midr_val = qemu_read_aarch64_midr_el1();
> +    aarch64_midr_val &= CPU_MODEL_MASK;
> +
> +    return aarch64_midr_val;
> +#else
> +    return 0;
> +#endif
> +}
> +
> +bool is_thunderx_pass2_cpu(void)
> +{
> +    return aarch64_midr_val == MIDR_THUNDERX_PASS2;
> +}

Any particular reason why you want to keep midr_val and MIDR_THUNDERX private
to this file?  Seems like it would be cheaper to export those in the header.


r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/3] Live migration optimization for Thunderx platform
  2016-10-24  5:55 [Qemu-devel] [PATCH v3 0/3] Live migration optimization for Thunderx platform vijay.kilari
                   ` (2 preceding siblings ...)
  2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 3/3] utils: Add prefetch for Thunderx platform vijay.kilari
@ 2016-10-24 15:47 ` Richard Henderson
  3 siblings, 0 replies; 12+ messages in thread
From: Richard Henderson @ 2016-10-24 15:47 UTC (permalink / raw)
  To: vijay.kilari, qemu-arm, peter.maydell, pbonzini
  Cc: qemu-devel, Vijaya Kumar K

On 10/23/2016 10:55 PM, vijay.kilari@gmail.com wrote:
> The results of live migration time improvement is provided
> in commit message of patch 2.

It's no longer there?


r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] utils: Add prefetch for Thunderx platform
  2016-10-24 11:25   ` Paolo Bonzini
@ 2016-10-24 15:51     ` Richard Henderson
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Henderson @ 2016-10-24 15:51 UTC (permalink / raw)
  To: Paolo Bonzini, vijay.kilari, qemu-arm, peter.maydell
  Cc: qemu-devel, Vijaya Kumar K

On 10/24/2016 04:25 AM, Paolo Bonzini wrote:
>> >          for (; p + 8 <= e; p += 8) {
>> > -            __builtin_prefetch(p + 8, 0, 0);
>> > +            __builtin_prefetch(p +
>> > +               (8 * cache_line_factor * prefetch_line_dist), 0, 0);
> You should precompute cache_line_bytes * prefetch_line_dist /
> sizeof(uint64_t) in a single variable, prefetch_distance.  This saves
> the effort of loading global variables repeatedly.  Then you can do
> 
>     __builtin_prefetch(p + prefetch_distance, 0, 0);
> 

Let's not complicate things by dividing by sizeof(uint64_t).
It's less complicated to avoid both that and the implied multiply.

  __builtin_prefetch((char *)p + prefetch_distance, 0, 0)


r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-10-24 16:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24  5:55 [Qemu-devel] [PATCH v3 0/3] Live migration optimization for Thunderx platform vijay.kilari
2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 1/3] cutils: Set __builtin_prefetch optional parameters vijay.kilari
2016-10-24 15:43   ` Richard Henderson
2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 2/3] utils: Add helper to read arm MIDR_EL1 register vijay.kilari
2016-10-24  9:39   ` Dr. David Alan Gilbert
2016-10-24 10:20     ` Vijay Kilari
2016-10-24 11:26     ` Paolo Bonzini
2016-10-24 15:47   ` Richard Henderson
2016-10-24  5:55 ` [Qemu-devel] [PATCH v3 3/3] utils: Add prefetch for Thunderx platform vijay.kilari
2016-10-24 11:25   ` Paolo Bonzini
2016-10-24 15:51     ` Richard Henderson
2016-10-24 15:47 ` [Qemu-devel] [PATCH v3 0/3] Live migration optimization " Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.