All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
@ 2015-05-06 15:38 Alvise Rigo
  2015-05-06 15:38 ` [Qemu-devel] [RFC 1/5] exec: Add new exclusive bitmap to ram_list Alvise Rigo
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Alvise Rigo @ 2015-05-06 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

This patch series provides an infrastructure for atomic
instruction implementation in QEMU, paving the way for TCG multi-threading.
The adopted design does not rely on host atomic
instructions and is intended to propose a 'legacy' solution for
translating guest atomic instructions.

The underlying idea is to provide new TCG instructions that guarantee
atomicity to some memory accesses or in general a way to define memory
transactions. More specifically, a new pair of TCG instructions are
implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
LoadLink and StoreConditional primitives (only 32 bit variant
implemented).  In order to achieve this, a new bitmap is added to the
ram_list structure (always unique) which flags all memory pages that
could not be accessed directly through the fast-path, due to previous
exclusive operations. This new bitmap is coupled with a new TLB flag
which forces the slow-path exectuion. All stores which take place
between an LL/SC operation by other vCPUs in the same memory page, will
fail the subsequent StoreConditional.

In theory, the provided implementation of TCG LoadLink/StoreConditional
can be used to properly handle atomic instructions on any architecture.

The new slow-path is implemented such that:
- the LoadLink behaves as a normal load slow-path, except for cleaning
  the dirty flag in the bitmap. The TLB entries created from now on will
  force the slow-path. To ensure it, we flush the TLB cache for the
  other vCPUs
- the StoreConditional behaves as a normal store slow-path, except for
  checking the state of the dirty bitmap and returning 0 or 1 whether or
  not the StoreConditional succeeded (0 when no vCPU has touched the
  same memory in the mean time).

All those write accesses that are forced to follow the 'legacy'
slow-path will set the accessed memory page to dirty.

In this series only the ARM ldrex/strex instructions are implemented.
The code was tested with bare-metal test cases and with Linux, using
upstream QEMU.

This work has been sponsored by Huawei Technologies Dusseldorf GmbH.

Alvise Rigo (5):
  exec: Add new exclusive bitmap to ram_list
  Add new TLB_EXCL flag
  softmmu: Add helpers for a new slow-path
  tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
  target-arm: translate: implement qemu_ldlink and qemu_stcond ops

 cputlb.c                |  11 ++-
 include/exec/cpu-all.h  |   1 +
 include/exec/cpu-defs.h |   2 +
 include/exec/memory.h   |   3 +-
 include/exec/ram_addr.h |  19 +++-
 softmmu_llsc_template.h | 233 ++++++++++++++++++++++++++++++++++++++++++++++++
 softmmu_template.h      |  52 ++++++++++-
 target-arm/translate.c  |  94 ++++++++++++++++++-
 tcg/arm/tcg-target.c    | 105 ++++++++++++++++------
 tcg/tcg-be-ldst.h       |   2 +
 tcg/tcg-op.c            |  20 +++++
 tcg/tcg-op.h            |   3 +
 tcg/tcg-opc.h           |   4 +
 tcg/tcg.c               |   2 +
 tcg/tcg.h               |  20 +++++
 15 files changed, 538 insertions(+), 33 deletions(-)
 create mode 100644 softmmu_llsc_template.h

-- 
2.4.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC 1/5] exec: Add new exclusive bitmap to ram_list
  2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
@ 2015-05-06 15:38 ` Alvise Rigo
  2015-05-07 17:12   ` Richard Henderson
  2015-05-06 15:38 ` [Qemu-devel] [RFC 2/5] Add new TLB_EXCL flag Alvise Rigo
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Alvise Rigo @ 2015-05-06 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

The purpose of this new bitmap is to flag the memory pages that are in
the middle of LL/SC operations (after a LL, before a SC).
For all these pages, the corresponding TLB entries will be generated
in such a way to force the slow-path.

The accessors to this bitmap are currently not atomic, but they have to
be so in a real multi-threading TCG.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 include/exec/cpu-defs.h |  2 ++
 include/exec/memory.h   |  3 ++-
 include/exec/ram_addr.h | 19 ++++++++++++++++++-
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 0ca6f0b..d12cb4c 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -123,5 +123,7 @@ QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
 #define CPU_COMMON                                                      \
     /* soft mmu support */                                              \
     CPU_COMMON_TLB                                                      \
+    /* true if in the middle of a LoadLink/StoreConditional */          \
+    bool ll_sc_context;                                                 \
 
 #endif
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 06ffa1d..aadd2cc 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -19,7 +19,8 @@
 #define DIRTY_MEMORY_VGA       0
 #define DIRTY_MEMORY_CODE      1
 #define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
+#define DIRTY_MEMORY_EXCLUSIVE 3
+#define DIRTY_MEMORY_NUM       4        /* num of dirty bits */
 
 #include <stdint.h>
 #include <stdbool.h>
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index ff558a4..7a448ea 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "qemu/bitmap.h"
 
 ram_addr_t qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
                                     bool share, const char *mem_path,
@@ -199,9 +200,25 @@ static inline void cpu_physical_memory_clear_dirty_range(ram_addr_t start,
     cpu_physical_memory_clear_dirty_range_type(start, length, DIRTY_MEMORY_CODE);
 }
 
-
 void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t length,
                                      unsigned client);
 
+static inline void cpu_physical_memory_set_excl_dirty(ram_addr_t addr)
+{
+    clear_bit(addr >> TARGET_PAGE_BITS,
+              ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+static inline int cpu_physical_memory_excl_is_dirty(ram_addr_t addr)
+{
+    return !test_bit(addr >> TARGET_PAGE_BITS,
+                     ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+static inline void cpu_physical_memory_clear_excl_dirty(ram_addr_t addr)
+{
+    set_bit(addr >> TARGET_PAGE_BITS,
+            ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
 #endif
 #endif
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel]  [RFC 2/5] Add new TLB_EXCL flag
  2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
  2015-05-06 15:38 ` [Qemu-devel] [RFC 1/5] exec: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2015-05-06 15:38 ` Alvise Rigo
  2015-05-07 17:25   ` Richard Henderson
  2015-05-06 15:38 ` [Qemu-devel] [RFC 3/5] softmmu: Add helpers for a new slow-path Alvise Rigo
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Alvise Rigo @ 2015-05-06 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

Add a new flag for the TLB entries to force all the accesses made to a
page to follow the slow-path.

Mark the accessed page as dirty to invalidate any pending operation of
LL/SC.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c               |  7 ++++++-
 include/exec/cpu-all.h |  1 +
 softmmu_template.h     | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 38f2151..3e4ccba 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -324,7 +324,12 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
                                                    + xlat)) {
             te->addr_write = address | TLB_NOTDIRTY;
         } else {
-            te->addr_write = address;
+            if (!cpu_physical_memory_excl_is_dirty(section->mr->ram_addr
+                                                   + xlat)) {
+                te->addr_write = address | TLB_EXCL;
+            } else {
+                te->addr_write = address;
+            }
         }
     } else {
         te->addr_write = -1;
diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index ac06c67..bd19a94 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -311,6 +311,7 @@ extern RAMList ram_list;
 #define TLB_NOTDIRTY    (1 << 4)
 /* Set if TLB entry is an IO callback.  */
 #define TLB_MMIO        (1 << 5)
+#define TLB_EXCL        (1 << 6)
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
diff --git a/softmmu_template.h b/softmmu_template.h
index 0e3dd35..1ac99da 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -242,6 +242,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
 #endif
 
     haddr = addr + env->tlb_table[mmu_idx][index].addend;
+
 #if DATA_SIZE == 1
     res = glue(glue(ld, LSUFFIX), _p)((uint8_t *)haddr);
 #else
@@ -262,7 +263,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
     uintptr_t haddr;
     DATA_TYPE res;
 
-    /* Adjust the given return address.  */
+        /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
 
     /* If the TLB entry is for a different page, reload and try again.  */
@@ -387,6 +388,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     uintptr_t haddr;
+    bool to_excl_mem = false;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -406,6 +408,14 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
+    if (unlikely(tlb_addr & TLB_EXCL &&
+        !(tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL)))) {
+        /* The slow-path has been forced since we are reading a page used for a
+         * load-link operation. */
+        to_excl_mem = true;
+        goto skip_io;
+    }
+
     /* Handle an IO access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
         hwaddr ioaddr;
@@ -445,6 +455,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         return;
     }
 
+skip_io:
     /* Handle aligned access or unaligned access in the same page.  */
 #ifdef ALIGNED_ONLY
     if ((addr & (DATA_SIZE - 1)) != 0) {
@@ -453,6 +464,17 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     }
 #endif
 
+    /* If we are performing a store to exclusive-protected memory,
+     * set the bit to dirty. */
+    if (unlikely(to_excl_mem)) {
+        MemoryRegionSection *section;
+        hwaddr xlat, sz;
+
+        section = address_space_translate_for_iotlb(ENV_GET_CPU(env), tlb_addr,
+                                                    &xlat, &sz);
+        cpu_physical_memory_set_excl_dirty(section->mr->ram_addr + xlat);
+    }
+
     haddr = addr + env->tlb_table[mmu_idx][index].addend;
 #if DATA_SIZE == 1
     glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
@@ -468,6 +490,14 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     uintptr_t haddr;
+    bool to_excl_mem = false;
+
+    if (unlikely(tlb_addr & TLB_EXCL &&
+            !(tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL)))) {
+        /* The slow-path has been forced since we are reading a page used for a
+         * load-link operation. */
+        to_excl_mem = true;
+    }
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -487,6 +517,10 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
+    if (to_excl_mem) {
+        goto skip_io;
+    }
+
     /* Handle an IO access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
         hwaddr ioaddr;
@@ -526,6 +560,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         return;
     }
 
+skip_io:
     /* Handle aligned access or unaligned access in the same page.  */
 #ifdef ALIGNED_ONLY
     if ((addr & (DATA_SIZE - 1)) != 0) {
@@ -534,6 +569,17 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     }
 #endif
 
+    /* If we are performing a store to exclusive-protected memory,
+     * set the bit to dirty. */
+    if (unlikely(to_excl_mem)) {
+        MemoryRegionSection *section;
+        hwaddr xlat, sz;
+
+        section = address_space_translate_for_iotlb(ENV_GET_CPU(env), tlb_addr,
+                                                    &xlat, &sz);
+        cpu_physical_memory_set_excl_dirty(section->mr->ram_addr + xlat);
+    }
+
     haddr = addr + env->tlb_table[mmu_idx][index].addend;
     glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
 }
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel]  [RFC 3/5] softmmu: Add helpers for a new slow-path
  2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
  2015-05-06 15:38 ` [Qemu-devel] [RFC 1/5] exec: Add new exclusive bitmap to ram_list Alvise Rigo
  2015-05-06 15:38 ` [Qemu-devel] [RFC 2/5] Add new TLB_EXCL flag Alvise Rigo
@ 2015-05-06 15:38 ` Alvise Rigo
  2015-05-07 17:56   ` Richard Henderson
  2015-05-06 15:38 ` [Qemu-devel] [RFC 4/5] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions Alvise Rigo
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Alvise Rigo @ 2015-05-06 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

The new helpers rely on the legacy ones to perform the actual read/write.

The StoreConditional helper (helper_le_stcond_name) returns 1 if the
store has to fail due to a concurrent access to the same page in the by
another vCPU.  A 'concurrent access' can be a store made by *any* vCPU
(although, some implementatios allow stores made by the CPU that issued
the LoadLink).

These helpers also update the TLB entry of the page involved in the
LL/SC, so that all the following accesses made by any vCPU will follow
the slow path.
In real multi-threading, these helpers will require to temporarily pause
the execution of the other vCPUs in order to update accordingly (flush)
the TLB cache.

Some corner cases have still to be proper handled, like when a vCPU doesn't
pair a LoadLink with a StoreConditional.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |   4 +
 softmmu_llsc_template.h | 233 ++++++++++++++++++++++++++++++++++++++++++++++++
 softmmu_template.h      |   4 +
 tcg/tcg-be-ldst.h       |   2 +
 tcg/tcg.h               |  20 +++++
 5 files changed, 263 insertions(+)
 create mode 100644 softmmu_llsc_template.h

diff --git a/cputlb.c b/cputlb.c
index 3e4ccba..415196d 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -378,8 +378,12 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
 #define SHIFT 1
 #include "softmmu_template.h"
 
+/* Generates LoadLink/StoreConditional helpers (for the time being only 4 bytes) */
+#define GEN_EXCLUSIVE_HELPERS
 #define SHIFT 2
 #include "softmmu_template.h"
+#include "softmmu_llsc_template.h"
+#undef GEN_EXCLUSIVE_HELPERS
 
 #define SHIFT 3
 #include "softmmu_template.h"
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
new file mode 100644
index 0000000..9f25db4
--- /dev/null
+++ b/softmmu_llsc_template.h
@@ -0,0 +1,233 @@
+/*
+ *  Software MMU support (esclusive load/store operations)
+ *
+ * Generate helpers used by TCG for qemu_ldlink/stcond ops and code load
+ * functions.
+ *
+ * Included from target op helpers and exec.c.
+ *
+ *  Copyright (c) 2003 Fabrice Bellard
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define DATA_SIZE (1 << SHIFT)
+
+#if DATA_SIZE == 8
+#define SUFFIX q
+#define LSUFFIX q
+#define SDATA_TYPE  int64_t
+#define DATA_TYPE  uint64_t
+#elif DATA_SIZE == 4
+#define SUFFIX l
+#define LSUFFIX l
+#define SDATA_TYPE  int32_t
+#define DATA_TYPE  uint32_t
+#elif DATA_SIZE == 2
+#define SUFFIX w
+#define LSUFFIX uw
+#define SDATA_TYPE  int16_t
+#define DATA_TYPE  uint16_t
+#elif DATA_SIZE == 1
+#define SUFFIX b
+#define LSUFFIX ub
+#define SDATA_TYPE  int8_t
+#define DATA_TYPE  uint8_t
+#else
+#error unsupported data size
+#endif
+
+/* For the benefit of TCG generated code, we want to avoid the complication
+   of ABI-specific return type promotion and always return a value extended
+   to the register size of the host.  This is tcg_target_long, except in the
+   case of a 32-bit host and 64-bit data, and for that we always have
+   uint64_t.  Don't bother with this widened value for SOFTMMU_CODE_ACCESS.  */
+#if DATA_SIZE == 8
+# define WORD_TYPE  DATA_TYPE
+# define USUFFIX    SUFFIX
+#else
+# define WORD_TYPE  tcg_target_ulong
+# define USUFFIX    glue(u, SUFFIX)
+# define SSUFFIX    glue(s, SUFFIX)
+#endif
+
+
+
+#if DATA_SIZE == 1
+# define helper_le_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
+# define helper_be_ldlink_name  helper_le_ldex_name
+# define helper_le_ldlinks_name glue(glue(helper_ret_ldlink, SSUFFIX), MMUSUFFIX)
+# define helper_be_ldlinks_name helper_le_ldexs_name
+# define helper_le_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
+# define helper_be_stcond_name  helper_le_stex_name
+#else
+# define helper_le_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
+# define helper_be_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
+# define helper_le_ldlinks_name glue(glue(helper_le_ldlink, SSUFFIX), MMUSUFFIX)
+# define helper_be_ldlinks_name glue(glue(helper_be_ldlink, SSUFFIX), MMUSUFFIX)
+# define helper_le_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
+# define helper_be_stcond_name  glue(glue(helper_be_stcond, SUFFIX), MMUSUFFIX)
+#endif
+
+#ifdef TARGET_WORDS_BIGENDIAN
+# define helper_te_ldlink_name  helper_be_ldlink_name
+# define helper_te_stcond_name  helper_be_stcond_name
+#else
+# define helper_te_ldlink_name  helper_le_ldlink_name
+# define helper_te_stcond_name  helper_le_stcond_name
+#endif
+
+/* helpers from cpu_ldst.h, byte-order independent versions */
+#if DATA_SIZE == 1
+#define helper_ld_legacy glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
+#define helper_st_legacy glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
+#else
+#define helper_ld_legacy glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
+#define helper_st_legacy glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
+#endif
+
+#define is_write_tlb_entry_set(env, page, index)                             \
+({                                                                           \
+    (addr & TARGET_PAGE_MASK)                                                \
+         == ((env->tlb_table[mmu_idx][index].addr_write) &                   \
+                 (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
+})                                                                           \
+
+WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr, int mmu_idx,
+                                uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    int index;
+    CPUState *cpu;
+    target_ulong tlb_addr;
+    hwaddr xlat, sz;
+    MemoryRegionSection *section;
+
+    env->ll_sc_context = 1;
+
+    /* Use the proper load helper from cpu_ldst.h */
+    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
+
+    /* The last legacy access ensures that the TLB entry for 'addr' has been
+     * created. */
+    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    tlb_addr = env->tlb_table[mmu_idx][index].addr_read;
+
+    cpu = ENV_GET_CPU(env);
+    section = address_space_translate_for_iotlb(cpu, tlb_addr, &xlat, &sz);
+    cpu_physical_memory_clear_excl_dirty(section->mr->ram_addr + xlat);
+
+    /* Flush the TLB entry to force slow-path for following accesses. */
+    tlb_flush_page(cpu, tlb_addr);
+
+    /* Invalidate the TLB entry for the other processors. The next TLB entries
+     * for this page will have the TLB_EXCL flag set. */
+    CPU_FOREACH(cpu) {
+        if (cpu != current_cpu) {
+            tlb_flush(cpu, 1);
+        }
+    }
+
+    return ret;
+}
+
+uint32_t helper_le_stcond_name(CPUArchState *env, target_ulong addr,
+                               DATA_TYPE val, int mmu_idx, uintptr_t retaddr)
+{
+    CPUState *cpu;
+    uint32_t ret;
+    int index;
+    target_ulong tlb_addr;
+    hwaddr xlat, sz;
+    MemoryRegionSection *section;
+    ram_addr_t ram_addr;
+
+    /* If the TLB entry is not the right one, create it. */
+    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    if (!is_write_tlb_entry_set(env, addr, index)) {
+        tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
+    }
+
+    tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+
+    cpu = ENV_GET_CPU(env);
+    section = address_space_translate_for_iotlb(cpu, tlb_addr, &xlat, &sz);
+    ram_addr = section->mr->ram_addr + xlat;
+    if (cpu_physical_memory_excl_is_dirty(ram_addr) || !env->ll_sc_context) {
+        /* Another vCPU has accessed the memory after the LoadLink or not link
+         * has been previously set. */
+        ret = 1;
+
+        goto out;
+    }
+
+    helper_st_legacy(env, addr, val, mmu_idx, retaddr);
+
+    /* Set the page as dirty to avoid the creation of TLB entries with the
+     * TLB_EXCL bit set. */
+    cpu_physical_memory_set_excl_dirty(ram_addr);
+
+    /* The StoreConditional succeeded */
+    ret = 0;
+
+out:
+    /* Flush the page since it is no more protected. */
+    tlb_flush_page(cpu, tlb_addr);
+
+    CPU_FOREACH(cpu) {
+        if (cpu != current_cpu) {
+            tlb_flush(cpu, 1);
+        }
+    }
+    env->ll_sc_context = 0;
+
+    return ret;
+}
+
+#undef helper_le_ldlink_name
+#undef helper_be_ldlink_name
+#undef helper_le_ldlinks_name
+#undef helper_be_ldlinks_name
+#undef helper_le_stcond_name
+#undef helper_be_stcond_name
+#undef helper_te_ldlink_name
+#undef helper_te_stcond_name
+#undef helper_ld_legacy
+#undef helper_st_legacy
+
+/* undef softmmu_template macros */
+#undef READ_ACCESS_TYPE
+#undef SHIFT
+#undef DATA_TYPE
+#undef SUFFIX
+#undef LSUFFIX
+#undef DATA_SIZE
+#undef ADDR_READ
+#undef WORD_TYPE
+#undef SDATA_TYPE
+#undef USUFFIX
+#undef SSUFFIX
+#undef BSWAP
+#undef TGT_BE
+#undef TGT_LE
+#undef CPU_BE
+#undef CPU_LE
+#undef helper_le_ld_name
+#undef helper_be_ld_name
+#undef helper_le_lds_name
+#undef helper_be_lds_name
+#undef helper_le_st_name
+#undef helper_be_st_name
+#undef helper_te_ld_name
+#undef helper_te_st_name
diff --git a/softmmu_template.h b/softmmu_template.h
index 1ac99da..b15b2c1 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -594,6 +594,9 @@ glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
 
 #endif /* !defined(SOFTMMU_CODE_ACCESS) */
 
+#ifdef GEN_EXCLUSIVE_HELPERS
+/* All the defined macros will be undeffed in softmmu_ldstex_template.h */
+#else
 #undef READ_ACCESS_TYPE
 #undef SHIFT
 #undef DATA_TYPE
@@ -618,3 +621,4 @@ glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
 #undef helper_be_st_name
 #undef helper_te_ld_name
 #undef helper_te_st_name
+#endif
diff --git a/tcg/tcg-be-ldst.h b/tcg/tcg-be-ldst.h
index 4a45102..7cba87b 100644
--- a/tcg/tcg-be-ldst.h
+++ b/tcg/tcg-be-ldst.h
@@ -24,6 +24,8 @@
 
 typedef struct TCGLabelQemuLdst {
     bool is_ld;             /* qemu_ld: true, qemu_st: false */
+    bool is_llsc;           /* true if LoadLink/StoreConditional exclusive */
+    TCGReg llsc_success;    /* true if the StoreConditional succeeded */
     TCGMemOp opc;
     TCGType type;           /* result type of a load */
     TCGReg addrlo_reg;      /* reg index for low word of guest virtual addr */
diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..0607d07 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -898,6 +898,15 @@ tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     int mmu_idx, uintptr_t retaddr);
 uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            int mmu_idx, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
+                                       int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                      int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                      int mmu_idx, uintptr_t retaddr);
+uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                             int mmu_idx, uintptr_t retaddr);
 
 /* Value sign-extended to tcg register size.  */
 tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
@@ -925,6 +934,17 @@ void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        int mmu_idx, uintptr_t retaddr);
 void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        int mmu_idx, uintptr_t retaddr);
+/* Exclusive variants */
+uint32_t helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
+                              int mmu_idx, uintptr_t retaddr);
+uint32_t helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
+                             int mmu_idx, uintptr_t retaddr);
+uint32_t helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                             int mmu_idx, uintptr_t retaddr);
+uint32_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                             int mmu_idx, uintptr_t retaddr);
+
+
 
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC 4/5] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
  2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
                   ` (2 preceding siblings ...)
  2015-05-06 15:38 ` [Qemu-devel] [RFC 3/5] softmmu: Add helpers for a new slow-path Alvise Rigo
@ 2015-05-06 15:38 ` Alvise Rigo
  2015-05-07 17:58   ` Richard Henderson
  2015-05-06 15:38 ` [Qemu-devel] [RFC 5/5] target-arm: translate: implement qemu_ldlink and qemu_stcond ops Alvise Rigo
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Alvise Rigo @ 2015-05-06 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

Create a new pair of instructions that implement a LoadLink/StoreConditional
mechanism.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 tcg/tcg-op.c  | 20 ++++++++++++++++++++
 tcg/tcg-op.h  |  3 +++
 tcg/tcg-opc.h |  4 ++++
 tcg/tcg.c     |  2 ++
 4 files changed, 29 insertions(+)

diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 2b6be75..7c3e85b 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -1886,6 +1886,13 @@ static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv addr,
 #endif
 }
 
+/* An output operand to return the StoreConditional result */
+static void gen_stcond_i32(TCGOpcode opc, TCGv_i32 is_dirty, TCGv_i32 val,
+                         TCGv addr, TCGMemOp memop, TCGArg idx)
+{
+    tcg_gen_op5ii_i32(opc, is_dirty, val, addr, memop, idx);
+}
+
 static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, TCGv addr,
                          TCGMemOp memop, TCGArg idx)
 {
@@ -1913,12 +1920,25 @@ void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     gen_ldst_i32(INDEX_op_qemu_ld_i32, val, addr, memop, idx);
 }
 
+void tcg_gen_qemu_ldlink_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
+{
+    memop = tcg_canonicalize_memop(memop, 0, 0);
+    gen_ldst_i32(INDEX_op_qemu_ldlink_i32, val, addr, memop, idx);
+}
+
 void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     memop = tcg_canonicalize_memop(memop, 0, 1);
     gen_ldst_i32(INDEX_op_qemu_st_i32, val, addr, memop, idx);
 }
 
+void tcg_gen_qemu_stcond_i32(TCGv_i32 is_dirty, TCGv_i32 val, TCGv addr,
+                           TCGArg idx, TCGMemOp memop)
+{
+    memop = tcg_canonicalize_memop(memop, 0, 1);
+    gen_stcond_i32(INDEX_op_qemu_stcond_i32, is_dirty, val, addr, memop, idx);
+}
+
 void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index d1d763f..f183169 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -754,6 +754,9 @@ void tcg_gen_qemu_st_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
 void tcg_gen_qemu_ld_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
 void tcg_gen_qemu_st_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
 
+void tcg_gen_qemu_ldlink_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
+void tcg_gen_qemu_stcond_i32(TCGv_i32, TCGv_i32, TCGv, TCGArg, TCGMemOp);
+
 static inline void tcg_gen_qemu_ld8u(TCGv ret, TCGv addr, int mem_index)
 {
     tcg_gen_qemu_ld_tl(ret, addr, mem_index, MO_UB);
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 42d0cfe..c88411c 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -183,6 +183,10 @@ DEF(qemu_ld_i32, 1, TLADDR_ARGS, 2,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
 DEF(qemu_st_i32, 0, TLADDR_ARGS + 1, 2,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
+DEF(qemu_ldlink_i32, 1, TLADDR_ARGS, 2,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
+DEF(qemu_stcond_i32, 1, TLADDR_ARGS + 1, 2,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
 DEF(qemu_ld_i64, DATA64_ARGS, TLADDR_ARGS, 2,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
 DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 2,
diff --git a/tcg/tcg.c b/tcg/tcg.c
index f1558b7..368f6ae 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1068,6 +1068,8 @@ void tcg_dump_ops(TCGContext *s)
                 i = 1;
                 break;
             case INDEX_op_qemu_ld_i32:
+            case INDEX_op_qemu_ldlink_i32:
+            case INDEX_op_qemu_stcond_i32:
             case INDEX_op_qemu_st_i32:
             case INDEX_op_qemu_ld_i64:
             case INDEX_op_qemu_st_i64:
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC 5/5] target-arm: translate: implement qemu_ldlink and qemu_stcond ops
  2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
                   ` (3 preceding siblings ...)
  2015-05-06 15:38 ` [Qemu-devel] [RFC 4/5] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions Alvise Rigo
@ 2015-05-06 15:38 ` Alvise Rigo
  2015-05-06 15:51 ` [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Paolo Bonzini
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Alvise Rigo @ 2015-05-06 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

Implement strex and ldrex instruction relying on TCG's qemu_ldlink and
qemu_stcond.  For the time being only the 32bit instructions are supported.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/translate.c |  94 ++++++++++++++++++++++++++++++++++++++++++-
 tcg/arm/tcg-target.c   | 105 ++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 170 insertions(+), 29 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 9116529..3b209a5 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -956,6 +956,18 @@ DO_GEN_ST(8, MO_UB)
 DO_GEN_ST(16, MO_TEUW)
 DO_GEN_ST(32, MO_TEUL)
 
+/* Load/Store exclusive generators (always unsigned) */
+static inline void gen_aa32_ldex32(TCGv_i32 val, TCGv_i32 addr, int index)
+{
+    tcg_gen_qemu_ldlink_i32(val, addr, index, MO_TEUL);
+}
+
+static inline void gen_aa32_stex32(TCGv_i32 is_excl, TCGv_i32 val,
+                                   TCGv_i32 addr, int index)
+{
+    tcg_gen_qemu_stcond_i32(is_excl, val, addr, index, MO_TEUL);
+}
+
 static inline void gen_set_pc_im(DisasContext *s, target_ulong val)
 {
     tcg_gen_movi_i32(cpu_R[15], val);
@@ -7420,6 +7432,26 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     tcg_gen_extu_i32_i64(cpu_exclusive_addr, addr);
 }
 
+static void gen_load_exclusive_multi(DisasContext *s, int rt, int rt2,
+                                     TCGv_i32 addr, int size)
+{
+    TCGv_i32 tmp = tcg_temp_new_i32();
+
+    switch (size) {
+    case 0:
+    case 1:
+        abort();
+    case 2:
+        gen_aa32_ldex32(tmp, addr, get_mem_index(s));
+        break;
+    case 3:
+    default:
+        abort();
+    }
+
+    store_reg(s, rt, tmp);
+}
+
 static void gen_clrex(DisasContext *s)
 {
     tcg_gen_movi_i64(cpu_exclusive_addr, -1);
@@ -7518,6 +7550,63 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
     gen_set_label(done_label);
     tcg_gen_movi_i64(cpu_exclusive_addr, -1);
 }
+
+static void gen_store_exclusive_multi(DisasContext *s, int rd, int rt, int rt2,
+                                      TCGv_i32 addr, int size)
+{
+    TCGv_i32 tmp;
+    TCGv_i32 is_dirty;
+    TCGv_i32 success;
+    TCGLabel *done_label;
+    TCGLabel *fail_label;
+
+    fail_label = gen_new_label();
+    done_label = gen_new_label();
+
+    tmp = tcg_temp_new_i32();
+    is_dirty = tcg_temp_new_i32();
+    success = tcg_temp_new_i32();
+    switch (size) {
+    case 0:
+    case 1:
+        abort();
+        break;
+    case 2:
+        gen_aa32_stex32(is_dirty, tmp, addr, get_mem_index(s));
+        break;
+    case 3:
+    default:
+        abort();
+    }
+
+    tcg_temp_free_i32(tmp);
+
+    /* Check if the store conditional has to fail */
+    tcg_gen_movi_i32(success, 1);
+    tcg_gen_brcond_i32(TCG_COND_EQ, is_dirty, success, fail_label);
+    tcg_temp_free_i32(success);
+    tcg_temp_free_i32(is_dirty);
+
+    tmp = load_reg(s, rt);
+    switch (size) {
+    case 0:
+    case 1:
+        abort();
+    case 2:
+        gen_aa32_st32(tmp, addr, get_mem_index(s));
+        break;
+    case 3:
+    default:
+        abort();
+    }
+    tcg_temp_free_i32(tmp);
+
+    tcg_gen_movi_i32(cpu_R[rd], 0); // is_dirty = 0
+    tcg_gen_br(done_label);
+    gen_set_label(fail_label);
+    tcg_gen_movi_i32(cpu_R[rd], 1); // is_dirty = 1
+    gen_set_label(done_label);
+}
 #endif
 
 /* gen_srs:
@@ -8365,7 +8454,7 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
                         } else if (insn & (1 << 20)) {
                             switch (op1) {
                             case 0: /* ldrex */
-                                gen_load_exclusive(s, rd, 15, addr, 2);
+                                gen_load_exclusive_multi(s, rd, 15, addr, 2);
                                 break;
                             case 1: /* ldrexd */
                                 gen_load_exclusive(s, rd, rd + 1, addr, 3);
@@ -8383,7 +8472,8 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
                             rm = insn & 0xf;
                             switch (op1) {
                             case 0:  /*  strex */
-                                gen_store_exclusive(s, rd, rm, 15, addr, 2);
+                                gen_store_exclusive_multi(s, rd, rm, 15,
+                                                          addr, 2);
                                 break;
                             case 1: /*  strexd */
                                 gen_store_exclusive(s, rd, rm, rm + 1, addr, 3);
diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 01e6fbf..1a301be 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -1069,6 +1069,11 @@ static void * const qemu_ld_helpers[16] = {
     [MO_BESL] = helper_be_ldul_mmu,
 };
 
+/* LoadLink helpers, only unsigned */
+static void * const qemu_ldex_helpers[16] = {
+    [MO_LEUL] = helper_le_ldlinkul_mmu,
+};
+
 /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
  *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
  */
@@ -1082,6 +1087,11 @@ static void * const qemu_st_helpers[16] = {
     [MO_BEQ]  = helper_be_stq_mmu,
 };
 
+/* StoreConditional helpers, only unsigned */
+static void * const qemu_stex_helpers[16] = {
+    [MO_LEUL] = helper_le_stcondl_mmu,
+};
+
 /* Helper routines for marshalling helper function arguments into
  * the correct registers and stack.
  * argreg is where we want to put this argument, arg is the argument itself.
@@ -1222,6 +1232,7 @@ static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
    path for a load or store, so that we can later generate the correct
    helper code.  */
 static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOp opc,
+                                bool is_excl, TCGReg llsc_success,
                                 TCGReg datalo, TCGReg datahi, TCGReg addrlo,
                                 TCGReg addrhi, int mem_index,
                                 tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
@@ -1229,6 +1240,8 @@ static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOp opc,
     TCGLabelQemuLdst *label = new_ldst_label(s);
 
     label->is_ld = is_ld;
+    label->is_llsc = is_excl;
+    label->llsc_success = llsc_success;
     label->opc = opc;
     label->datalo_reg = datalo;
     label->datahi_reg = datahi;
@@ -1259,12 +1272,16 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     /* For armv6 we can use the canonical unsigned helpers and minimize
        icache usage.  For pre-armv6, use the signed helpers since we do
        not have a single insn sign-extend.  */
-    if (use_armv6_instructions) {
-        func = qemu_ld_helpers[opc & ~MO_SIGN];
+    if (lb->is_llsc) {
+        func = qemu_ldex_helpers[opc];
     } else {
-        func = qemu_ld_helpers[opc];
-        if (opc & MO_SIGN) {
-            opc = MO_UL;
+        if (use_armv6_instructions) {
+            func = qemu_ld_helpers[opc & ~MO_SIGN];
+        } else {
+            func = qemu_ld_helpers[opc];
+            if (opc & MO_SIGN) {
+                opc = MO_UL;
+            }
         }
     }
     tcg_out_call(s, func);
@@ -1336,7 +1353,14 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
 
     /* Tail-call to the helper, which will return to the fast path.  */
-    tcg_out_goto(s, COND_AL, qemu_st_helpers[opc]);
+    if (lb->is_llsc) {
+        tcg_out_call(s, qemu_stex_helpers[opc]);
+        /* Save the output of the StoreConditional */
+        tcg_out_mov_reg(s, COND_AL, lb->llsc_success, TCG_REG_R0);
+        tcg_out_goto(s, COND_AL, lb->raddr);
+    } else {
+        tcg_out_goto(s, COND_AL, qemu_st_helpers[opc]);
+    }
 }
 #endif /* SOFTMMU */
 
@@ -1460,7 +1484,8 @@ static inline void tcg_out_qemu_ld_direct(TCGContext *s, TCGMemOp opc,
     }
 }
 
-static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
+static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64,
+                            bool isLoadLink)
 {
     TCGReg addrlo, datalo, datahi, addrhi __attribute__((unused));
     TCGMemOp opc;
@@ -1478,17 +1503,23 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
 
 #ifdef CONFIG_SOFTMMU
     mem_index = *args;
-    addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 1);
 
     /* This a conditional BL only to load a pointer within this opcode into LR
-       for the slow path.  We will not be using the value for a tail call.  */
-    label_ptr = s->code_ptr;
-    tcg_out_bl_noaddr(s, COND_NE);
-
-    tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
+       for the slow path.  We will not be using the value for a tail call.
+       In the context of a LoadLink instruction, we don't check the TLB but we
+       always follow the slow path.  */
+    if (isLoadLink) {
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_AL);
+    } else {
+        addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 1);
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_NE);
+        tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
+    }
 
-    add_qemu_ldst_label(s, true, opc, datalo, datahi, addrlo, addrhi,
-                        mem_index, s->code_ptr, label_ptr);
+    add_qemu_ldst_label(s, true, opc, isLoadLink, 0, datalo, datahi, addrlo,
+                        addrhi, mem_index, s->code_ptr, label_ptr);
 #else /* !CONFIG_SOFTMMU */
     if (GUEST_BASE) {
         tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP, GUEST_BASE);
@@ -1589,9 +1620,11 @@ static inline void tcg_out_qemu_st_direct(TCGContext *s, TCGMemOp opc,
     }
 }
 
-static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
+static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64,
+                            bool isStoreCond)
 {
     TCGReg addrlo, datalo, datahi, addrhi __attribute__((unused));
+    TCGReg llsc_success;
     TCGMemOp opc;
 #ifdef CONFIG_SOFTMMU
     int mem_index;
@@ -1599,6 +1632,8 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
     tcg_insn_unit *label_ptr;
 #endif
 
+    llsc_success = (isStoreCond ? *args++ : 0);
+
     datalo = *args++;
     datahi = (is64 ? *args++ : 0);
     addrlo = *args++;
@@ -1607,16 +1642,24 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
 
 #ifdef CONFIG_SOFTMMU
     mem_index = *args;
-    addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 0);
 
-    tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
+    if (isStoreCond) {
+        /* Always follow the slow-path for an exclusive access */
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_AL);
+    } else {
+        addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 0);
 
-    /* The conditional call must come last, as we're going to return here.  */
-    label_ptr = s->code_ptr;
-    tcg_out_bl_noaddr(s, COND_NE);
+        tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
 
-    add_qemu_ldst_label(s, false, opc, datalo, datahi, addrlo, addrhi,
-                        mem_index, s->code_ptr, label_ptr);
+        /* The conditional call must come last, as we're going to return here.  */
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_NE);
+    }
+
+    add_qemu_ldst_label(s, false, opc, isStoreCond, llsc_success, datalo,
+                        datahi, addrlo, addrhi, mem_index, s->code_ptr,
+                        label_ptr);
 #else /* !CONFIG_SOFTMMU */
     if (GUEST_BASE) {
         tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP, GUEST_BASE);
@@ -1859,16 +1902,22 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
 
     case INDEX_op_qemu_ld_i32:
-        tcg_out_qemu_ld(s, args, 0);
+        tcg_out_qemu_ld(s, args, 0, 0);
+        break;
+    case INDEX_op_qemu_ldlink_i32:
+        tcg_out_qemu_ld(s, args, 0, 1); // LoadLink
         break;
     case INDEX_op_qemu_ld_i64:
-        tcg_out_qemu_ld(s, args, 1);
+        tcg_out_qemu_ld(s, args, 1, 0);
         break;
     case INDEX_op_qemu_st_i32:
-        tcg_out_qemu_st(s, args, 0);
+        tcg_out_qemu_st(s, args, 0, 0);
+        break;
+    case INDEX_op_qemu_stcond_i32:
+        tcg_out_qemu_st(s, args, 0, 1); // StoreConditional
         break;
     case INDEX_op_qemu_st_i64:
-        tcg_out_qemu_st(s, args, 1);
+        tcg_out_qemu_st(s, args, 1, 0);
         break;
 
     case INDEX_op_bswap16_i32:
@@ -1952,8 +2001,10 @@ static const TCGTargetOpDef arm_op_defs[] = {
 
 #if TARGET_LONG_BITS == 32
     { INDEX_op_qemu_ld_i32, { "r", "l" } },
+    { INDEX_op_qemu_ldlink_i32, { "r", "l" } },
     { INDEX_op_qemu_ld_i64, { "r", "r", "l" } },
     { INDEX_op_qemu_st_i32, { "s", "s" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "s", "s" } },
     { INDEX_op_qemu_st_i64, { "s", "s", "s" } },
 #else
     { INDEX_op_qemu_ld_i32, { "r", "l", "l" } },
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
                   ` (4 preceding siblings ...)
  2015-05-06 15:38 ` [Qemu-devel] [RFC 5/5] target-arm: translate: implement qemu_ldlink and qemu_stcond ops Alvise Rigo
@ 2015-05-06 15:51 ` Paolo Bonzini
  2015-05-06 16:00   ` Mark Burton
  2015-05-06 15:55 ` Mark Burton
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2015-05-06 15:51 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

On 06/05/2015 17:38, Alvise Rigo wrote:
> This patch series provides an infrastructure for atomic
> instruction implementation in QEMU, paving the way for TCG multi-threading.
> The adopted design does not rely on host atomic
> instructions and is intended to propose a 'legacy' solution for
> translating guest atomic instructions.
> 
> The underlying idea is to provide new TCG instructions that guarantee
> atomicity to some memory accesses or in general a way to define memory
> transactions. More specifically, a new pair of TCG instructions are
> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
> LoadLink and StoreConditional primitives (only 32 bit variant
> implemented).  In order to achieve this, a new bitmap is added to the
> ram_list structure (always unique) which flags all memory pages that
> could not be accessed directly through the fast-path, due to previous
> exclusive operations. This new bitmap is coupled with a new TLB flag
> which forces the slow-path exectuion. All stores which take place
> between an LL/SC operation by other vCPUs in the same memory page, will
> fail the subsequent StoreConditional.
> 
> In theory, the provided implementation of TCG LoadLink/StoreConditional
> can be used to properly handle atomic instructions on any architecture.
> 
> The new slow-path is implemented such that:
> - the LoadLink behaves as a normal load slow-path, except for cleaning
>   the dirty flag in the bitmap. The TLB entries created from now on will
>   force the slow-path. To ensure it, we flush the TLB cache for the
>   other vCPUs
> - the StoreConditional behaves as a normal store slow-path, except for
>   checking the state of the dirty bitmap and returning 0 or 1 whether or
>   not the StoreConditional succeeded (0 when no vCPU has touched the
>   same memory in the mean time).
> 
> All those write accesses that are forced to follow the 'legacy'
> slow-path will set the accessed memory page to dirty.
> 
> In this series only the ARM ldrex/strex instructions are implemented.
> The code was tested with bare-metal test cases and with Linux, using
> upstream QEMU.
> 
> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
> 
> Alvise Rigo (5):
>   exec: Add new exclusive bitmap to ram_list
>   Add new TLB_EXCL flag
>   softmmu: Add helpers for a new slow-path
>   tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>   target-arm: translate: implement qemu_ldlink and qemu_stcond ops

That's pretty cool.

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
                   ` (5 preceding siblings ...)
  2015-05-06 15:51 ` [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Paolo Bonzini
@ 2015-05-06 15:55 ` Mark Burton
  2015-05-06 16:19   ` alvise rigo
  2015-05-08 15:22 ` Alex Bennée
  2015-05-08 18:29 ` Emilio G. Cota
  8 siblings, 1 reply; 27+ messages in thread
From: Mark Burton @ 2015-05-06 15:55 UTC (permalink / raw)
  To: Alvise Rigo; +Cc: mttcg, Jani Kokkonen, tech, claudio.fontana, QEMU Developers

A massive thank you for doing this work Alvise,

On our side, the patch we suggested is only applicable for ARM, though the mechanism would work for any CPU, 
	- BUT
It doesn’t force atomic instructions out through the slow path. This is either a very good thing (it’s much faster), or a very bad thing (it doesn’t allow you to treat them in the IO space), depending on your point of view.

Depending on what the rest of the community thinks, it seems to me we should apply both patches so that e.g. ARM’s existing atomic instructions run much faster and above all more ‘accurately’ - (with the patch we’ve provided),  and the same mechanism can be applied to all other architectures - but we can - somehow - swap for this more ‘controllable’ implementation when e.g. the mutex is located in IO space….

Cheers

Mark.

> On 6 May 2015, at 17:38, Alvise Rigo <a.rigo@virtualopensystems.com> wrote:
> 
> This patch series provides an infrastructure for atomic
> instruction implementation in QEMU, paving the way for TCG multi-threading.
> The adopted design does not rely on host atomic
> instructions and is intended to propose a 'legacy' solution for
> translating guest atomic instructions.
> 
> The underlying idea is to provide new TCG instructions that guarantee
> atomicity to some memory accesses or in general a way to define memory
> transactions. More specifically, a new pair of TCG instructions are
> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
> LoadLink and StoreConditional primitives (only 32 bit variant
> implemented).  In order to achieve this, a new bitmap is added to the
> ram_list structure (always unique) which flags all memory pages that
> could not be accessed directly through the fast-path, due to previous
> exclusive operations. This new bitmap is coupled with a new TLB flag
> which forces the slow-path exectuion. All stores which take place
> between an LL/SC operation by other vCPUs in the same memory page, will
> fail the subsequent StoreConditional.
> 
> In theory, the provided implementation of TCG LoadLink/StoreConditional
> can be used to properly handle atomic instructions on any architecture.
> 
> The new slow-path is implemented such that:
> - the LoadLink behaves as a normal load slow-path, except for cleaning
>  the dirty flag in the bitmap. The TLB entries created from now on will
>  force the slow-path. To ensure it, we flush the TLB cache for the
>  other vCPUs
> - the StoreConditional behaves as a normal store slow-path, except for
>  checking the state of the dirty bitmap and returning 0 or 1 whether or
>  not the StoreConditional succeeded (0 when no vCPU has touched the
>  same memory in the mean time).
> 
> All those write accesses that are forced to follow the 'legacy'
> slow-path will set the accessed memory page to dirty.
> 
> In this series only the ARM ldrex/strex instructions are implemented.
> The code was tested with bare-metal test cases and with Linux, using
> upstream QEMU.
> 
> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
> 
> Alvise Rigo (5):
>  exec: Add new exclusive bitmap to ram_list
>  Add new TLB_EXCL flag
>  softmmu: Add helpers for a new slow-path
>  tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>  target-arm: translate: implement qemu_ldlink and qemu_stcond ops
> 
> cputlb.c                |  11 ++-
> include/exec/cpu-all.h  |   1 +
> include/exec/cpu-defs.h |   2 +
> include/exec/memory.h   |   3 +-
> include/exec/ram_addr.h |  19 +++-
> softmmu_llsc_template.h | 233 ++++++++++++++++++++++++++++++++++++++++++++++++
> softmmu_template.h      |  52 ++++++++++-
> target-arm/translate.c  |  94 ++++++++++++++++++-
> tcg/arm/tcg-target.c    | 105 ++++++++++++++++------
> tcg/tcg-be-ldst.h       |   2 +
> tcg/tcg-op.c            |  20 +++++
> tcg/tcg-op.h            |   3 +
> tcg/tcg-opc.h           |   4 +
> tcg/tcg.c               |   2 +
> tcg/tcg.h               |  20 +++++
> 15 files changed, 538 insertions(+), 33 deletions(-)
> create mode 100644 softmmu_llsc_template.h
> 
> -- 
> 2.4.0
> 


	 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

	+33 (0)603762104
	mark.burton

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-06 15:51 ` [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Paolo Bonzini
@ 2015-05-06 16:00   ` Mark Burton
  2015-05-06 16:21     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Mark Burton @ 2015-05-06 16:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, claudio.fontana, Alvise Rigo, QEMU Developers,
	Jani Kokkonen, tech

By the way - would it help to send the atomic patch we did separately from the whole MTTCG patch set?
Or have you already taken a look at that - it’s pretty short.

Cheers

Mark.


> On 6 May 2015, at 17:51, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 06/05/2015 17:38, Alvise Rigo wrote:
>> This patch series provides an infrastructure for atomic
>> instruction implementation in QEMU, paving the way for TCG multi-threading.
>> The adopted design does not rely on host atomic
>> instructions and is intended to propose a 'legacy' solution for
>> translating guest atomic instructions.
>> 
>> The underlying idea is to provide new TCG instructions that guarantee
>> atomicity to some memory accesses or in general a way to define memory
>> transactions. More specifically, a new pair of TCG instructions are
>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>> LoadLink and StoreConditional primitives (only 32 bit variant
>> implemented).  In order to achieve this, a new bitmap is added to the
>> ram_list structure (always unique) which flags all memory pages that
>> could not be accessed directly through the fast-path, due to previous
>> exclusive operations. This new bitmap is coupled with a new TLB flag
>> which forces the slow-path exectuion. All stores which take place
>> between an LL/SC operation by other vCPUs in the same memory page, will
>> fail the subsequent StoreConditional.
>> 
>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>> can be used to properly handle atomic instructions on any architecture.
>> 
>> The new slow-path is implemented such that:
>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>  the dirty flag in the bitmap. The TLB entries created from now on will
>>  force the slow-path. To ensure it, we flush the TLB cache for the
>>  other vCPUs
>> - the StoreConditional behaves as a normal store slow-path, except for
>>  checking the state of the dirty bitmap and returning 0 or 1 whether or
>>  not the StoreConditional succeeded (0 when no vCPU has touched the
>>  same memory in the mean time).
>> 
>> All those write accesses that are forced to follow the 'legacy'
>> slow-path will set the accessed memory page to dirty.
>> 
>> In this series only the ARM ldrex/strex instructions are implemented.
>> The code was tested with bare-metal test cases and with Linux, using
>> upstream QEMU.
>> 
>> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
>> 
>> Alvise Rigo (5):
>>  exec: Add new exclusive bitmap to ram_list
>>  Add new TLB_EXCL flag
>>  softmmu: Add helpers for a new slow-path
>>  tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>  target-arm: translate: implement qemu_ldlink and qemu_stcond ops
> 
> That's pretty cool.
> 
> Paolo


	 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

	+33 (0)603762104
	mark.burton

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-06 15:55 ` Mark Burton
@ 2015-05-06 16:19   ` alvise rigo
  2015-05-06 16:20     ` Mark Burton
  0 siblings, 1 reply; 27+ messages in thread
From: alvise rigo @ 2015-05-06 16:19 UTC (permalink / raw)
  To: Mark Burton
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers

Hi Mark,

Firstly, thank you for your feedback.

On Wed, May 6, 2015 at 5:55 PM, Mark Burton <mark.burton@greensocs.com> wrote:
> A massive thank you for doing this work Alvise,
>
> On our side, the patch we suggested is only applicable for ARM, though the mechanism would work for any CPU,
>         - BUT
> It doesn’t force atomic instructions out through the slow path. This is either a very good thing (it’s much faster), or a very bad thing (it doesn’t allow you to treat them in the IO space), depending on your point of view.

Indeed, this is for sure a more invasive approach, but it's made on
purpose to have control over those non-atomic stores that might modify
the 'linked' memory.

>
> Depending on what the rest of the community thinks, it seems to me we should apply both patches so that e.g. ARM’s existing atomic instructions run much faster and above all more ‘accurately’ - (with the patch we’ve provided),  and the same mechanism can be applied to all other architectures - but we can - somehow - swap for this more ‘controllable’ implementation when e.g. the mutex is located in IO space….

Yes, this makes sense.

Thank you,
alvise

>
> Cheers
>
> Mark.
>
>> On 6 May 2015, at 17:38, Alvise Rigo <a.rigo@virtualopensystems.com> wrote:
>>
>> This patch series provides an infrastructure for atomic
>> instruction implementation in QEMU, paving the way for TCG multi-threading.
>> The adopted design does not rely on host atomic
>> instructions and is intended to propose a 'legacy' solution for
>> translating guest atomic instructions.
>>
>> The underlying idea is to provide new TCG instructions that guarantee
>> atomicity to some memory accesses or in general a way to define memory
>> transactions. More specifically, a new pair of TCG instructions are
>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>> LoadLink and StoreConditional primitives (only 32 bit variant
>> implemented).  In order to achieve this, a new bitmap is added to the
>> ram_list structure (always unique) which flags all memory pages that
>> could not be accessed directly through the fast-path, due to previous
>> exclusive operations. This new bitmap is coupled with a new TLB flag
>> which forces the slow-path exectuion. All stores which take place
>> between an LL/SC operation by other vCPUs in the same memory page, will
>> fail the subsequent StoreConditional.
>>
>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>> can be used to properly handle atomic instructions on any architecture.
>>
>> The new slow-path is implemented such that:
>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>  the dirty flag in the bitmap. The TLB entries created from now on will
>>  force the slow-path. To ensure it, we flush the TLB cache for the
>>  other vCPUs
>> - the StoreConditional behaves as a normal store slow-path, except for
>>  checking the state of the dirty bitmap and returning 0 or 1 whether or
>>  not the StoreConditional succeeded (0 when no vCPU has touched the
>>  same memory in the mean time).
>>
>> All those write accesses that are forced to follow the 'legacy'
>> slow-path will set the accessed memory page to dirty.
>>
>> In this series only the ARM ldrex/strex instructions are implemented.
>> The code was tested with bare-metal test cases and with Linux, using
>> upstream QEMU.
>>
>> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
>>
>> Alvise Rigo (5):
>>  exec: Add new exclusive bitmap to ram_list
>>  Add new TLB_EXCL flag
>>  softmmu: Add helpers for a new slow-path
>>  tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>  target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>>
>> cputlb.c                |  11 ++-
>> include/exec/cpu-all.h  |   1 +
>> include/exec/cpu-defs.h |   2 +
>> include/exec/memory.h   |   3 +-
>> include/exec/ram_addr.h |  19 +++-
>> softmmu_llsc_template.h | 233 ++++++++++++++++++++++++++++++++++++++++++++++++
>> softmmu_template.h      |  52 ++++++++++-
>> target-arm/translate.c  |  94 ++++++++++++++++++-
>> tcg/arm/tcg-target.c    | 105 ++++++++++++++++------
>> tcg/tcg-be-ldst.h       |   2 +
>> tcg/tcg-op.c            |  20 +++++
>> tcg/tcg-op.h            |   3 +
>> tcg/tcg-opc.h           |   4 +
>> tcg/tcg.c               |   2 +
>> tcg/tcg.h               |  20 +++++
>> 15 files changed, 538 insertions(+), 33 deletions(-)
>> create mode 100644 softmmu_llsc_template.h
>>
>> --
>> 2.4.0
>>
>
>
>          +44 (0)20 7100 3485 x 210
>  +33 (0)5 33 52 01 77x 210
>
>         +33 (0)603762104
>         mark.burton
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-06 16:19   ` alvise rigo
@ 2015-05-06 16:20     ` Mark Burton
  0 siblings, 0 replies; 27+ messages in thread
From: Mark Burton @ 2015-05-06 16:20 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers


> On 6 May 2015, at 18:19, alvise rigo <a.rigo@virtualopensystems.com> wrote:
> 
> Hi Mark,
> 
> Firstly, thank you for your feedback.
> 
> On Wed, May 6, 2015 at 5:55 PM, Mark Burton <mark.burton@greensocs.com> wrote:
>> A massive thank you for doing this work Alvise,
>> 
>> On our side, the patch we suggested is only applicable for ARM, though the mechanism would work for any CPU,
>>        - BUT
>> It doesn’t force atomic instructions out through the slow path. This is either a very good thing (it’s much faster), or a very bad thing (it doesn’t allow you to treat them in the IO space), depending on your point of view.
> 
> Indeed, this is for sure a more invasive approach, but it's made on
> purpose to have control over those non-atomic stores that might modify
> the 'linked' memory.
> 

exactly

:-)
Cheers
Mark.

>> 
>> Depending on what the rest of the community thinks, it seems to me we should apply both patches so that e.g. ARM’s existing atomic instructions run much faster and above all more ‘accurately’ - (with the patch we’ve provided),  and the same mechanism can be applied to all other architectures - but we can - somehow - swap for this more ‘controllable’ implementation when e.g. the mutex is located in IO space….
> 
> Yes, this makes sense.
> 
> Thank you,
> alvise
> 
>> 
>> Cheers
>> 
>> Mark.
>> 
>>> On 6 May 2015, at 17:38, Alvise Rigo <a.rigo@virtualopensystems.com> wrote:
>>> 
>>> This patch series provides an infrastructure for atomic
>>> instruction implementation in QEMU, paving the way for TCG multi-threading.
>>> The adopted design does not rely on host atomic
>>> instructions and is intended to propose a 'legacy' solution for
>>> translating guest atomic instructions.
>>> 
>>> The underlying idea is to provide new TCG instructions that guarantee
>>> atomicity to some memory accesses or in general a way to define memory
>>> transactions. More specifically, a new pair of TCG instructions are
>>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>>> LoadLink and StoreConditional primitives (only 32 bit variant
>>> implemented).  In order to achieve this, a new bitmap is added to the
>>> ram_list structure (always unique) which flags all memory pages that
>>> could not be accessed directly through the fast-path, due to previous
>>> exclusive operations. This new bitmap is coupled with a new TLB flag
>>> which forces the slow-path exectuion. All stores which take place
>>> between an LL/SC operation by other vCPUs in the same memory page, will
>>> fail the subsequent StoreConditional.
>>> 
>>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>>> can be used to properly handle atomic instructions on any architecture.
>>> 
>>> The new slow-path is implemented such that:
>>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>> the dirty flag in the bitmap. The TLB entries created from now on will
>>> force the slow-path. To ensure it, we flush the TLB cache for the
>>> other vCPUs
>>> - the StoreConditional behaves as a normal store slow-path, except for
>>> checking the state of the dirty bitmap and returning 0 or 1 whether or
>>> not the StoreConditional succeeded (0 when no vCPU has touched the
>>> same memory in the mean time).
>>> 
>>> All those write accesses that are forced to follow the 'legacy'
>>> slow-path will set the accessed memory page to dirty.
>>> 
>>> In this series only the ARM ldrex/strex instructions are implemented.
>>> The code was tested with bare-metal test cases and with Linux, using
>>> upstream QEMU.
>>> 
>>> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
>>> 
>>> Alvise Rigo (5):
>>> exec: Add new exclusive bitmap to ram_list
>>> Add new TLB_EXCL flag
>>> softmmu: Add helpers for a new slow-path
>>> tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>> target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>>> 
>>> cputlb.c                |  11 ++-
>>> include/exec/cpu-all.h  |   1 +
>>> include/exec/cpu-defs.h |   2 +
>>> include/exec/memory.h   |   3 +-
>>> include/exec/ram_addr.h |  19 +++-
>>> softmmu_llsc_template.h | 233 ++++++++++++++++++++++++++++++++++++++++++++++++
>>> softmmu_template.h      |  52 ++++++++++-
>>> target-arm/translate.c  |  94 ++++++++++++++++++-
>>> tcg/arm/tcg-target.c    | 105 ++++++++++++++++------
>>> tcg/tcg-be-ldst.h       |   2 +
>>> tcg/tcg-op.c            |  20 +++++
>>> tcg/tcg-op.h            |   3 +
>>> tcg/tcg-opc.h           |   4 +
>>> tcg/tcg.c               |   2 +
>>> tcg/tcg.h               |  20 +++++
>>> 15 files changed, 538 insertions(+), 33 deletions(-)
>>> create mode 100644 softmmu_llsc_template.h
>>> 
>>> --
>>> 2.4.0
>>> 
>> 
>> 
>>         +44 (0)20 7100 3485 x 210
>> +33 (0)5 33 52 01 77x 210
>> 
>>        +33 (0)603762104
>>        mark.burton
>> 


	 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

	+33 (0)603762104
	mark.burton

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-06 16:00   ` Mark Burton
@ 2015-05-06 16:21     ` alvise rigo
  0 siblings, 0 replies; 27+ messages in thread
From: alvise rigo @ 2015-05-06 16:21 UTC (permalink / raw)
  To: Mark Burton
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

On Wed, May 6, 2015 at 6:00 PM, Mark Burton <mark.burton@greensocs.com> wrote:
> By the way - would it help to send the atomic patch we did separately from the whole MTTCG patch set?

I don't think you should spend time on this. As you said it's short, I
can do it by myself when necessary.

Thank you,
alvise

> Or have you already taken a look at that - it’s pretty short.
>
> Cheers
>
> Mark.
>
>
>> On 6 May 2015, at 17:51, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 06/05/2015 17:38, Alvise Rigo wrote:
>>> This patch series provides an infrastructure for atomic
>>> instruction implementation in QEMU, paving the way for TCG multi-threading.
>>> The adopted design does not rely on host atomic
>>> instructions and is intended to propose a 'legacy' solution for
>>> translating guest atomic instructions.
>>>
>>> The underlying idea is to provide new TCG instructions that guarantee
>>> atomicity to some memory accesses or in general a way to define memory
>>> transactions. More specifically, a new pair of TCG instructions are
>>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>>> LoadLink and StoreConditional primitives (only 32 bit variant
>>> implemented).  In order to achieve this, a new bitmap is added to the
>>> ram_list structure (always unique) which flags all memory pages that
>>> could not be accessed directly through the fast-path, due to previous
>>> exclusive operations. This new bitmap is coupled with a new TLB flag
>>> which forces the slow-path exectuion. All stores which take place
>>> between an LL/SC operation by other vCPUs in the same memory page, will
>>> fail the subsequent StoreConditional.
>>>
>>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>>> can be used to properly handle atomic instructions on any architecture.
>>>
>>> The new slow-path is implemented such that:
>>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>>  the dirty flag in the bitmap. The TLB entries created from now on will
>>>  force the slow-path. To ensure it, we flush the TLB cache for the
>>>  other vCPUs
>>> - the StoreConditional behaves as a normal store slow-path, except for
>>>  checking the state of the dirty bitmap and returning 0 or 1 whether or
>>>  not the StoreConditional succeeded (0 when no vCPU has touched the
>>>  same memory in the mean time).
>>>
>>> All those write accesses that are forced to follow the 'legacy'
>>> slow-path will set the accessed memory page to dirty.
>>>
>>> In this series only the ARM ldrex/strex instructions are implemented.
>>> The code was tested with bare-metal test cases and with Linux, using
>>> upstream QEMU.
>>>
>>> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
>>>
>>> Alvise Rigo (5):
>>>  exec: Add new exclusive bitmap to ram_list
>>>  Add new TLB_EXCL flag
>>>  softmmu: Add helpers for a new slow-path
>>>  tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>>  target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>>
>> That's pretty cool.
>>
>> Paolo
>
>
>          +44 (0)20 7100 3485 x 210
>  +33 (0)5 33 52 01 77x 210
>
>         +33 (0)603762104
>         mark.burton
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] exec: Add new exclusive bitmap to ram_list
  2015-05-06 15:38 ` [Qemu-devel] [RFC 1/5] exec: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2015-05-07 17:12   ` Richard Henderson
  2015-05-11  7:48     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-05-07 17:12 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

On 05/06/2015 08:38 AM, Alvise Rigo wrote:
> The purpose of this new bitmap is to flag the memory pages that are in
> the middle of LL/SC operations (after a LL, before a SC).
> For all these pages, the corresponding TLB entries will be generated
> in such a way to force the slow-path.
> 
> The accessors to this bitmap are currently not atomic, but they have to
> be so in a real multi-threading TCG.
> 
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  include/exec/cpu-defs.h |  2 ++
>  include/exec/memory.h   |  3 ++-
>  include/exec/ram_addr.h | 19 ++++++++++++++++++-
>  3 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 0ca6f0b..d12cb4c 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -123,5 +123,7 @@ QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
>  #define CPU_COMMON                                                      \
>      /* soft mmu support */                                              \
>      CPU_COMMON_TLB                                                      \
> +    /* true if in the middle of a LoadLink/StoreConditional */          \
> +    bool ll_sc_context;                                                 \

Belongs to a different patch?


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 2/5] Add new TLB_EXCL flag
  2015-05-06 15:38 ` [Qemu-devel] [RFC 2/5] Add new TLB_EXCL flag Alvise Rigo
@ 2015-05-07 17:25   ` Richard Henderson
  2015-05-11  7:47     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-05-07 17:25 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

On 05/06/2015 08:38 AM, Alvise Rigo wrote:
> Add a new flag for the TLB entries to force all the accesses made to a
> page to follow the slow-path.
> 
> Mark the accessed page as dirty to invalidate any pending operation of
> LL/SC.
> 
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cputlb.c               |  7 ++++++-
>  include/exec/cpu-all.h |  1 +
>  softmmu_template.h     | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 54 insertions(+), 2 deletions(-)
> 
> diff --git a/cputlb.c b/cputlb.c
> index 38f2151..3e4ccba 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -324,7 +324,12 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
>                                                     + xlat)) {
>              te->addr_write = address | TLB_NOTDIRTY;
>          } else {
> -            te->addr_write = address;
> +            if (!cpu_physical_memory_excl_is_dirty(section->mr->ram_addr
> +                                                   + xlat)) {
> +                te->addr_write = address | TLB_EXCL;
> +            } else {
> +                te->addr_write = address;
> +            }

I don't see that you initialize this new bitmap to all ones?  That would
generate exclusive accesses on all of memory until each unit (page?) is touched.

Perhaps you should invert the sense of your use of the dirty bitmap, such that
a set bit means that some vcpu is monitoring the unit, and no vcpu has written
to the unit?

Your nesting above would seem to fail when TLB_NOTDIRTY needs to be set as well.

> @@ -262,7 +263,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>      uintptr_t haddr;
>      DATA_TYPE res;
>  
> -    /* Adjust the given return address.  */
> +        /* Adjust the given return address.  */
>      retaddr -= GETPC_ADJ;

Careful...

> @@ -387,6 +388,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      uintptr_t haddr;
> +    bool to_excl_mem = false;
>  
>      /* Adjust the given return address.  */
>      retaddr -= GETPC_ADJ;
> @@ -406,6 +408,14 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      }
>  
> +    if (unlikely(tlb_addr & TLB_EXCL &&
> +        !(tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL)))) {

This is (tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL, but...

> +        /* The slow-path has been forced since we are reading a page used for a
> +         * load-link operation. */
> +        to_excl_mem = true;
> +        goto skip_io;

I'm not fond of either the boolean or the goto.

> +    }
> +
>      /* Handle an IO access.  */
>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {

Nesting your unlikely code under one unlikely test seems better.  And would be
more likely to work in the case you need to handle both TLB_NOTDIRTY and TLB_EXCL.


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 3/5] softmmu: Add helpers for a new slow-path
  2015-05-06 15:38 ` [Qemu-devel] [RFC 3/5] softmmu: Add helpers for a new slow-path Alvise Rigo
@ 2015-05-07 17:56   ` Richard Henderson
  2015-05-11  8:07     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-05-07 17:56 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

On 05/06/2015 08:38 AM, Alvise Rigo wrote:
> +#define DATA_SIZE (1 << SHIFT)
> +
> +#if DATA_SIZE == 8
> +#define SUFFIX q
> +#define LSUFFIX q
> +#define SDATA_TYPE  int64_t
> +#define DATA_TYPE  uint64_t

Duplicating all of the stuff from softmmu_template.h is Just Wrong.

> +/* For the benefit of TCG generated code, we want to avoid the complication
> +   of ABI-specific return type promotion and always return a value extended
> +   to the register size of the host.  This is tcg_target_long, except in the
> +   case of a 32-bit host and 64-bit data, and for that we always have
> +   uint64_t.  Don't bother with this widened value for SOFTMMU_CODE_ACCESS.  */

...

> +uint32_t helper_le_stcond_name(CPUArchState *env, target_ulong addr,
> +                               DATA_TYPE val, int mmu_idx, uintptr_t retaddr)

You didn't even read the comment above re the return type.

> +        /* Another vCPU has accessed the memory after the LoadLink or not link
> +         * has been previously set. */
> +        ret = 1;
> +        goto out;
> +    }
> +
> +    helper_st_legacy(env, addr, val, mmu_idx, retaddr);
> +
> +    /* Set the page as dirty to avoid the creation of TLB entries with the
> +     * TLB_EXCL bit set. */
> +    cpu_physical_memory_set_excl_dirty(ram_addr);
> +
> +    /* The StoreConditional succeeded */
> +    ret = 0;

How did you come up with 0 == success, 1 = failure?   Is this an arm-ism?


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 4/5] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
  2015-05-06 15:38 ` [Qemu-devel] [RFC 4/5] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions Alvise Rigo
@ 2015-05-07 17:58   ` Richard Henderson
  2015-05-11  8:12     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-05-07 17:58 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana

On 05/06/2015 08:38 AM, Alvise Rigo wrote:
> +/* An output operand to return the StoreConditional result */
> +static void gen_stcond_i32(TCGOpcode opc, TCGv_i32 is_dirty, TCGv_i32 val,
> +                         TCGv addr, TCGMemOp memop, TCGArg idx)
> +{
> +    tcg_gen_op5ii_i32(opc, is_dirty, val, addr, memop, idx);
> +}

This is the wrong way to go about this.  I think you should merely add an EXCL
bit to TCGMemOp, and add no new opcodes at all.


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
                   ` (6 preceding siblings ...)
  2015-05-06 15:55 ` Mark Burton
@ 2015-05-08 15:22 ` Alex Bennée
  2015-05-11  9:08   ` alvise rigo
  2015-05-08 18:29 ` Emilio G. Cota
  8 siblings, 1 reply; 27+ messages in thread
From: Alex Bennée @ 2015-05-08 15:22 UTC (permalink / raw)
  To: Alvise Rigo; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana, qemu-devel


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> This patch series provides an infrastructure for atomic
> instruction implementation in QEMU, paving the way for TCG multi-threading.
> The adopted design does not rely on host atomic
> instructions and is intended to propose a 'legacy' solution for
> translating guest atomic instructions.

Thanks for posting this. 

> The underlying idea is to provide new TCG instructions that guarantee
> atomicity to some memory accesses or in general a way to define memory
> transactions. More specifically, a new pair of TCG instructions are
> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
> LoadLink and StoreConditional primitives (only 32 bit variant
> implemented).  In order to achieve this, a new bitmap is added to the
> ram_list structure (always unique) which flags all memory pages that
> could not be accessed directly through the fast-path, due to previous
> exclusive operations. This new bitmap is coupled with a new TLB flag
> which forces the slow-path exectuion. All stores which take place
> between an LL/SC operation by other vCPUs in the same memory page, will
> fail the subsequent StoreConditional.

Do you have any figures for contention with these page aligned exclusive
locks? On ARMv8 the global monitor reservation granule is IMDEF but
generally smaller than the page size. If a large number of exclusively
accessed variables end up in the same page there could be a lot of
failed exclusive ops compare to the real world.

>
> In theory, the provided implementation of TCG LoadLink/StoreConditional
> can be used to properly handle atomic instructions on any architecture.
>
> The new slow-path is implemented such that:
> - the LoadLink behaves as a normal load slow-path, except for cleaning
>   the dirty flag in the bitmap. The TLB entries created from now on will
>   force the slow-path. To ensure it, we flush the TLB cache for the
>   other vCPUs
> - the StoreConditional behaves as a normal store slow-path, except for
>   checking the state of the dirty bitmap and returning 0 or 1 whether or
>   not the StoreConditional succeeded (0 when no vCPU has touched the
>   same memory in the mean time).
>
> All those write accesses that are forced to follow the 'legacy'
> slow-path will set the accessed memory page to dirty.
>
> In this series only the ARM ldrex/strex instructions are implemented.
> The code was tested with bare-metal test cases and with Linux, using
> upstream QEMU.

Have you developed any specific test cases to exercise the logic?

>
> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
>
> Alvise Rigo (5):
>   exec: Add new exclusive bitmap to ram_list
>   Add new TLB_EXCL flag
>   softmmu: Add helpers for a new slow-path
>   tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>   target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>
>  cputlb.c                |  11 ++-
>  include/exec/cpu-all.h  |   1 +
>  include/exec/cpu-defs.h |   2 +
>  include/exec/memory.h   |   3 +-
>  include/exec/ram_addr.h |  19 +++-
>  softmmu_llsc_template.h | 233 ++++++++++++++++++++++++++++++++++++++++++++++++
>  softmmu_template.h      |  52 ++++++++++-
>  target-arm/translate.c  |  94 ++++++++++++++++++-
>  tcg/arm/tcg-target.c    | 105 ++++++++++++++++------
>  tcg/tcg-be-ldst.h       |   2 +
>  tcg/tcg-op.c            |  20 +++++
>  tcg/tcg-op.h            |   3 +
>  tcg/tcg-opc.h           |   4 +
>  tcg/tcg.c               |   2 +
>  tcg/tcg.h               |  20 +++++
>  15 files changed, 538 insertions(+), 33 deletions(-)
>  create mode 100644 softmmu_llsc_template.h

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
                   ` (7 preceding siblings ...)
  2015-05-08 15:22 ` Alex Bennée
@ 2015-05-08 18:29 ` Emilio G. Cota
  2015-05-11  9:10   ` alvise rigo
  8 siblings, 1 reply; 27+ messages in thread
From: Emilio G. Cota @ 2015-05-08 18:29 UTC (permalink / raw)
  To: Alvise Rigo; +Cc: mttcg, jani.kokkonen, tech, claudio.fontana, qemu-devel

On Wed, May 06, 2015 at 17:38:02 +0200, Alvise Rigo wrote:
> This patch series provides an infrastructure for atomic
> instruction implementation in QEMU, paving the way for TCG multi-threading.
> The adopted design does not rely on host atomic
> instructions and is intended to propose a 'legacy' solution for
> translating guest atomic instructions.

Patch 2 doesn't apply to current master. I fixed the conflict manually
and get a segfault before it boots:

(gdb) bt
#0  0x0000555555657b04 in test_bit (addr=<optimized out>, nr=<optimized out>)
    at /local/home/cota/src/qemu/include/qemu/bitops.h:119
#1  cpu_physical_memory_excl_is_dirty (addr=18446744073709551615)
    at /local/home/cota/src/qemu/include/exec/ram_addr.h:214
#2  tlb_set_page_with_attrs (cpu=<optimized out>, vaddr=<optimized out>, 
    paddr=503316480, attrs=..., prot=<optimized out>, mmu_idx=3, size=1024)
    at /local/home/cota/src/qemu/cputlb.c:328
#3  0x0000555555714c68 in arm_cpu_handle_mmu_fault (cs=0x555556334500, 
    address=<optimized out>, access_type=0, mmu_idx=3)
    at /local/home/cota/src/qemu/target-arm/helper.c:5813
#4  0x00005555557077b0 in tlb_fill (cs=0x555556334500, addr=<optimized out>, 
    is_write=<optimized out>, mmu_idx=<optimized out>, retaddr=140737065132893)
    at /local/home/cota/src/qemu/target-arm/op_helper.c:69
#5  0x000055555565939f in helper_le_ldul_mmu (env=0x55555633c750, 
    addr=503316484, mmu_idx=3, retaddr=<optimized out>)
    at /local/home/cota/src/qemu/softmmu_template.h:192
#6  0x00007fffe6c623db in code_gen_buffer ()
#7  0x00005555556156ea in cpu_tb_exec (
    tb_ptr=0x7fffe6c62320 "A\213n\374\205\355\017\205\207", cpu=0x555556334500)
    at /local/home/cota/src/qemu/cpu-exec.c:199
#8  cpu_arm_exec (env=0x55555633c750)
    at /local/home/cota/src/qemu/cpu-exec.c:519
#9  0x000055555563c340 in tcg_cpu_exec (env=0x55555633c750)
    at /local/home/cota/src/qemu/cpus.c:1354
#10 tcg_exec_all () at /local/home/cota/src/qemu/cpus.c:1387
#11 qemu_tcg_cpu_thread_fn (arg=<optimized out>)
    at /local/home/cota/src/qemu/cpus.c:1032
#12 0x00007ffff40dfe9a in start_thread (arg=0x7fffe4a45700)
    at pthread_create.c:308
#13 0x00007ffff3e0d38d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()

It could be that my manual fix of the conflicts was wrong. What commit
are your patches based on?

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 2/5] Add new TLB_EXCL flag
  2015-05-07 17:25   ` Richard Henderson
@ 2015-05-11  7:47     ` alvise rigo
  0 siblings, 0 replies; 27+ messages in thread
From: alvise rigo @ 2015-05-11  7:47 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers

Hi Richard,

Thank you for looking at this.
Some comments below.

On Thu, May 7, 2015 at 7:25 PM, Richard Henderson <rth@twiddle.net> wrote:
>
> On 05/06/2015 08:38 AM, Alvise Rigo wrote:
> > Add a new flag for the TLB entries to force all the accesses made to a
> > page to follow the slow-path.
> >
> > Mark the accessed page as dirty to invalidate any pending operation of
> > LL/SC.
> >
> > Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> > Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> > Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> > ---
> >  cputlb.c               |  7 ++++++-
> >  include/exec/cpu-all.h |  1 +
> >  softmmu_template.h     | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 54 insertions(+), 2 deletions(-)
> >
> > diff --git a/cputlb.c b/cputlb.c
> > index 38f2151..3e4ccba 100644
> > --- a/cputlb.c
> > +++ b/cputlb.c
> > @@ -324,7 +324,12 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
> >                                                     + xlat)) {
> >              te->addr_write = address | TLB_NOTDIRTY;
> >          } else {
> > -            te->addr_write = address;
> > +            if (!cpu_physical_memory_excl_is_dirty(section->mr->ram_addr
> > +                                                   + xlat)) {
> > +                te->addr_write = address | TLB_EXCL;
> > +            } else {
> > +                te->addr_write = address;
> > +            }
>
> I don't see that you initialize this new bitmap to all ones?  That would
> generate exclusive accesses on all of memory until each unit (page?) is touched.
>
>
> Perhaps you should invert the sense of your use of the dirty bitmap, such that
> a set bit means that some vcpu is monitoring the unit, and no vcpu has written
> to the unit?


Actually the sense of the bitmap is already inverted (0 means dirty),
in this way I didn't have to touch the code to initialize the bitmap
in exec.c.
In the next iteration I will make this more clear by naming it
differently or by initializing it with all ones.

>
>
> Your nesting above would seem to fail when TLB_NOTDIRTY needs to be set as well.
>
> > @@ -262,7 +263,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
> >      uintptr_t haddr;
> >      DATA_TYPE res;
> >
> > -    /* Adjust the given return address.  */
> > +        /* Adjust the given return address.  */
> >      retaddr -= GETPC_ADJ;
>
> Careful...

I missed this..

>
> > @@ -387,6 +388,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
> >      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> >      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> >      uintptr_t haddr;
> > +    bool to_excl_mem = false;
> >
> >      /* Adjust the given return address.  */
> >      retaddr -= GETPC_ADJ;
> > @@ -406,6 +408,14 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
> >          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> >      }
> >
> > +    if (unlikely(tlb_addr & TLB_EXCL &&
> > +        !(tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL)))) {
>
> This is (tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL, but...
>
> > +        /* The slow-path has been forced since we are reading a page used for a
> > +         * load-link operation. */
> > +        to_excl_mem = true;
> > +        goto skip_io;
>
> I'm not fond of either the boolean or the goto.

In this case the goto seems to make the code a bit cleaner, but if
it's necessary I will remove them.

>
> > +    }
> > +
> >      /* Handle an IO access.  */
> >      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>
> Nesting your unlikely code under one unlikely test seems better.  And would be
> more likely to work in the case you need to handle both TLB_NOTDIRTY and TLB_EXCL.

OK, I will keep this in mind for the next version.

Thank you,
alvise

>
>
> r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] exec: Add new exclusive bitmap to ram_list
  2015-05-07 17:12   ` Richard Henderson
@ 2015-05-11  7:48     ` alvise rigo
  0 siblings, 0 replies; 27+ messages in thread
From: alvise rigo @ 2015-05-11  7:48 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers

On Thu, May 7, 2015 at 7:12 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 05/06/2015 08:38 AM, Alvise Rigo wrote:
>> The purpose of this new bitmap is to flag the memory pages that are in
>> the middle of LL/SC operations (after a LL, before a SC).
>> For all these pages, the corresponding TLB entries will be generated
>> in such a way to force the slow-path.
>>
>> The accessors to this bitmap are currently not atomic, but they have to
>> be so in a real multi-threading TCG.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  include/exec/cpu-defs.h |  2 ++
>>  include/exec/memory.h   |  3 ++-
>>  include/exec/ram_addr.h | 19 ++++++++++++++++++-
>>  3 files changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
>> index 0ca6f0b..d12cb4c 100644
>> --- a/include/exec/cpu-defs.h
>> +++ b/include/exec/cpu-defs.h
>> @@ -123,5 +123,7 @@ QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
>>  #define CPU_COMMON                                                      \
>>      /* soft mmu support */                                              \
>>      CPU_COMMON_TLB                                                      \
>> +    /* true if in the middle of a LoadLink/StoreConditional */          \
>> +    bool ll_sc_context;                                                 \
>
> Belongs to a different patch?

Yes, it shouldn't be here.

Thank you,
alvise

>
>
> r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 3/5] softmmu: Add helpers for a new slow-path
  2015-05-07 17:56   ` Richard Henderson
@ 2015-05-11  8:07     ` alvise rigo
  0 siblings, 0 replies; 27+ messages in thread
From: alvise rigo @ 2015-05-11  8:07 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers

On Thu, May 7, 2015 at 7:56 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 05/06/2015 08:38 AM, Alvise Rigo wrote:
>> +#define DATA_SIZE (1 << SHIFT)
>> +
>> +#if DATA_SIZE == 8
>> +#define SUFFIX q
>> +#define LSUFFIX q
>> +#define SDATA_TYPE  int64_t
>> +#define DATA_TYPE  uint64_t
>
> Duplicating all of the stuff from softmmu_template.h is Just Wrong.
>
>> +/* For the benefit of TCG generated code, we want to avoid the complication
>> +   of ABI-specific return type promotion and always return a value extended
>> +   to the register size of the host.  This is tcg_target_long, except in the
>> +   case of a 32-bit host and 64-bit data, and for that we always have
>> +   uint64_t.  Don't bother with this widened value for SOFTMMU_CODE_ACCESS.  */
>
> ...
>
>> +uint32_t helper_le_stcond_name(CPUArchState *env, target_ulong addr,
>> +                               DATA_TYPE val, int mmu_idx, uintptr_t retaddr)
>
> You didn't even read the comment above re the return type.

This part has to be reworked quite a bit. Thank you for pointing this out.

>
>> +        /* Another vCPU has accessed the memory after the LoadLink or not link
>> +         * has been previously set. */
>> +        ret = 1;
>> +        goto out;
>> +    }
>> +
>> +    helper_st_legacy(env, addr, val, mmu_idx, retaddr);
>> +
>> +    /* Set the page as dirty to avoid the creation of TLB entries with the
>> +     * TLB_EXCL bit set. */
>> +    cpu_physical_memory_set_excl_dirty(ram_addr);
>> +
>> +    /* The StoreConditional succeeded */
>> +    ret = 0;
>
> How did you come up with 0 == success, 1 = failure?   Is this an arm-ism?

Yes, but it made sense to me to keep in that way.
I will add a comment on top of the method explaining the return values.

Thank you,
alvise

>
>
> r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 4/5] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
  2015-05-07 17:58   ` Richard Henderson
@ 2015-05-11  8:12     ` alvise rigo
  0 siblings, 0 replies; 27+ messages in thread
From: alvise rigo @ 2015-05-11  8:12 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers

On Thu, May 7, 2015 at 7:58 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 05/06/2015 08:38 AM, Alvise Rigo wrote:
>> +/* An output operand to return the StoreConditional result */
>> +static void gen_stcond_i32(TCGOpcode opc, TCGv_i32 is_dirty, TCGv_i32 val,
>> +                         TCGv addr, TCGMemOp memop, TCGArg idx)
>> +{
>> +    tcg_gen_op5ii_i32(opc, is_dirty, val, addr, memop, idx);
>> +}
>
> This is the wrong way to go about this.  I think you should merely add an EXCL
> bit to TCGMemOp, and add no new opcodes at all.

You are right, there is no need of a new opcode here.

Thank you,
alvise

>
>
> r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-08 15:22 ` Alex Bennée
@ 2015-05-11  9:08   ` alvise rigo
  0 siblings, 0 replies; 27+ messages in thread
From: alvise rigo @ 2015-05-11  9:08 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers

Hi,

On Fri, May 8, 2015 at 5:22 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> This patch series provides an infrastructure for atomic
>> instruction implementation in QEMU, paving the way for TCG multi-threading.
>> The adopted design does not rely on host atomic
>> instructions and is intended to propose a 'legacy' solution for
>> translating guest atomic instructions.
>
> Thanks for posting this.
>
>> The underlying idea is to provide new TCG instructions that guarantee
>> atomicity to some memory accesses or in general a way to define memory
>> transactions. More specifically, a new pair of TCG instructions are
>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>> LoadLink and StoreConditional primitives (only 32 bit variant
>> implemented).  In order to achieve this, a new bitmap is added to the
>> ram_list structure (always unique) which flags all memory pages that
>> could not be accessed directly through the fast-path, due to previous
>> exclusive operations. This new bitmap is coupled with a new TLB flag
>> which forces the slow-path exectuion. All stores which take place
>> between an LL/SC operation by other vCPUs in the same memory page, will
>> fail the subsequent StoreConditional.
>
> Do you have any figures for contention with these page aligned exclusive
> locks? On ARMv8 the global monitor reservation granule is IMDEF but
> generally smaller than the page size. If a large number of exclusively
> accessed variables end up in the same page there could be a lot of
> failed exclusive ops compare to the real world.

For sure the reservation granule is much bigger here than in an
average ARM implementation.
However, this could be somehow improved by allowing to have one exact
'linked' address per memory page.
In this way, all the accesses made to the page will follow the
slow-path, but only those writing to the linked address will make the
page dirty.
In essence, the following limitations will remain:
- slow-path forced at a page granularity
- only one linked address per page (this might not be the real behaviour)

>
>>
>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>> can be used to properly handle atomic instructions on any architecture.
>>
>> The new slow-path is implemented such that:
>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>   the dirty flag in the bitmap. The TLB entries created from now on will
>>   force the slow-path. To ensure it, we flush the TLB cache for the
>>   other vCPUs
>> - the StoreConditional behaves as a normal store slow-path, except for
>>   checking the state of the dirty bitmap and returning 0 or 1 whether or
>>   not the StoreConditional succeeded (0 when no vCPU has touched the
>>   same memory in the mean time).
>>
>> All those write accesses that are forced to follow the 'legacy'
>> slow-path will set the accessed memory page to dirty.
>>
>> In this series only the ARM ldrex/strex instructions are implemented.
>> The code was tested with bare-metal test cases and with Linux, using
>> upstream QEMU.
>
> Have you developed any specific test cases to exercise the logic?

Yes, some trivial baremetal tests were I was writing to the linked
address in the middle of a LL/SC.
I think the next mandatory test is to run it in real multi-threading
to exercise even more the logic.

Thank you,
alvise

>
>>
>> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
>>
>> Alvise Rigo (5):
>>   exec: Add new exclusive bitmap to ram_list
>>   Add new TLB_EXCL flag
>>   softmmu: Add helpers for a new slow-path
>>   tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>   target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>>
>>  cputlb.c                |  11 ++-
>>  include/exec/cpu-all.h  |   1 +
>>  include/exec/cpu-defs.h |   2 +
>>  include/exec/memory.h   |   3 +-
>>  include/exec/ram_addr.h |  19 +++-
>>  softmmu_llsc_template.h | 233 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  softmmu_template.h      |  52 ++++++++++-
>>  target-arm/translate.c  |  94 ++++++++++++++++++-
>>  tcg/arm/tcg-target.c    | 105 ++++++++++++++++------
>>  tcg/tcg-be-ldst.h       |   2 +
>>  tcg/tcg-op.c            |  20 +++++
>>  tcg/tcg-op.h            |   3 +
>>  tcg/tcg-opc.h           |   4 +
>>  tcg/tcg.c               |   2 +
>>  tcg/tcg.h               |  20 +++++
>>  15 files changed, 538 insertions(+), 33 deletions(-)
>>  create mode 100644 softmmu_llsc_template.h
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-08 18:29 ` Emilio G. Cota
@ 2015-05-11  9:10   ` alvise rigo
  2015-05-26 21:51     ` Emilio G. Cota
  0 siblings, 1 reply; 27+ messages in thread
From: alvise rigo @ 2015-05-11  9:10 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers

On Fri, May 8, 2015 at 8:29 PM, Emilio G. Cota <cota@braap.org> wrote:
> On Wed, May 06, 2015 at 17:38:02 +0200, Alvise Rigo wrote:
>> This patch series provides an infrastructure for atomic
>> instruction implementation in QEMU, paving the way for TCG multi-threading.
>> The adopted design does not rely on host atomic
>> instructions and is intended to propose a 'legacy' solution for
>> translating guest atomic instructions.
>
> Patch 2 doesn't apply to current master. I fixed the conflict manually
> and get a segfault before it boots:
>
> (gdb) bt
> #0  0x0000555555657b04 in test_bit (addr=<optimized out>, nr=<optimized out>)
>     at /local/home/cota/src/qemu/include/qemu/bitops.h:119
> #1  cpu_physical_memory_excl_is_dirty (addr=18446744073709551615)
>     at /local/home/cota/src/qemu/include/exec/ram_addr.h:214
> #2  tlb_set_page_with_attrs (cpu=<optimized out>, vaddr=<optimized out>,
>     paddr=503316480, attrs=..., prot=<optimized out>, mmu_idx=3, size=1024)
>     at /local/home/cota/src/qemu/cputlb.c:328
> #3  0x0000555555714c68 in arm_cpu_handle_mmu_fault (cs=0x555556334500,
>     address=<optimized out>, access_type=0, mmu_idx=3)
>     at /local/home/cota/src/qemu/target-arm/helper.c:5813
> #4  0x00005555557077b0 in tlb_fill (cs=0x555556334500, addr=<optimized out>,
>     is_write=<optimized out>, mmu_idx=<optimized out>, retaddr=140737065132893)
>     at /local/home/cota/src/qemu/target-arm/op_helper.c:69
> #5  0x000055555565939f in helper_le_ldul_mmu (env=0x55555633c750,
>     addr=503316484, mmu_idx=3, retaddr=<optimized out>)
>     at /local/home/cota/src/qemu/softmmu_template.h:192
> #6  0x00007fffe6c623db in code_gen_buffer ()
> #7  0x00005555556156ea in cpu_tb_exec (
>     tb_ptr=0x7fffe6c62320 "A\213n\374\205\355\017\205\207", cpu=0x555556334500)
>     at /local/home/cota/src/qemu/cpu-exec.c:199
> #8  cpu_arm_exec (env=0x55555633c750)
>     at /local/home/cota/src/qemu/cpu-exec.c:519
> #9  0x000055555563c340 in tcg_cpu_exec (env=0x55555633c750)
>     at /local/home/cota/src/qemu/cpus.c:1354
> #10 tcg_exec_all () at /local/home/cota/src/qemu/cpus.c:1387
> #11 qemu_tcg_cpu_thread_fn (arg=<optimized out>)
>     at /local/home/cota/src/qemu/cpus.c:1032
> #12 0x00007ffff40dfe9a in start_thread (arg=0x7fffe4a45700)
>     at pthread_create.c:308
> #13 0x00007ffff3e0d38d in clone ()
>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #14 0x0000000000000000 in ?? ()
>
> It could be that my manual fix of the conflicts was wrong. What commit
> are your patches based on?

Hi,

the last commit was b8df9208f357d2b36e1b19634aea973618dc7ba8.

Regards,
alvise

>
> Thanks,
>
>                 Emilio

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-11  9:10   ` alvise rigo
@ 2015-05-26 21:51     ` Emilio G. Cota
  2015-05-27  7:20       ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Emilio G. Cota @ 2015-05-26 21:51 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers

On Mon, May 11, 2015 at 11:10:05 +0200, alvise rigo wrote:
> the last commit was b8df9208f357d2b36e1b19634aea973618dc7ba8.

Thanks.

Unfortunately a segfault still happens very early:

$ gdb arm-softmmu/qemu-system-arm
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /local/home/cota/src/qemu/arm-softmmu/qemu-system-arm...done.
(gdb) set args  -M vexpress-a9 -kernel img/arm/zImage -drive file=img/arm/rootfs.ext2,if=sd -append "console=ttyAMA0,115200 root=/dev/mmcblk0" -net nic,model=lan9118 -net user -nographic -smp 1
(gdb) r
Starting program: /local/home/cota/src/qemu/arm-softmmu/qemu-system-arm -M vexpress-a9 -kernel img/arm/zImage -drive file=img/arm/rootfs.ext2,if=sd -append "console=ttyAMA0,115200 root=/dev/mmcblk0" -net nic,model=lan9118 -net user -nographic -smp 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffe9447700 (LWP 4309)]
[New Thread 0x7fffe5246700 (LWP 4310)]
WARNING: Image format was not specified for 'img/arm/rootfs.ext2' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
[New Thread 0x7fffe4a45700 (LWP 4311)]
audio: Could not init `oss' audio driver

Program received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x7fffe4a45700 (LWP 4311)]
pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
162	../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000555555655c34 in test_bit (addr=<optimized out>, nr=<optimized out>)
    at /local/home/cota/src/qemu/include/qemu/bitops.h:119
119		return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
(gdb) bt
#0  0x0000555555655c34 in test_bit (addr=<optimized out>, nr=<optimized out>)
    at /local/home/cota/src/qemu/include/qemu/bitops.h:119
#1  cpu_physical_memory_excl_is_dirty (addr=18446744073709551615)
    at /local/home/cota/src/qemu/include/exec/ram_addr.h:214
#2  tlb_set_page (cpu=<optimized out>, vaddr=<optimized out>, paddr=503316480, 
    prot=<optimized out>, mmu_idx=3, size=<optimized out>)
    at /local/home/cota/src/qemu/cputlb.c:327
#3  0x0000555555712091 in arm_cpu_handle_mmu_fault (cs=0x55555632c4e0, 
    address=<optimized out>, access_type=0, mmu_idx=3)
    at /local/home/cota/src/qemu/target-arm/helper.c:5726
#4  0x0000555555704f70 in tlb_fill (cs=0x55555632c4e0, addr=<optimized out>, 
    is_write=<optimized out>, mmu_idx=<optimized out>, retaddr=140737065132893)
    at /local/home/cota/src/qemu/target-arm/op_helper.c:69
#5  0x000055555565733f in helper_le_ldul_mmu (env=0x555556334730, 
    addr=503316484, mmu_idx=3, retaddr=<optimized out>)
    at /local/home/cota/src/qemu/softmmu_template.h:190
#6  0x00007fffe6c623db in code_gen_buffer ()
#7  0x00005555556148ba in cpu_tb_exec (
    tb_ptr=0x7fffe6c62320 "A\213n\374\205\355\017\205\207", cpu=0x55555632c4e0)
    at /local/home/cota/src/qemu/cpu-exec.c:199
#8  cpu_arm_exec (env=0x555556334730)
    at /local/home/cota/src/qemu/cpu-exec.c:519
#9  0x000055555563a880 in tcg_cpu_exec (env=0x555556334730)
    at /local/home/cota/src/qemu/cpus.c:1354
#10 tcg_exec_all () at /local/home/cota/src/qemu/cpus.c:1387
#11 qemu_tcg_cpu_thread_fn (arg=<optimized out>)
    at /local/home/cota/src/qemu/cpus.c:1032
#12 0x00007ffff40dfe9a in start_thread (arg=0x7fffe4a45700)
    at pthread_create.c:308
#13 0x00007ffff3e0d38d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()

		Emilio

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-26 21:51     ` Emilio G. Cota
@ 2015-05-27  7:20       ` alvise rigo
  2015-05-27  8:51         ` Alex Bennée
  0 siblings, 1 reply; 27+ messages in thread
From: alvise rigo @ 2015-05-27  7:20 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Claudio Fontana, QEMU Developers

I'm going to respin these patches soon, I've found some issues that
I'm addressing now.

Thank you for your feedback,
alvise

On Tue, May 26, 2015 at 11:51 PM, Emilio G. Cota <cota@braap.org> wrote:
> On Mon, May 11, 2015 at 11:10:05 +0200, alvise rigo wrote:
>> the last commit was b8df9208f357d2b36e1b19634aea973618dc7ba8.
>
> Thanks.
>
> Unfortunately a segfault still happens very early:
>
> $ gdb arm-softmmu/qemu-system-arm
> GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
> Copyright (C) 2012 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> <http://bugs.launchpad.net/gdb-linaro/>...
> Reading symbols from /local/home/cota/src/qemu/arm-softmmu/qemu-system-arm...done.
> (gdb) set args  -M vexpress-a9 -kernel img/arm/zImage -drive file=img/arm/rootfs.ext2,if=sd -append "console=ttyAMA0,115200 root=/dev/mmcblk0" -net nic,model=lan9118 -net user -nographic -smp 1
> (gdb) r
> Starting program: /local/home/cota/src/qemu/arm-softmmu/qemu-system-arm -M vexpress-a9 -kernel img/arm/zImage -drive file=img/arm/rootfs.ext2,if=sd -append "console=ttyAMA0,115200 root=/dev/mmcblk0" -net nic,model=lan9118 -net user -nographic -smp 1
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x7fffe9447700 (LWP 4309)]
> [New Thread 0x7fffe5246700 (LWP 4310)]
> WARNING: Image format was not specified for 'img/arm/rootfs.ext2' and probing guessed raw.
>          Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
>          Specify the 'raw' format explicitly to remove the restrictions.
> [New Thread 0x7fffe4a45700 (LWP 4311)]
> audio: Could not init `oss' audio driver
>
> Program received signal SIGUSR1, User defined signal 1.
> [Switching to Thread 0x7fffe4a45700 (LWP 4311)]
> pthread_cond_wait@@GLIBC_2.3.2 ()
>     at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> 162     ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
> (gdb) cont
> Continuing.
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000555555655c34 in test_bit (addr=<optimized out>, nr=<optimized out>)
>     at /local/home/cota/src/qemu/include/qemu/bitops.h:119
> 119             return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
> (gdb) bt
> #0  0x0000555555655c34 in test_bit (addr=<optimized out>, nr=<optimized out>)
>     at /local/home/cota/src/qemu/include/qemu/bitops.h:119
> #1  cpu_physical_memory_excl_is_dirty (addr=18446744073709551615)
>     at /local/home/cota/src/qemu/include/exec/ram_addr.h:214
> #2  tlb_set_page (cpu=<optimized out>, vaddr=<optimized out>, paddr=503316480,
>     prot=<optimized out>, mmu_idx=3, size=<optimized out>)
>     at /local/home/cota/src/qemu/cputlb.c:327
> #3  0x0000555555712091 in arm_cpu_handle_mmu_fault (cs=0x55555632c4e0,
>     address=<optimized out>, access_type=0, mmu_idx=3)
>     at /local/home/cota/src/qemu/target-arm/helper.c:5726
> #4  0x0000555555704f70 in tlb_fill (cs=0x55555632c4e0, addr=<optimized out>,
>     is_write=<optimized out>, mmu_idx=<optimized out>, retaddr=140737065132893)
>     at /local/home/cota/src/qemu/target-arm/op_helper.c:69
> #5  0x000055555565733f in helper_le_ldul_mmu (env=0x555556334730,
>     addr=503316484, mmu_idx=3, retaddr=<optimized out>)
>     at /local/home/cota/src/qemu/softmmu_template.h:190
> #6  0x00007fffe6c623db in code_gen_buffer ()
> #7  0x00005555556148ba in cpu_tb_exec (
>     tb_ptr=0x7fffe6c62320 "A\213n\374\205\355\017\205\207", cpu=0x55555632c4e0)
>     at /local/home/cota/src/qemu/cpu-exec.c:199
> #8  cpu_arm_exec (env=0x555556334730)
>     at /local/home/cota/src/qemu/cpu-exec.c:519
> #9  0x000055555563a880 in tcg_cpu_exec (env=0x555556334730)
>     at /local/home/cota/src/qemu/cpus.c:1354
> #10 tcg_exec_all () at /local/home/cota/src/qemu/cpus.c:1387
> #11 qemu_tcg_cpu_thread_fn (arg=<optimized out>)
>     at /local/home/cota/src/qemu/cpus.c:1032
> #12 0x00007ffff40dfe9a in start_thread (arg=0x7fffe4a45700)
>     at pthread_create.c:308
> #13 0x00007ffff3e0d38d in clone ()
>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #14 0x0000000000000000 in ?? ()
>
>                 Emilio

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
  2015-05-27  7:20       ` alvise rigo
@ 2015-05-27  8:51         ` Alex Bennée
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Bennée @ 2015-05-27  8:51 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Emilio G. Cota,
	Jani Kokkonen, VirtualOpenSystems Technical Team


alvise rigo <a.rigo@virtualopensystems.com> writes:

> I'm going to respin these patches soon, I've found some issues that
> I'm addressing now.

Thanks, please feel free to add me to your CC list.

>
> Thank you for your feedback,
> alvise
>
> On Tue, May 26, 2015 at 11:51 PM, Emilio G. Cota <cota@braap.org> wrote:
>> On Mon, May 11, 2015 at 11:10:05 +0200, alvise rigo wrote:
>>> the last commit was b8df9208f357d2b36e1b19634aea973618dc7ba8.
>>
>> Thanks.
>>
>> Unfortunately a segfault still happens very early:
>>
>> $ gdb arm-softmmu/qemu-system-arm
>> GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
>> Copyright (C) 2012 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> For bug reporting instructions, please see:
>> <http://bugs.launchpad.net/gdb-linaro/>...
>> Reading symbols from /local/home/cota/src/qemu/arm-softmmu/qemu-system-arm...done.
>> (gdb) set args  -M vexpress-a9 -kernel img/arm/zImage -drive file=img/arm/rootfs.ext2,if=sd -append "console=ttyAMA0,115200 root=/dev/mmcblk0" -net nic,model=lan9118 -net user -nographic -smp 1
>> (gdb) r
>> Starting program: /local/home/cota/src/qemu/arm-softmmu/qemu-system-arm -M vexpress-a9 -kernel img/arm/zImage -drive file=img/arm/rootfs.ext2,if=sd -append "console=ttyAMA0,115200 root=/dev/mmcblk0" -net nic,model=lan9118 -net user -nographic -smp 1
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>> [New Thread 0x7fffe9447700 (LWP 4309)]
>> [New Thread 0x7fffe5246700 (LWP 4310)]
>> WARNING: Image format was not specified for 'img/arm/rootfs.ext2' and probing guessed raw.
>>          Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
>>          Specify the 'raw' format explicitly to remove the restrictions.
>> [New Thread 0x7fffe4a45700 (LWP 4311)]
>> audio: Could not init `oss' audio driver
>>
>> Program received signal SIGUSR1, User defined signal 1.
>> [Switching to Thread 0x7fffe4a45700 (LWP 4311)]
>> pthread_cond_wait@@GLIBC_2.3.2 ()
>>     at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
>> 162     ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
>> (gdb) cont
>> Continuing.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x0000555555655c34 in test_bit (addr=<optimized out>, nr=<optimized out>)
>>     at /local/home/cota/src/qemu/include/qemu/bitops.h:119
>> 119             return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
>> (gdb) bt
>> #0  0x0000555555655c34 in test_bit (addr=<optimized out>, nr=<optimized out>)
>>     at /local/home/cota/src/qemu/include/qemu/bitops.h:119
>> #1  cpu_physical_memory_excl_is_dirty (addr=18446744073709551615)
>>     at /local/home/cota/src/qemu/include/exec/ram_addr.h:214
>> #2  tlb_set_page (cpu=<optimized out>, vaddr=<optimized out>, paddr=503316480,
>>     prot=<optimized out>, mmu_idx=3, size=<optimized out>)
>>     at /local/home/cota/src/qemu/cputlb.c:327
>> #3  0x0000555555712091 in arm_cpu_handle_mmu_fault (cs=0x55555632c4e0,
>>     address=<optimized out>, access_type=0, mmu_idx=3)
>>     at /local/home/cota/src/qemu/target-arm/helper.c:5726
>> #4  0x0000555555704f70 in tlb_fill (cs=0x55555632c4e0, addr=<optimized out>,
>>     is_write=<optimized out>, mmu_idx=<optimized out>, retaddr=140737065132893)
>>     at /local/home/cota/src/qemu/target-arm/op_helper.c:69
>> #5  0x000055555565733f in helper_le_ldul_mmu (env=0x555556334730,
>>     addr=503316484, mmu_idx=3, retaddr=<optimized out>)
>>     at /local/home/cota/src/qemu/softmmu_template.h:190
>> #6  0x00007fffe6c623db in code_gen_buffer ()
>> #7  0x00005555556148ba in cpu_tb_exec (
>>     tb_ptr=0x7fffe6c62320 "A\213n\374\205\355\017\205\207", cpu=0x55555632c4e0)
>>     at /local/home/cota/src/qemu/cpu-exec.c:199
>> #8  cpu_arm_exec (env=0x555556334730)
>>     at /local/home/cota/src/qemu/cpu-exec.c:519
>> #9  0x000055555563a880 in tcg_cpu_exec (env=0x555556334730)
>>     at /local/home/cota/src/qemu/cpus.c:1354
>> #10 tcg_exec_all () at /local/home/cota/src/qemu/cpus.c:1387
>> #11 qemu_tcg_cpu_thread_fn (arg=<optimized out>)
>>     at /local/home/cota/src/qemu/cpus.c:1032
>> #12 0x00007ffff40dfe9a in start_thread (arg=0x7fffe4a45700)
>>     at pthread_create.c:308
>> #13 0x00007ffff3e0d38d in clone ()
>>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>> #14 0x0000000000000000 in ?? ()
>>
>>                 Emilio

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-05-27  8:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-06 15:38 [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Alvise Rigo
2015-05-06 15:38 ` [Qemu-devel] [RFC 1/5] exec: Add new exclusive bitmap to ram_list Alvise Rigo
2015-05-07 17:12   ` Richard Henderson
2015-05-11  7:48     ` alvise rigo
2015-05-06 15:38 ` [Qemu-devel] [RFC 2/5] Add new TLB_EXCL flag Alvise Rigo
2015-05-07 17:25   ` Richard Henderson
2015-05-11  7:47     ` alvise rigo
2015-05-06 15:38 ` [Qemu-devel] [RFC 3/5] softmmu: Add helpers for a new slow-path Alvise Rigo
2015-05-07 17:56   ` Richard Henderson
2015-05-11  8:07     ` alvise rigo
2015-05-06 15:38 ` [Qemu-devel] [RFC 4/5] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions Alvise Rigo
2015-05-07 17:58   ` Richard Henderson
2015-05-11  8:12     ` alvise rigo
2015-05-06 15:38 ` [Qemu-devel] [RFC 5/5] target-arm: translate: implement qemu_ldlink and qemu_stcond ops Alvise Rigo
2015-05-06 15:51 ` [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation Paolo Bonzini
2015-05-06 16:00   ` Mark Burton
2015-05-06 16:21     ` alvise rigo
2015-05-06 15:55 ` Mark Burton
2015-05-06 16:19   ` alvise rigo
2015-05-06 16:20     ` Mark Burton
2015-05-08 15:22 ` Alex Bennée
2015-05-11  9:08   ` alvise rigo
2015-05-08 18:29 ` Emilio G. Cota
2015-05-11  9:10   ` alvise rigo
2015-05-26 21:51     ` Emilio G. Cota
2015-05-27  7:20       ` alvise rigo
2015-05-27  8:51         ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.