All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
@ 2015-07-10  8:23 Alvise Rigo
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 01/13] exec: Add new exclusive bitmap to ram_list Alvise Rigo
                   ` (14 more replies)
  0 siblings, 15 replies; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

This is the third iteration of the patch series; starting from PATCH 007
there are the changes to move the whole work to multi-threading.
Changes versus previous versions are at the bottom of this cover letter.

This patch series provides an infrastructure for atomic
instruction implementation in QEMU, paving the way for TCG multi-threading.
The adopted design does not rely on host atomic
instructions and is intended to propose a 'legacy' solution for
translating guest atomic instructions.

The underlying idea is to provide new TCG instructions that guarantee
atomicity to some memory accesses or in general a way to define memory
transactions. More specifically, a new pair of TCG instructions are
implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
LoadLink and StoreConditional primitives (only 32 bit variant
implemented).  In order to achieve this, a new bitmap is added to the
ram_list structure (always unique) which flags all memory pages that
could not be accessed directly through the fast-path, due to previous
exclusive operations. This new bitmap is coupled with a new TLB flag
which forces the slow-path execution. All stores which are performed
between an LL/SC operation by other vCPUs to the same (protected) address
will fail the subsequent StoreConditional.

In theory, the provided implementation of TCG LoadLink/StoreConditional
can be used to properly handle atomic instructions on any architecture.

The new slow-path is implemented such that:
- the LoadLink behaves as a normal load slow-path, except for cleaning
  the dirty flag in the bitmap. The TLB entries created from now on will
  force the slow-path. To ensure it, we flush the TLB cache for the
  other vCPUs. The vCPU also sets into a private variable the accessed
  address, in order to make it visible to the other vCPUs
- the StoreConditional behaves as a normal store slow-path, except for
  checking whether other vCPUs have set the same exclusive address

All those write accesses that are forced to follow the 'legacy'
slow-path will set the accessed memory page to dirty.

In this series only the ARM ldrex/strex instructions are implemented
for ARM and i386 hosts.
The code has been tested with bare-metal test cases and by booting Linux,
using the latest mttcg QEMU branch available at
http://git.greensocs.com/fkonrad/mttcg.git.

* Performance considerations
This implementation shows good results while booting a Linux kernel,
where tons of flushes affect the overall performance. A complete ARM
Linux boot, without any filesystem, requires 30% longer if compared to
the mttcg implementation, benefiting however of being capable to offer
the infrastructure to handle atomic instructions on any architecture.
Instead compared to the current TCG upstream, it is 40% faster with four
vCPUs and 2.1 times faster with 8 vCPUs.
In addition, there is still margin to improve such performance, since at
the moment TLB is flushed quite often, probably more than the required.

On the other hand, the test case
https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git
that stresses heavily the LL/SC mechanic but not that much the TLB related
part, performs up to 1.9 times faster with 8 cores and one milion iterations
if compared with the mttcg implementation.

Changes from v2:
- the bitmap accessors are now atomic
- a rendezvous between vCPUs and a simple callback support before executing
  a TB have been added to handle the TLB flush support
- the softmmu_template and softmmu_llsc_template have been adapted to work
  on real multi-threading

Changes from v1:
- The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
- The way how the offset to access the bitmap is calculated has
  been improved and fixed
- A page to be set as dirty requires a vCPU to target the protected address
  and not just an address in the page
- Addressed comments from Richard Henderson to improve the logic in
  softmmu_template.h and to simplify the methods generation through
  softmmu_llsc_template.h
- Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386

This work has been sponsored by Huawei Technologies Duesseldorf GmbH.

Alvise Rigo (13):
  exec: Add new exclusive bitmap to ram_list
  cputlb: Add new TLB_EXCL flag
  softmmu: Add helpers for a new slow-path
  tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
  target-arm: translate: implement qemu_ldlink and qemu_stcond ops
  target-i386: translate: implement qemu_ldlink and qemu_stcond ops
  ram_addr.h: Make exclusive bitmap accessors atomic
  exec.c: introduce a simple rendezvous support
  cpus.c: introduce simple callback support
  Simple TLB flush wrap to use as exit callback
  Introduce exit_flush_req and tcg_excl_access_lock
  softmmu_llsc_template.h: move to multithreading
  softmmu_template.h: move to multithreading

 cpus.c                  |  39 ++++++++
 cputlb.c                |  33 +++++-
 exec.c                  |  46 +++++++++
 include/exec/cpu-all.h  |   2 +
 include/exec/cpu-defs.h |   8 ++
 include/exec/memory.h   |   3 +-
 include/exec/ram_addr.h |  22 ++++
 include/qom/cpu.h       |  37 +++++++
 softmmu_llsc_template.h | 184 ++++++++++++++++++++++++++++++++++
 softmmu_template.h      | 261 +++++++++++++++++++++++++++++++++++-------------
 target-arm/translate.c  |  87 +++++++++++++++-
 tcg/arm/tcg-target.c    | 121 ++++++++++++++++------
 tcg/i386/tcg-target.c   | 136 +++++++++++++++++++++----
 tcg/tcg-be-ldst.h       |   1 +
 tcg/tcg-op.c            |  23 +++++
 tcg/tcg-op.h            |   3 +
 tcg/tcg-opc.h           |   4 +
 tcg/tcg.c               |   2 +
 tcg/tcg.h               |  20 ++++
 19 files changed, 910 insertions(+), 122 deletions(-)
 create mode 100644 softmmu_llsc_template.h

-- 
2.4.5

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 01/13] exec: Add new exclusive bitmap to ram_list
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 02/13] cputlb: Add new TLB_EXCL flag Alvise Rigo
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

The purpose of this new bitmap is to flag the memory pages that are in
the middle of LL/SC operations (after a LL, before a SC).
For all these pages, the corresponding TLB entries will be generated
in such a way to force the slow-path.
When the system starts, the whole memory is dirty (all the bitmap is
set). A page, after being marked as exclusively-clean, will *not* be
restored as dirty after the SC; the cputlb code will take care of that,
lazily setting the page as dirty when the TLB EXCL entry is about to be
overwritten.

The accessors to this bitmap are currently not atomic, but they have to
be so in a real multi-threading TCG.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 include/exec/memory.h   |  3 ++-
 include/exec/ram_addr.h | 22 ++++++++++++++++++++++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 8ae004e..5ad6f20 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -19,7 +19,8 @@
 #define DIRTY_MEMORY_VGA       0
 #define DIRTY_MEMORY_CODE      1
 #define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
+#define DIRTY_MEMORY_EXCLUSIVE 3
+#define DIRTY_MEMORY_NUM       4        /* num of dirty bits */
 
 #include <stdint.h>
 #include <stdbool.h>
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index c113f21..2766541 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -135,6 +135,9 @@ static inline void cpu_physical_memory_set_dirty_range(ram_addr_t start,
     if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
         bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
     }
+    if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
+        bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE], page, end - page);
+    }
     xen_modified_memory(start, length);
 }
 
@@ -249,5 +252,24 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
     return num_dirty;
 }
 
+/* Exclusive bitmap accessors. */
+static inline void cpu_physical_memory_set_excl_dirty(ram_addr_t addr)
+{
+    set_bit(addr >> TARGET_PAGE_BITS,
+            ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+static inline int cpu_physical_memory_excl_is_dirty(ram_addr_t addr)
+{
+    return test_bit(addr >> TARGET_PAGE_BITS,
+                    ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+static inline void cpu_physical_memory_clear_excl_dirty(ram_addr_t addr)
+{
+    clear_bit(addr >> TARGET_PAGE_BITS,
+              ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
 #endif
 #endif
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel]  [RFC v3 02/13] cputlb: Add new TLB_EXCL flag
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 01/13] exec: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-16 14:32   ` Alex Bennée
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 03/13] softmmu: Add helpers for a new slow-path Alvise Rigo
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Add a new flag for the TLB entries to force all the accesses made to a
page to follow the slow-path.

In the case we remove a TLB entry marked as EXCL, we unset the
corresponding exclusive bit in the bitmap.

Mark the accessed page as dirty to invalidate any pending operation of
LL/SC only if a vCPU writes to the protected address.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |  18 ++++-
 include/exec/cpu-all.h  |   2 +
 include/exec/cpu-defs.h |   4 +
 softmmu_template.h      | 189 +++++++++++++++++++++++++++++++-----------------
 4 files changed, 144 insertions(+), 69 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index e5853fd..0aca407 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -380,6 +380,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     env->tlb_v_table[mmu_idx][vidx] = *te;
     env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
 
+    if (!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL)) {
+        /* We are removing an exclusive entry, if the corresponding exclusive
+         * bit is set, unset it. */
+        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
+                                          (te->addr_write & TARGET_PAGE_MASK);
+        if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
+            cpu_physical_memory_set_excl_dirty(hw_addr);
+        }
+    }
+
     /* refill the tlb */
     env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
     env->iotlb[mmu_idx][index].attrs = attrs;
@@ -405,7 +415,13 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
                                                    + xlat)) {
             te->addr_write = address | TLB_NOTDIRTY;
         } else {
-            te->addr_write = address;
+            if (!(address & TLB_MMIO) &&
+                !cpu_physical_memory_excl_is_dirty(section->mr->ram_addr
+                                                   + xlat)) {
+                te->addr_write = address | TLB_EXCL;
+            } else {
+                te->addr_write = address;
+            }
         }
     } else {
         te->addr_write = -1;
diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index ac06c67..632f6ce 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -311,6 +311,8 @@ extern RAMList ram_list;
 #define TLB_NOTDIRTY    (1 << 4)
 /* Set if TLB entry is an IO callback.  */
 #define TLB_MMIO        (1 << 5)
+/* Set if TLB entry refers a page that requires exclusive access.  */
+#define TLB_EXCL        (1 << 6)
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index d5aecaf..c73a75f 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -165,5 +165,9 @@ typedef struct CPUIOTLBEntry {
 #define CPU_COMMON                                                      \
     /* soft mmu support */                                              \
     CPU_COMMON_TLB                                                      \
+                                                                        \
+    /* Used for atomic instruction translation. */                      \
+    bool ll_sc_context;                                                 \
+    hwaddr excl_protected_hwaddr;                                       \
 
 #endif
diff --git a/softmmu_template.h b/softmmu_template.h
index 18871f5..0edd451 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -141,6 +141,23 @@
     vidx >= 0;                                                                \
 })
 
+#define lookup_cpus_ll_addr(addr)                                             \
+({                                                                            \
+    CPUState *cpu;                                                            \
+    CPUArchState *acpu;                                                       \
+    bool hit = false;                                                         \
+                                                                              \
+    CPU_FOREACH(cpu) {                                                        \
+        acpu = (CPUArchState *)cpu->env_ptr;                                  \
+        if (cpu != current_cpu && acpu->excl_protected_hwaddr == addr) {      \
+            hit = true;                                                       \
+            break;                                                            \
+        }                                                                     \
+    }                                                                         \
+                                                                              \
+    hit;                                                                      \
+})
+
 #ifndef SOFTMMU_CODE_ACCESS
 static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
                                               CPUIOTLBEntry *iotlbentry,
@@ -414,43 +431,61 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
-        }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_LE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
-        return;
-    }
-
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (DATA_SIZE > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
-                     >= TARGET_PAGE_SIZE)) {
-        int i;
-    do_unaligned_access:
-        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                                 mmu_idx, retaddr);
+        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+            /* The slow-path has been forced since we are writing to
+             * exclusive-protected memory. */
+            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+            bool set_to_dirty;
+
+            /* Two cases of invalidation: the current vCPU is writing to another
+             * vCPU's exclusive address or the vCPU that issued the LoadLink is
+             * writing to it, but not through a StoreCond. */
+            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
+            set_to_dirty |= env->ll_sc_context &&
+                           (env->excl_protected_hwaddr == hw_addr);
+
+            if (set_to_dirty) {
+                cpu_physical_memory_set_excl_dirty(hw_addr);
+            } /* the vCPU is legitimately writing to the protected address */
+        } else {
+            if ((addr & (DATA_SIZE - 1)) != 0) {
+                goto do_unaligned_access;
+            }
+
+            /* ??? Note that the io helpers always read data in the target
+               byte ordering.  We should push the LE/BE request down into io. */
+            val = TGT_LE(val);
+            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            return;
         }
-        /* XXX: not efficient, but simple */
-        /* Note: relies on the fact that tlb_fill() does not remove the
-         * previous page from the TLB cache.  */
-        for (i = DATA_SIZE - 1; i >= 0; i--) {
-            /* Little-endian extract.  */
-            uint8_t val8 = val >> (i * 8);
-            /* Note the adjustment at the beginning of the function.
-               Undo that for the recursion.  */
-            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
-                                            oi, retaddr + GETPC_ADJ);
+    } else {
+        /* Handle slow unaligned access (it spans two pages or IO).  */
+        if (DATA_SIZE > 1
+            && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                         >= TARGET_PAGE_SIZE)) {
+            int i;
+        do_unaligned_access:
+            if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+                cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                                     mmu_idx, retaddr);
+            }
+            /* XXX: not efficient, but simple */
+            /* Note: relies on the fact that tlb_fill() does not remove the
+             * previous page from the TLB cache.  */
+            for (i = DATA_SIZE - 1; i >= 0; i--) {
+                /* Little-endian extract.  */
+                uint8_t val8 = val >> (i * 8);
+                /* Note the adjustment at the beginning of the function.
+                   Undo that for the recursion.  */
+                glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
+                                                oi, retaddr + GETPC_ADJ);
+            }
+            return;
         }
-        return;
     }
 
     /* Handle aligned access or unaligned access in the same page.  */
@@ -494,43 +529,61 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
-        }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_BE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
-        return;
-    }
-
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (DATA_SIZE > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
-                     >= TARGET_PAGE_SIZE)) {
-        int i;
-    do_unaligned_access:
-        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                                 mmu_idx, retaddr);
+        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+            /* The slow-path has been forced since we are writing to
+             * exclusive-protected memory. */
+            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+            bool set_to_dirty;
+
+            /* Two cases of invalidation: the current vCPU is writing to another
+             * vCPU's exclusive address or the vCPU that issued the LoadLink is
+             * writing to it, but not through a StoreCond. */
+            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
+            set_to_dirty |= env->ll_sc_context &&
+                           (env->excl_protected_hwaddr == hw_addr);
+
+            if (set_to_dirty) {
+                cpu_physical_memory_set_excl_dirty(hw_addr);
+            } /* the vCPU is legitimately writing to the protected address */
+        } else {
+            if ((addr & (DATA_SIZE - 1)) != 0) {
+                goto do_unaligned_access;
+            }
+
+            /* ??? Note that the io helpers always read data in the target
+               byte ordering.  We should push the LE/BE request down into io. */
+            val = TGT_BE(val);
+            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            return;
         }
-        /* XXX: not efficient, but simple */
-        /* Note: relies on the fact that tlb_fill() does not remove the
-         * previous page from the TLB cache.  */
-        for (i = DATA_SIZE - 1; i >= 0; i--) {
-            /* Big-endian extract.  */
-            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
-            /* Note the adjustment at the beginning of the function.
-               Undo that for the recursion.  */
-            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
-                                            oi, retaddr + GETPC_ADJ);
+    } else {
+        /* Handle slow unaligned access (it spans two pages or IO).  */
+        if (DATA_SIZE > 1
+            && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                         >= TARGET_PAGE_SIZE)) {
+            int i;
+        do_unaligned_access:
+            if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+                cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                                     mmu_idx, retaddr);
+            }
+            /* XXX: not efficient, but simple */
+            /* Note: relies on the fact that tlb_fill() does not remove the
+             * previous page from the TLB cache.  */
+            for (i = DATA_SIZE - 1; i >= 0; i--) {
+                /* Big-endian extract.  */
+                uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
+                /* Note the adjustment at the beginning of the function.
+                   Undo that for the recursion.  */
+                glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
+                                                oi, retaddr + GETPC_ADJ);
+            }
+            return;
         }
-        return;
     }
 
     /* Handle aligned access or unaligned access in the same page.  */
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 03/13] softmmu: Add helpers for a new slow-path
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 01/13] exec: Add new exclusive bitmap to ram_list Alvise Rigo
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 02/13] cputlb: Add new TLB_EXCL flag Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-16 14:53   ` Alex Bennée
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 04/13] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions Alvise Rigo
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

The new helpers rely on the legacy ones to perform the actual read/write.

The StoreConditional helper (helper_le_stcond_name) returns 1 if the
store has to fail due to a concurrent access to the same page by
another vCPU.  A 'concurrent access' can be a store made by *any* vCPU
(although, some implementations allow stores made by the CPU that issued
the LoadLink).

These helpers also update the TLB entry of the page involved in the
LL/SC, so that all the following accesses made by any vCPU will follow
the slow path.
In real multi-threading, these helpers will require to temporarily pause
the execution of the other vCPUs in order to update accordingly (flush)
the TLB cache.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |   3 +
 softmmu_llsc_template.h | 155 ++++++++++++++++++++++++++++++++++++++++++++++++
 softmmu_template.h      |   4 ++
 tcg/tcg.h               |  18 ++++++
 4 files changed, 180 insertions(+)
 create mode 100644 softmmu_llsc_template.h

diff --git a/cputlb.c b/cputlb.c
index 0aca407..fa38714 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -475,6 +475,8 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
 
 #define MMUSUFFIX _mmu
 
+/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
+#define GEN_EXCLUSIVE_HELPERS
 #define SHIFT 0
 #include "softmmu_template.h"
 
@@ -487,6 +489,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
 #define SHIFT 3
 #include "softmmu_template.h"
 #undef MMUSUFFIX
+#undef GEN_EXCLUSIVE_HELPERS
 
 #define MMUSUFFIX _cmmu
 #undef GETPC_ADJ
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
new file mode 100644
index 0000000..81e9d8e
--- /dev/null
+++ b/softmmu_llsc_template.h
@@ -0,0 +1,155 @@
+/*
+ *  Software MMU support (esclusive load/store operations)
+ *
+ * Generate helpers used by TCG for qemu_ldlink/stcond ops.
+ *
+ * Included from softmmu_template.h only.
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.rigo@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * TODO configurations not implemented:
+ *     - Signed/Unsigned Big-endian
+ *     - Signed Little-endian
+ * */
+
+#if DATA_SIZE > 1
+#define helper_le_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_le_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
+#else
+#define helper_le_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_le_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
+#endif
+
+/* helpers from cpu_ldst.h, byte-order independent versions */
+#if DATA_SIZE > 1
+#define helper_ld_legacy glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
+#define helper_st_legacy glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
+#else
+#define helper_ld_legacy glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
+#define helper_st_legacy glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
+#endif
+
+#define is_write_tlb_entry_set(env, page, index)                             \
+({                                                                           \
+    (addr & TARGET_PAGE_MASK)                                                \
+         == ((env->tlb_table[mmu_idx][index].addr_write) &                   \
+                 (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
+})                                                                           \
+
+#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+
+WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr,
+                                TCGMemOpIdx oi, uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    int index;
+    CPUState *cpu;
+    hwaddr hw_addr;
+    unsigned mmu_idx = get_mmuidx(oi);
+
+    /* Use the proper load helper from cpu_ldst.h */
+    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
+
+    /* The last legacy access ensures that the TLB and IOTLB entry for 'addr'
+     * have been created. */
+    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+
+    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
+     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
+    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
+
+    /* Set the exclusive-protected hwaddr. */
+    env->excl_protected_hwaddr = hw_addr;
+    env->ll_sc_context = true;
+
+    /* No need to mask hw_addr with TARGET_PAGE_MASK since
+     * cpu_physical_memory_excl_is_dirty() will take care of that. */
+    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
+        cpu_physical_memory_clear_excl_dirty(hw_addr);
+
+        /* Invalidate the TLB entry for the other processors. The next TLB
+         * entries for this page will have the TLB_EXCL flag set. */
+        CPU_FOREACH(cpu) {
+            if (cpu != current_cpu) {
+                tlb_flush(cpu, 1);
+            }
+        }
+    }
+
+    /* For this vCPU, just update the TLB entry, no need to flush. */
+    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
+
+    return ret;
+}
+
+WORD_TYPE helper_le_stcond_name(CPUArchState *env, target_ulong addr,
+                                DATA_TYPE val, TCGMemOpIdx oi,
+                                uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    int index;
+    hwaddr hw_addr;
+    unsigned mmu_idx = get_mmuidx(oi);
+
+    /* If the TLB entry is not the right one, create it. */
+    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    if (!is_write_tlb_entry_set(env, addr, index)) {
+        tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
+    }
+
+    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
+     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
+    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
+
+    if (!env->ll_sc_context) {
+        /* No LoakLink has been set, the StoreCond has to fail. */
+        return 1;
+    }
+
+    env->ll_sc_context = 0;
+
+    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
+        /* Another vCPU has accessed the memory after the LoadLink. */
+        ret = 1;
+    } else {
+        helper_st_legacy(env, addr, val, mmu_idx, retaddr);
+
+        /* The StoreConditional succeeded */
+        ret = 0;
+    }
+
+    env->tlb_table[mmu_idx][index].addr_write &= ~TLB_EXCL;
+    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
+    /* It's likely that the page will be used again for exclusive accesses,
+     * for this reason we don't flush any TLB cache at the price of some
+     * additional slow paths and we don't set the page bit as dirty.
+     * The EXCL TLB entries will not remain there forever since they will be
+     * eventually removed to serve another guest page; when this happens we
+     * remove also the dirty bit (see cputlb.c).
+     * */
+
+    return ret;
+}
+
+#undef helper_le_ldlink_name
+#undef helper_le_stcond_name
+#undef helper_ld_legacy
+#undef helper_st_legacy
diff --git a/softmmu_template.h b/softmmu_template.h
index 0edd451..bc767f6 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -630,6 +630,10 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
 #endif
 #endif /* !defined(SOFTMMU_CODE_ACCESS) */
 
+#ifdef GEN_EXCLUSIVE_HELPERS
+#include "softmmu_llsc_template.h"
+#endif
+
 #undef READ_ACCESS_TYPE
 #undef SHIFT
 #undef DATA_TYPE
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 032fe10..8ca85ab 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -962,6 +962,15 @@ tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     TCGMemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
+                                         int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                        int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                        int mmu_idx, uintptr_t retaddr);
+uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                               int mmu_idx, uintptr_t retaddr);
 
 /* Value sign-extended to tcg register size.  */
 tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
@@ -989,6 +998,15 @@ void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        TCGMemOpIdx oi, uintptr_t retaddr);
 void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
+                                uint8_t val, int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
+                                uint16_t val, int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
+                                uint32_t val, int mmu_idx, uintptr_t retaddr);
+uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
+                                uint64_t val, int mmu_idx, uintptr_t retaddr);
 
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 04/13] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (2 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 03/13] softmmu: Add helpers for a new slow-path Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-17  9:49   ` Alex Bennée
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 05/13] target-arm: translate: implement qemu_ldlink and qemu_stcond ops Alvise Rigo
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Create a new pair of instructions that implement a LoadLink/StoreConditional
mechanism.

It has not been possible to completely include the two new opcodes
in the plain variants, since the StoreConditional will always require
one more argument to store the success of the operation.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 tcg/tcg-be-ldst.h |  1 +
 tcg/tcg-op.c      | 23 +++++++++++++++++++++++
 tcg/tcg-op.h      |  3 +++
 tcg/tcg-opc.h     |  4 ++++
 tcg/tcg.c         |  2 ++
 tcg/tcg.h         | 18 ++++++++++--------
 6 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/tcg/tcg-be-ldst.h b/tcg/tcg-be-ldst.h
index 40a2369..b3f9c51 100644
--- a/tcg/tcg-be-ldst.h
+++ b/tcg/tcg-be-ldst.h
@@ -24,6 +24,7 @@
 
 typedef struct TCGLabelQemuLdst {
     bool is_ld;             /* qemu_ld: true, qemu_st: false */
+    TCGReg llsc_success;    /* reg index for qemu_stcond outcome */
     TCGMemOpIdx oi;
     TCGType type;           /* result type of a load */
     TCGReg addrlo_reg;      /* reg index for low word of guest virtual addr */
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 45098c3..a73b522 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -1885,6 +1885,15 @@ static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv addr,
 #endif
 }
 
+/* An output operand to return the StoreConditional result */
+static void gen_stcond_i32(TCGOpcode opc, TCGv_i32 is_dirty, TCGv_i32 val,
+                           TCGv addr, TCGMemOp memop, TCGArg idx)
+{
+    TCGMemOpIdx oi = make_memop_idx(memop, idx);
+
+    tcg_gen_op4i_i32(opc, is_dirty, val, addr, oi);
+}
+
 static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, TCGv addr,
                          TCGMemOp memop, TCGArg idx)
 {
@@ -1911,12 +1920,26 @@ void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     gen_ldst_i32(INDEX_op_qemu_ld_i32, val, addr, memop, idx);
 }
 
+void tcg_gen_qemu_ldlink_i32(TCGv_i32 val, TCGv addr, TCGArg idx,
+                             TCGMemOp memop)
+{
+    memop = tcg_canonicalize_memop(memop, 0, 0);
+    gen_ldst_i32(INDEX_op_qemu_ldlink_i32, val, addr, memop, idx);
+}
+
 void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     memop = tcg_canonicalize_memop(memop, 0, 1);
     gen_ldst_i32(INDEX_op_qemu_st_i32, val, addr, memop, idx);
 }
 
+void tcg_gen_qemu_stcond_i32(TCGv_i32 is_dirty, TCGv_i32 val, TCGv addr,
+                             TCGArg idx, TCGMemOp memop)
+{
+    memop = tcg_canonicalize_memop(memop, 0, 1);
+    gen_stcond_i32(INDEX_op_qemu_stcond_i32, is_dirty, val, addr, memop, idx);
+}
+
 void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index d1d763f..f183169 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -754,6 +754,9 @@ void tcg_gen_qemu_st_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
 void tcg_gen_qemu_ld_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
 void tcg_gen_qemu_st_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
 
+void tcg_gen_qemu_ldlink_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
+void tcg_gen_qemu_stcond_i32(TCGv_i32, TCGv_i32, TCGv, TCGArg, TCGMemOp);
+
 static inline void tcg_gen_qemu_ld8u(TCGv ret, TCGv addr, int mem_index)
 {
     tcg_gen_qemu_ld_tl(ret, addr, mem_index, MO_UB);
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 13ccb60..d6c0454 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -183,6 +183,10 @@ DEF(qemu_ld_i32, 1, TLADDR_ARGS, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
 DEF(qemu_st_i32, 0, TLADDR_ARGS + 1, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
+DEF(qemu_ldlink_i32, 1, TLADDR_ARGS, 2,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
+DEF(qemu_stcond_i32, 1, TLADDR_ARGS + 1, 2,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
 DEF(qemu_ld_i64, DATA64_ARGS, TLADDR_ARGS, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
 DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 7e088b1..8a2265e 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1068,6 +1068,8 @@ void tcg_dump_ops(TCGContext *s)
                 i = 1;
                 break;
             case INDEX_op_qemu_ld_i32:
+            case INDEX_op_qemu_ldlink_i32:
+            case INDEX_op_qemu_stcond_i32:
             case INDEX_op_qemu_st_i32:
             case INDEX_op_qemu_ld_i64:
             case INDEX_op_qemu_st_i64:
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 8ca85ab..d41a18c 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -282,6 +282,8 @@ typedef enum TCGMemOp {
     MO_TEQ   = MO_TE | MO_Q,
 
     MO_SSIZE = MO_SIZE | MO_SIGN,
+
+    MO_EXCL  = 32, /* Set for exclusive memory access */
 } TCGMemOp;
 
 typedef tcg_target_ulong TCGArg;
@@ -964,13 +966,13 @@ uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr);
 /* Exclusive variants */
 tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
-                                         int mmu_idx, uintptr_t retaddr);
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
 tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
-                                        int mmu_idx, uintptr_t retaddr);
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
 tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
-                                        int mmu_idx, uintptr_t retaddr);
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
-                               int mmu_idx, uintptr_t retaddr);
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
 
 /* Value sign-extended to tcg register size.  */
 tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
@@ -1000,13 +1002,13 @@ void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        TCGMemOpIdx oi, uintptr_t retaddr);
 /* Exclusive variants */
 tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
-                                uint8_t val, int mmu_idx, uintptr_t retaddr);
+                            uint8_t val, TCGMemOpIdx oi, uintptr_t retaddr);
 tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
-                                uint16_t val, int mmu_idx, uintptr_t retaddr);
+                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
 tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
-                                uint32_t val, int mmu_idx, uintptr_t retaddr);
+                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
-                                uint64_t val, int mmu_idx, uintptr_t retaddr);
+                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
 
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 05/13] target-arm: translate: implement qemu_ldlink and qemu_stcond ops
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (3 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 04/13] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-17 12:51   ` Alex Bennée
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 06/13] target-i386: " Alvise Rigo
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Implement strex and ldrex instruction relying on TCG's qemu_ldlink and
qemu_stcond.  For the time being only the 32bit instructions are supported.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/translate.c |  87 ++++++++++++++++++++++++++++++++++-
 tcg/arm/tcg-target.c   | 121 +++++++++++++++++++++++++++++++++++++------------
 2 files changed, 178 insertions(+), 30 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 80302cd..0366c76 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -72,6 +72,8 @@ static TCGv_i64 cpu_exclusive_test;
 static TCGv_i32 cpu_exclusive_info;
 #endif
 
+static TCGv_i32 cpu_ll_sc_context;
+
 /* FIXME:  These should be removed.  */
 static TCGv_i32 cpu_F0s, cpu_F1s;
 static TCGv_i64 cpu_F0d, cpu_F1d;
@@ -103,6 +105,8 @@ void arm_translate_init(void)
         offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
     cpu_exclusive_val = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_val), "exclusive_val");
+    cpu_ll_sc_context = tcg_global_mem_new_i32(TCG_AREG0,
+        offsetof(CPUARMState, ll_sc_context), "ll_sc_context");
 #ifdef CONFIG_USER_ONLY
     cpu_exclusive_test = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_test), "exclusive_test");
@@ -961,6 +965,18 @@ DO_GEN_ST(8, MO_UB)
 DO_GEN_ST(16, MO_TEUW)
 DO_GEN_ST(32, MO_TEUL)
 
+/* Load/Store exclusive generators (always unsigned) */
+static inline void gen_aa32_ldex32(TCGv_i32 val, TCGv_i32 addr, int index)
+{
+    tcg_gen_qemu_ldlink_i32(val, addr, index, MO_TEUL | MO_EXCL);
+}
+
+static inline void gen_aa32_stex32(TCGv_i32 is_dirty, TCGv_i32 val,
+                                   TCGv_i32 addr, int index)
+{
+    tcg_gen_qemu_stcond_i32(is_dirty, val, addr, index, MO_TEUL | MO_EXCL);
+}
+
 static inline void gen_set_pc_im(DisasContext *s, target_ulong val)
 {
     tcg_gen_movi_i32(cpu_R[15], val);
@@ -7427,6 +7443,26 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     store_reg(s, rt, tmp);
 }
 
+static void gen_load_exclusive_multi(DisasContext *s, int rt, int rt2,
+                                     TCGv_i32 addr, int size)
+{
+    TCGv_i32 tmp = tcg_temp_new_i32();
+
+    switch (size) {
+    case 0:
+    case 1:
+        abort();
+    case 2:
+        gen_aa32_ldex32(tmp, addr, get_mem_index(s));
+        break;
+    case 3:
+    default:
+        abort();
+    }
+
+    store_reg(s, rt, tmp);
+}
+
 static void gen_clrex(DisasContext *s)
 {
     gen_helper_atomic_clear(cpu_env);
@@ -7460,6 +7496,52 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
     tcg_temp_free_i64(val);
     tcg_temp_free_i32(tmp_size);
 }
+
+static void gen_store_exclusive_multi(DisasContext *s, int rd, int rt, int rt2,
+                                      TCGv_i32 addr, int size)
+{
+    TCGv_i32 tmp;
+    TCGv_i32 is_dirty;
+    TCGLabel *done_label;
+    TCGLabel *fail_label;
+
+    fail_label = gen_new_label();
+    done_label = gen_new_label();
+
+    tmp = tcg_temp_new_i32();
+    is_dirty = tcg_temp_new_i32();
+
+    /* Fail if we are not in LL/SC context. */
+    tcg_gen_brcondi_i32(TCG_COND_NE, cpu_ll_sc_context, 1, fail_label);
+
+    tmp = load_reg(s, rt);
+    switch (size) {
+    case 0:
+    case 1:
+        abort();
+        break;
+    case 2:
+        gen_aa32_stex32(is_dirty, tmp, addr, get_mem_index(s));
+        break;
+    case 3:
+    default:
+        abort();
+    }
+
+    tcg_temp_free_i32(tmp);
+
+    /* Check if the store conditional has to fail. */
+    tcg_gen_brcondi_i32(TCG_COND_EQ, is_dirty, 1, fail_label);
+    tcg_temp_free_i32(is_dirty);
+
+    tcg_temp_free_i32(tmp);
+
+    tcg_gen_movi_i32(cpu_R[rd], 0); /* is_dirty = 0 */
+    tcg_gen_br(done_label);
+    gen_set_label(fail_label);
+    tcg_gen_movi_i32(cpu_R[rd], 1); /* is_dirty = 1 */
+    gen_set_label(done_label);
+}
 #endif
 
 /* gen_srs:
@@ -8308,7 +8390,7 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
                         } else if (insn & (1 << 20)) {
                             switch (op1) {
                             case 0: /* ldrex */
-                                gen_load_exclusive(s, rd, 15, addr, 2);
+                                gen_load_exclusive_multi(s, rd, 15, addr, 2);
                                 break;
                             case 1: /* ldrexd */
                                 gen_load_exclusive(s, rd, rd + 1, addr, 3);
@@ -8326,7 +8408,8 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
                             rm = insn & 0xf;
                             switch (op1) {
                             case 0:  /*  strex */
-                                gen_store_exclusive(s, rd, rm, 15, addr, 2);
+                                gen_store_exclusive_multi(s, rd, rm, 15,
+                                                          addr, 2);
                                 break;
                             case 1: /*  strexd */
                                 gen_store_exclusive(s, rd, rm, rm + 1, addr, 3);
diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index ae2ec7a..f2b69a0 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -1069,6 +1069,17 @@ static void * const qemu_ld_helpers[16] = {
     [MO_BESL] = helper_be_ldul_mmu,
 };
 
+/* LoadLink helpers, only unsigned. Use the macro below to access them. */
+static void * const qemu_ldex_helpers[16] = {
+    [MO_LEUL] = helper_le_ldlinkul_mmu,
+};
+
+#define LDEX_HELPER(mem_op)                                             \
+({                                                                      \
+    assert(mem_op & MO_EXCL);                                           \
+    qemu_ldex_helpers[((int)mem_op - MO_EXCL)];                         \
+})
+
 /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
  *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
  */
@@ -1082,6 +1093,19 @@ static void * const qemu_st_helpers[16] = {
     [MO_BEQ]  = helper_be_stq_mmu,
 };
 
+/* StoreConditional helpers. Use the macro below to access them. */
+static void * const qemu_stex_helpers[16] = {
+    [MO_LEUL] = helper_le_stcondl_mmu,
+};
+
+#define STEX_HELPER(mem_op)                                             \
+({                                                                      \
+    assert(mem_op & MO_EXCL);                                           \
+    qemu_stex_helpers[(int)mem_op - MO_EXCL];                           \
+})
+
+
+
 /* Helper routines for marshalling helper function arguments into
  * the correct registers and stack.
  * argreg is where we want to put this argument, arg is the argument itself.
@@ -1222,13 +1246,14 @@ static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
    path for a load or store, so that we can later generate the correct
    helper code.  */
 static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
-                                TCGReg datalo, TCGReg datahi, TCGReg addrlo,
-                                TCGReg addrhi, tcg_insn_unit *raddr,
-                                tcg_insn_unit *label_ptr)
+                                TCGReg llsc_success, TCGReg datalo,
+                                TCGReg datahi, TCGReg addrlo, TCGReg addrhi,
+                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
 {
     TCGLabelQemuLdst *label = new_ldst_label(s);
 
     label->is_ld = is_ld;
+    label->llsc_success = llsc_success;
     label->oi = oi;
     label->datalo_reg = datalo;
     label->datahi_reg = datahi;
@@ -1259,12 +1284,16 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     /* For armv6 we can use the canonical unsigned helpers and minimize
        icache usage.  For pre-armv6, use the signed helpers since we do
        not have a single insn sign-extend.  */
-    if (use_armv6_instructions) {
-        func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)];
+    if (opc & MO_EXCL) {
+        func = LDEX_HELPER(opc);
     } else {
-        func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)];
-        if (opc & MO_SIGN) {
-            opc = MO_UL;
+        if (use_armv6_instructions) {
+            func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)];
+        } else {
+            func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)];
+            if (opc & MO_SIGN) {
+                opc = MO_UL;
+            }
         }
     }
     tcg_out_call(s, func);
@@ -1336,8 +1365,15 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     argreg = tcg_out_arg_imm32(s, argreg, oi);
     argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
 
-    /* Tail-call to the helper, which will return to the fast path.  */
-    tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    if (opc & MO_EXCL) {
+        tcg_out_call(s, STEX_HELPER(opc));
+        /* Save the output of the StoreConditional */
+        tcg_out_mov_reg(s, COND_AL, lb->llsc_success, TCG_REG_R0);
+        tcg_out_goto(s, COND_AL, lb->raddr);
+    } else {
+        /* Tail-call to the helper, which will return to the fast path.  */
+        tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    }
 }
 #endif /* SOFTMMU */
 
@@ -1461,7 +1497,8 @@ static inline void tcg_out_qemu_ld_direct(TCGContext *s, TCGMemOp opc,
     }
 }
 
-static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
+static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64,
+                            bool isLoadLink)
 {
     TCGReg addrlo, datalo, datahi, addrhi __attribute__((unused));
     TCGMemOpIdx oi;
@@ -1484,13 +1521,20 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
     addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 1);
 
     /* This a conditional BL only to load a pointer within this opcode into LR
-       for the slow path.  We will not be using the value for a tail call.  */
-    label_ptr = s->code_ptr;
-    tcg_out_bl_noaddr(s, COND_NE);
+       for the slow path.  We will not be using the value for a tail call.
+       In the context of a LoadLink instruction, we don't check the TLB but we
+       always follow the slow path.  */
+    if (isLoadLink) {
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_AL);
+    } else {
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_NE);
 
-    tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
+        tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
+    }
 
-    add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
+    add_qemu_ldst_label(s, true, oi, 0, datalo, datahi, addrlo, addrhi,
                         s->code_ptr, label_ptr);
 #else /* !CONFIG_SOFTMMU */
     if (GUEST_BASE) {
@@ -1592,9 +1636,11 @@ static inline void tcg_out_qemu_st_direct(TCGContext *s, TCGMemOp opc,
     }
 }
 
-static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
+static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64,
+                            bool isStoreCond)
 {
     TCGReg addrlo, datalo, datahi, addrhi __attribute__((unused));
+    TCGReg llsc_success;
     TCGMemOpIdx oi;
     TCGMemOp opc;
 #ifdef CONFIG_SOFTMMU
@@ -1603,6 +1649,9 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
     tcg_insn_unit *label_ptr;
 #endif
 
+    /* The stcond variant has one more param */
+    llsc_success = (isStoreCond ? *args++ : 0);
+
     datalo = *args++;
     datahi = (is64 ? *args++ : 0);
     addrlo = *args++;
@@ -1612,16 +1661,24 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
 
 #ifdef CONFIG_SOFTMMU
     mem_index = get_mmuidx(oi);
-    addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 0);
 
-    tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
+    if (isStoreCond) {
+        /* Always follow the slow-path for an exclusive access */
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_AL);
+    } else {
+        addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE,
+                                  mem_index, 0);
 
-    /* The conditional call must come last, as we're going to return here.  */
-    label_ptr = s->code_ptr;
-    tcg_out_bl_noaddr(s, COND_NE);
+        tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
 
-    add_qemu_ldst_label(s, false, oi, datalo, datahi, addrlo, addrhi,
-                        s->code_ptr, label_ptr);
+        /* The conditional call must come last, as we're going to return here.*/
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_NE);
+    }
+
+    add_qemu_ldst_label(s, false, oi, llsc_success, datalo, datahi, addrlo,
+                        addrhi, s->code_ptr, label_ptr);
 #else /* !CONFIG_SOFTMMU */
     if (GUEST_BASE) {
         tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP, GUEST_BASE);
@@ -1864,16 +1921,22 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
 
     case INDEX_op_qemu_ld_i32:
-        tcg_out_qemu_ld(s, args, 0);
+        tcg_out_qemu_ld(s, args, 0, 0);
+        break;
+    case INDEX_op_qemu_ldlink_i32:
+        tcg_out_qemu_ld(s, args, 0, 1); /* LoadLink */
         break;
     case INDEX_op_qemu_ld_i64:
-        tcg_out_qemu_ld(s, args, 1);
+        tcg_out_qemu_ld(s, args, 1, 0);
         break;
     case INDEX_op_qemu_st_i32:
-        tcg_out_qemu_st(s, args, 0);
+        tcg_out_qemu_st(s, args, 0, 0);
+        break;
+    case INDEX_op_qemu_stcond_i32:
+        tcg_out_qemu_st(s, args, 0, 1); /* StoreConditional */
         break;
     case INDEX_op_qemu_st_i64:
-        tcg_out_qemu_st(s, args, 1);
+        tcg_out_qemu_st(s, args, 1, 0);
         break;
 
     case INDEX_op_bswap16_i32:
@@ -1957,8 +2020,10 @@ static const TCGTargetOpDef arm_op_defs[] = {
 
 #if TARGET_LONG_BITS == 32
     { INDEX_op_qemu_ld_i32, { "r", "l" } },
+    { INDEX_op_qemu_ldlink_i32, { "r", "l" } },
     { INDEX_op_qemu_ld_i64, { "r", "r", "l" } },
     { INDEX_op_qemu_st_i32, { "s", "s" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "s", "s" } },
     { INDEX_op_qemu_st_i64, { "s", "s", "s" } },
 #else
     { INDEX_op_qemu_ld_i32, { "r", "l", "l" } },
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 06/13] target-i386: translate: implement qemu_ldlink and qemu_stcond ops
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (4 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 05/13] target-arm: translate: implement qemu_ldlink and qemu_stcond ops Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-17 12:56   ` Alex Bennée
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 07/13] ram_addr.h: Make exclusive bitmap accessors atomic Alvise Rigo
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Implement strex and ldrex instruction relying on TCG's qemu_ldlink and
qemu_stcond.  For the time being only 32bit configurations are supported.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 tcg/i386/tcg-target.c | 136 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 114 insertions(+), 22 deletions(-)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 0d7c99c..d8250a9 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1141,6 +1141,17 @@ static void * const qemu_ld_helpers[16] = {
     [MO_BEQ]  = helper_be_ldq_mmu,
 };
 
+/* LoadLink helpers, only unsigned. Use the macro below to access them. */
+static void * const qemu_ldex_helpers[16] = {
+    [MO_LEUL] = helper_le_ldlinkul_mmu,
+};
+
+#define LDEX_HELPER(mem_op)                                             \
+({                                                                      \
+    assert(mem_op & MO_EXCL);                                           \
+    qemu_ldex_helpers[((int)mem_op - MO_EXCL)];                         \
+})
+
 /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
  *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
  */
@@ -1154,6 +1165,17 @@ static void * const qemu_st_helpers[16] = {
     [MO_BEQ]  = helper_be_stq_mmu,
 };
 
+/* StoreConditional helpers. Use the macro below to access them. */
+static void * const qemu_stex_helpers[16] = {
+    [MO_LEUL] = helper_le_stcondl_mmu,
+};
+
+#define STEX_HELPER(mem_op)                                             \
+({                                                                      \
+    assert(mem_op & MO_EXCL);                                           \
+    qemu_stex_helpers[(int)mem_op - MO_EXCL];                           \
+})
+
 /* Perform the TLB load and compare.
 
    Inputs:
@@ -1249,6 +1271,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
  * for a load or store, so that we can later generate the correct helper code
  */
 static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
+                                TCGReg llsc_success,
                                 TCGReg datalo, TCGReg datahi,
                                 TCGReg addrlo, TCGReg addrhi,
                                 tcg_insn_unit *raddr,
@@ -1257,6 +1280,7 @@ static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
     TCGLabelQemuLdst *label = new_ldst_label(s);
 
     label->is_ld = is_ld;
+    label->llsc_success = llsc_success;
     label->oi = oi;
     label->datalo_reg = datalo;
     label->datahi_reg = datahi;
@@ -1311,7 +1335,11 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
                      (uintptr_t)l->raddr);
     }
 
-    tcg_out_call(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    if (opc & MO_EXCL) {
+        tcg_out_call(s, LDEX_HELPER(opc));
+    } else {
+        tcg_out_call(s, qemu_ld_helpers[opc & ~MO_SIGN]);
+    }
 
     data_reg = l->datalo_reg;
     switch (opc & MO_SSIZE) {
@@ -1415,9 +1443,16 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         }
     }
 
-    /* "Tail call" to the helper, with the return address back inline.  */
-    tcg_out_push(s, retaddr);
-    tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    if (opc & MO_EXCL) {
+        tcg_out_call(s, STEX_HELPER(opc));
+        /* Save the output of the StoreConditional */
+        tcg_out_mov(s, TCG_TYPE_I32, l->llsc_success, TCG_REG_EAX);
+        tcg_out_jmp(s, l->raddr);
+    } else {
+        /* "Tail call" to the helper, with the return address back inline.  */
+        tcg_out_push(s, retaddr);
+        tcg_out_jmp(s, qemu_st_helpers[opc]);
+    }
 }
 #elif defined(__x86_64__) && defined(__linux__)
 # include <asm/prctl.h>
@@ -1530,7 +1565,8 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
 /* XXX: qemu_ld and qemu_st could be modified to clobber only EDX and
    EAX. It will be useful once fixed registers globals are less
    common. */
-static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
+static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64,
+                            bool isLoadLink)
 {
     TCGReg datalo, datahi, addrlo;
     TCGReg addrhi __attribute__((unused));
@@ -1553,14 +1589,34 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
     mem_index = get_mmuidx(oi);
     s_bits = opc & MO_SIZE;
 
-    tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
-                     label_ptr, offsetof(CPUTLBEntry, addr_read));
+    if (isLoadLink) {
+        TCGType t = ((TCG_TARGET_REG_BITS == 64) && (TARGET_LONG_BITS == 64)) ?
+                                                   TCG_TYPE_I64 : TCG_TYPE_I32;
+        /* The JMP address will be patched afterwards,
+         * in tcg_out_qemu_ld_slow_path (two times when
+         * TARGET_LONG_BITS > TCG_TARGET_REG_BITS). */
+        tcg_out_mov(s, t, TCG_REG_L1, addrlo);
+
+        if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+            /* Store the second part of the address. */
+            tcg_out_mov(s, t, TCG_REG_L0, addrhi);
+            /* We add 4 to include the jmp that follows. */
+            label_ptr[1] = s->code_ptr + 4;
+        }
 
-    /* TLB Hit.  */
-    tcg_out_qemu_ld_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
+        tcg_out_opc(s, OPC_JMP_long, 0, 0, 0);
+        label_ptr[0] = s->code_ptr;
+        s->code_ptr += 4;
+    } else {
+        tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
+                         label_ptr, offsetof(CPUTLBEntry, addr_read));
+
+        /* TLB Hit.  */
+        tcg_out_qemu_ld_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
+    }
 
     /* Record the current context of a load into ldst label */
-    add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
+    add_qemu_ldst_label(s, true, oi, 0, datalo, datahi, addrlo, addrhi,
                         s->code_ptr, label_ptr);
 #else
     {
@@ -1663,9 +1719,10 @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
     }
 }
 
-static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
+static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64,
+                            bool isStoreCond)
 {
-    TCGReg datalo, datahi, addrlo;
+    TCGReg datalo, datahi, addrlo, llsc_success;
     TCGReg addrhi __attribute__((unused));
     TCGMemOpIdx oi;
     TCGMemOp opc;
@@ -1675,6 +1732,9 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
     tcg_insn_unit *label_ptr[2];
 #endif
 
+    /* The stcond variant has one more param */
+    llsc_success = (isStoreCond ? *args++ : 0);
+
     datalo = *args++;
     datahi = (TCG_TARGET_REG_BITS == 32 && is64 ? *args++ : 0);
     addrlo = *args++;
@@ -1686,15 +1746,35 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
     mem_index = get_mmuidx(oi);
     s_bits = opc & MO_SIZE;
 
-    tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
-                     label_ptr, offsetof(CPUTLBEntry, addr_write));
+    if (isStoreCond) {
+        TCGType t = ((TCG_TARGET_REG_BITS == 64) && (TARGET_LONG_BITS == 64)) ?
+                                                   TCG_TYPE_I64 : TCG_TYPE_I32;
+        /* The JMP address will be filled afterwards,
+         * in tcg_out_qemu_ld_slow_path (two times when
+         * TARGET_LONG_BITS > TCG_TARGET_REG_BITS). */
+        tcg_out_mov(s, t, TCG_REG_L1, addrlo);
+
+        if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+            /* Store the second part of the address. */
+            tcg_out_mov(s, t, TCG_REG_L0, addrhi);
+            /* We add 4 to include the jmp that follows. */
+            label_ptr[1] = s->code_ptr + 4;
+        }
 
-    /* TLB Hit.  */
-    tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
+        tcg_out_opc(s, OPC_JMP_long, 0, 0, 0);
+        label_ptr[0] = s->code_ptr;
+        s->code_ptr += 4;
+    } else {
+        tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
+                         label_ptr, offsetof(CPUTLBEntry, addr_write));
+
+        /* TLB Hit.  */
+        tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
+    }
 
     /* Record the current context of a store into ldst label */
-    add_qemu_ldst_label(s, false, oi, datalo, datahi, addrlo, addrhi,
-                        s->code_ptr, label_ptr);
+    add_qemu_ldst_label(s, false, oi, llsc_success, datalo, datahi, addrlo,
+                        addrhi, s->code_ptr, label_ptr);
 #else
     {
         int32_t offset = GUEST_BASE;
@@ -1955,16 +2035,22 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
 
     case INDEX_op_qemu_ld_i32:
-        tcg_out_qemu_ld(s, args, 0);
+        tcg_out_qemu_ld(s, args, 0, 0);
+        break;
+    case INDEX_op_qemu_ldlink_i32:
+        tcg_out_qemu_ld(s, args, 0, 1);
         break;
     case INDEX_op_qemu_ld_i64:
-        tcg_out_qemu_ld(s, args, 1);
+        tcg_out_qemu_ld(s, args, 1, 0);
         break;
     case INDEX_op_qemu_st_i32:
-        tcg_out_qemu_st(s, args, 0);
+        tcg_out_qemu_st(s, args, 0, 0);
+        break;
+    case INDEX_op_qemu_stcond_i32:
+        tcg_out_qemu_st(s, args, 0, 1);
         break;
     case INDEX_op_qemu_st_i64:
-        tcg_out_qemu_st(s, args, 1);
+        tcg_out_qemu_st(s, args, 1, 0);
         break;
 
     OP_32_64(mulu2):
@@ -2186,17 +2272,23 @@ static const TCGTargetOpDef x86_op_defs[] = {
 
 #if TCG_TARGET_REG_BITS == 64
     { INDEX_op_qemu_ld_i32, { "r", "L" } },
+    { INDEX_op_qemu_ldlink_i32, { "r", "L" } },
     { INDEX_op_qemu_st_i32, { "L", "L" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "L", "L" } },
     { INDEX_op_qemu_ld_i64, { "r", "L" } },
     { INDEX_op_qemu_st_i64, { "L", "L" } },
 #elif TARGET_LONG_BITS <= TCG_TARGET_REG_BITS
     { INDEX_op_qemu_ld_i32, { "r", "L" } },
+    { INDEX_op_qemu_ldlink_i32, { "r", "L" } },
     { INDEX_op_qemu_st_i32, { "L", "L" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "L", "L" } },
     { INDEX_op_qemu_ld_i64, { "r", "r", "L" } },
     { INDEX_op_qemu_st_i64, { "L", "L", "L" } },
 #else
     { INDEX_op_qemu_ld_i32, { "r", "L", "L" } },
+    { INDEX_op_qemu_ldlink_i32, { "r", "L", "L" } },
     { INDEX_op_qemu_st_i32, { "L", "L", "L" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "L", "L", "L" } },
     { INDEX_op_qemu_ld_i64, { "r", "r", "L", "L" } },
     { INDEX_op_qemu_st_i64, { "L", "L", "L", "L" } },
 #endif
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 07/13] ram_addr.h: Make exclusive bitmap accessors atomic
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (5 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 06/13] target-i386: " Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-17 13:32   ` Alex Bennée
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 08/13] exec.c: introduce a simple rendezvous support Alvise Rigo
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 include/exec/ram_addr.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 2766541..e51bd65 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -255,7 +255,7 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
 /* Exclusive bitmap accessors. */
 static inline void cpu_physical_memory_set_excl_dirty(ram_addr_t addr)
 {
-    set_bit(addr >> TARGET_PAGE_BITS,
+    set_bit_atomic(addr >> TARGET_PAGE_BITS,
             ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
 }
 
@@ -267,8 +267,8 @@ static inline int cpu_physical_memory_excl_is_dirty(ram_addr_t addr)
 
 static inline void cpu_physical_memory_clear_excl_dirty(ram_addr_t addr)
 {
-    clear_bit(addr >> TARGET_PAGE_BITS,
-              ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+    bitmap_test_and_clear_atomic(ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                                 addr >> TARGET_PAGE_BITS, 1);
 }
 
 #endif
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 08/13] exec.c: introduce a simple rendezvous support
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (6 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 07/13] ram_addr.h: Make exclusive bitmap accessors atomic Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-17 13:45   ` Alex Bennée
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support Alvise Rigo
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

When a vCPU is about to set a memory page as exclusive, it needs to wait
that all the running vCPUs finish to execute the current TB and to be aware
of the exact moment when that happens. For this, add a simple rendezvous
mechanism that will be used in softmmu_llsc_template.h to implement the
ldlink operation.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cpus.c            |  5 +++++
 exec.c            | 45 +++++++++++++++++++++++++++++++++++++++++++++
 include/qom/cpu.h | 16 ++++++++++++++++
 3 files changed, 66 insertions(+)

diff --git a/cpus.c b/cpus.c
index aee445a..f4d938e 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1423,6 +1423,11 @@ static int tcg_cpu_exec(CPUArchState *env)
     qemu_mutex_unlock_iothread();
     ret = cpu_exec(env);
     cpu->tcg_executing = 0;
+
+    if (unlikely(cpu->pending_rdv)) {
+        cpu_exit_do_rendezvous(cpu);
+    }
+
     qemu_mutex_lock_iothread();
 #ifdef CONFIG_PROFILER
     tcg_time += profile_getclock() - ti;
diff --git a/exec.c b/exec.c
index 964e922..51958ed 100644
--- a/exec.c
+++ b/exec.c
@@ -746,6 +746,51 @@ void cpu_breakpoint_remove_all(CPUState *cpu, int mask)
     }
 }
 
+/* Rendezvous implementation.
+ * The corresponding definitions are in include/qom/cpu.h. */
+CpuExitRendezvous cpu_exit_rendezvous;
+inline void cpu_exit_init_rendezvous(void)
+{
+    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
+
+    rdv->attendees = 0;
+}
+
+inline void cpu_exit_rendezvous_add_attendee(CPUState *cpu)
+{
+    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
+
+    if (!cpu->pending_rdv) {
+        cpu->pending_rdv = 1;
+        atomic_inc(&rdv->attendees);
+    }
+}
+
+void cpu_exit_do_rendezvous(CPUState *cpu)
+{
+    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
+
+    atomic_dec(&rdv->attendees);
+
+    cpu->pending_rdv = 0;
+}
+
+void cpu_exit_rendezvous_wait_others(CPUState *cpu)
+{
+    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
+
+    while (rdv->attendees) {
+        g_usleep(TCG_RDV_POLLING_PERIOD);
+    }
+}
+
+void cpu_exit_rendezvous_release(void)
+{
+    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
+
+    rdv->attendees = 0;
+}
+
 /* enable or disable single step mode. EXCP_DEBUG is returned by the
    CPU loop after each instruction */
 void cpu_single_step(CPUState *cpu, int enabled)
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 8f3fe56..8d121b3 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -201,6 +201,20 @@ typedef struct CPUWatchpoint {
     QTAILQ_ENTRY(CPUWatchpoint) entry;
 } CPUWatchpoint;
 
+/* Rendezvous support */
+#define TCG_RDV_POLLING_PERIOD 10
+typedef struct CpuExitRendezvous {
+    volatile int attendees;
+    QemuMutex lock;
+} CpuExitRendezvous;
+
+extern CpuExitRendezvous cpu_exit_rendezvous;
+void cpu_exit_init_rendezvous(void);
+void cpu_exit_rendezvous_add_attendee(CPUState *cpu);
+void cpu_exit_do_rendezvous(CPUState *cpu);
+void cpu_exit_rendezvous_wait_others(CPUState *cpu);
+void cpu_exit_rendezvous_release(void);
+
 struct KVMState;
 struct kvm_run;
 
@@ -291,6 +305,8 @@ struct CPUState {
 
     void *opaque;
 
+    volatile int pending_rdv;
+
     /* In order to avoid passing too many arguments to the MMIO helpers,
      * we store some rarely used information in the CPU context.
      */
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (7 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 08/13] exec.c: introduce a simple rendezvous support Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-10  9:36   ` Paolo Bonzini
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 10/13] Simple TLB flush wrap to use as exit callback Alvise Rigo
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

In order to perfom "lazy" TLB invalidation requests, introduce a
queue of callbacks at every vCPU disposal that will be fired just
before entering the next TB.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cpus.c            | 34 ++++++++++++++++++++++++++++++++++
 exec.c            |  1 +
 include/qom/cpu.h | 20 ++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/cpus.c b/cpus.c
index f4d938e..b9f0329 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1421,6 +1421,7 @@ static int tcg_cpu_exec(CPUArchState *env)
         cpu->icount_extra = count;
     }
     qemu_mutex_unlock_iothread();
+    cpu_exit_callbacks_call_all(cpu);
     ret = cpu_exec(env);
     cpu->tcg_executing = 0;
 
@@ -1469,6 +1470,39 @@ static void tcg_exec_all(CPUState *cpu)
     cpu->exit_request = 0;
 }
 
+void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
+                           void *opaque)
+{
+    CPUExitCB *cb;
+
+    cb = g_malloc(sizeof(*cb));
+    cb->callback = callback;
+    cb->opaque = opaque;
+
+    qemu_mutex_lock(&cpu->exit_cbs.mutex);
+    QTAILQ_INSERT_TAIL(&cpu->exit_cbs.exit_callbacks, cb, entry);
+    qemu_mutex_unlock(&cpu->exit_cbs.mutex);
+}
+
+void cpu_exit_callbacks_call_all(CPUState *cpu)
+{
+    CPUExitCB *cb, *next;
+
+    if (QTAILQ_EMPTY(&cpu->exit_cbs.exit_callbacks)) {
+        return;
+    }
+
+    QTAILQ_FOREACH_SAFE(cb, &cpu->exit_cbs.exit_callbacks, entry, next) {
+        cb->callback(cpu, cb->opaque);
+
+        /* one-shot callbacks, remove it after using it */
+        qemu_mutex_lock(&cpu->exit_cbs.mutex);
+        QTAILQ_REMOVE(&cpu->exit_cbs.exit_callbacks, cb, entry);
+        g_free(cb);
+        qemu_mutex_unlock(&cpu->exit_cbs.mutex);
+    }
+}
+
 void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
 {
     /* XXX: implement xxx_cpu_list for targets that still miss it */
diff --git a/exec.c b/exec.c
index 51958ed..322f2c6 100644
--- a/exec.c
+++ b/exec.c
@@ -531,6 +531,7 @@ void cpu_exec_init(CPUArchState *env)
     cpu->numa_node = 0;
     QTAILQ_INIT(&cpu->breakpoints);
     QTAILQ_INIT(&cpu->watchpoints);
+    QTAILQ_INIT(&cpu->exit_cbs.exit_callbacks);
 #ifndef CONFIG_USER_ONLY
     cpu->as = &address_space_memory;
     cpu->thread_id = qemu_get_thread_id();
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 8d121b3..0ec020b 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -201,6 +201,24 @@ typedef struct CPUWatchpoint {
     QTAILQ_ENTRY(CPUWatchpoint) entry;
 } CPUWatchpoint;
 
+/* vCPU exit callbacks */
+typedef void (*CPUExitCallback)(CPUState *cpu, void *opaque);
+struct CPUExitCBs {
+    QemuMutex mutex;
+    QTAILQ_HEAD(exit_callbacks_head, CPUExitCB) exit_callbacks;
+};
+
+typedef struct CPUExitCB {
+    CPUExitCallback callback;
+    void *opaque;
+
+    QTAILQ_ENTRY(CPUExitCB) entry;
+} CPUExitCB;
+
+void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
+                           void *opaque);
+void cpu_exit_callbacks_call_all(CPUState *cpu);
+
 /* Rendezvous support */
 #define TCG_RDV_POLLING_PERIOD 10
 typedef struct CpuExitRendezvous {
@@ -305,6 +323,8 @@ struct CPUState {
 
     void *opaque;
 
+    /* One-shot callbacks for stopping requests. */
+    struct CPUExitCBs exit_cbs;
     volatile int pending_rdv;
 
     /* In order to avoid passing too many arguments to the MMIO helpers,
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 10/13] Simple TLB flush wrap to use as exit callback
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (8 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 11/13] Introduce exit_flush_req and tcg_excl_access_lock Alvise Rigo
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Add a new way to query a TLB flush request for a given vCPU using the
new callback support.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c          | 6 ++++++
 include/qom/cpu.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/cputlb.c b/cputlb.c
index fa38714..9794e6b 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -85,6 +85,12 @@ static void tlb_flush_async_work(void *opaque)
     g_free(params);
 }
 
+static void cpu_exit_tlb_flush_all_cb(CPUState *cpu, void *opaque)
+{
+    tlb_flush(cpu, 1);
+    cpu->pending_tlb_flush = 0;
+}
+
 void tlb_flush_all(int flush_global)
 {
     CPUState *cpu;
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 0ec020b..c5b93c9 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -326,6 +326,7 @@ struct CPUState {
     /* One-shot callbacks for stopping requests. */
     struct CPUExitCBs exit_cbs;
     volatile int pending_rdv;
+    volatile int pending_tlb_flush;
 
     /* In order to avoid passing too many arguments to the MMIO helpers,
      * we store some rarely used information in the CPU context.
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 11/13] Introduce exit_flush_req and tcg_excl_access_lock
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (9 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 10/13] Simple TLB flush wrap to use as exit callback Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 12/13] softmmu_llsc_template.h: move to multithreading Alvise Rigo
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Introduce two new variables to synchronize the vCPUs during atomic
operations.

- exit_flush_request allows one vCPU to make an exclusive flush request for all
  the running vCPUs
- tcg_excl_access_lock is a mutex that protects all the sensible
  operations concerning atomic instruction emulation. Most of all, the
  mutex is used to protect the env->exclusive_protected_hwaddr (one
  mutex for all vCPUs).

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/cputlb.c b/cputlb.c
index 9794e6b..66df41a 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -39,6 +39,10 @@ void qemu_mutex_unlock_iothread(void);
 /* statistics */
 int tlb_flush_count;
 
+/* For atomic instruction handling. */
+volatile int exit_flush_request = 0;
+QemuMutex tcg_excl_access_lock;
+
 /* NOTE:
  * If flush_global is true (the usual case), flush all tlb entries.
  * If flush_global is false, flush (at least) all tlb entries not
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 12/13] softmmu_llsc_template.h: move to multithreading
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (10 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 11/13] Introduce exit_flush_req and tcg_excl_access_lock Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-17 15:27   ` Alex Bennée
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 13/13] softmmu_template.h: " Alvise Rigo
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Update the TCG LL/SC instructions to work in multi-threading.

The basic idea remains untouched, but the whole mechanism is improved to
make use of the callback support to query TLB flush requests and the
rendezvous callback to synchronize all the currently running vCPUs.

In essence, if a vCPU wants to LL to a page which is not already set as
EXCL, it will arrange a rendezvous with all the vCPUs that are executing
a TB and query a TLB flush for *all* the vCPUs.
Doing so, we make sure that:
- the running vCPUs do not touch the EXCL page while the requesting vCPU
  is setting the transaction to EXCL of the page
- all the vCPUs will have the EXCL flag in the TLB entry for that
  specific page *before* entering the next TB

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |  2 +
 include/exec/cpu-defs.h |  4 ++
 softmmu_llsc_template.h | 97 ++++++++++++++++++++++++++++++++-----------------
 3 files changed, 69 insertions(+), 34 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 66df41a..0566e0f 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -30,6 +30,8 @@
 #include "exec/ram_addr.h"
 #include "tcg/tcg.h"
 
+#include "sysemu/cpus.h"
+
 void qemu_mutex_lock_iothread(void);
 void qemu_mutex_unlock_iothread(void);
 
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index c73a75f..40742b3 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -169,5 +169,9 @@ typedef struct CPUIOTLBEntry {
     /* Used for atomic instruction translation. */                      \
     bool ll_sc_context;                                                 \
     hwaddr excl_protected_hwaddr;                                       \
+    /* Used to carry the stcond result and also as a flag to flag a
+     * normal store access made by a stcond. */                         \
+    int excl_succeeded;                                                 \
+
 
 #endif
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index 81e9d8e..4105e72 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -54,7 +54,21 @@
                  (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
 })                                                                           \
 
-#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+#define is_read_tlb_entry_set(env, page, index)                              \
+({                                                                           \
+    (addr & TARGET_PAGE_MASK)                                                \
+         == ((env->tlb_table[mmu_idx][index].addr_read) &                    \
+                 (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
+})                                                                           \
+
+/* Whenever a SC operation fails, we add a small delay to reduce the
+ * concurrency among the atomic instruction emulation code. Without this delay,
+ * in very congested situation where plain stores make all the pending LLs
+ * fail, the code could reach a stalling situation in which all the SCs happen
+ * to fail.
+ * TODO: make the delay dynamic according with the SC failing rate.
+ * */
+#define TCG_ATOMIC_INSN_EMUL_DELAY 100
 
 WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr,
                                 TCGMemOpIdx oi, uintptr_t retaddr)
@@ -65,35 +79,58 @@ WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr,
     hwaddr hw_addr;
     unsigned mmu_idx = get_mmuidx(oi);
 
-    /* Use the proper load helper from cpu_ldst.h */
-    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
-
-    /* The last legacy access ensures that the TLB and IOTLB entry for 'addr'
-     * have been created. */
     index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    if (!is_read_tlb_entry_set(env, addr, index)) {
+        tlb_fill(ENV_GET_CPU(env), addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+    }
 
     /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
      * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
     hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
 
     /* Set the exclusive-protected hwaddr. */
-    env->excl_protected_hwaddr = hw_addr;
-    env->ll_sc_context = true;
+    qemu_mutex_lock(&tcg_excl_access_lock);
+    if (cpu_physical_memory_excl_is_dirty(hw_addr) && !exit_flush_request) {
+        exit_flush_request = 1;
 
-    /* No need to mask hw_addr with TARGET_PAGE_MASK since
-     * cpu_physical_memory_excl_is_dirty() will take care of that. */
-    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
-        cpu_physical_memory_clear_excl_dirty(hw_addr);
+        qemu_mutex_unlock(&tcg_excl_access_lock);
+
+        cpu_exit_init_rendezvous();
 
-        /* Invalidate the TLB entry for the other processors. The next TLB
-         * entries for this page will have the TLB_EXCL flag set. */
         CPU_FOREACH(cpu) {
-            if (cpu != current_cpu) {
-                tlb_flush(cpu, 1);
+            if ((cpu->thread_id != qemu_get_thread_id())) {
+                if (!cpu->pending_tlb_flush) {
+                    /* Flush the TLB cache before executing the next TB. */
+                    cpu->pending_tlb_flush = 1;
+                    cpu_exit_callback_add(cpu, cpu_exit_tlb_flush_all_cb, NULL);
+                }
+                if (cpu->tcg_executing) {
+                    /* We want to wait all the vCPUs that are running in this
+                     * exact moment.
+                     * Add a callback to be executed as soon as the vCPU exits
+                     * from the current TB. Force it to exit. */
+                    cpu_exit_rendezvous_add_attendee(cpu);
+                    qemu_cpu_kick_thread(cpu);
+                }
             }
         }
+
+        cpu_exit_rendezvous_wait_others(ENV_GET_CPU(env));
+
+        exit_flush_request = 0;
+
+        qemu_mutex_lock(&tcg_excl_access_lock);
+        cpu_physical_memory_clear_excl_dirty(hw_addr);
     }
 
+    env->ll_sc_context = true;
+
+    /* Use the proper load helper from cpu_ldst.h */
+    env->excl_protected_hwaddr = hw_addr;
+    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
+
+    qemu_mutex_unlock(&tcg_excl_access_lock);
+
     /* For this vCPU, just update the TLB entry, no need to flush. */
     env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
 
@@ -106,7 +143,6 @@ WORD_TYPE helper_le_stcond_name(CPUArchState *env, target_ulong addr,
 {
     WORD_TYPE ret;
     int index;
-    hwaddr hw_addr;
     unsigned mmu_idx = get_mmuidx(oi);
 
     /* If the TLB entry is not the right one, create it. */
@@ -115,29 +151,22 @@ WORD_TYPE helper_le_stcond_name(CPUArchState *env, target_ulong addr,
         tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
     }
 
-    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
-     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
-    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
-
-    if (!env->ll_sc_context) {
-        /* No LoakLink has been set, the StoreCond has to fail. */
-        return 1;
-    }
-
     env->ll_sc_context = 0;
 
-    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
-        /* Another vCPU has accessed the memory after the LoadLink. */
-        ret = 1;
-    } else {
-        helper_st_legacy(env, addr, val, mmu_idx, retaddr);
+    /* We set it preventively to true to distinguish the following legacy
+     * access as one made by the store conditional wrapper. If the store
+     * conditional does not succeed, the value will be set to 0.*/
+    env->excl_succeeded = 1;
+    helper_st_legacy(env, addr, val, mmu_idx, retaddr);
 
-        /* The StoreConditional succeeded */
+    if (env->excl_succeeded) {
+        env->excl_succeeded = 0;
         ret = 0;
+    } else {
+        g_usleep(TCG_ATOMIC_INSN_EMUL_DELAY);
+        ret = 1;
     }
 
-    env->tlb_table[mmu_idx][index].addr_write &= ~TLB_EXCL;
-    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
     /* It's likely that the page will be used again for exclusive accesses,
      * for this reason we don't flush any TLB cache at the price of some
      * additional slow paths and we don't set the page bit as dirty.
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v3 13/13] softmmu_template.h: move to multithreading
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (11 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 12/13] softmmu_llsc_template.h: move to multithreading Alvise Rigo
@ 2015-07-10  8:23 ` Alvise Rigo
  2015-07-17 15:57   ` Alex Bennée
  2015-07-10  8:31 ` [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Mark Burton
  2015-07-10  8:39 ` Frederic Konrad
  14 siblings, 1 reply; 41+ messages in thread
From: Alvise Rigo @ 2015-07-10  8:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Exploiting the tcg_excl_access_lock, port the helper_{le,be}_st_name to
work in real multithreading.

- The macro lookup_cpus_ll_addr now uses directly the
  env->excl_protected_addr to invalidate others' LL/SC operations

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 softmmu_template.h | 110 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 89 insertions(+), 21 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index bc767f6..522454f 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -141,21 +141,24 @@
     vidx >= 0;                                                                \
 })
 
+#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+
+/* This macro requires the caller to have the tcg_excl_access_lock lock since
+ * it modifies the excl_protected_hwaddr of a running vCPU.
+ * The macros scans all the excl_protected_hwaddr of all the vCPUs and compare
+ * them with the address the current vCPU is writing to. If there is a match,
+ * we reset the value, making the SC fail. */
 #define lookup_cpus_ll_addr(addr)                                             \
 ({                                                                            \
     CPUState *cpu;                                                            \
     CPUArchState *acpu;                                                       \
-    bool hit = false;                                                         \
                                                                               \
     CPU_FOREACH(cpu) {                                                        \
         acpu = (CPUArchState *)cpu->env_ptr;                                  \
         if (cpu != current_cpu && acpu->excl_protected_hwaddr == addr) {      \
-            hit = true;                                                       \
-            break;                                                            \
+            acpu->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;               \
         }                                                                     \
     }                                                                         \
-                                                                              \
-    hit;                                                                      \
 })
 
 #ifndef SOFTMMU_CODE_ACCESS
@@ -439,18 +442,52 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
              * exclusive-protected memory. */
             hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
 
-            bool set_to_dirty;
-
             /* Two cases of invalidation: the current vCPU is writing to another
              * vCPU's exclusive address or the vCPU that issued the LoadLink is
              * writing to it, but not through a StoreCond. */
-            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
-            set_to_dirty |= env->ll_sc_context &&
-                           (env->excl_protected_hwaddr == hw_addr);
+            qemu_mutex_lock(&tcg_excl_access_lock);
+
+            /* The macro lookup_cpus_ll_addr could have reset the exclusive
+             * address. Fail the SC in this case.
+             * N.B.: Here excl_succeeded == 0 means that we don't come from a
+             * store conditional.  */
+            if (env->excl_succeeded &&
+                        (env->excl_protected_hwaddr == EXCLUSIVE_RESET_ADDR)) {
+                env->excl_succeeded = 0;
+                qemu_mutex_unlock(&tcg_excl_access_lock);
+
+                return;
+            }
+
+            lookup_cpus_ll_addr(hw_addr);
+
+            if (!env->excl_succeeded) {
+                if (env->ll_sc_context &&
+                            (env->excl_protected_hwaddr == hw_addr)) {
+                    cpu_physical_memory_set_excl_dirty(hw_addr);
+                }
+            } else {
+                if (cpu_physical_memory_excl_is_dirty(hw_addr) ||
+                        env->excl_protected_hwaddr != hw_addr) {
+                    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
+                    qemu_mutex_unlock(&tcg_excl_access_lock);
+                    env->excl_succeeded = 0;
+
+                    return;
+                }
+            }
+
+            haddr = addr + env->tlb_table[mmu_idx][index].addend;
+        #if DATA_SIZE == 1
+            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+        #else
+            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+        #endif
+
+            env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
+            qemu_mutex_unlock(&tcg_excl_access_lock);
 
-            if (set_to_dirty) {
-                cpu_physical_memory_set_excl_dirty(hw_addr);
-            } /* the vCPU is legitimately writing to the protected address */
+            return;
         } else {
             if ((addr & (DATA_SIZE - 1)) != 0) {
                 goto do_unaligned_access;
@@ -537,18 +574,49 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
              * exclusive-protected memory. */
             hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
 
-            bool set_to_dirty;
-
             /* Two cases of invalidation: the current vCPU is writing to another
              * vCPU's exclusive address or the vCPU that issued the LoadLink is
              * writing to it, but not through a StoreCond. */
-            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
-            set_to_dirty |= env->ll_sc_context &&
-                           (env->excl_protected_hwaddr == hw_addr);
+            qemu_mutex_lock(&tcg_excl_access_lock);
+
+            /* The macro lookup_cpus_ll_addr could have reset the exclusive
+             * address. Fail the SC in this case.
+             * N.B.: Here excl_succeeded == 0 means that we don't come from a
+             * store conditional.  */
+            if (env->excl_succeeded &&
+                        (env->excl_protected_hwaddr == EXCLUSIVE_RESET_ADDR)) {
+                env->excl_succeeded = 0;
+                qemu_mutex_unlock(&tcg_excl_access_lock);
+
+                return;
+            }
+
+            lookup_cpus_ll_addr(hw_addr);
+
+            if (!env->excl_succeeded) {
+                if (env->ll_sc_context &&
+                            (env->excl_protected_hwaddr == hw_addr)) {
+                    cpu_physical_memory_set_excl_dirty(hw_addr);
+                }
+            } else {
+                if (cpu_physical_memory_excl_is_dirty(hw_addr) ||
+                        env->excl_protected_hwaddr != hw_addr) {
+                    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
+                    qemu_mutex_unlock(&tcg_excl_access_lock);
+                    env->excl_succeeded = 0;
+
+                    return;
+                }
+            }
 
-            if (set_to_dirty) {
-                cpu_physical_memory_set_excl_dirty(hw_addr);
-            } /* the vCPU is legitimately writing to the protected address */
+            haddr = addr + env->tlb_table[mmu_idx][index].addend;
+
+            glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
+
+            qemu_mutex_unlock(&tcg_excl_access_lock);
+            env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
+
+            return;
         } else {
             if ((addr & (DATA_SIZE - 1)) != 0) {
                 goto do_unaligned_access;
-- 
2.4.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (12 preceding siblings ...)
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 13/13] softmmu_template.h: " Alvise Rigo
@ 2015-07-10  8:31 ` Mark Burton
  2015-07-10  8:58   ` alvise rigo
  2015-07-10  8:39 ` Frederic Konrad
  14 siblings, 1 reply; 41+ messages in thread
From: Mark Burton @ 2015-07-10  8:31 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen,
	tech, alex.bennee

<big snip>
To be clear, for a normal user (e.g. they boot linux, they run some apps, etc)..., if they use only one core, is it true that they will see no difference in performance?
For a ‘normal user’ who does use multi-core, are you saying that a typical boot is slower?

Cheers

Mark.

> On 10 Jul 2015, at 10:23, Alvise Rigo <a.rigo@virtualopensystems.com> wrote:
> 
> * Performance considerations
> This implementation shows good results while booting a Linux kernel,
> where tons of flushes affect the overall performance. A complete ARM
> Linux boot, without any filesystem, requires 30% longer if compared to
> the mttcg implementation, benefiting however of being capable to offer
> the infrastructure to handle atomic instructions on any architecture.
> Instead compared to the current TCG upstream, it is 40% faster with four
> vCPUs and 2.1 times faster with 8 vCPUs.
> In addition, there is still margin to improve such performance, since at
> the moment TLB is flushed quite often, probably more than the required.


	 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

	+33 (0)603762104
	mark.burton

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
  2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
                   ` (13 preceding siblings ...)
  2015-07-10  8:31 ` [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Mark Burton
@ 2015-07-10  8:39 ` Frederic Konrad
  2015-07-10  9:04   ` alvise rigo
  14 siblings, 1 reply; 41+ messages in thread
From: Frederic Konrad @ 2015-07-10  8:39 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

On 10/07/2015 10:23, Alvise Rigo wrote:
> This is the third iteration of the patch series; starting from PATCH 007
> there are the changes to move the whole work to multi-threading.
> Changes versus previous versions are at the bottom of this cover letter.
>
> This patch series provides an infrastructure for atomic
> instruction implementation in QEMU, paving the way for TCG multi-threading.
> The adopted design does not rely on host atomic
> instructions and is intended to propose a 'legacy' solution for
> translating guest atomic instructions.
>
> The underlying idea is to provide new TCG instructions that guarantee
> atomicity to some memory accesses or in general a way to define memory
> transactions. More specifically, a new pair of TCG instructions are
> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
> LoadLink and StoreConditional primitives (only 32 bit variant
> implemented).  In order to achieve this, a new bitmap is added to the
> ram_list structure (always unique) which flags all memory pages that
> could not be accessed directly through the fast-path, due to previous
> exclusive operations. This new bitmap is coupled with a new TLB flag
> which forces the slow-path execution. All stores which are performed
> between an LL/SC operation by other vCPUs to the same (protected) address
> will fail the subsequent StoreConditional.
>
> In theory, the provided implementation of TCG LoadLink/StoreConditional
> can be used to properly handle atomic instructions on any architecture.
>
> The new slow-path is implemented such that:
> - the LoadLink behaves as a normal load slow-path, except for cleaning
>    the dirty flag in the bitmap. The TLB entries created from now on will
>    force the slow-path. To ensure it, we flush the TLB cache for the
>    other vCPUs. The vCPU also sets into a private variable the accessed
>    address, in order to make it visible to the other vCPUs
> - the StoreConditional behaves as a normal store slow-path, except for
>    checking whether other vCPUs have set the same exclusive address
>
> All those write accesses that are forced to follow the 'legacy'
> slow-path will set the accessed memory page to dirty.
>
> In this series only the ARM ldrex/strex instructions are implemented
> for ARM and i386 hosts.
> The code has been tested with bare-metal test cases and by booting Linux,
> using the latest mttcg QEMU branch available at
> http://git.greensocs.com/fkonrad/mttcg.git.
branch multi_tcg_v6 at this time.

>
> * Performance considerations
> This implementation shows good results while booting a Linux kernel,
> where tons of flushes affect the overall performance. A complete ARM
> Linux boot, without any filesystem, requires 30% longer if compared to
> the mttcg implementation, benefiting however of being capable to offer
> the infrastructure to handle atomic instructions on any architecture.
> Instead compared to the current TCG upstream, it is 40% faster with four
> vCPUs and 2.1 times faster with 8 vCPUs.
> In addition, there is still margin to improve such performance, since at
> the moment TLB is flushed quite often, probably more than the required.
>
> On the other hand, the test case
> https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git
> that stresses heavily the LL/SC mechanic but not that much the TLB related
> part, performs up to 1.9 times faster with 8 cores and one milion iterations
> if compared with the mttcg implementation.
>
> Changes from v2:
> - the bitmap accessors are now atomic
> - a rendezvous between vCPUs and a simple callback support before executing
>    a TB have been added to handle the TLB flush support
Isn't exactly what my async_safe_work is supposed to do?

> - the softmmu_template and softmmu_llsc_template have been adapted to work
>    on real multi-threading
>
> Changes from v1:
> - The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
> - The way how the offset to access the bitmap is calculated has
>    been improved and fixed
> - A page to be set as dirty requires a vCPU to target the protected address
>    and not just an address in the page
> - Addressed comments from Richard Henderson to improve the logic in
>    softmmu_template.h and to simplify the methods generation through
>    softmmu_llsc_template.h
> - Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386
>
> This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
>
> Alvise Rigo (13):
>    exec: Add new exclusive bitmap to ram_list
>    cputlb: Add new TLB_EXCL flag
>    softmmu: Add helpers for a new slow-path
>    tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>    target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>    target-i386: translate: implement qemu_ldlink and qemu_stcond ops
>    ram_addr.h: Make exclusive bitmap accessors atomic
>    exec.c: introduce a simple rendezvous support
>    cpus.c: introduce simple callback support
>    Simple TLB flush wrap to use as exit callback
>    Introduce exit_flush_req and tcg_excl_access_lock
>    softmmu_llsc_template.h: move to multithreading
>    softmmu_template.h: move to multithreading
>
>   cpus.c                  |  39 ++++++++
>   cputlb.c                |  33 +++++-
>   exec.c                  |  46 +++++++++
>   include/exec/cpu-all.h  |   2 +
>   include/exec/cpu-defs.h |   8 ++
>   include/exec/memory.h   |   3 +-
>   include/exec/ram_addr.h |  22 ++++
>   include/qom/cpu.h       |  37 +++++++
>   softmmu_llsc_template.h | 184 ++++++++++++++++++++++++++++++++++
>   softmmu_template.h      | 261 +++++++++++++++++++++++++++++++++++-------------
>   target-arm/translate.c  |  87 +++++++++++++++-
>   tcg/arm/tcg-target.c    | 121 ++++++++++++++++------
>   tcg/i386/tcg-target.c   | 136 +++++++++++++++++++++----
>   tcg/tcg-be-ldst.h       |   1 +
>   tcg/tcg-op.c            |  23 +++++
>   tcg/tcg-op.h            |   3 +
>   tcg/tcg-opc.h           |   4 +
>   tcg/tcg.c               |   2 +
>   tcg/tcg.h               |  20 ++++
>   19 files changed, 910 insertions(+), 122 deletions(-)
>   create mode 100644 softmmu_llsc_template.h
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
  2015-07-10  8:31 ` [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Mark Burton
@ 2015-07-10  8:58   ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-10  8:58 UTC (permalink / raw)
  To: Mark Burton
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

Hi Mark,

On Fri, Jul 10, 2015 at 10:31 AM, Mark Burton <mark.burton@greensocs.com> wrote:
> <big snip>
> To be clear, for a normal user (e.g. they boot linux, they run some apps, etc)..., if they use only one core, is it true that they will see no difference in performance?

I didn't test the one core scenario, but I expect less than 10% of
degradation of this solution compared to upstream (confirmed by a
quick test booting Linux).
Actually the performance of these patches for a unicore system can be
improved, I will keep this in mind for the next release.

> For a ‘normal user’ who does use multi-core, are you saying that a typical boot is slower?

Slower than what?

Regards,
alvise

>
> Cheers
>
> Mark.
>
>> On 10 Jul 2015, at 10:23, Alvise Rigo <a.rigo@virtualopensystems.com> wrote:
>>
>> * Performance considerations
>> This implementation shows good results while booting a Linux kernel,
>> where tons of flushes affect the overall performance. A complete ARM
>> Linux boot, without any filesystem, requires 30% longer if compared to
>> the mttcg implementation, benefiting however of being capable to offer
>> the infrastructure to handle atomic instructions on any architecture.
>> Instead compared to the current TCG upstream, it is 40% faster with four
>> vCPUs and 2.1 times faster with 8 vCPUs.
>> In addition, there is still margin to improve such performance, since at
>> the moment TLB is flushed quite often, probably more than the required.
>
>
>          +44 (0)20 7100 3485 x 210
>  +33 (0)5 33 52 01 77x 210
>
>         +33 (0)603762104
>         mark.burton
>
>
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
  2015-07-10  8:39 ` Frederic Konrad
@ 2015-07-10  9:04   ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-10  9:04 UTC (permalink / raw)
  To: Frederic Konrad
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On Fri, Jul 10, 2015 at 10:39 AM, Frederic Konrad
<fred.konrad@greensocs.com> wrote:
> On 10/07/2015 10:23, Alvise Rigo wrote:
>>
>> This is the third iteration of the patch series; starting from PATCH 007
>> there are the changes to move the whole work to multi-threading.
>> Changes versus previous versions are at the bottom of this cover letter.
>>
>> This patch series provides an infrastructure for atomic
>> instruction implementation in QEMU, paving the way for TCG
>> multi-threading.
>> The adopted design does not rely on host atomic
>> instructions and is intended to propose a 'legacy' solution for
>> translating guest atomic instructions.
>>
>> The underlying idea is to provide new TCG instructions that guarantee
>> atomicity to some memory accesses or in general a way to define memory
>> transactions. More specifically, a new pair of TCG instructions are
>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>> LoadLink and StoreConditional primitives (only 32 bit variant
>> implemented).  In order to achieve this, a new bitmap is added to the
>> ram_list structure (always unique) which flags all memory pages that
>> could not be accessed directly through the fast-path, due to previous
>> exclusive operations. This new bitmap is coupled with a new TLB flag
>> which forces the slow-path execution. All stores which are performed
>> between an LL/SC operation by other vCPUs to the same (protected) address
>> will fail the subsequent StoreConditional.
>>
>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>> can be used to properly handle atomic instructions on any architecture.
>>
>> The new slow-path is implemented such that:
>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>    the dirty flag in the bitmap. The TLB entries created from now on will
>>    force the slow-path. To ensure it, we flush the TLB cache for the
>>    other vCPUs. The vCPU also sets into a private variable the accessed
>>    address, in order to make it visible to the other vCPUs
>> - the StoreConditional behaves as a normal store slow-path, except for
>>    checking whether other vCPUs have set the same exclusive address
>>
>> All those write accesses that are forced to follow the 'legacy'
>> slow-path will set the accessed memory page to dirty.
>>
>> In this series only the ARM ldrex/strex instructions are implemented
>> for ARM and i386 hosts.
>> The code has been tested with bare-metal test cases and by booting Linux,
>> using the latest mttcg QEMU branch available at
>> http://git.greensocs.com/fkonrad/mttcg.git.
>
> branch multi_tcg_v6 at this time.
>
>>
>> * Performance considerations
>> This implementation shows good results while booting a Linux kernel,
>> where tons of flushes affect the overall performance. A complete ARM
>> Linux boot, without any filesystem, requires 30% longer if compared to
>> the mttcg implementation, benefiting however of being capable to offer
>> the infrastructure to handle atomic instructions on any architecture.
>> Instead compared to the current TCG upstream, it is 40% faster with four
>> vCPUs and 2.1 times faster with 8 vCPUs.
>> In addition, there is still margin to improve such performance, since at
>> the moment TLB is flushed quite often, probably more than the required.
>>
>> On the other hand, the test case
>> https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git
>> that stresses heavily the LL/SC mechanic but not that much the TLB related
>> part, performs up to 1.9 times faster with 8 cores and one milion
>> iterations
>> if compared with the mttcg implementation.
>>
>> Changes from v2:
>> - the bitmap accessors are now atomic
>> - a rendezvous between vCPUs and a simple callback support before
>> executing
>>    a TB have been added to handle the TLB flush support
>
> Isn't exactly what my async_safe_work is supposed to do?

Hi Frederic,

I've started this implementation with your v4 and I've missed this
feature while porting to v6.
I think it's doable, it will make things simpler and cleaner.

Thank you,
alvise

>
>
>> - the softmmu_template and softmmu_llsc_template have been adapted to work
>>    on real multi-threading
>>
>> Changes from v1:
>> - The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
>> - The way how the offset to access the bitmap is calculated has
>>    been improved and fixed
>> - A page to be set as dirty requires a vCPU to target the protected
>> address
>>    and not just an address in the page
>> - Addressed comments from Richard Henderson to improve the logic in
>>    softmmu_template.h and to simplify the methods generation through
>>    softmmu_llsc_template.h
>> - Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386
>>
>> This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
>>
>> Alvise Rigo (13):
>>    exec: Add new exclusive bitmap to ram_list
>>    cputlb: Add new TLB_EXCL flag
>>    softmmu: Add helpers for a new slow-path
>>    tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>    target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>>    target-i386: translate: implement qemu_ldlink and qemu_stcond ops
>>    ram_addr.h: Make exclusive bitmap accessors atomic
>>    exec.c: introduce a simple rendezvous support
>>    cpus.c: introduce simple callback support
>>    Simple TLB flush wrap to use as exit callback
>>    Introduce exit_flush_req and tcg_excl_access_lock
>>    softmmu_llsc_template.h: move to multithreading
>>    softmmu_template.h: move to multithreading
>>
>>   cpus.c                  |  39 ++++++++
>>   cputlb.c                |  33 +++++-
>>   exec.c                  |  46 +++++++++
>>   include/exec/cpu-all.h  |   2 +
>>   include/exec/cpu-defs.h |   8 ++
>>   include/exec/memory.h   |   3 +-
>>   include/exec/ram_addr.h |  22 ++++
>>   include/qom/cpu.h       |  37 +++++++
>>   softmmu_llsc_template.h | 184 ++++++++++++++++++++++++++++++++++
>>   softmmu_template.h      | 261
>> +++++++++++++++++++++++++++++++++++-------------
>>   target-arm/translate.c  |  87 +++++++++++++++-
>>   tcg/arm/tcg-target.c    | 121 ++++++++++++++++------
>>   tcg/i386/tcg-target.c   | 136 +++++++++++++++++++++----
>>   tcg/tcg-be-ldst.h       |   1 +
>>   tcg/tcg-op.c            |  23 +++++
>>   tcg/tcg-op.h            |   3 +
>>   tcg/tcg-opc.h           |   4 +
>>   tcg/tcg.c               |   2 +
>>   tcg/tcg.h               |  20 ++++
>>   19 files changed, 910 insertions(+), 122 deletions(-)
>>   create mode 100644 softmmu_llsc_template.h
>>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support Alvise Rigo
@ 2015-07-10  9:36   ` Paolo Bonzini
  2015-07-10  9:47     ` alvise rigo
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2015-07-10  9:36 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, jani.kokkonen, tech, alex.bennee



On 10/07/2015 10:23, Alvise Rigo wrote:
> In order to perfom "lazy" TLB invalidation requests, introduce a
> queue of callbacks at every vCPU disposal that will be fired just
> before entering the next TB.
> 
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>

Why is async_run_on_cpu not enough?

Paolo

> ---
>  cpus.c            | 34 ++++++++++++++++++++++++++++++++++
>  exec.c            |  1 +
>  include/qom/cpu.h | 20 ++++++++++++++++++++
>  3 files changed, 55 insertions(+)
> 
> diff --git a/cpus.c b/cpus.c
> index f4d938e..b9f0329 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -1421,6 +1421,7 @@ static int tcg_cpu_exec(CPUArchState *env)
>          cpu->icount_extra = count;
>      }
>      qemu_mutex_unlock_iothread();
> +    cpu_exit_callbacks_call_all(cpu);
>      ret = cpu_exec(env);
>      cpu->tcg_executing = 0;
>  
> @@ -1469,6 +1470,39 @@ static void tcg_exec_all(CPUState *cpu)
>      cpu->exit_request = 0;
>  }
>  
> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
> +                           void *opaque)
> +{
> +    CPUExitCB *cb;
> +
> +    cb = g_malloc(sizeof(*cb));
> +    cb->callback = callback;
> +    cb->opaque = opaque;
> +
> +    qemu_mutex_lock(&cpu->exit_cbs.mutex);
> +    QTAILQ_INSERT_TAIL(&cpu->exit_cbs.exit_callbacks, cb, entry);
> +    qemu_mutex_unlock(&cpu->exit_cbs.mutex);
> +}
> +
> +void cpu_exit_callbacks_call_all(CPUState *cpu)
> +{
> +    CPUExitCB *cb, *next;
> +
> +    if (QTAILQ_EMPTY(&cpu->exit_cbs.exit_callbacks)) {
> +        return;
> +    }
> +
> +    QTAILQ_FOREACH_SAFE(cb, &cpu->exit_cbs.exit_callbacks, entry, next) {
> +        cb->callback(cpu, cb->opaque);
> +
> +        /* one-shot callbacks, remove it after using it */
> +        qemu_mutex_lock(&cpu->exit_cbs.mutex);
> +        QTAILQ_REMOVE(&cpu->exit_cbs.exit_callbacks, cb, entry);
> +        g_free(cb);
> +        qemu_mutex_unlock(&cpu->exit_cbs.mutex);
> +    }
> +}
> +
>  void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
>  {
>      /* XXX: implement xxx_cpu_list for targets that still miss it */
> diff --git a/exec.c b/exec.c
> index 51958ed..322f2c6 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -531,6 +531,7 @@ void cpu_exec_init(CPUArchState *env)
>      cpu->numa_node = 0;
>      QTAILQ_INIT(&cpu->breakpoints);
>      QTAILQ_INIT(&cpu->watchpoints);
> +    QTAILQ_INIT(&cpu->exit_cbs.exit_callbacks);
>  #ifndef CONFIG_USER_ONLY
>      cpu->as = &address_space_memory;
>      cpu->thread_id = qemu_get_thread_id();
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 8d121b3..0ec020b 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -201,6 +201,24 @@ typedef struct CPUWatchpoint {
>      QTAILQ_ENTRY(CPUWatchpoint) entry;
>  } CPUWatchpoint;
>  
> +/* vCPU exit callbacks */
> +typedef void (*CPUExitCallback)(CPUState *cpu, void *opaque);
> +struct CPUExitCBs {
> +    QemuMutex mutex;
> +    QTAILQ_HEAD(exit_callbacks_head, CPUExitCB) exit_callbacks;
> +};
> +
> +typedef struct CPUExitCB {
> +    CPUExitCallback callback;
> +    void *opaque;
> +
> +    QTAILQ_ENTRY(CPUExitCB) entry;
> +} CPUExitCB;
> +
> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
> +                           void *opaque);
> +void cpu_exit_callbacks_call_all(CPUState *cpu);
> +
>  /* Rendezvous support */
>  #define TCG_RDV_POLLING_PERIOD 10
>  typedef struct CpuExitRendezvous {
> @@ -305,6 +323,8 @@ struct CPUState {
>  
>      void *opaque;
>  
> +    /* One-shot callbacks for stopping requests. */
> +    struct CPUExitCBs exit_cbs;
>      volatile int pending_rdv;
>  
>      /* In order to avoid passing too many arguments to the MMIO helpers,
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support
  2015-07-10  9:36   ` Paolo Bonzini
@ 2015-07-10  9:47     ` alvise rigo
  2015-07-10  9:53       ` Frederic Konrad
  2015-07-10 10:24       ` Paolo Bonzini
  0 siblings, 2 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-10  9:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

I tried to use it, but it would then create a deadlock at a very early
stage of the stress test.
The problem is likely related to the fact that flush_queued_work
happens with the global mutex locked.

As Frederick suggested, we can use the newly introduced
flush_queued_safe_work for this.

Regards,
alvise

On Fri, Jul 10, 2015 at 11:36 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 10/07/2015 10:23, Alvise Rigo wrote:
>> In order to perfom "lazy" TLB invalidation requests, introduce a
>> queue of callbacks at every vCPU disposal that will be fired just
>> before entering the next TB.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>
> Why is async_run_on_cpu not enough?
>
> Paolo
>
>> ---
>>  cpus.c            | 34 ++++++++++++++++++++++++++++++++++
>>  exec.c            |  1 +
>>  include/qom/cpu.h | 20 ++++++++++++++++++++
>>  3 files changed, 55 insertions(+)
>>
>> diff --git a/cpus.c b/cpus.c
>> index f4d938e..b9f0329 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -1421,6 +1421,7 @@ static int tcg_cpu_exec(CPUArchState *env)
>>          cpu->icount_extra = count;
>>      }
>>      qemu_mutex_unlock_iothread();
>> +    cpu_exit_callbacks_call_all(cpu);
>>      ret = cpu_exec(env);
>>      cpu->tcg_executing = 0;
>>
>> @@ -1469,6 +1470,39 @@ static void tcg_exec_all(CPUState *cpu)
>>      cpu->exit_request = 0;
>>  }
>>
>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>> +                           void *opaque)
>> +{
>> +    CPUExitCB *cb;
>> +
>> +    cb = g_malloc(sizeof(*cb));
>> +    cb->callback = callback;
>> +    cb->opaque = opaque;
>> +
>> +    qemu_mutex_lock(&cpu->exit_cbs.mutex);
>> +    QTAILQ_INSERT_TAIL(&cpu->exit_cbs.exit_callbacks, cb, entry);
>> +    qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>> +}
>> +
>> +void cpu_exit_callbacks_call_all(CPUState *cpu)
>> +{
>> +    CPUExitCB *cb, *next;
>> +
>> +    if (QTAILQ_EMPTY(&cpu->exit_cbs.exit_callbacks)) {
>> +        return;
>> +    }
>> +
>> +    QTAILQ_FOREACH_SAFE(cb, &cpu->exit_cbs.exit_callbacks, entry, next) {
>> +        cb->callback(cpu, cb->opaque);
>> +
>> +        /* one-shot callbacks, remove it after using it */
>> +        qemu_mutex_lock(&cpu->exit_cbs.mutex);
>> +        QTAILQ_REMOVE(&cpu->exit_cbs.exit_callbacks, cb, entry);
>> +        g_free(cb);
>> +        qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>> +    }
>> +}
>> +
>>  void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
>>  {
>>      /* XXX: implement xxx_cpu_list for targets that still miss it */
>> diff --git a/exec.c b/exec.c
>> index 51958ed..322f2c6 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -531,6 +531,7 @@ void cpu_exec_init(CPUArchState *env)
>>      cpu->numa_node = 0;
>>      QTAILQ_INIT(&cpu->breakpoints);
>>      QTAILQ_INIT(&cpu->watchpoints);
>> +    QTAILQ_INIT(&cpu->exit_cbs.exit_callbacks);
>>  #ifndef CONFIG_USER_ONLY
>>      cpu->as = &address_space_memory;
>>      cpu->thread_id = qemu_get_thread_id();
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 8d121b3..0ec020b 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -201,6 +201,24 @@ typedef struct CPUWatchpoint {
>>      QTAILQ_ENTRY(CPUWatchpoint) entry;
>>  } CPUWatchpoint;
>>
>> +/* vCPU exit callbacks */
>> +typedef void (*CPUExitCallback)(CPUState *cpu, void *opaque);
>> +struct CPUExitCBs {
>> +    QemuMutex mutex;
>> +    QTAILQ_HEAD(exit_callbacks_head, CPUExitCB) exit_callbacks;
>> +};
>> +
>> +typedef struct CPUExitCB {
>> +    CPUExitCallback callback;
>> +    void *opaque;
>> +
>> +    QTAILQ_ENTRY(CPUExitCB) entry;
>> +} CPUExitCB;
>> +
>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>> +                           void *opaque);
>> +void cpu_exit_callbacks_call_all(CPUState *cpu);
>> +
>>  /* Rendezvous support */
>>  #define TCG_RDV_POLLING_PERIOD 10
>>  typedef struct CpuExitRendezvous {
>> @@ -305,6 +323,8 @@ struct CPUState {
>>
>>      void *opaque;
>>
>> +    /* One-shot callbacks for stopping requests. */
>> +    struct CPUExitCBs exit_cbs;
>>      volatile int pending_rdv;
>>
>>      /* In order to avoid passing too many arguments to the MMIO helpers,
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support
  2015-07-10  9:47     ` alvise rigo
@ 2015-07-10  9:53       ` Frederic Konrad
  2015-07-10 10:06         ` alvise rigo
  2015-07-10 10:24       ` Paolo Bonzini
  1 sibling, 1 reply; 41+ messages in thread
From: Frederic Konrad @ 2015-07-10  9:53 UTC (permalink / raw)
  To: alvise rigo, Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

On 10/07/2015 11:47, alvise rigo wrote:
> I tried to use it, but it would then create a deadlock at a very early
> stage of the stress test.
> The problem is likely related to the fact that flush_queued_work
> happens with the global mutex locked.
>
> As Frederick suggested, we can use the newly introduced
> flush_queued_safe_work for this.
>
> Regards,
> alvise

It depends on the purpose.

async safe work will requires all VCPU to be exited (eg: like your 
rendez-vous).
async work doesn't it will just do the work when it's outside cpu-exec().

Theorically this is required only when a VCPU flushes the TLB of an 
other VCPU.
That's the behaviour in tlb_flush_all tlb_page_flush_all. The "normal" 
tlb_flush should
just work as it only plays with it's own CPUState.

Fred

> On Fri, Jul 10, 2015 at 11:36 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 10/07/2015 10:23, Alvise Rigo wrote:
>>> In order to perfom "lazy" TLB invalidation requests, introduce a
>>> queue of callbacks at every vCPU disposal that will be fired just
>>> before entering the next TB.
>>>
>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> Why is async_run_on_cpu not enough?
>>
>> Paolo
>>
>>> ---
>>>   cpus.c            | 34 ++++++++++++++++++++++++++++++++++
>>>   exec.c            |  1 +
>>>   include/qom/cpu.h | 20 ++++++++++++++++++++
>>>   3 files changed, 55 insertions(+)
>>>
>>> diff --git a/cpus.c b/cpus.c
>>> index f4d938e..b9f0329 100644
>>> --- a/cpus.c
>>> +++ b/cpus.c
>>> @@ -1421,6 +1421,7 @@ static int tcg_cpu_exec(CPUArchState *env)
>>>           cpu->icount_extra = count;
>>>       }
>>>       qemu_mutex_unlock_iothread();
>>> +    cpu_exit_callbacks_call_all(cpu);
>>>       ret = cpu_exec(env);
>>>       cpu->tcg_executing = 0;
>>>
>>> @@ -1469,6 +1470,39 @@ static void tcg_exec_all(CPUState *cpu)
>>>       cpu->exit_request = 0;
>>>   }
>>>
>>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>>> +                           void *opaque)
>>> +{
>>> +    CPUExitCB *cb;
>>> +
>>> +    cb = g_malloc(sizeof(*cb));
>>> +    cb->callback = callback;
>>> +    cb->opaque = opaque;
>>> +
>>> +    qemu_mutex_lock(&cpu->exit_cbs.mutex);
>>> +    QTAILQ_INSERT_TAIL(&cpu->exit_cbs.exit_callbacks, cb, entry);
>>> +    qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>>> +}
>>> +
>>> +void cpu_exit_callbacks_call_all(CPUState *cpu)
>>> +{
>>> +    CPUExitCB *cb, *next;
>>> +
>>> +    if (QTAILQ_EMPTY(&cpu->exit_cbs.exit_callbacks)) {
>>> +        return;
>>> +    }
>>> +
>>> +    QTAILQ_FOREACH_SAFE(cb, &cpu->exit_cbs.exit_callbacks, entry, next) {
>>> +        cb->callback(cpu, cb->opaque);
>>> +
>>> +        /* one-shot callbacks, remove it after using it */
>>> +        qemu_mutex_lock(&cpu->exit_cbs.mutex);
>>> +        QTAILQ_REMOVE(&cpu->exit_cbs.exit_callbacks, cb, entry);
>>> +        g_free(cb);
>>> +        qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>>> +    }
>>> +}
>>> +
>>>   void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
>>>   {
>>>       /* XXX: implement xxx_cpu_list for targets that still miss it */
>>> diff --git a/exec.c b/exec.c
>>> index 51958ed..322f2c6 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -531,6 +531,7 @@ void cpu_exec_init(CPUArchState *env)
>>>       cpu->numa_node = 0;
>>>       QTAILQ_INIT(&cpu->breakpoints);
>>>       QTAILQ_INIT(&cpu->watchpoints);
>>> +    QTAILQ_INIT(&cpu->exit_cbs.exit_callbacks);
>>>   #ifndef CONFIG_USER_ONLY
>>>       cpu->as = &address_space_memory;
>>>       cpu->thread_id = qemu_get_thread_id();
>>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>>> index 8d121b3..0ec020b 100644
>>> --- a/include/qom/cpu.h
>>> +++ b/include/qom/cpu.h
>>> @@ -201,6 +201,24 @@ typedef struct CPUWatchpoint {
>>>       QTAILQ_ENTRY(CPUWatchpoint) entry;
>>>   } CPUWatchpoint;
>>>
>>> +/* vCPU exit callbacks */
>>> +typedef void (*CPUExitCallback)(CPUState *cpu, void *opaque);
>>> +struct CPUExitCBs {
>>> +    QemuMutex mutex;
>>> +    QTAILQ_HEAD(exit_callbacks_head, CPUExitCB) exit_callbacks;
>>> +};
>>> +
>>> +typedef struct CPUExitCB {
>>> +    CPUExitCallback callback;
>>> +    void *opaque;
>>> +
>>> +    QTAILQ_ENTRY(CPUExitCB) entry;
>>> +} CPUExitCB;
>>> +
>>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>>> +                           void *opaque);
>>> +void cpu_exit_callbacks_call_all(CPUState *cpu);
>>> +
>>>   /* Rendezvous support */
>>>   #define TCG_RDV_POLLING_PERIOD 10
>>>   typedef struct CpuExitRendezvous {
>>> @@ -305,6 +323,8 @@ struct CPUState {
>>>
>>>       void *opaque;
>>>
>>> +    /* One-shot callbacks for stopping requests. */
>>> +    struct CPUExitCBs exit_cbs;
>>>       volatile int pending_rdv;
>>>
>>>       /* In order to avoid passing too many arguments to the MMIO helpers,
>>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support
  2015-07-10  9:53       ` Frederic Konrad
@ 2015-07-10 10:06         ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-10 10:06 UTC (permalink / raw)
  To: Frederic Konrad
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On Fri, Jul 10, 2015 at 11:53 AM, Frederic Konrad
<fred.konrad@greensocs.com> wrote:
> On 10/07/2015 11:47, alvise rigo wrote:
>>
>> I tried to use it, but it would then create a deadlock at a very early
>> stage of the stress test.
>> The problem is likely related to the fact that flush_queued_work
>> happens with the global mutex locked.
>>
>> As Frederick suggested, we can use the newly introduced
>> flush_queued_safe_work for this.
>>
>> Regards,
>> alvise
>
>
> It depends on the purpose.
>
> async safe work will requires all VCPU to be exited (eg: like your
> rendez-vous).
> async work doesn't it will just do the work when it's outside cpu-exec().

I guess that still this should work. If you look at helper_le_ldlink_name, only
the vCPUs in cpu-exec() receive a rendez-vous request while *all*
vCPUs are asked to perform a TLB flush.

alvise

>
> Theorically this is required only when a VCPU flushes the TLB of an other
> VCPU.
> That's the behaviour in tlb_flush_all tlb_page_flush_all. The "normal"
> tlb_flush should
> just work as it only plays with it's own CPUState.
>
> Fred
>
>
>> On Fri, Jul 10, 2015 at 11:36 AM, Paolo Bonzini <pbonzini@redhat.com>
>> wrote:
>>>
>>>
>>> On 10/07/2015 10:23, Alvise Rigo wrote:
>>>>
>>>> In order to perfom "lazy" TLB invalidation requests, introduce a
>>>> queue of callbacks at every vCPU disposal that will be fired just
>>>> before entering the next TB.
>>>>
>>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>>
>>> Why is async_run_on_cpu not enough?
>>>
>>> Paolo
>>>
>>>> ---
>>>>   cpus.c            | 34 ++++++++++++++++++++++++++++++++++
>>>>   exec.c            |  1 +
>>>>   include/qom/cpu.h | 20 ++++++++++++++++++++
>>>>   3 files changed, 55 insertions(+)
>>>>
>>>> diff --git a/cpus.c b/cpus.c
>>>> index f4d938e..b9f0329 100644
>>>> --- a/cpus.c
>>>> +++ b/cpus.c
>>>> @@ -1421,6 +1421,7 @@ static int tcg_cpu_exec(CPUArchState *env)
>>>>           cpu->icount_extra = count;
>>>>       }
>>>>       qemu_mutex_unlock_iothread();
>>>> +    cpu_exit_callbacks_call_all(cpu);
>>>>       ret = cpu_exec(env);
>>>>       cpu->tcg_executing = 0;
>>>>
>>>> @@ -1469,6 +1470,39 @@ static void tcg_exec_all(CPUState *cpu)
>>>>       cpu->exit_request = 0;
>>>>   }
>>>>
>>>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>>>> +                           void *opaque)
>>>> +{
>>>> +    CPUExitCB *cb;
>>>> +
>>>> +    cb = g_malloc(sizeof(*cb));
>>>> +    cb->callback = callback;
>>>> +    cb->opaque = opaque;
>>>> +
>>>> +    qemu_mutex_lock(&cpu->exit_cbs.mutex);
>>>> +    QTAILQ_INSERT_TAIL(&cpu->exit_cbs.exit_callbacks, cb, entry);
>>>> +    qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>>>> +}
>>>> +
>>>> +void cpu_exit_callbacks_call_all(CPUState *cpu)
>>>> +{
>>>> +    CPUExitCB *cb, *next;
>>>> +
>>>> +    if (QTAILQ_EMPTY(&cpu->exit_cbs.exit_callbacks)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    QTAILQ_FOREACH_SAFE(cb, &cpu->exit_cbs.exit_callbacks, entry, next)
>>>> {
>>>> +        cb->callback(cpu, cb->opaque);
>>>> +
>>>> +        /* one-shot callbacks, remove it after using it */
>>>> +        qemu_mutex_lock(&cpu->exit_cbs.mutex);
>>>> +        QTAILQ_REMOVE(&cpu->exit_cbs.exit_callbacks, cb, entry);
>>>> +        g_free(cb);
>>>> +        qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>>>> +    }
>>>> +}
>>>> +
>>>>   void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char
>>>> *optarg)
>>>>   {
>>>>       /* XXX: implement xxx_cpu_list for targets that still miss it */
>>>> diff --git a/exec.c b/exec.c
>>>> index 51958ed..322f2c6 100644
>>>> --- a/exec.c
>>>> +++ b/exec.c
>>>> @@ -531,6 +531,7 @@ void cpu_exec_init(CPUArchState *env)
>>>>       cpu->numa_node = 0;
>>>>       QTAILQ_INIT(&cpu->breakpoints);
>>>>       QTAILQ_INIT(&cpu->watchpoints);
>>>> +    QTAILQ_INIT(&cpu->exit_cbs.exit_callbacks);
>>>>   #ifndef CONFIG_USER_ONLY
>>>>       cpu->as = &address_space_memory;
>>>>       cpu->thread_id = qemu_get_thread_id();
>>>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>>>> index 8d121b3..0ec020b 100644
>>>> --- a/include/qom/cpu.h
>>>> +++ b/include/qom/cpu.h
>>>> @@ -201,6 +201,24 @@ typedef struct CPUWatchpoint {
>>>>       QTAILQ_ENTRY(CPUWatchpoint) entry;
>>>>   } CPUWatchpoint;
>>>>
>>>> +/* vCPU exit callbacks */
>>>> +typedef void (*CPUExitCallback)(CPUState *cpu, void *opaque);
>>>> +struct CPUExitCBs {
>>>> +    QemuMutex mutex;
>>>> +    QTAILQ_HEAD(exit_callbacks_head, CPUExitCB) exit_callbacks;
>>>> +};
>>>> +
>>>> +typedef struct CPUExitCB {
>>>> +    CPUExitCallback callback;
>>>> +    void *opaque;
>>>> +
>>>> +    QTAILQ_ENTRY(CPUExitCB) entry;
>>>> +} CPUExitCB;
>>>> +
>>>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>>>> +                           void *opaque);
>>>> +void cpu_exit_callbacks_call_all(CPUState *cpu);
>>>> +
>>>>   /* Rendezvous support */
>>>>   #define TCG_RDV_POLLING_PERIOD 10
>>>>   typedef struct CpuExitRendezvous {
>>>> @@ -305,6 +323,8 @@ struct CPUState {
>>>>
>>>>       void *opaque;
>>>>
>>>> +    /* One-shot callbacks for stopping requests. */
>>>> +    struct CPUExitCBs exit_cbs;
>>>>       volatile int pending_rdv;
>>>>
>>>>       /* In order to avoid passing too many arguments to the MMIO
>>>> helpers,
>>>>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support
  2015-07-10  9:47     ` alvise rigo
  2015-07-10  9:53       ` Frederic Konrad
@ 2015-07-10 10:24       ` Paolo Bonzini
  2015-07-10 12:16         ` Frederic Konrad
  1 sibling, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2015-07-10 10:24 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée



On 10/07/2015 11:47, alvise rigo wrote:
> I tried to use it, but it would then create a deadlock at a very early
> stage of the stress test.
> The problem is likely related to the fact that flush_queued_work
> happens with the global mutex locked.

Let's fix that and move the global mutex inside the callbacks.  I can
take a look.

Paolo

> As Frederick suggested, we can use the newly introduced
> flush_queued_safe_work for this.
> 
> Regards,
> alvise
> 
> On Fri, Jul 10, 2015 at 11:36 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 10/07/2015 10:23, Alvise Rigo wrote:
>>> In order to perfom "lazy" TLB invalidation requests, introduce a
>>> queue of callbacks at every vCPU disposal that will be fired just
>>> before entering the next TB.
>>>
>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>
>> Why is async_run_on_cpu not enough?
>>
>> Paolo
>>
>>> ---
>>>  cpus.c            | 34 ++++++++++++++++++++++++++++++++++
>>>  exec.c            |  1 +
>>>  include/qom/cpu.h | 20 ++++++++++++++++++++
>>>  3 files changed, 55 insertions(+)
>>>
>>> diff --git a/cpus.c b/cpus.c
>>> index f4d938e..b9f0329 100644
>>> --- a/cpus.c
>>> +++ b/cpus.c
>>> @@ -1421,6 +1421,7 @@ static int tcg_cpu_exec(CPUArchState *env)
>>>          cpu->icount_extra = count;
>>>      }
>>>      qemu_mutex_unlock_iothread();
>>> +    cpu_exit_callbacks_call_all(cpu);
>>>      ret = cpu_exec(env);
>>>      cpu->tcg_executing = 0;
>>>
>>> @@ -1469,6 +1470,39 @@ static void tcg_exec_all(CPUState *cpu)
>>>      cpu->exit_request = 0;
>>>  }
>>>
>>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>>> +                           void *opaque)
>>> +{
>>> +    CPUExitCB *cb;
>>> +
>>> +    cb = g_malloc(sizeof(*cb));
>>> +    cb->callback = callback;
>>> +    cb->opaque = opaque;
>>> +
>>> +    qemu_mutex_lock(&cpu->exit_cbs.mutex);
>>> +    QTAILQ_INSERT_TAIL(&cpu->exit_cbs.exit_callbacks, cb, entry);
>>> +    qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>>> +}
>>> +
>>> +void cpu_exit_callbacks_call_all(CPUState *cpu)
>>> +{
>>> +    CPUExitCB *cb, *next;
>>> +
>>> +    if (QTAILQ_EMPTY(&cpu->exit_cbs.exit_callbacks)) {
>>> +        return;
>>> +    }
>>> +
>>> +    QTAILQ_FOREACH_SAFE(cb, &cpu->exit_cbs.exit_callbacks, entry, next) {
>>> +        cb->callback(cpu, cb->opaque);
>>> +
>>> +        /* one-shot callbacks, remove it after using it */
>>> +        qemu_mutex_lock(&cpu->exit_cbs.mutex);
>>> +        QTAILQ_REMOVE(&cpu->exit_cbs.exit_callbacks, cb, entry);
>>> +        g_free(cb);
>>> +        qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>>> +    }
>>> +}
>>> +
>>>  void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
>>>  {
>>>      /* XXX: implement xxx_cpu_list for targets that still miss it */
>>> diff --git a/exec.c b/exec.c
>>> index 51958ed..322f2c6 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -531,6 +531,7 @@ void cpu_exec_init(CPUArchState *env)
>>>      cpu->numa_node = 0;
>>>      QTAILQ_INIT(&cpu->breakpoints);
>>>      QTAILQ_INIT(&cpu->watchpoints);
>>> +    QTAILQ_INIT(&cpu->exit_cbs.exit_callbacks);
>>>  #ifndef CONFIG_USER_ONLY
>>>      cpu->as = &address_space_memory;
>>>      cpu->thread_id = qemu_get_thread_id();
>>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>>> index 8d121b3..0ec020b 100644
>>> --- a/include/qom/cpu.h
>>> +++ b/include/qom/cpu.h
>>> @@ -201,6 +201,24 @@ typedef struct CPUWatchpoint {
>>>      QTAILQ_ENTRY(CPUWatchpoint) entry;
>>>  } CPUWatchpoint;
>>>
>>> +/* vCPU exit callbacks */
>>> +typedef void (*CPUExitCallback)(CPUState *cpu, void *opaque);
>>> +struct CPUExitCBs {
>>> +    QemuMutex mutex;
>>> +    QTAILQ_HEAD(exit_callbacks_head, CPUExitCB) exit_callbacks;
>>> +};
>>> +
>>> +typedef struct CPUExitCB {
>>> +    CPUExitCallback callback;
>>> +    void *opaque;
>>> +
>>> +    QTAILQ_ENTRY(CPUExitCB) entry;
>>> +} CPUExitCB;
>>> +
>>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>>> +                           void *opaque);
>>> +void cpu_exit_callbacks_call_all(CPUState *cpu);
>>> +
>>>  /* Rendezvous support */
>>>  #define TCG_RDV_POLLING_PERIOD 10
>>>  typedef struct CpuExitRendezvous {
>>> @@ -305,6 +323,8 @@ struct CPUState {
>>>
>>>      void *opaque;
>>>
>>> +    /* One-shot callbacks for stopping requests. */
>>> +    struct CPUExitCBs exit_cbs;
>>>      volatile int pending_rdv;
>>>
>>>      /* In order to avoid passing too many arguments to the MMIO helpers,
>>>
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support
  2015-07-10 10:24       ` Paolo Bonzini
@ 2015-07-10 12:16         ` Frederic Konrad
  0 siblings, 0 replies; 41+ messages in thread
From: Frederic Konrad @ 2015-07-10 12:16 UTC (permalink / raw)
  To: Paolo Bonzini, alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

On 10/07/2015 12:24, Paolo Bonzini wrote:
>
> On 10/07/2015 11:47, alvise rigo wrote:
>> I tried to use it, but it would then create a deadlock at a very early
>> stage of the stress test.
>> The problem is likely related to the fact that flush_queued_work
>> happens with the global mutex locked.
> Let's fix that and move the global mutex inside the callbacks.  I can
> take a look.
>
> Paolo

Meanwhile I can add a lock to protect the list as you suggested if I 
remember right.

Fred

>> As Frederick suggested, we can use the newly introduced
>> flush_queued_safe_work for this.
>>
>> Regards,
>> alvise
>>
>> On Fri, Jul 10, 2015 at 11:36 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>> On 10/07/2015 10:23, Alvise Rigo wrote:
>>>> In order to perfom "lazy" TLB invalidation requests, introduce a
>>>> queue of callbacks at every vCPU disposal that will be fired just
>>>> before entering the next TB.
>>>>
>>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>> Why is async_run_on_cpu not enough?
>>>
>>> Paolo
>>>
>>>> ---
>>>>   cpus.c            | 34 ++++++++++++++++++++++++++++++++++
>>>>   exec.c            |  1 +
>>>>   include/qom/cpu.h | 20 ++++++++++++++++++++
>>>>   3 files changed, 55 insertions(+)
>>>>
>>>> diff --git a/cpus.c b/cpus.c
>>>> index f4d938e..b9f0329 100644
>>>> --- a/cpus.c
>>>> +++ b/cpus.c
>>>> @@ -1421,6 +1421,7 @@ static int tcg_cpu_exec(CPUArchState *env)
>>>>           cpu->icount_extra = count;
>>>>       }
>>>>       qemu_mutex_unlock_iothread();
>>>> +    cpu_exit_callbacks_call_all(cpu);
>>>>       ret = cpu_exec(env);
>>>>       cpu->tcg_executing = 0;
>>>>
>>>> @@ -1469,6 +1470,39 @@ static void tcg_exec_all(CPUState *cpu)
>>>>       cpu->exit_request = 0;
>>>>   }
>>>>
>>>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>>>> +                           void *opaque)
>>>> +{
>>>> +    CPUExitCB *cb;
>>>> +
>>>> +    cb = g_malloc(sizeof(*cb));
>>>> +    cb->callback = callback;
>>>> +    cb->opaque = opaque;
>>>> +
>>>> +    qemu_mutex_lock(&cpu->exit_cbs.mutex);
>>>> +    QTAILQ_INSERT_TAIL(&cpu->exit_cbs.exit_callbacks, cb, entry);
>>>> +    qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>>>> +}
>>>> +
>>>> +void cpu_exit_callbacks_call_all(CPUState *cpu)
>>>> +{
>>>> +    CPUExitCB *cb, *next;
>>>> +
>>>> +    if (QTAILQ_EMPTY(&cpu->exit_cbs.exit_callbacks)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    QTAILQ_FOREACH_SAFE(cb, &cpu->exit_cbs.exit_callbacks, entry, next) {
>>>> +        cb->callback(cpu, cb->opaque);
>>>> +
>>>> +        /* one-shot callbacks, remove it after using it */
>>>> +        qemu_mutex_lock(&cpu->exit_cbs.mutex);
>>>> +        QTAILQ_REMOVE(&cpu->exit_cbs.exit_callbacks, cb, entry);
>>>> +        g_free(cb);
>>>> +        qemu_mutex_unlock(&cpu->exit_cbs.mutex);
>>>> +    }
>>>> +}
>>>> +
>>>>   void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
>>>>   {
>>>>       /* XXX: implement xxx_cpu_list for targets that still miss it */
>>>> diff --git a/exec.c b/exec.c
>>>> index 51958ed..322f2c6 100644
>>>> --- a/exec.c
>>>> +++ b/exec.c
>>>> @@ -531,6 +531,7 @@ void cpu_exec_init(CPUArchState *env)
>>>>       cpu->numa_node = 0;
>>>>       QTAILQ_INIT(&cpu->breakpoints);
>>>>       QTAILQ_INIT(&cpu->watchpoints);
>>>> +    QTAILQ_INIT(&cpu->exit_cbs.exit_callbacks);
>>>>   #ifndef CONFIG_USER_ONLY
>>>>       cpu->as = &address_space_memory;
>>>>       cpu->thread_id = qemu_get_thread_id();
>>>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>>>> index 8d121b3..0ec020b 100644
>>>> --- a/include/qom/cpu.h
>>>> +++ b/include/qom/cpu.h
>>>> @@ -201,6 +201,24 @@ typedef struct CPUWatchpoint {
>>>>       QTAILQ_ENTRY(CPUWatchpoint) entry;
>>>>   } CPUWatchpoint;
>>>>
>>>> +/* vCPU exit callbacks */
>>>> +typedef void (*CPUExitCallback)(CPUState *cpu, void *opaque);
>>>> +struct CPUExitCBs {
>>>> +    QemuMutex mutex;
>>>> +    QTAILQ_HEAD(exit_callbacks_head, CPUExitCB) exit_callbacks;
>>>> +};
>>>> +
>>>> +typedef struct CPUExitCB {
>>>> +    CPUExitCallback callback;
>>>> +    void *opaque;
>>>> +
>>>> +    QTAILQ_ENTRY(CPUExitCB) entry;
>>>> +} CPUExitCB;
>>>> +
>>>> +void cpu_exit_callback_add(CPUState *cpu, CPUExitCallback callback,
>>>> +                           void *opaque);
>>>> +void cpu_exit_callbacks_call_all(CPUState *cpu);
>>>> +
>>>>   /* Rendezvous support */
>>>>   #define TCG_RDV_POLLING_PERIOD 10
>>>>   typedef struct CpuExitRendezvous {
>>>> @@ -305,6 +323,8 @@ struct CPUState {
>>>>
>>>>       void *opaque;
>>>>
>>>> +    /* One-shot callbacks for stopping requests. */
>>>> +    struct CPUExitCBs exit_cbs;
>>>>       volatile int pending_rdv;
>>>>
>>>>       /* In order to avoid passing too many arguments to the MMIO helpers,
>>>>
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 02/13] cputlb: Add new TLB_EXCL flag
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 02/13] cputlb: Add new TLB_EXCL flag Alvise Rigo
@ 2015-07-16 14:32   ` Alex Bennée
  2015-07-16 15:04     ` alvise rigo
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2015-07-16 14:32 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Add a new flag for the TLB entries to force all the accesses made to a
> page to follow the slow-path.
>
> In the case we remove a TLB entry marked as EXCL, we unset the
> corresponding exclusive bit in the bitmap.
>
> Mark the accessed page as dirty to invalidate any pending operation of
> LL/SC only if a vCPU writes to the protected address.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cputlb.c                |  18 ++++-
>  include/exec/cpu-all.h  |   2 +
>  include/exec/cpu-defs.h |   4 +
>  softmmu_template.h      | 189 +++++++++++++++++++++++++++++++-----------------
>  4 files changed, 144 insertions(+), 69 deletions(-)
>
> diff --git a/cputlb.c b/cputlb.c
> index e5853fd..0aca407 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -380,6 +380,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>      env->tlb_v_table[mmu_idx][vidx] = *te;
>      env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
>  
> +    if (!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL)) {
> +        /* We are removing an exclusive entry, if the corresponding exclusive
> +         * bit is set, unset it. */
> +        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
> +                                          (te->addr_write & TARGET_PAGE_MASK);
> +        if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
> +            cpu_physical_memory_set_excl_dirty(hw_addr);
> +        }

I'm confused. I'm reading that as "if the dirty exclusive bit is set
then set the dirty exclusive bit", that doesn't seem right. The comment
seems to imply that should be a: cpu_physical_memory_clear_excl_dirty?

> +    }
> +
>      /* refill the tlb */
>      env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
>      env->iotlb[mmu_idx][index].attrs = attrs;
> @@ -405,7 +415,13 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>                                                     + xlat)) {
>              te->addr_write = address | TLB_NOTDIRTY;
>          } else {
> -            te->addr_write = address;
> +            if (!(address & TLB_MMIO) &&
> +                !cpu_physical_memory_excl_is_dirty(section->mr->ram_addr
> +                                                   + xlat)) {
> +                te->addr_write = address | TLB_EXCL;
> +            } else {
> +                te->addr_write = address;
> +            }
>          }
>      } else {
>          te->addr_write = -1;
> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> index ac06c67..632f6ce 100644
> --- a/include/exec/cpu-all.h
> +++ b/include/exec/cpu-all.h
> @@ -311,6 +311,8 @@ extern RAMList ram_list;
>  #define TLB_NOTDIRTY    (1 << 4)
>  /* Set if TLB entry is an IO callback.  */
>  #define TLB_MMIO        (1 << 5)
> +/* Set if TLB entry refers a page that requires exclusive access.  */
> +#define TLB_EXCL        (1 << 6)

I wonder if a compile time assert should be added here to trap the case
when TARGET_PAGE_MASK starts encroaching on the lower bits? It looks
like the smallest at the moment gives us 10 bits to play with.

>  
>  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
>  void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index d5aecaf..c73a75f 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -165,5 +165,9 @@ typedef struct CPUIOTLBEntry {
>  #define CPU_COMMON                                                      \
>      /* soft mmu support */                                              \
>      CPU_COMMON_TLB                                                      \
> +                                                                        \
> +    /* Used for atomic instruction translation. */                      \
> +    bool ll_sc_context;                                                 \
> +    hwaddr excl_protected_hwaddr;                                       \
>  
>  #endif
> diff --git a/softmmu_template.h b/softmmu_template.h
> index 18871f5..0edd451 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -141,6 +141,23 @@
>      vidx >= 0;                                                                \
>  })
>  
> +#define lookup_cpus_ll_addr(addr)                                             \
> +({                                                                            \
> +    CPUState *cpu;                                                            \
> +    CPUArchState *acpu;                                                       \
> +    bool hit = false;                                                         \
> +                                                                              \
> +    CPU_FOREACH(cpu) {                                                        \
> +        acpu = (CPUArchState *)cpu->env_ptr;                                  \
> +        if (cpu != current_cpu && acpu->excl_protected_hwaddr == addr) {      \
> +            hit = true;                                                       \
> +            break;                                                            \
> +        }                                                                     \
> +    }                                                                         \
> +                                                                              \
> +    hit;                                                                      \
> +})
> +

Is there a reason to abuse a #define like this instead of having an
inline and letting the compiler sort it out?

>  #ifndef SOFTMMU_CODE_ACCESS
>  static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
>                                                CPUIOTLBEntry *iotlbentry,
> @@ -414,43 +431,61 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      }
>  
> -    /* Handle an IO access.  */
> +    /* Handle an IO access or exclusive access.  */
>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> -        CPUIOTLBEntry *iotlbentry;
> -        if ((addr & (DATA_SIZE - 1)) != 0) {
> -            goto do_unaligned_access;
> -        }
> -        iotlbentry = &env->iotlb[mmu_idx][index];
> -
> -        /* ??? Note that the io helpers always read data in the target
> -           byte ordering.  We should push the LE/BE request down into io.  */
> -        val = TGT_LE(val);
> -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> -        return;
> -    }
> -
> -    /* Handle slow unaligned access (it spans two pages or IO).  */
> -    if (DATA_SIZE > 1
> -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> -                     >= TARGET_PAGE_SIZE)) {
> -        int i;
> -    do_unaligned_access:
> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> -                                 mmu_idx, retaddr);
> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
> +            /* The slow-path has been forced since we are writing to
> +             * exclusive-protected memory. */
> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
> +
> +            bool set_to_dirty;
> +
> +            /* Two cases of invalidation: the current vCPU is writing to another
> +             * vCPU's exclusive address or the vCPU that issued the LoadLink is
> +             * writing to it, but not through a StoreCond. */
> +            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
> +            set_to_dirty |= env->ll_sc_context &&
> +                           (env->excl_protected_hwaddr == hw_addr);
> +
> +            if (set_to_dirty) {
> +                cpu_physical_memory_set_excl_dirty(hw_addr);
> +            } /* the vCPU is legitimately writing to the protected address */
> +        } else {
> +            if ((addr & (DATA_SIZE - 1)) != 0) {
> +                goto do_unaligned_access;
> +            }
> +
> +            /* ??? Note that the io helpers always read data in the target
> +               byte ordering.  We should push the LE/BE request down into io. */
> +            val = TGT_LE(val);
> +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> +            return;
>          }
> -        /* XXX: not efficient, but simple */
> -        /* Note: relies on the fact that tlb_fill() does not remove the
> -         * previous page from the TLB cache.  */
> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
> -            /* Little-endian extract.  */
> -            uint8_t val8 = val >> (i * 8);
> -            /* Note the adjustment at the beginning of the function.
> -               Undo that for the recursion.  */
> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> -                                            oi, retaddr + GETPC_ADJ);
> +    } else {
> +        /* Handle slow unaligned access (it spans two pages or IO).  */
> +        if (DATA_SIZE > 1
> +            && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> +                         >= TARGET_PAGE_SIZE)) {
> +            int i;
> +        do_unaligned_access:
> +            if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> +                cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> +                                     mmu_idx, retaddr);
> +            }
> +            /* XXX: not efficient, but simple */
> +            /* Note: relies on the fact that tlb_fill() does not remove the
> +             * previous page from the TLB cache.  */
> +            for (i = DATA_SIZE - 1; i >= 0; i--) {
> +                /* Little-endian extract.  */
> +                uint8_t val8 = val >> (i * 8);
> +                /* Note the adjustment at the beginning of the function.
> +                   Undo that for the recursion.  */
> +                glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> +                                                oi, retaddr + GETPC_ADJ);
> +            }
> +            return;
>          }
> -        return;
>      }

OK I can just about follow what happened now with the 3 exit points and
extra goto thrown in but this function is starting to smell. The changes
seem reasonable but what happens to the next tweak to the function?

>  
>      /* Handle aligned access or unaligned access in the same page.  */
> @@ -494,43 +529,61 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      }
>  
> -    /* Handle an IO access.  */
> +    /* Handle an IO access or exclusive access.  */
>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> -        CPUIOTLBEntry *iotlbentry;
> -        if ((addr & (DATA_SIZE - 1)) != 0) {
> -            goto do_unaligned_access;
> -        }
> -        iotlbentry = &env->iotlb[mmu_idx][index];
> -
> -        /* ??? Note that the io helpers always read data in the target
> -           byte ordering.  We should push the LE/BE request down into io.  */
> -        val = TGT_BE(val);
> -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> -        return;
> -    }
> -
> -    /* Handle slow unaligned access (it spans two pages or IO).  */
> -    if (DATA_SIZE > 1
> -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> -                     >= TARGET_PAGE_SIZE)) {
> -        int i;
> -    do_unaligned_access:
> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> -                                 mmu_idx, retaddr);
> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
> +            /* The slow-path has been forced since we are writing to
> +             * exclusive-protected memory. */
> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
> +
> +            bool set_to_dirty;
> +
> +            /* Two cases of invalidation: the current vCPU is writing to another
> +             * vCPU's exclusive address or the vCPU that issued the LoadLink is
> +             * writing to it, but not through a StoreCond. */
> +            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
> +            set_to_dirty |= env->ll_sc_context &&
> +                           (env->excl_protected_hwaddr == hw_addr);
> +
> +            if (set_to_dirty) {
> +                cpu_physical_memory_set_excl_dirty(hw_addr);
> +            } /* the vCPU is legitimately writing to the protected address */
> +        } else {
> +            if ((addr & (DATA_SIZE - 1)) != 0) {
> +                goto do_unaligned_access;
> +            }
> +
> +            /* ??? Note that the io helpers always read data in the target
> +               byte ordering.  We should push the LE/BE request down into io. */
> +            val = TGT_BE(val);
> +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> +            return;
>          }
> -        /* XXX: not efficient, but simple */
> -        /* Note: relies on the fact that tlb_fill() does not remove the
> -         * previous page from the TLB cache.  */
> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
> -            /* Big-endian extract.  */
> -            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> -            /* Note the adjustment at the beginning of the function.
> -               Undo that for the recursion.  */
> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> -                                            oi, retaddr + GETPC_ADJ);
> +    } else {
> +        /* Handle slow unaligned access (it spans two pages or IO).  */
> +        if (DATA_SIZE > 1
> +            && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> +                         >= TARGET_PAGE_SIZE)) {
> +            int i;
> +        do_unaligned_access:
> +            if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> +                cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> +                                     mmu_idx, retaddr);
> +            }
> +            /* XXX: not efficient, but simple */
> +            /* Note: relies on the fact that tlb_fill() does not remove the
> +             * previous page from the TLB cache.  */
> +            for (i = DATA_SIZE - 1; i >= 0; i--) {
> +                /* Big-endian extract.  */
> +                uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> +                /* Note the adjustment at the beginning of the function.
> +                   Undo that for the recursion.  */
> +                glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> +                                                oi, retaddr + GETPC_ADJ);
> +            }
> +            return;
>          }
> -        return;
>      }
>  
>      /* Handle aligned access or unaligned access in the same page.  */

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 03/13] softmmu: Add helpers for a new slow-path
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 03/13] softmmu: Add helpers for a new slow-path Alvise Rigo
@ 2015-07-16 14:53   ` Alex Bennée
  2015-07-16 15:15     ` alvise rigo
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2015-07-16 14:53 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> The new helpers rely on the legacy ones to perform the actual read/write.
>
> The StoreConditional helper (helper_le_stcond_name) returns 1 if the
> store has to fail due to a concurrent access to the same page by
> another vCPU.  A 'concurrent access' can be a store made by *any* vCPU
> (although, some implementations allow stores made by the CPU that issued
> the LoadLink).
>
> These helpers also update the TLB entry of the page involved in the
> LL/SC, so that all the following accesses made by any vCPU will follow
> the slow path.
> In real multi-threading, these helpers will require to temporarily pause
> the execution of the other vCPUs in order to update accordingly (flush)
> the TLB cache.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cputlb.c                |   3 +
>  softmmu_llsc_template.h | 155 ++++++++++++++++++++++++++++++++++++++++++++++++
>  softmmu_template.h      |   4 ++
>  tcg/tcg.h               |  18 ++++++
>  4 files changed, 180 insertions(+)
>  create mode 100644 softmmu_llsc_template.h
>
> diff --git a/cputlb.c b/cputlb.c
> index 0aca407..fa38714 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -475,6 +475,8 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
>  
>  #define MMUSUFFIX _mmu
>  
> +/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
> +#define GEN_EXCLUSIVE_HELPERS
>  #define SHIFT 0
>  #include "softmmu_template.h"
>  
> @@ -487,6 +489,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
>  #define SHIFT 3
>  #include "softmmu_template.h"
>  #undef MMUSUFFIX
> +#undef GEN_EXCLUSIVE_HELPERS
>  
>  #define MMUSUFFIX _cmmu
>  #undef GETPC_ADJ
> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
> new file mode 100644
> index 0000000..81e9d8e
> --- /dev/null
> +++ b/softmmu_llsc_template.h
> @@ -0,0 +1,155 @@
> +/*
> + *  Software MMU support (esclusive load/store operations)
> + *
> + * Generate helpers used by TCG for qemu_ldlink/stcond ops.
> + *
> + * Included from softmmu_template.h only.
> + *
> + * Copyright (c) 2015 Virtual Open Systems
> + *
> + * Authors:
> + *  Alvise Rigo <a.rigo@virtualopensystems.com>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +/*
> + * TODO configurations not implemented:
> + *     - Signed/Unsigned Big-endian
> + *     - Signed Little-endian
> + * */
> +
> +#if DATA_SIZE > 1
> +#define helper_le_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_le_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
> +#else
> +#define helper_le_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_le_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
> +#endif
> +
> +/* helpers from cpu_ldst.h, byte-order independent versions */
> +#if DATA_SIZE > 1
> +#define helper_ld_legacy glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st_legacy glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
> +#else
> +#define helper_ld_legacy glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st_legacy glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
> +#endif
> +
> +#define is_write_tlb_entry_set(env, page, index)                             \
> +({                                                                           \
> +    (addr & TARGET_PAGE_MASK)                                                \
> +         == ((env->tlb_table[mmu_idx][index].addr_write) &                   \
> +                 (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
> +})                                                                           \
> +
> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
> +
> +WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr,
> +                                TCGMemOpIdx oi, uintptr_t retaddr)
> +{
> +    WORD_TYPE ret;
> +    int index;
> +    CPUState *cpu;
> +    hwaddr hw_addr;
> +    unsigned mmu_idx = get_mmuidx(oi);
> +
> +    /* Use the proper load helper from cpu_ldst.h */
> +    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
> +
> +    /* The last legacy access ensures that the TLB and IOTLB entry for 'addr'
> +     * have been created. */
> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +
> +    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
> +     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
> +    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
>  addr;

I'm having trouble parsing the comment w.r.t the code here. Aren't the
TLB entries already page aligned? Should you not have:

    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
    offset = (addr & ~TARGET_PAGE_MASK)

    hw_addr = env->iotlb[mmu_idx][index].addr + offset

?

> +
> +    /* Set the exclusive-protected hwaddr. */
> +    env->excl_protected_hwaddr = hw_addr;
> +    env->ll_sc_context = true;
> +
> +    /* No need to mask hw_addr with TARGET_PAGE_MASK since
> +     * cpu_physical_memory_excl_is_dirty() will take care of that. */
> +    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
> +        cpu_physical_memory_clear_excl_dirty(hw_addr);
> +
> +        /* Invalidate the TLB entry for the other processors. The next TLB
> +         * entries for this page will have the TLB_EXCL flag set. */
> +        CPU_FOREACH(cpu) {
> +            if (cpu != current_cpu) {
> +                tlb_flush(cpu, 1);
> +            }
> +        }
> +    }
> +
> +    /* For this vCPU, just update the TLB entry, no need to flush. */
> +    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
> +
> +    return ret;
> +}
> +
> +WORD_TYPE helper_le_stcond_name(CPUArchState *env, target_ulong addr,
> +                                DATA_TYPE val, TCGMemOpIdx oi,
> +                                uintptr_t retaddr)
> +{
> +    WORD_TYPE ret;
> +    int index;
> +    hwaddr hw_addr;
> +    unsigned mmu_idx = get_mmuidx(oi);
> +
> +    /* If the TLB entry is not the right one, create it. */
> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +    if (!is_write_tlb_entry_set(env, addr, index)) {
> +        tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
> +    }
> +
> +    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
> +     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
> +    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
> +
> +    if (!env->ll_sc_context) {
> +        /* No LoakLink has been set, the StoreCond has to fail. */
> +        return 1;
> +    }
> +
> +    env->ll_sc_context = 0;
> +
> +    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
> +        /* Another vCPU has accessed the memory after the LoadLink. */
> +        ret = 1;
> +    } else {
> +        helper_st_legacy(env, addr, val, mmu_idx, retaddr);
> +
> +        /* The StoreConditional succeeded */
> +        ret = 0;
> +    }
> +
> +    env->tlb_table[mmu_idx][index].addr_write &= ~TLB_EXCL;
> +    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
> +    /* It's likely that the page will be used again for exclusive accesses,
> +     * for this reason we don't flush any TLB cache at the price of some
> +     * additional slow paths and we don't set the page bit as dirty.
> +     * The EXCL TLB entries will not remain there forever since they will be
> +     * eventually removed to serve another guest page; when this happens we
> +     * remove also the dirty bit (see cputlb.c).
> +     * */

We are explaining code that was never added here in the first place?

> +
> +    return ret;
> +}
> +
> +#undef helper_le_ldlink_name
> +#undef helper_le_stcond_name
> +#undef helper_ld_legacy
> +#undef helper_st_legacy
> diff --git a/softmmu_template.h b/softmmu_template.h
> index 0edd451..bc767f6 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -630,6 +630,10 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
>  #endif
>  #endif /* !defined(SOFTMMU_CODE_ACCESS) */
>  
> +#ifdef GEN_EXCLUSIVE_HELPERS
> +#include "softmmu_llsc_template.h"
> +#endif
> +
>  #undef READ_ACCESS_TYPE
>  #undef SHIFT
>  #undef DATA_TYPE
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 032fe10..8ca85ab 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -962,6 +962,15 @@ tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
>                                      TCGMemOpIdx oi, uintptr_t retaddr);
>  uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
>                             TCGMemOpIdx oi, uintptr_t retaddr);
> +/* Exclusive variants */
> +tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
> +                                         int mmu_idx, uintptr_t retaddr);
> +tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
> +                                        int mmu_idx, uintptr_t retaddr);
> +tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
> +                                        int mmu_idx, uintptr_t retaddr);
> +uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
> +                               int mmu_idx, uintptr_t retaddr);
>  
>  /* Value sign-extended to tcg register size.  */
>  tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
> @@ -989,6 +998,15 @@ void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
>                         TCGMemOpIdx oi, uintptr_t retaddr);
>  void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
>                         TCGMemOpIdx oi, uintptr_t retaddr);
> +/* Exclusive variants */
> +tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
> +                                uint8_t val, int mmu_idx, uintptr_t retaddr);
> +tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
> +                                uint16_t val, int mmu_idx, uintptr_t retaddr);
> +tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
> +                                uint32_t val, int mmu_idx, uintptr_t retaddr);
> +uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
> +                                uint64_t val, int mmu_idx, uintptr_t retaddr);
>  
>  /* Temporary aliases until backends are converted.  */
>  #ifdef TARGET_WORDS_BIGENDIAN

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 02/13] cputlb: Add new TLB_EXCL flag
  2015-07-16 14:32   ` Alex Bennée
@ 2015-07-16 15:04     ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-16 15:04 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

On Thu, Jul 16, 2015 at 4:32 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
> > Add a new flag for the TLB entries to force all the accesses made to a
> > page to follow the slow-path.
> >
> > In the case we remove a TLB entry marked as EXCL, we unset the
> > corresponding exclusive bit in the bitmap.
> >
> > Mark the accessed page as dirty to invalidate any pending operation of
> > LL/SC only if a vCPU writes to the protected address.
> >
> > Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> > Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> > Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> > ---
> >  cputlb.c                |  18 ++++-
> >  include/exec/cpu-all.h  |   2 +
> >  include/exec/cpu-defs.h |   4 +
> >  softmmu_template.h      | 189 +++++++++++++++++++++++++++++++-----------------
> >  4 files changed, 144 insertions(+), 69 deletions(-)
> >
> > diff --git a/cputlb.c b/cputlb.c
> > index e5853fd..0aca407 100644
> > --- a/cputlb.c
> > +++ b/cputlb.c
> > @@ -380,6 +380,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
> >      env->tlb_v_table[mmu_idx][vidx] = *te;
> >      env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
> >
> > +    if (!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL)) {
> > +        /* We are removing an exclusive entry, if the corresponding exclusive
> > +         * bit is set, unset it. */
> > +        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
> > +                                          (te->addr_write & TARGET_PAGE_MASK);
> > +        if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
> > +            cpu_physical_memory_set_excl_dirty(hw_addr);
> > +        }
>
> I'm confused. I'm reading that as "if the dirty exclusive bit is set
> then set the dirty exclusive bit", that doesn't seem right. The comment
> seems to imply that should be a: cpu_physical_memory_clear_excl_dirty?


Yes, you are right, I've already fixed this issued in the upcoming v4.
It should be:

if (!cpu_physical_memory_excl_is_dirty(hw_addr)) {
            cpu_physical_memory_set_excl_dirty(hw_addr);
}

I will also make the comment more clear.
The rational is to restore the dirty state of a page when a vCPU is
deleting the EXCL TLB entry associated to that page.

This piece of code actually lowers a lot the performance, since it
will then force all the other vCPUs to flush at the next stcond. This
is why I'm testing right now a version of this patch series where each
vCPU has its own EXCL bit: the performance is much better at the cost
of a bigger bitmap.

>
>
> > +    }
> > +
> >      /* refill the tlb */
> >      env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
> >      env->iotlb[mmu_idx][index].attrs = attrs;
> > @@ -405,7 +415,13 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
> >                                                     + xlat)) {
> >              te->addr_write = address | TLB_NOTDIRTY;
> >          } else {
> > -            te->addr_write = address;
> > +            if (!(address & TLB_MMIO) &&
> > +                !cpu_physical_memory_excl_is_dirty(section->mr->ram_addr
> > +                                                   + xlat)) {
> > +                te->addr_write = address | TLB_EXCL;
> > +            } else {
> > +                te->addr_write = address;
> > +            }
> >          }
> >      } else {
> >          te->addr_write = -1;
> > diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> > index ac06c67..632f6ce 100644
> > --- a/include/exec/cpu-all.h
> > +++ b/include/exec/cpu-all.h
> > @@ -311,6 +311,8 @@ extern RAMList ram_list;
> >  #define TLB_NOTDIRTY    (1 << 4)
> >  /* Set if TLB entry is an IO callback.  */
> >  #define TLB_MMIO        (1 << 5)
> > +/* Set if TLB entry refers a page that requires exclusive access.  */
> > +#define TLB_EXCL        (1 << 6)
>
> I wonder if a compile time assert should be added here to trap the case
> when TARGET_PAGE_MASK starts encroaching on the lower bits? It looks
> like the smallest at the moment gives us 10 bits to play with.

Yes, it absolutely makes sense. I will take this into account for the
next version.

>
> >
> >  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
> >  void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
> > diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> > index d5aecaf..c73a75f 100644
> > --- a/include/exec/cpu-defs.h
> > +++ b/include/exec/cpu-defs.h
> > @@ -165,5 +165,9 @@ typedef struct CPUIOTLBEntry {
> >  #define CPU_COMMON                                                      \
> >      /* soft mmu support */                                              \
> >      CPU_COMMON_TLB                                                      \
> > +                                                                        \
> > +    /* Used for atomic instruction translation. */                      \
> > +    bool ll_sc_context;                                                 \
> > +    hwaddr excl_protected_hwaddr;                                       \
> >
> >  #endif
> > diff --git a/softmmu_template.h b/softmmu_template.h
> > index 18871f5..0edd451 100644
> > --- a/softmmu_template.h
> > +++ b/softmmu_template.h
> > @@ -141,6 +141,23 @@
> >      vidx >= 0;                                                                \
> >  })
> >
> > +#define lookup_cpus_ll_addr(addr)                                             \
> > +({                                                                            \
> > +    CPUState *cpu;                                                            \
> > +    CPUArchState *acpu;                                                       \
> > +    bool hit = false;                                                         \
> > +                                                                              \
> > +    CPU_FOREACH(cpu) {                                                        \
> > +        acpu = (CPUArchState *)cpu->env_ptr;                                  \
> > +        if (cpu != current_cpu && acpu->excl_protected_hwaddr == addr) {      \
> > +            hit = true;                                                       \
> > +            break;                                                            \
> > +        }                                                                     \
> > +    }                                                                         \
> > +                                                                              \
> > +    hit;                                                                      \
> > +})
> > +
>
> Is there a reason to abuse a #define like this instead of having an
> inline and letting the compiler sort it out?

No, I can also move it to cputlb.c if necessary.

>
> >  #ifndef SOFTMMU_CODE_ACCESS
> >  static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
> >                                                CPUIOTLBEntry *iotlbentry,
> > @@ -414,43 +431,61 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
> >          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> >      }
> >
> > -    /* Handle an IO access.  */
> > +    /* Handle an IO access or exclusive access.  */
> >      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> > -        CPUIOTLBEntry *iotlbentry;
> > -        if ((addr & (DATA_SIZE - 1)) != 0) {
> > -            goto do_unaligned_access;
> > -        }
> > -        iotlbentry = &env->iotlb[mmu_idx][index];
> > -
> > -        /* ??? Note that the io helpers always read data in the target
> > -           byte ordering.  We should push the LE/BE request down into io.  */
> > -        val = TGT_LE(val);
> > -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> > -        return;
> > -    }
> > -
> > -    /* Handle slow unaligned access (it spans two pages or IO).  */
> > -    if (DATA_SIZE > 1
> > -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> > -                     >= TARGET_PAGE_SIZE)) {
> > -        int i;
> > -    do_unaligned_access:
> > -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> > -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> > -                                 mmu_idx, retaddr);
> > +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
> > +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
> > +            /* The slow-path has been forced since we are writing to
> > +             * exclusive-protected memory. */
> > +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
> > +
> > +            bool set_to_dirty;
> > +
> > +            /* Two cases of invalidation: the current vCPU is writing to another
> > +             * vCPU's exclusive address or the vCPU that issued the LoadLink is
> > +             * writing to it, but not through a StoreCond. */
> > +            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
> > +            set_to_dirty |= env->ll_sc_context &&
> > +                           (env->excl_protected_hwaddr == hw_addr);
> > +
> > +            if (set_to_dirty) {
> > +                cpu_physical_memory_set_excl_dirty(hw_addr);
> > +            } /* the vCPU is legitimately writing to the protected address */
> > +        } else {
> > +            if ((addr & (DATA_SIZE - 1)) != 0) {
> > +                goto do_unaligned_access;
> > +            }
> > +
> > +            /* ??? Note that the io helpers always read data in the target
> > +               byte ordering.  We should push the LE/BE request down into io. */
> > +            val = TGT_LE(val);
> > +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> > +            return;
> >          }
> > -        /* XXX: not efficient, but simple */
> > -        /* Note: relies on the fact that tlb_fill() does not remove the
> > -         * previous page from the TLB cache.  */
> > -        for (i = DATA_SIZE - 1; i >= 0; i--) {
> > -            /* Little-endian extract.  */
> > -            uint8_t val8 = val >> (i * 8);
> > -            /* Note the adjustment at the beginning of the function.
> > -               Undo that for the recursion.  */
> > -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> > -                                            oi, retaddr + GETPC_ADJ);
> > +    } else {
> > +        /* Handle slow unaligned access (it spans two pages or IO).  */
> > +        if (DATA_SIZE > 1
> > +            && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> > +                         >= TARGET_PAGE_SIZE)) {
> > +            int i;
> > +        do_unaligned_access:
> > +            if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> > +                cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> > +                                     mmu_idx, retaddr);
> > +            }
> > +            /* XXX: not efficient, but simple */
> > +            /* Note: relies on the fact that tlb_fill() does not remove the
> > +             * previous page from the TLB cache.  */
> > +            for (i = DATA_SIZE - 1; i >= 0; i--) {
> > +                /* Little-endian extract.  */
> > +                uint8_t val8 = val >> (i * 8);
> > +                /* Note the adjustment at the beginning of the function.
> > +                   Undo that for the recursion.  */
> > +                glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> > +                                                oi, retaddr + GETPC_ADJ);
> > +            }
> > +            return;
> >          }
> > -        return;
> >      }
>
> OK I can just about follow what happened now with the 3 exit points and
> extra goto thrown in but this function is starting to smell. The changes
> seem reasonable but what happens to the next tweak to the function?

I agree with you...It's a bit messy, but it does not touch the likely
path at all.

Thank you,
alvise

>
> >
> >      /* Handle aligned access or unaligned access in the same page.  */
> > @@ -494,43 +529,61 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
> >          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> >      }
> >
> > -    /* Handle an IO access.  */
> > +    /* Handle an IO access or exclusive access.  */
> >      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> > -        CPUIOTLBEntry *iotlbentry;
> > -        if ((addr & (DATA_SIZE - 1)) != 0) {
> > -            goto do_unaligned_access;
> > -        }
> > -        iotlbentry = &env->iotlb[mmu_idx][index];
> > -
> > -        /* ??? Note that the io helpers always read data in the target
> > -           byte ordering.  We should push the LE/BE request down into io.  */
> > -        val = TGT_BE(val);
> > -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> > -        return;
> > -    }
> > -
> > -    /* Handle slow unaligned access (it spans two pages or IO).  */
> > -    if (DATA_SIZE > 1
> > -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> > -                     >= TARGET_PAGE_SIZE)) {
> > -        int i;
> > -    do_unaligned_access:
> > -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> > -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> > -                                 mmu_idx, retaddr);
> > +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
> > +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
> > +            /* The slow-path has been forced since we are writing to
> > +             * exclusive-protected memory. */
> > +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
> > +
> > +            bool set_to_dirty;
> > +
> > +            /* Two cases of invalidation: the current vCPU is writing to another
> > +             * vCPU's exclusive address or the vCPU that issued the LoadLink is
> > +             * writing to it, but not through a StoreCond. */
> > +            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
> > +            set_to_dirty |= env->ll_sc_context &&
> > +                           (env->excl_protected_hwaddr == hw_addr);
> > +
> > +            if (set_to_dirty) {
> > +                cpu_physical_memory_set_excl_dirty(hw_addr);
> > +            } /* the vCPU is legitimately writing to the protected address */
> > +        } else {
> > +            if ((addr & (DATA_SIZE - 1)) != 0) {
> > +                goto do_unaligned_access;
> > +            }
> > +
> > +            /* ??? Note that the io helpers always read data in the target
> > +               byte ordering.  We should push the LE/BE request down into io. */
> > +            val = TGT_BE(val);
> > +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> > +            return;
> >          }
> > -        /* XXX: not efficient, but simple */
> > -        /* Note: relies on the fact that tlb_fill() does not remove the
> > -         * previous page from the TLB cache.  */
> > -        for (i = DATA_SIZE - 1; i >= 0; i--) {
> > -            /* Big-endian extract.  */
> > -            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> > -            /* Note the adjustment at the beginning of the function.
> > -               Undo that for the recursion.  */
> > -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> > -                                            oi, retaddr + GETPC_ADJ);
> > +    } else {
> > +        /* Handle slow unaligned access (it spans two pages or IO).  */
> > +        if (DATA_SIZE > 1
> > +            && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> > +                         >= TARGET_PAGE_SIZE)) {
> > +            int i;
> > +        do_unaligned_access:
> > +            if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> > +                cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> > +                                     mmu_idx, retaddr);
> > +            }
> > +            /* XXX: not efficient, but simple */
> > +            /* Note: relies on the fact that tlb_fill() does not remove the
> > +             * previous page from the TLB cache.  */
> > +            for (i = DATA_SIZE - 1; i >= 0; i--) {
> > +                /* Big-endian extract.  */
> > +                uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> > +                /* Note the adjustment at the beginning of the function.
> > +                   Undo that for the recursion.  */
> > +                glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> > +                                                oi, retaddr + GETPC_ADJ);
> > +            }
> > +            return;
> >          }
> > -        return;
> >      }
> >
> >      /* Handle aligned access or unaligned access in the same page.  */
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 03/13] softmmu: Add helpers for a new slow-path
  2015-07-16 14:53   ` Alex Bennée
@ 2015-07-16 15:15     ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-16 15:15 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

On Thu, Jul 16, 2015 at 4:53 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> The new helpers rely on the legacy ones to perform the actual read/write.
>>
>> The StoreConditional helper (helper_le_stcond_name) returns 1 if the
>> store has to fail due to a concurrent access to the same page by
>> another vCPU.  A 'concurrent access' can be a store made by *any* vCPU
>> (although, some implementations allow stores made by the CPU that issued
>> the LoadLink).
>>
>> These helpers also update the TLB entry of the page involved in the
>> LL/SC, so that all the following accesses made by any vCPU will follow
>> the slow path.
>> In real multi-threading, these helpers will require to temporarily pause
>> the execution of the other vCPUs in order to update accordingly (flush)
>> the TLB cache.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  cputlb.c                |   3 +
>>  softmmu_llsc_template.h | 155 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  softmmu_template.h      |   4 ++
>>  tcg/tcg.h               |  18 ++++++
>>  4 files changed, 180 insertions(+)
>>  create mode 100644 softmmu_llsc_template.h
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index 0aca407..fa38714 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -475,6 +475,8 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
>>
>>  #define MMUSUFFIX _mmu
>>
>> +/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
>> +#define GEN_EXCLUSIVE_HELPERS
>>  #define SHIFT 0
>>  #include "softmmu_template.h"
>>
>> @@ -487,6 +489,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
>>  #define SHIFT 3
>>  #include "softmmu_template.h"
>>  #undef MMUSUFFIX
>> +#undef GEN_EXCLUSIVE_HELPERS
>>
>>  #define MMUSUFFIX _cmmu
>>  #undef GETPC_ADJ
>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>> new file mode 100644
>> index 0000000..81e9d8e
>> --- /dev/null
>> +++ b/softmmu_llsc_template.h
>> @@ -0,0 +1,155 @@
>> +/*
>> + *  Software MMU support (esclusive load/store operations)
>> + *
>> + * Generate helpers used by TCG for qemu_ldlink/stcond ops.
>> + *
>> + * Included from softmmu_template.h only.
>> + *
>> + * Copyright (c) 2015 Virtual Open Systems
>> + *
>> + * Authors:
>> + *  Alvise Rigo <a.rigo@virtualopensystems.com>
>> + *
>> + * This library is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU Lesser General Public
>> + * License as published by the Free Software Foundation; either
>> + * version 2 of the License, or (at your option) any later version.
>> + *
>> + * This library is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * Lesser General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU Lesser General Public
>> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +/*
>> + * TODO configurations not implemented:
>> + *     - Signed/Unsigned Big-endian
>> + *     - Signed Little-endian
>> + * */
>> +
>> +#if DATA_SIZE > 1
>> +#define helper_le_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
>> +#define helper_le_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
>> +#else
>> +#define helper_le_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
>> +#define helper_le_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
>> +#endif
>> +
>> +/* helpers from cpu_ldst.h, byte-order independent versions */
>> +#if DATA_SIZE > 1
>> +#define helper_ld_legacy glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
>> +#define helper_st_legacy glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
>> +#else
>> +#define helper_ld_legacy glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
>> +#define helper_st_legacy glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
>> +#endif
>> +
>> +#define is_write_tlb_entry_set(env, page, index)                             \
>> +({                                                                           \
>> +    (addr & TARGET_PAGE_MASK)                                                \
>> +         == ((env->tlb_table[mmu_idx][index].addr_write) &                   \
>> +                 (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
>> +})                                                                           \
>> +
>> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
>> +
>> +WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr,
>> +                                TCGMemOpIdx oi, uintptr_t retaddr)
>> +{
>> +    WORD_TYPE ret;
>> +    int index;
>> +    CPUState *cpu;
>> +    hwaddr hw_addr;
>> +    unsigned mmu_idx = get_mmuidx(oi);
>> +
>> +    /* Use the proper load helper from cpu_ldst.h */
>> +    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
>> +
>> +    /* The last legacy access ensures that the TLB and IOTLB entry for 'addr'
>> +     * have been created. */
>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> +
>> +    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>> +     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>> +    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
>>  addr;
>
> I'm having trouble parsing the comment w.r.t the code here. Aren't the
> TLB entries already page aligned? Should you not have:
>
>     index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>     offset = (addr & ~TARGET_PAGE_MASK)
>
>     hw_addr = env->iotlb[mmu_idx][index].addr + offset
>
> ?

If I'm not wrong the iotlb entries can have one of the last two bits
set, this is why I'm making an AND with TARGET_PAGE_MASK (we can in
case reduce the mask).

Thank you,
alvise

>
>> +
>> +    /* Set the exclusive-protected hwaddr. */
>> +    env->excl_protected_hwaddr = hw_addr;
>> +    env->ll_sc_context = true;
>> +
>> +    /* No need to mask hw_addr with TARGET_PAGE_MASK since
>> +     * cpu_physical_memory_excl_is_dirty() will take care of that. */
>> +    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
>> +        cpu_physical_memory_clear_excl_dirty(hw_addr);
>> +
>> +        /* Invalidate the TLB entry for the other processors. The next TLB
>> +         * entries for this page will have the TLB_EXCL flag set. */
>> +        CPU_FOREACH(cpu) {
>> +            if (cpu != current_cpu) {
>> +                tlb_flush(cpu, 1);
>> +            }
>> +        }
>> +    }
>> +
>> +    /* For this vCPU, just update the TLB entry, no need to flush. */
>> +    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>> +
>> +    return ret;
>> +}
>> +
>> +WORD_TYPE helper_le_stcond_name(CPUArchState *env, target_ulong addr,
>> +                                DATA_TYPE val, TCGMemOpIdx oi,
>> +                                uintptr_t retaddr)
>> +{
>> +    WORD_TYPE ret;
>> +    int index;
>> +    hwaddr hw_addr;
>> +    unsigned mmu_idx = get_mmuidx(oi);
>> +
>> +    /* If the TLB entry is not the right one, create it. */
>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> +    if (!is_write_tlb_entry_set(env, addr, index)) {
>> +        tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
>> +    }
>> +
>> +    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>> +     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>> +    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>> +
>> +    if (!env->ll_sc_context) {
>> +        /* No LoakLink has been set, the StoreCond has to fail. */
>> +        return 1;
>> +    }
>> +
>> +    env->ll_sc_context = 0;
>> +
>> +    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
>> +        /* Another vCPU has accessed the memory after the LoadLink. */
>> +        ret = 1;
>> +    } else {
>> +        helper_st_legacy(env, addr, val, mmu_idx, retaddr);
>> +
>> +        /* The StoreConditional succeeded */
>> +        ret = 0;
>> +    }
>> +
>> +    env->tlb_table[mmu_idx][index].addr_write &= ~TLB_EXCL;
>> +    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
>> +    /* It's likely that the page will be used again for exclusive accesses,
>> +     * for this reason we don't flush any TLB cache at the price of some
>> +     * additional slow paths and we don't set the page bit as dirty.
>> +     * The EXCL TLB entries will not remain there forever since they will be
>> +     * eventually removed to serve another guest page; when this happens we
>> +     * remove also the dirty bit (see cputlb.c).
>> +     * */
>
> We are explaining code that was never added here in the first place?
>
>> +
>> +    return ret;
>> +}
>> +
>> +#undef helper_le_ldlink_name
>> +#undef helper_le_stcond_name
>> +#undef helper_ld_legacy
>> +#undef helper_st_legacy
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index 0edd451..bc767f6 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -630,6 +630,10 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
>>  #endif
>>  #endif /* !defined(SOFTMMU_CODE_ACCESS) */
>>
>> +#ifdef GEN_EXCLUSIVE_HELPERS
>> +#include "softmmu_llsc_template.h"
>> +#endif
>> +
>>  #undef READ_ACCESS_TYPE
>>  #undef SHIFT
>>  #undef DATA_TYPE
>> diff --git a/tcg/tcg.h b/tcg/tcg.h
>> index 032fe10..8ca85ab 100644
>> --- a/tcg/tcg.h
>> +++ b/tcg/tcg.h
>> @@ -962,6 +962,15 @@ tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
>>                                      TCGMemOpIdx oi, uintptr_t retaddr);
>>  uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
>>                             TCGMemOpIdx oi, uintptr_t retaddr);
>> +/* Exclusive variants */
>> +tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
>> +                                         int mmu_idx, uintptr_t retaddr);
>> +tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
>> +                                        int mmu_idx, uintptr_t retaddr);
>> +tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
>> +                                        int mmu_idx, uintptr_t retaddr);
>> +uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
>> +                               int mmu_idx, uintptr_t retaddr);
>>
>>  /* Value sign-extended to tcg register size.  */
>>  tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
>> @@ -989,6 +998,15 @@ void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
>>                         TCGMemOpIdx oi, uintptr_t retaddr);
>>  void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
>>                         TCGMemOpIdx oi, uintptr_t retaddr);
>> +/* Exclusive variants */
>> +tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
>> +                                uint8_t val, int mmu_idx, uintptr_t retaddr);
>> +tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
>> +                                uint16_t val, int mmu_idx, uintptr_t retaddr);
>> +tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
>> +                                uint32_t val, int mmu_idx, uintptr_t retaddr);
>> +uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
>> +                                uint64_t val, int mmu_idx, uintptr_t retaddr);
>>
>>  /* Temporary aliases until backends are converted.  */
>>  #ifdef TARGET_WORDS_BIGENDIAN
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 04/13] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 04/13] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions Alvise Rigo
@ 2015-07-17  9:49   ` Alex Bennée
  2015-07-17 10:05     ` alvise rigo
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2015-07-17  9:49 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Create a new pair of instructions that implement a LoadLink/StoreConditional
> mechanism.
>
> It has not been possible to completely include the two new opcodes
> in the plain variants, since the StoreConditional will always require
> one more argument to store the success of the operation.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  tcg/tcg-be-ldst.h |  1 +
>  tcg/tcg-op.c      | 23 +++++++++++++++++++++++
>  tcg/tcg-op.h      |  3 +++
>  tcg/tcg-opc.h     |  4 ++++
>  tcg/tcg.c         |  2 ++
>  tcg/tcg.h         | 18 ++++++++++--------
>  6 files changed, 43 insertions(+), 8 deletions(-)
>
> diff --git a/tcg/tcg-be-ldst.h b/tcg/tcg-be-ldst.h
> index 40a2369..b3f9c51 100644
> --- a/tcg/tcg-be-ldst.h
> +++ b/tcg/tcg-be-ldst.h
> @@ -24,6 +24,7 @@
>  
>  typedef struct TCGLabelQemuLdst {
>      bool is_ld;             /* qemu_ld: true, qemu_st: false */
> +    TCGReg llsc_success;    /* reg index for qemu_stcond outcome */
>      TCGMemOpIdx oi;
>      TCGType type;           /* result type of a load */
>      TCGReg addrlo_reg;      /* reg index for low word of guest virtual addr */
> diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
> index 45098c3..a73b522 100644
> --- a/tcg/tcg-op.c
> +++ b/tcg/tcg-op.c
> @@ -1885,6 +1885,15 @@ static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv addr,
>  #endif
>  }
>  
> +/* An output operand to return the StoreConditional result */
> +static void gen_stcond_i32(TCGOpcode opc, TCGv_i32 is_dirty, TCGv_i32 val,
> +                           TCGv addr, TCGMemOp memop, TCGArg idx)
> +{
> +    TCGMemOpIdx oi = make_memop_idx(memop, idx);
> +
> +    tcg_gen_op4i_i32(opc, is_dirty, val, addr, oi);

This breaks on 64 bit builds as TCGv addr can be 64 bit.

> +}
> +
>  static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, TCGv addr,
>                           TCGMemOp memop, TCGArg idx)
>  {
> @@ -1911,12 +1920,26 @@ void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
>      gen_ldst_i32(INDEX_op_qemu_ld_i32, val, addr, memop, idx);
>  }
>  
> +void tcg_gen_qemu_ldlink_i32(TCGv_i32 val, TCGv addr, TCGArg idx,
> +                             TCGMemOp memop)
> +{
> +    memop = tcg_canonicalize_memop(memop, 0, 0);
> +    gen_ldst_i32(INDEX_op_qemu_ldlink_i32, val, addr, memop, idx);
> +}
> +
>  void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
>  {
>      memop = tcg_canonicalize_memop(memop, 0, 1);
>      gen_ldst_i32(INDEX_op_qemu_st_i32, val, addr, memop, idx);
>  }
>  
> +void tcg_gen_qemu_stcond_i32(TCGv_i32 is_dirty, TCGv_i32 val, TCGv addr,
> +                             TCGArg idx, TCGMemOp memop)
> +{
> +    memop = tcg_canonicalize_memop(memop, 0, 1);
> +    gen_stcond_i32(INDEX_op_qemu_stcond_i32, is_dirty, val, addr, memop, idx);
> +}
> +
>  void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
>  {
>      if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
> diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
> index d1d763f..f183169 100644
> --- a/tcg/tcg-op.h
> +++ b/tcg/tcg-op.h
> @@ -754,6 +754,9 @@ void tcg_gen_qemu_st_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
>  void tcg_gen_qemu_ld_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
>  void tcg_gen_qemu_st_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
>  
> +void tcg_gen_qemu_ldlink_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
> +void tcg_gen_qemu_stcond_i32(TCGv_i32, TCGv_i32, TCGv, TCGArg, TCGMemOp);
> +
>  static inline void tcg_gen_qemu_ld8u(TCGv ret, TCGv addr, int mem_index)
>  {
>      tcg_gen_qemu_ld_tl(ret, addr, mem_index, MO_UB);
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 13ccb60..d6c0454 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -183,6 +183,10 @@ DEF(qemu_ld_i32, 1, TLADDR_ARGS, 1,
>      TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
>  DEF(qemu_st_i32, 0, TLADDR_ARGS + 1, 1,
>      TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
> +DEF(qemu_ldlink_i32, 1, TLADDR_ARGS, 2,
> +    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
> +DEF(qemu_stcond_i32, 1, TLADDR_ARGS + 1, 2,
> +    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
>  DEF(qemu_ld_i64, DATA64_ARGS, TLADDR_ARGS, 1,
>      TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
>  DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 7e088b1..8a2265e 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -1068,6 +1068,8 @@ void tcg_dump_ops(TCGContext *s)
>                  i = 1;
>                  break;
>              case INDEX_op_qemu_ld_i32:
> +            case INDEX_op_qemu_ldlink_i32:
> +            case INDEX_op_qemu_stcond_i32:
>              case INDEX_op_qemu_st_i32:
>              case INDEX_op_qemu_ld_i64:
>              case INDEX_op_qemu_st_i64:
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 8ca85ab..d41a18c 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -282,6 +282,8 @@ typedef enum TCGMemOp {
>      MO_TEQ   = MO_TE | MO_Q,
>  
>      MO_SSIZE = MO_SIZE | MO_SIGN,
> +
> +    MO_EXCL  = 32, /* Set for exclusive memory access */
>  } TCGMemOp;
>  
>  typedef tcg_target_ulong TCGArg;
> @@ -964,13 +966,13 @@ uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
>                             TCGMemOpIdx oi, uintptr_t retaddr);
>  /* Exclusive variants */
>  tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
> -                                         int mmu_idx, uintptr_t retaddr);
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
>  tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
> -                                        int mmu_idx, uintptr_t retaddr);
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
>  tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
> -                                        int mmu_idx, uintptr_t retaddr);
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
>  uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
> -                               int mmu_idx, uintptr_t retaddr);
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
>  
>  /* Value sign-extended to tcg register size.  */
>  tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
> @@ -1000,13 +1002,13 @@ void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
>                         TCGMemOpIdx oi, uintptr_t retaddr);
>  /* Exclusive variants */
>  tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
> -                                uint8_t val, int mmu_idx, uintptr_t retaddr);
> +                            uint8_t val, TCGMemOpIdx oi, uintptr_t retaddr);
>  tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
> -                                uint16_t val, int mmu_idx, uintptr_t retaddr);
> +                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
>  tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
> -                                uint32_t val, int mmu_idx, uintptr_t retaddr);
> +                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
>  uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
> -                                uint64_t val, int mmu_idx, uintptr_t retaddr);
> +                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
>  
>  /* Temporary aliases until backends are converted.  */
>  #ifdef TARGET_WORDS_BIGENDIAN

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 04/13] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
  2015-07-17  9:49   ` Alex Bennée
@ 2015-07-17 10:05     ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-17 10:05 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

You are right. I will fix this case as in gen_ldst_i32().

Thanks,
alvise

On Fri, Jul 17, 2015 at 11:49 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Create a new pair of instructions that implement a LoadLink/StoreConditional
>> mechanism.
>>
>> It has not been possible to completely include the two new opcodes
>> in the plain variants, since the StoreConditional will always require
>> one more argument to store the success of the operation.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  tcg/tcg-be-ldst.h |  1 +
>>  tcg/tcg-op.c      | 23 +++++++++++++++++++++++
>>  tcg/tcg-op.h      |  3 +++
>>  tcg/tcg-opc.h     |  4 ++++
>>  tcg/tcg.c         |  2 ++
>>  tcg/tcg.h         | 18 ++++++++++--------
>>  6 files changed, 43 insertions(+), 8 deletions(-)
>>
>> diff --git a/tcg/tcg-be-ldst.h b/tcg/tcg-be-ldst.h
>> index 40a2369..b3f9c51 100644
>> --- a/tcg/tcg-be-ldst.h
>> +++ b/tcg/tcg-be-ldst.h
>> @@ -24,6 +24,7 @@
>>
>>  typedef struct TCGLabelQemuLdst {
>>      bool is_ld;             /* qemu_ld: true, qemu_st: false */
>> +    TCGReg llsc_success;    /* reg index for qemu_stcond outcome */
>>      TCGMemOpIdx oi;
>>      TCGType type;           /* result type of a load */
>>      TCGReg addrlo_reg;      /* reg index for low word of guest virtual addr */
>> diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
>> index 45098c3..a73b522 100644
>> --- a/tcg/tcg-op.c
>> +++ b/tcg/tcg-op.c
>> @@ -1885,6 +1885,15 @@ static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv addr,
>>  #endif
>>  }
>>
>> +/* An output operand to return the StoreConditional result */
>> +static void gen_stcond_i32(TCGOpcode opc, TCGv_i32 is_dirty, TCGv_i32 val,
>> +                           TCGv addr, TCGMemOp memop, TCGArg idx)
>> +{
>> +    TCGMemOpIdx oi = make_memop_idx(memop, idx);
>> +
>> +    tcg_gen_op4i_i32(opc, is_dirty, val, addr, oi);
>
> This breaks on 64 bit builds as TCGv addr can be 64 bit.
>
>> +}
>> +
>>  static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, TCGv addr,
>>                           TCGMemOp memop, TCGArg idx)
>>  {
>> @@ -1911,12 +1920,26 @@ void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
>>      gen_ldst_i32(INDEX_op_qemu_ld_i32, val, addr, memop, idx);
>>  }
>>
>> +void tcg_gen_qemu_ldlink_i32(TCGv_i32 val, TCGv addr, TCGArg idx,
>> +                             TCGMemOp memop)
>> +{
>> +    memop = tcg_canonicalize_memop(memop, 0, 0);
>> +    gen_ldst_i32(INDEX_op_qemu_ldlink_i32, val, addr, memop, idx);
>> +}
>> +
>>  void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
>>  {
>>      memop = tcg_canonicalize_memop(memop, 0, 1);
>>      gen_ldst_i32(INDEX_op_qemu_st_i32, val, addr, memop, idx);
>>  }
>>
>> +void tcg_gen_qemu_stcond_i32(TCGv_i32 is_dirty, TCGv_i32 val, TCGv addr,
>> +                             TCGArg idx, TCGMemOp memop)
>> +{
>> +    memop = tcg_canonicalize_memop(memop, 0, 1);
>> +    gen_stcond_i32(INDEX_op_qemu_stcond_i32, is_dirty, val, addr, memop, idx);
>> +}
>> +
>>  void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
>>  {
>>      if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
>> diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
>> index d1d763f..f183169 100644
>> --- a/tcg/tcg-op.h
>> +++ b/tcg/tcg-op.h
>> @@ -754,6 +754,9 @@ void tcg_gen_qemu_st_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
>>  void tcg_gen_qemu_ld_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
>>  void tcg_gen_qemu_st_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
>>
>> +void tcg_gen_qemu_ldlink_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
>> +void tcg_gen_qemu_stcond_i32(TCGv_i32, TCGv_i32, TCGv, TCGArg, TCGMemOp);
>> +
>>  static inline void tcg_gen_qemu_ld8u(TCGv ret, TCGv addr, int mem_index)
>>  {
>>      tcg_gen_qemu_ld_tl(ret, addr, mem_index, MO_UB);
>> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
>> index 13ccb60..d6c0454 100644
>> --- a/tcg/tcg-opc.h
>> +++ b/tcg/tcg-opc.h
>> @@ -183,6 +183,10 @@ DEF(qemu_ld_i32, 1, TLADDR_ARGS, 1,
>>      TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
>>  DEF(qemu_st_i32, 0, TLADDR_ARGS + 1, 1,
>>      TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
>> +DEF(qemu_ldlink_i32, 1, TLADDR_ARGS, 2,
>> +    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
>> +DEF(qemu_stcond_i32, 1, TLADDR_ARGS + 1, 2,
>> +    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
>>  DEF(qemu_ld_i64, DATA64_ARGS, TLADDR_ARGS, 1,
>>      TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
>>  DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>> diff --git a/tcg/tcg.c b/tcg/tcg.c
>> index 7e088b1..8a2265e 100644
>> --- a/tcg/tcg.c
>> +++ b/tcg/tcg.c
>> @@ -1068,6 +1068,8 @@ void tcg_dump_ops(TCGContext *s)
>>                  i = 1;
>>                  break;
>>              case INDEX_op_qemu_ld_i32:
>> +            case INDEX_op_qemu_ldlink_i32:
>> +            case INDEX_op_qemu_stcond_i32:
>>              case INDEX_op_qemu_st_i32:
>>              case INDEX_op_qemu_ld_i64:
>>              case INDEX_op_qemu_st_i64:
>> diff --git a/tcg/tcg.h b/tcg/tcg.h
>> index 8ca85ab..d41a18c 100644
>> --- a/tcg/tcg.h
>> +++ b/tcg/tcg.h
>> @@ -282,6 +282,8 @@ typedef enum TCGMemOp {
>>      MO_TEQ   = MO_TE | MO_Q,
>>
>>      MO_SSIZE = MO_SIZE | MO_SIGN,
>> +
>> +    MO_EXCL  = 32, /* Set for exclusive memory access */
>>  } TCGMemOp;
>>
>>  typedef tcg_target_ulong TCGArg;
>> @@ -964,13 +966,13 @@ uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
>>                             TCGMemOpIdx oi, uintptr_t retaddr);
>>  /* Exclusive variants */
>>  tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
>> -                                         int mmu_idx, uintptr_t retaddr);
>> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
>>  tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
>> -                                        int mmu_idx, uintptr_t retaddr);
>> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
>>  tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
>> -                                        int mmu_idx, uintptr_t retaddr);
>> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
>>  uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
>> -                               int mmu_idx, uintptr_t retaddr);
>> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
>>
>>  /* Value sign-extended to tcg register size.  */
>>  tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
>> @@ -1000,13 +1002,13 @@ void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
>>                         TCGMemOpIdx oi, uintptr_t retaddr);
>>  /* Exclusive variants */
>>  tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
>> -                                uint8_t val, int mmu_idx, uintptr_t retaddr);
>> +                            uint8_t val, TCGMemOpIdx oi, uintptr_t retaddr);
>>  tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
>> -                                uint16_t val, int mmu_idx, uintptr_t retaddr);
>> +                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
>>  tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
>> -                                uint32_t val, int mmu_idx, uintptr_t retaddr);
>> +                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
>>  uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
>> -                                uint64_t val, int mmu_idx, uintptr_t retaddr);
>> +                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
>>
>>  /* Temporary aliases until backends are converted.  */
>>  #ifdef TARGET_WORDS_BIGENDIAN
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 05/13] target-arm: translate: implement qemu_ldlink and qemu_stcond ops
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 05/13] target-arm: translate: implement qemu_ldlink and qemu_stcond ops Alvise Rigo
@ 2015-07-17 12:51   ` Alex Bennée
  2015-07-17 13:01     ` alvise rigo
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2015-07-17 12:51 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Implement strex and ldrex instruction relying on TCG's qemu_ldlink and
> qemu_stcond.  For the time being only the 32bit instructions are supported.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  target-arm/translate.c |  87 ++++++++++++++++++++++++++++++++++-
>  tcg/arm/tcg-target.c   | 121 +++++++++++++++++++++++++++++++++++++------------
>  2 files changed, 178 insertions(+), 30 deletions(-)
>
> diff --git a/target-arm/translate.c b/target-arm/translate.c
> index 80302cd..0366c76 100644
> --- a/target-arm/translate.c
> +++ b/target-arm/translate.c
> @@ -72,6 +72,8 @@ static TCGv_i64 cpu_exclusive_test;
>  static TCGv_i32 cpu_exclusive_info;
>  #endif
>  
> +static TCGv_i32 cpu_ll_sc_context;
> +
>  /* FIXME:  These should be removed.  */
>  static TCGv_i32 cpu_F0s, cpu_F1s;
>  static TCGv_i64 cpu_F0d, cpu_F1d;
> @@ -103,6 +105,8 @@ void arm_translate_init(void)
>          offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
>      cpu_exclusive_val = tcg_global_mem_new_i64(TCG_AREG0,
>          offsetof(CPUARMState, exclusive_val), "exclusive_val");
> +    cpu_ll_sc_context = tcg_global_mem_new_i32(TCG_AREG0,
> +        offsetof(CPUARMState, ll_sc_context), "ll_sc_context");
>  #ifdef CONFIG_USER_ONLY
>      cpu_exclusive_test = tcg_global_mem_new_i64(TCG_AREG0,
>          offsetof(CPUARMState, exclusive_test), "exclusive_test");
> @@ -961,6 +965,18 @@ DO_GEN_ST(8, MO_UB)
>  DO_GEN_ST(16, MO_TEUW)
>  DO_GEN_ST(32, MO_TEUL)
>  
> +/* Load/Store exclusive generators (always unsigned) */
> +static inline void gen_aa32_ldex32(TCGv_i32 val, TCGv_i32 addr, int index)
> +{
> +    tcg_gen_qemu_ldlink_i32(val, addr, index, MO_TEUL | MO_EXCL);
> +}
> +
> +static inline void gen_aa32_stex32(TCGv_i32 is_dirty, TCGv_i32 val,
> +                                   TCGv_i32 addr, int index)
> +{
> +    tcg_gen_qemu_stcond_i32(is_dirty, val, addr, index, MO_TEUL | MO_EXCL);
> +}
> +
>  static inline void gen_set_pc_im(DisasContext *s, target_ulong val)
>  {
>      tcg_gen_movi_i32(cpu_R[15], val);
> @@ -7427,6 +7443,26 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
>      store_reg(s, rt, tmp);
>  }
>  
> +static void gen_load_exclusive_multi(DisasContext *s, int rt, int rt2,
> +                                     TCGv_i32 addr, int size)
> +{
> +    TCGv_i32 tmp = tcg_temp_new_i32();
> +
> +    switch (size) {
> +    case 0:
> +    case 1:
> +        abort();
> +    case 2:
> +        gen_aa32_ldex32(tmp, addr, get_mem_index(s));
> +        break;
> +    case 3:
> +    default:
> +        abort();
> +    }
> +
> +    store_reg(s, rt, tmp);
> +}
> +
>  static void gen_clrex(DisasContext *s)
>  {
>      gen_helper_atomic_clear(cpu_env);
> @@ -7460,6 +7496,52 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>      tcg_temp_free_i64(val);
>      tcg_temp_free_i32(tmp_size);
>  }
> +
> +static void gen_store_exclusive_multi(DisasContext *s, int rd, int rt, int rt2,
> +                                      TCGv_i32 addr, int size)
> +{
> +    TCGv_i32 tmp;
> +    TCGv_i32 is_dirty;
> +    TCGLabel *done_label;
> +    TCGLabel *fail_label;
> +
> +    fail_label = gen_new_label();
> +    done_label = gen_new_label();
> +
> +    tmp = tcg_temp_new_i32();
> +    is_dirty = tcg_temp_new_i32();
> +
> +    /* Fail if we are not in LL/SC context. */
> +    tcg_gen_brcondi_i32(TCG_COND_NE, cpu_ll_sc_context, 1, fail_label);
> +
> +    tmp = load_reg(s, rt);
> +    switch (size) {
> +    case 0:
> +    case 1:
> +        abort();
> +        break;
> +    case 2:
> +        gen_aa32_stex32(is_dirty, tmp, addr, get_mem_index(s));
> +        break;
> +    case 3:
> +    default:
> +        abort();
> +    }
> +
> +    tcg_temp_free_i32(tmp);
> +
> +    /* Check if the store conditional has to fail. */
> +    tcg_gen_brcondi_i32(TCG_COND_EQ, is_dirty, 1, fail_label);
> +    tcg_temp_free_i32(is_dirty);
> +
> +    tcg_temp_free_i32(tmp);
> +
> +    tcg_gen_movi_i32(cpu_R[rd], 0); /* is_dirty = 0 */
> +    tcg_gen_br(done_label);
> +    gen_set_label(fail_label);
> +    tcg_gen_movi_i32(cpu_R[rd], 1); /* is_dirty = 1 */
> +    gen_set_label(done_label);
> +}
>  #endif
>  
>  /* gen_srs:
> @@ -8308,7 +8390,7 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
>                          } else if (insn & (1 << 20)) {
>                              switch (op1) {
>                              case 0: /* ldrex */
> -                                gen_load_exclusive(s, rd, 15, addr, 2);
> +                                gen_load_exclusive_multi(s, rd, 15, addr, 2);
>                                  break;
>                              case 1: /* ldrexd */
>                                  gen_load_exclusive(s, rd, rd + 1, addr, 3);
> @@ -8326,7 +8408,8 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
>                              rm = insn & 0xf;
>                              switch (op1) {
>                              case 0:  /*  strex */
> -                                gen_store_exclusive(s, rd, rm, 15, addr, 2);
> +                                gen_store_exclusive_multi(s, rd, rm, 15,
> +                                                          addr, 2);
>                                  break;
>                              case 1: /*  strexd */
>                                  gen_store_exclusive(s, rd, rm, rm + 1, addr, 3);
> diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
> index ae2ec7a..f2b69a0 100644
> --- a/tcg/arm/tcg-target.c
> +++ b/tcg/arm/tcg-target.c
> @@ -1069,6 +1069,17 @@ static void * const qemu_ld_helpers[16] = {
>      [MO_BESL] = helper_be_ldul_mmu,
>  };

So I'm guessing we'll be implementing this for every TCG backend? 

>  
> +/* LoadLink helpers, only unsigned. Use the macro below to access them. */
> +static void * const qemu_ldex_helpers[16] = {
> +    [MO_LEUL] = helper_le_ldlinkul_mmu,
> +};
> +
> +#define LDEX_HELPER(mem_op)                                             \
> +({                                                                      \
> +    assert(mem_op & MO_EXCL);                                           \
> +    qemu_ldex_helpers[((int)mem_op - MO_EXCL)];                         \
> +})
> +
>  /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
>   *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
>   */
> @@ -1082,6 +1093,19 @@ static void * const qemu_st_helpers[16] = {
>      [MO_BEQ]  = helper_be_stq_mmu,
>  };
>  
> +/* StoreConditional helpers. Use the macro below to access them. */
> +static void * const qemu_stex_helpers[16] = {
> +    [MO_LEUL] = helper_le_stcondl_mmu,
> +};
> +
> +#define STEX_HELPER(mem_op)                                             \
> +({                                                                      \
> +    assert(mem_op & MO_EXCL);                                           \
> +    qemu_stex_helpers[(int)mem_op - MO_EXCL];                           \
> +})

Can the lookup not be merged with the existing ldst_helpers?

> +
> +
> +
>  /* Helper routines for marshalling helper function arguments into
>   * the correct registers and stack.
>   * argreg is where we want to put this argument, arg is the argument itself.
> @@ -1222,13 +1246,14 @@ static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
>     path for a load or store, so that we can later generate the correct
>     helper code.  */
>  static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
> -                                TCGReg datalo, TCGReg datahi, TCGReg addrlo,
> -                                TCGReg addrhi, tcg_insn_unit *raddr,
> -                                tcg_insn_unit *label_ptr)
> +                                TCGReg llsc_success, TCGReg datalo,
> +                                TCGReg datahi, TCGReg addrlo, TCGReg addrhi,
> +                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
>  {
>      TCGLabelQemuLdst *label = new_ldst_label(s);
>  
>      label->is_ld = is_ld;
> +    label->llsc_success = llsc_success;
>      label->oi = oi;
>      label->datalo_reg = datalo;
>      label->datahi_reg = datahi;
> @@ -1259,12 +1284,16 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
>      /* For armv6 we can use the canonical unsigned helpers and minimize
>         icache usage.  For pre-armv6, use the signed helpers since we do
>         not have a single insn sign-extend.  */
> -    if (use_armv6_instructions) {
> -        func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)];
> +    if (opc & MO_EXCL) {
> +        func = LDEX_HELPER(opc);
>      } else {
> -        func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)];
> -        if (opc & MO_SIGN) {
> -            opc = MO_UL;
> +        if (use_armv6_instructions) {
> +            func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)];
> +        } else {
> +            func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)];
> +            if (opc & MO_SIGN) {
> +                opc = MO_UL;
> +            }
>          }
>      }
>      tcg_out_call(s, func);
> @@ -1336,8 +1365,15 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
>      argreg = tcg_out_arg_imm32(s, argreg, oi);
>      argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
>  
> -    /* Tail-call to the helper, which will return to the fast path.  */
> -    tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
> +    if (opc & MO_EXCL) {
> +        tcg_out_call(s, STEX_HELPER(opc));
> +        /* Save the output of the StoreConditional */
> +        tcg_out_mov_reg(s, COND_AL, lb->llsc_success, TCG_REG_R0);
> +        tcg_out_goto(s, COND_AL, lb->raddr);
> +    } else {
> +        /* Tail-call to the helper, which will return to the fast path.  */
> +        tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
> +    }
>  }
>  #endif /* SOFTMMU */
>  
> @@ -1461,7 +1497,8 @@ static inline void tcg_out_qemu_ld_direct(TCGContext *s, TCGMemOp opc,
>      }
>  }
>  
> -static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
> +static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64,
> +                            bool isLoadLink)
>  {
>      TCGReg addrlo, datalo, datahi, addrhi __attribute__((unused));
>      TCGMemOpIdx oi;
> @@ -1484,13 +1521,20 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
>      addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 1);
>  
>      /* This a conditional BL only to load a pointer within this opcode into LR
> -       for the slow path.  We will not be using the value for a tail call.  */
> -    label_ptr = s->code_ptr;
> -    tcg_out_bl_noaddr(s, COND_NE);
> +       for the slow path.  We will not be using the value for a tail call.
> +       In the context of a LoadLink instruction, we don't check the TLB but we
> +       always follow the slow path.  */
> +    if (isLoadLink) {
> +        label_ptr = s->code_ptr;
> +        tcg_out_bl_noaddr(s, COND_AL);
> +    } else {
> +        label_ptr = s->code_ptr;
> +        tcg_out_bl_noaddr(s, COND_NE);

This seems a little redundant, we could set label_ptr outside the
isLoadLink check as it will be the same in both cases. However if we are
always taking the slow path for exclusive accesses then why should we
generate a tcg_out_tlb_read()? Do we need some side effects from calling
it for the slow path?

>  
> -    tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
> +        tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
> +    }
>  
> -    add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
> +    add_qemu_ldst_label(s, true, oi, 0, datalo, datahi, addrlo, addrhi,
>                          s->code_ptr, label_ptr);
>  #else /* !CONFIG_SOFTMMU */
>      if (GUEST_BASE) {
> @@ -1592,9 +1636,11 @@ static inline void tcg_out_qemu_st_direct(TCGContext *s, TCGMemOp opc,
>      }
>  }
>  
> -static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
> +static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64,
> +                            bool isStoreCond)
>  {
>      TCGReg addrlo, datalo, datahi, addrhi __attribute__((unused));
> +    TCGReg llsc_success;
>      TCGMemOpIdx oi;
>      TCGMemOp opc;
>  #ifdef CONFIG_SOFTMMU
> @@ -1603,6 +1649,9 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>      tcg_insn_unit *label_ptr;
>  #endif
>  
> +    /* The stcond variant has one more param */
> +    llsc_success = (isStoreCond ? *args++ : 0);
> +
>      datalo = *args++;
>      datahi = (is64 ? *args++ : 0);
>      addrlo = *args++;
> @@ -1612,16 +1661,24 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>  
>  #ifdef CONFIG_SOFTMMU
>      mem_index = get_mmuidx(oi);
> -    addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 0);
>  
> -    tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
> +    if (isStoreCond) {
> +        /* Always follow the slow-path for an exclusive access */
> +        label_ptr = s->code_ptr;
> +        tcg_out_bl_noaddr(s, COND_AL);
> +    } else {
> +        addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE,
> +                                  mem_index, 0);
>  
> -    /* The conditional call must come last, as we're going to return here.  */
> -    label_ptr = s->code_ptr;
> -    tcg_out_bl_noaddr(s, COND_NE);
> +        tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
>  
> -    add_qemu_ldst_label(s, false, oi, datalo, datahi, addrlo, addrhi,
> -                        s->code_ptr, label_ptr);
> +        /* The conditional call must come last, as we're going to return here.*/
> +        label_ptr = s->code_ptr;
> +        tcg_out_bl_noaddr(s, COND_NE);
> +    }
> +
> +    add_qemu_ldst_label(s, false, oi, llsc_success, datalo, datahi, addrlo,
> +                        addrhi, s->code_ptr, label_ptr);

Indeed this does what I commented about above ;-)

>  #else /* !CONFIG_SOFTMMU */
>      if (GUEST_BASE) {
>          tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP, GUEST_BASE);
> @@ -1864,16 +1921,22 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>          break;
>  
>      case INDEX_op_qemu_ld_i32:
> -        tcg_out_qemu_ld(s, args, 0);
> +        tcg_out_qemu_ld(s, args, 0, 0);
> +        break;
> +    case INDEX_op_qemu_ldlink_i32:
> +        tcg_out_qemu_ld(s, args, 0, 1); /* LoadLink */
>          break;
>      case INDEX_op_qemu_ld_i64:
> -        tcg_out_qemu_ld(s, args, 1);
> +        tcg_out_qemu_ld(s, args, 1, 0);
>          break;
>      case INDEX_op_qemu_st_i32:
> -        tcg_out_qemu_st(s, args, 0);
> +        tcg_out_qemu_st(s, args, 0, 0);
> +        break;
> +    case INDEX_op_qemu_stcond_i32:
> +        tcg_out_qemu_st(s, args, 0, 1); /* StoreConditional */
>          break;
>      case INDEX_op_qemu_st_i64:
> -        tcg_out_qemu_st(s, args, 1);
> +        tcg_out_qemu_st(s, args, 1, 0);
>          break;
>  
>      case INDEX_op_bswap16_i32:
> @@ -1957,8 +2020,10 @@ static const TCGTargetOpDef arm_op_defs[] = {
>  
>  #if TARGET_LONG_BITS == 32
>      { INDEX_op_qemu_ld_i32, { "r", "l" } },
> +    { INDEX_op_qemu_ldlink_i32, { "r", "l" } },
>      { INDEX_op_qemu_ld_i64, { "r", "r", "l" } },
>      { INDEX_op_qemu_st_i32, { "s", "s" } },
> +    { INDEX_op_qemu_stcond_i32, { "r", "s", "s" } },
>      { INDEX_op_qemu_st_i64, { "s", "s", "s" } },
>  #else
>      { INDEX_op_qemu_ld_i32, { "r", "l", "l" } },

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 06/13] target-i386: translate: implement qemu_ldlink and qemu_stcond ops
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 06/13] target-i386: " Alvise Rigo
@ 2015-07-17 12:56   ` Alex Bennée
  2015-07-17 13:27     ` alvise rigo
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2015-07-17 12:56 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Implement strex and ldrex instruction relying on TCG's qemu_ldlink and
> qemu_stcond.  For the time being only 32bit configurations are supported.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  tcg/i386/tcg-target.c | 136 ++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 114 insertions(+), 22 deletions(-)
>
> diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
> index 0d7c99c..d8250a9 100644
> --- a/tcg/i386/tcg-target.c
> +++ b/tcg/i386/tcg-target.c
> @@ -1141,6 +1141,17 @@ static void * const qemu_ld_helpers[16] = {
>      [MO_BEQ]  = helper_be_ldq_mmu,
>  };
>  
> +/* LoadLink helpers, only unsigned. Use the macro below to access them. */
> +static void * const qemu_ldex_helpers[16] = {
> +    [MO_LEUL] = helper_le_ldlinkul_mmu,
> +};
> +
> +#define LDEX_HELPER(mem_op)                                             \
> +({                                                                      \
> +    assert(mem_op & MO_EXCL);                                           \
> +    qemu_ldex_helpers[((int)mem_op - MO_EXCL)];                         \
> +})
> +
>  /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
>   *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
>   */
> @@ -1154,6 +1165,17 @@ static void * const qemu_st_helpers[16] = {
>      [MO_BEQ]  = helper_be_stq_mmu,
>  };
>  
> +/* StoreConditional helpers. Use the macro below to access them. */
> +static void * const qemu_stex_helpers[16] = {
> +    [MO_LEUL] = helper_le_stcondl_mmu,
> +};
> +
> +#define STEX_HELPER(mem_op)                                             \
> +({                                                                      \
> +    assert(mem_op & MO_EXCL);                                           \
> +    qemu_stex_helpers[(int)mem_op - MO_EXCL];                           \
> +})
> +

Same comments as for target-arm.

Do we need to be protecting backends with HAS_LDST_EXCL defines or some
such macro hackery? What currently happens if you use the new TCG ops
when the backend doesn't support them? Is supporting all backends a
prerequisite for the series?

>  /* Perform the TLB load and compare.
>  
>     Inputs:
> @@ -1249,6 +1271,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
>   * for a load or store, so that we can later generate the correct helper code
>   */
>  static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
> +                                TCGReg llsc_success,
>                                  TCGReg datalo, TCGReg datahi,
>                                  TCGReg addrlo, TCGReg addrhi,
>                                  tcg_insn_unit *raddr,
> @@ -1257,6 +1280,7 @@ static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
>      TCGLabelQemuLdst *label = new_ldst_label(s);
>  
>      label->is_ld = is_ld;
> +    label->llsc_success = llsc_success;
>      label->oi = oi;
>      label->datalo_reg = datalo;
>      label->datahi_reg = datahi;
> @@ -1311,7 +1335,11 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
>                       (uintptr_t)l->raddr);
>      }
>  
> -    tcg_out_call(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
> +    if (opc & MO_EXCL) {
> +        tcg_out_call(s, LDEX_HELPER(opc));
> +    } else {
> +        tcg_out_call(s, qemu_ld_helpers[opc & ~MO_SIGN]);
> +    }
>  
>      data_reg = l->datalo_reg;
>      switch (opc & MO_SSIZE) {
> @@ -1415,9 +1443,16 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
>          }
>      }
>  
> -    /* "Tail call" to the helper, with the return address back inline.  */
> -    tcg_out_push(s, retaddr);
> -    tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
> +    if (opc & MO_EXCL) {
> +        tcg_out_call(s, STEX_HELPER(opc));
> +        /* Save the output of the StoreConditional */
> +        tcg_out_mov(s, TCG_TYPE_I32, l->llsc_success, TCG_REG_EAX);
> +        tcg_out_jmp(s, l->raddr);
> +    } else {
> +        /* "Tail call" to the helper, with the return address back inline.  */
> +        tcg_out_push(s, retaddr);
> +        tcg_out_jmp(s, qemu_st_helpers[opc]);
> +    }
>  }
>  #elif defined(__x86_64__) && defined(__linux__)
>  # include <asm/prctl.h>
> @@ -1530,7 +1565,8 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
>  /* XXX: qemu_ld and qemu_st could be modified to clobber only EDX and
>     EAX. It will be useful once fixed registers globals are less
>     common. */
> -static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
> +static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64,
> +                            bool isLoadLink)
>  {
>      TCGReg datalo, datahi, addrlo;
>      TCGReg addrhi __attribute__((unused));
> @@ -1553,14 +1589,34 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
>      mem_index = get_mmuidx(oi);
>      s_bits = opc & MO_SIZE;
>  
> -    tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
> -                     label_ptr, offsetof(CPUTLBEntry, addr_read));
> +    if (isLoadLink) {
> +        TCGType t = ((TCG_TARGET_REG_BITS == 64) && (TARGET_LONG_BITS == 64)) ?
> +                                                   TCG_TYPE_I64 : TCG_TYPE_I32;
> +        /* The JMP address will be patched afterwards,
> +         * in tcg_out_qemu_ld_slow_path (two times when
> +         * TARGET_LONG_BITS > TCG_TARGET_REG_BITS). */
> +        tcg_out_mov(s, t, TCG_REG_L1, addrlo);
> +
> +        if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
> +            /* Store the second part of the address. */
> +            tcg_out_mov(s, t, TCG_REG_L0, addrhi);
> +            /* We add 4 to include the jmp that follows. */
> +            label_ptr[1] = s->code_ptr + 4;
> +        }
>  
> -    /* TLB Hit.  */
> -    tcg_out_qemu_ld_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
> +        tcg_out_opc(s, OPC_JMP_long, 0, 0, 0);
> +        label_ptr[0] = s->code_ptr;
> +        s->code_ptr += 4;
> +    } else {
> +        tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
> +                         label_ptr, offsetof(CPUTLBEntry, addr_read));
> +
> +        /* TLB Hit.  */
> +        tcg_out_qemu_ld_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
> +    }
>  
>      /* Record the current context of a load into ldst label */
> -    add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
> +    add_qemu_ldst_label(s, true, oi, 0, datalo, datahi, addrlo, addrhi,
>                          s->code_ptr, label_ptr);
>  #else
>      {
> @@ -1663,9 +1719,10 @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
>      }
>  }
>  
> -static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
> +static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64,
> +                            bool isStoreCond)
>  {
> -    TCGReg datalo, datahi, addrlo;
> +    TCGReg datalo, datahi, addrlo, llsc_success;
>      TCGReg addrhi __attribute__((unused));
>      TCGMemOpIdx oi;
>      TCGMemOp opc;
> @@ -1675,6 +1732,9 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>      tcg_insn_unit *label_ptr[2];
>  #endif
>  
> +    /* The stcond variant has one more param */
> +    llsc_success = (isStoreCond ? *args++ : 0);
> +
>      datalo = *args++;
>      datahi = (TCG_TARGET_REG_BITS == 32 && is64 ? *args++ : 0);
>      addrlo = *args++;
> @@ -1686,15 +1746,35 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>      mem_index = get_mmuidx(oi);
>      s_bits = opc & MO_SIZE;
>  
> -    tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
> -                     label_ptr, offsetof(CPUTLBEntry, addr_write));
> +    if (isStoreCond) {
> +        TCGType t = ((TCG_TARGET_REG_BITS == 64) && (TARGET_LONG_BITS == 64)) ?
> +                                                   TCG_TYPE_I64 : TCG_TYPE_I32;
> +        /* The JMP address will be filled afterwards,
> +         * in tcg_out_qemu_ld_slow_path (two times when
> +         * TARGET_LONG_BITS > TCG_TARGET_REG_BITS). */
> +        tcg_out_mov(s, t, TCG_REG_L1, addrlo);
> +
> +        if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
> +            /* Store the second part of the address. */
> +            tcg_out_mov(s, t, TCG_REG_L0, addrhi);
> +            /* We add 4 to include the jmp that follows. */
> +            label_ptr[1] = s->code_ptr + 4;
> +        }
>  
> -    /* TLB Hit.  */
> -    tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
> +        tcg_out_opc(s, OPC_JMP_long, 0, 0, 0);
> +        label_ptr[0] = s->code_ptr;
> +        s->code_ptr += 4;
> +    } else {
> +        tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
> +                         label_ptr, offsetof(CPUTLBEntry, addr_write));
> +
> +        /* TLB Hit.  */
> +        tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
> +    }
>  
>      /* Record the current context of a store into ldst label */
> -    add_qemu_ldst_label(s, false, oi, datalo, datahi, addrlo, addrhi,
> -                        s->code_ptr, label_ptr);
> +    add_qemu_ldst_label(s, false, oi, llsc_success, datalo, datahi, addrlo,
> +                        addrhi, s->code_ptr, label_ptr);
>  #else
>      {
>          int32_t offset = GUEST_BASE;
> @@ -1955,16 +2035,22 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>          break;
>  
>      case INDEX_op_qemu_ld_i32:
> -        tcg_out_qemu_ld(s, args, 0);
> +        tcg_out_qemu_ld(s, args, 0, 0);
> +        break;
> +    case INDEX_op_qemu_ldlink_i32:
> +        tcg_out_qemu_ld(s, args, 0, 1);
>          break;
>      case INDEX_op_qemu_ld_i64:
> -        tcg_out_qemu_ld(s, args, 1);
> +        tcg_out_qemu_ld(s, args, 1, 0);
>          break;
>      case INDEX_op_qemu_st_i32:
> -        tcg_out_qemu_st(s, args, 0);
> +        tcg_out_qemu_st(s, args, 0, 0);
> +        break;
> +    case INDEX_op_qemu_stcond_i32:
> +        tcg_out_qemu_st(s, args, 0, 1);
>          break;
>      case INDEX_op_qemu_st_i64:
> -        tcg_out_qemu_st(s, args, 1);
> +        tcg_out_qemu_st(s, args, 1, 0);
>          break;
>  
>      OP_32_64(mulu2):
> @@ -2186,17 +2272,23 @@ static const TCGTargetOpDef x86_op_defs[] = {
>  
>  #if TCG_TARGET_REG_BITS == 64
>      { INDEX_op_qemu_ld_i32, { "r", "L" } },
> +    { INDEX_op_qemu_ldlink_i32, { "r", "L" } },
>      { INDEX_op_qemu_st_i32, { "L", "L" } },
> +    { INDEX_op_qemu_stcond_i32, { "r", "L", "L" } },
>      { INDEX_op_qemu_ld_i64, { "r", "L" } },
>      { INDEX_op_qemu_st_i64, { "L", "L" } },
>  #elif TARGET_LONG_BITS <= TCG_TARGET_REG_BITS
>      { INDEX_op_qemu_ld_i32, { "r", "L" } },
> +    { INDEX_op_qemu_ldlink_i32, { "r", "L" } },
>      { INDEX_op_qemu_st_i32, { "L", "L" } },
> +    { INDEX_op_qemu_stcond_i32, { "r", "L", "L" } },
>      { INDEX_op_qemu_ld_i64, { "r", "r", "L" } },
>      { INDEX_op_qemu_st_i64, { "L", "L", "L" } },
>  #else
>      { INDEX_op_qemu_ld_i32, { "r", "L", "L" } },
> +    { INDEX_op_qemu_ldlink_i32, { "r", "L", "L" } },
>      { INDEX_op_qemu_st_i32, { "L", "L", "L" } },
> +    { INDEX_op_qemu_stcond_i32, { "r", "L", "L", "L" } },
>      { INDEX_op_qemu_ld_i64, { "r", "r", "L", "L" } },
>      { INDEX_op_qemu_st_i64, { "L", "L", "L", "L" } },
>  #endif

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 05/13] target-arm: translate: implement qemu_ldlink and qemu_stcond ops
  2015-07-17 12:51   ` Alex Bennée
@ 2015-07-17 13:01     ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-17 13:01 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

On Fri, Jul 17, 2015 at 2:51 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Implement strex and ldrex instruction relying on TCG's qemu_ldlink and
>> qemu_stcond.  For the time being only the 32bit instructions are supported.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  target-arm/translate.c |  87 ++++++++++++++++++++++++++++++++++-
>>  tcg/arm/tcg-target.c   | 121 +++++++++++++++++++++++++++++++++++++------------
>>  2 files changed, 178 insertions(+), 30 deletions(-)
>>
>> diff --git a/target-arm/translate.c b/target-arm/translate.c
>> index 80302cd..0366c76 100644
>> --- a/target-arm/translate.c
>> +++ b/target-arm/translate.c
>> @@ -72,6 +72,8 @@ static TCGv_i64 cpu_exclusive_test;
>>  static TCGv_i32 cpu_exclusive_info;
>>  #endif
>>
>> +static TCGv_i32 cpu_ll_sc_context;
>> +
>>  /* FIXME:  These should be removed.  */
>>  static TCGv_i32 cpu_F0s, cpu_F1s;
>>  static TCGv_i64 cpu_F0d, cpu_F1d;
>> @@ -103,6 +105,8 @@ void arm_translate_init(void)
>>          offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
>>      cpu_exclusive_val = tcg_global_mem_new_i64(TCG_AREG0,
>>          offsetof(CPUARMState, exclusive_val), "exclusive_val");
>> +    cpu_ll_sc_context = tcg_global_mem_new_i32(TCG_AREG0,
>> +        offsetof(CPUARMState, ll_sc_context), "ll_sc_context");
>>  #ifdef CONFIG_USER_ONLY
>>      cpu_exclusive_test = tcg_global_mem_new_i64(TCG_AREG0,
>>          offsetof(CPUARMState, exclusive_test), "exclusive_test");
>> @@ -961,6 +965,18 @@ DO_GEN_ST(8, MO_UB)
>>  DO_GEN_ST(16, MO_TEUW)
>>  DO_GEN_ST(32, MO_TEUL)
>>
>> +/* Load/Store exclusive generators (always unsigned) */
>> +static inline void gen_aa32_ldex32(TCGv_i32 val, TCGv_i32 addr, int index)
>> +{
>> +    tcg_gen_qemu_ldlink_i32(val, addr, index, MO_TEUL | MO_EXCL);
>> +}
>> +
>> +static inline void gen_aa32_stex32(TCGv_i32 is_dirty, TCGv_i32 val,
>> +                                   TCGv_i32 addr, int index)
>> +{
>> +    tcg_gen_qemu_stcond_i32(is_dirty, val, addr, index, MO_TEUL | MO_EXCL);
>> +}
>> +
>>  static inline void gen_set_pc_im(DisasContext *s, target_ulong val)
>>  {
>>      tcg_gen_movi_i32(cpu_R[15], val);
>> @@ -7427,6 +7443,26 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
>>      store_reg(s, rt, tmp);
>>  }
>>
>> +static void gen_load_exclusive_multi(DisasContext *s, int rt, int rt2,
>> +                                     TCGv_i32 addr, int size)
>> +{
>> +    TCGv_i32 tmp = tcg_temp_new_i32();
>> +
>> +    switch (size) {
>> +    case 0:
>> +    case 1:
>> +        abort();
>> +    case 2:
>> +        gen_aa32_ldex32(tmp, addr, get_mem_index(s));
>> +        break;
>> +    case 3:
>> +    default:
>> +        abort();
>> +    }
>> +
>> +    store_reg(s, rt, tmp);
>> +}
>> +
>>  static void gen_clrex(DisasContext *s)
>>  {
>>      gen_helper_atomic_clear(cpu_env);
>> @@ -7460,6 +7496,52 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>>      tcg_temp_free_i64(val);
>>      tcg_temp_free_i32(tmp_size);
>>  }
>> +
>> +static void gen_store_exclusive_multi(DisasContext *s, int rd, int rt, int rt2,
>> +                                      TCGv_i32 addr, int size)
>> +{
>> +    TCGv_i32 tmp;
>> +    TCGv_i32 is_dirty;
>> +    TCGLabel *done_label;
>> +    TCGLabel *fail_label;
>> +
>> +    fail_label = gen_new_label();
>> +    done_label = gen_new_label();
>> +
>> +    tmp = tcg_temp_new_i32();
>> +    is_dirty = tcg_temp_new_i32();
>> +
>> +    /* Fail if we are not in LL/SC context. */
>> +    tcg_gen_brcondi_i32(TCG_COND_NE, cpu_ll_sc_context, 1, fail_label);
>> +
>> +    tmp = load_reg(s, rt);
>> +    switch (size) {
>> +    case 0:
>> +    case 1:
>> +        abort();
>> +        break;
>> +    case 2:
>> +        gen_aa32_stex32(is_dirty, tmp, addr, get_mem_index(s));
>> +        break;
>> +    case 3:
>> +    default:
>> +        abort();
>> +    }
>> +
>> +    tcg_temp_free_i32(tmp);
>> +
>> +    /* Check if the store conditional has to fail. */
>> +    tcg_gen_brcondi_i32(TCG_COND_EQ, is_dirty, 1, fail_label);
>> +    tcg_temp_free_i32(is_dirty);
>> +
>> +    tcg_temp_free_i32(tmp);
>> +
>> +    tcg_gen_movi_i32(cpu_R[rd], 0); /* is_dirty = 0 */
>> +    tcg_gen_br(done_label);
>> +    gen_set_label(fail_label);
>> +    tcg_gen_movi_i32(cpu_R[rd], 1); /* is_dirty = 1 */
>> +    gen_set_label(done_label);
>> +}
>>  #endif
>>
>>  /* gen_srs:
>> @@ -8308,7 +8390,7 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
>>                          } else if (insn & (1 << 20)) {
>>                              switch (op1) {
>>                              case 0: /* ldrex */
>> -                                gen_load_exclusive(s, rd, 15, addr, 2);
>> +                                gen_load_exclusive_multi(s, rd, 15, addr, 2);
>>                                  break;
>>                              case 1: /* ldrexd */
>>                                  gen_load_exclusive(s, rd, rd + 1, addr, 3);
>> @@ -8326,7 +8408,8 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
>>                              rm = insn & 0xf;
>>                              switch (op1) {
>>                              case 0:  /*  strex */
>> -                                gen_store_exclusive(s, rd, rm, 15, addr, 2);
>> +                                gen_store_exclusive_multi(s, rd, rm, 15,
>> +                                                          addr, 2);
>>                                  break;
>>                              case 1: /*  strexd */
>>                                  gen_store_exclusive(s, rd, rm, rm + 1, addr, 3);
>> diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
>> index ae2ec7a..f2b69a0 100644
>> --- a/tcg/arm/tcg-target.c
>> +++ b/tcg/arm/tcg-target.c
>> @@ -1069,6 +1069,17 @@ static void * const qemu_ld_helpers[16] = {
>>      [MO_BESL] = helper_be_ldul_mmu,
>>  };
>
> So I'm guessing we'll be implementing this for every TCG backend?

Correct. According to the architecture's atomic instructions, a subset
of these helpers will be implemented.

>
>>
>> +/* LoadLink helpers, only unsigned. Use the macro below to access them. */
>> +static void * const qemu_ldex_helpers[16] = {
>> +    [MO_LEUL] = helper_le_ldlinkul_mmu,
>> +};
>> +
>> +#define LDEX_HELPER(mem_op)                                             \
>> +({                                                                      \
>> +    assert(mem_op & MO_EXCL);                                           \
>> +    qemu_ldex_helpers[((int)mem_op - MO_EXCL)];                         \
>> +})
>> +
>>  /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
>>   *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
>>   */
>> @@ -1082,6 +1093,19 @@ static void * const qemu_st_helpers[16] = {
>>      [MO_BEQ]  = helper_be_stq_mmu,
>>  };
>>
>> +/* StoreConditional helpers. Use the macro below to access them. */
>> +static void * const qemu_stex_helpers[16] = {
>> +    [MO_LEUL] = helper_le_stcondl_mmu,
>> +};
>> +
>> +#define STEX_HELPER(mem_op)                                             \
>> +({                                                                      \
>> +    assert(mem_op & MO_EXCL);                                           \
>> +    qemu_stex_helpers[(int)mem_op - MO_EXCL];                           \
>> +})
>
> Can the lookup not be merged with the existing ldst_helpers?

Sure.

>
>> +
>> +
>> +
>>  /* Helper routines for marshalling helper function arguments into
>>   * the correct registers and stack.
>>   * argreg is where we want to put this argument, arg is the argument itself.
>> @@ -1222,13 +1246,14 @@ static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
>>     path for a load or store, so that we can later generate the correct
>>     helper code.  */
>>  static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
>> -                                TCGReg datalo, TCGReg datahi, TCGReg addrlo,
>> -                                TCGReg addrhi, tcg_insn_unit *raddr,
>> -                                tcg_insn_unit *label_ptr)
>> +                                TCGReg llsc_success, TCGReg datalo,
>> +                                TCGReg datahi, TCGReg addrlo, TCGReg addrhi,
>> +                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
>>  {
>>      TCGLabelQemuLdst *label = new_ldst_label(s);
>>
>>      label->is_ld = is_ld;
>> +    label->llsc_success = llsc_success;
>>      label->oi = oi;
>>      label->datalo_reg = datalo;
>>      label->datahi_reg = datahi;
>> @@ -1259,12 +1284,16 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
>>      /* For armv6 we can use the canonical unsigned helpers and minimize
>>         icache usage.  For pre-armv6, use the signed helpers since we do
>>         not have a single insn sign-extend.  */
>> -    if (use_armv6_instructions) {
>> -        func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)];
>> +    if (opc & MO_EXCL) {
>> +        func = LDEX_HELPER(opc);
>>      } else {
>> -        func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)];
>> -        if (opc & MO_SIGN) {
>> -            opc = MO_UL;
>> +        if (use_armv6_instructions) {
>> +            func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)];
>> +        } else {
>> +            func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)];
>> +            if (opc & MO_SIGN) {
>> +                opc = MO_UL;
>> +            }
>>          }
>>      }
>>      tcg_out_call(s, func);
>> @@ -1336,8 +1365,15 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
>>      argreg = tcg_out_arg_imm32(s, argreg, oi);
>>      argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
>>
>> -    /* Tail-call to the helper, which will return to the fast path.  */
>> -    tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
>> +    if (opc & MO_EXCL) {
>> +        tcg_out_call(s, STEX_HELPER(opc));
>> +        /* Save the output of the StoreConditional */
>> +        tcg_out_mov_reg(s, COND_AL, lb->llsc_success, TCG_REG_R0);
>> +        tcg_out_goto(s, COND_AL, lb->raddr);
>> +    } else {
>> +        /* Tail-call to the helper, which will return to the fast path.  */
>> +        tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
>> +    }
>>  }
>>  #endif /* SOFTMMU */
>>
>> @@ -1461,7 +1497,8 @@ static inline void tcg_out_qemu_ld_direct(TCGContext *s, TCGMemOp opc,
>>      }
>>  }
>>
>> -static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
>> +static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64,
>> +                            bool isLoadLink)
>>  {
>>      TCGReg addrlo, datalo, datahi, addrhi __attribute__((unused));
>>      TCGMemOpIdx oi;
>> @@ -1484,13 +1521,20 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
>>      addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 1);
>>
>>      /* This a conditional BL only to load a pointer within this opcode into LR
>> -       for the slow path.  We will not be using the value for a tail call.  */
>> -    label_ptr = s->code_ptr;
>> -    tcg_out_bl_noaddr(s, COND_NE);
>> +       for the slow path.  We will not be using the value for a tail call.
>> +       In the context of a LoadLink instruction, we don't check the TLB but we
>> +       always follow the slow path.  */
>> +    if (isLoadLink) {
>> +        label_ptr = s->code_ptr;
>> +        tcg_out_bl_noaddr(s, COND_AL);
>> +    } else {
>> +        label_ptr = s->code_ptr;
>> +        tcg_out_bl_noaddr(s, COND_NE);
>
> This seems a little redundant, we could set label_ptr outside the
> isLoadLink check as it will be the same in both cases. However if we are
> always taking the slow path for exclusive accesses then why should we
> generate a tcg_out_tlb_read()? Do we need some side effects from calling
> it for the slow path?

Good point, we can avoid the TLB read if we are doing a SC/LL.
In fact, the first thing we do in softmmu_llsc_template.h is to update
the corresponding TLB entry if necessary.

Thank you,
alvise

>
>>
>> -    tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
>> +        tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
>> +    }
>>
>> -    add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
>> +    add_qemu_ldst_label(s, true, oi, 0, datalo, datahi, addrlo, addrhi,
>>                          s->code_ptr, label_ptr);
>>  #else /* !CONFIG_SOFTMMU */
>>      if (GUEST_BASE) {
>> @@ -1592,9 +1636,11 @@ static inline void tcg_out_qemu_st_direct(TCGContext *s, TCGMemOp opc,
>>      }
>>  }
>>
>> -static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>> +static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64,
>> +                            bool isStoreCond)
>>  {
>>      TCGReg addrlo, datalo, datahi, addrhi __attribute__((unused));
>> +    TCGReg llsc_success;
>>      TCGMemOpIdx oi;
>>      TCGMemOp opc;
>>  #ifdef CONFIG_SOFTMMU
>> @@ -1603,6 +1649,9 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>>      tcg_insn_unit *label_ptr;
>>  #endif
>>
>> +    /* The stcond variant has one more param */
>> +    llsc_success = (isStoreCond ? *args++ : 0);
>> +
>>      datalo = *args++;
>>      datahi = (is64 ? *args++ : 0);
>>      addrlo = *args++;
>> @@ -1612,16 +1661,24 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>>
>>  #ifdef CONFIG_SOFTMMU
>>      mem_index = get_mmuidx(oi);
>> -    addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 0);
>>
>> -    tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
>> +    if (isStoreCond) {
>> +        /* Always follow the slow-path for an exclusive access */
>> +        label_ptr = s->code_ptr;
>> +        tcg_out_bl_noaddr(s, COND_AL);
>> +    } else {
>> +        addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE,
>> +                                  mem_index, 0);
>>
>> -    /* The conditional call must come last, as we're going to return here.  */
>> -    label_ptr = s->code_ptr;
>> -    tcg_out_bl_noaddr(s, COND_NE);
>> +        tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
>>
>> -    add_qemu_ldst_label(s, false, oi, datalo, datahi, addrlo, addrhi,
>> -                        s->code_ptr, label_ptr);
>> +        /* The conditional call must come last, as we're going to return here.*/
>> +        label_ptr = s->code_ptr;
>> +        tcg_out_bl_noaddr(s, COND_NE);
>> +    }
>> +
>> +    add_qemu_ldst_label(s, false, oi, llsc_success, datalo, datahi, addrlo,
>> +                        addrhi, s->code_ptr, label_ptr);
>
> Indeed this does what I commented about above ;-)
>
>>  #else /* !CONFIG_SOFTMMU */
>>      if (GUEST_BASE) {
>>          tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP, GUEST_BASE);
>> @@ -1864,16 +1921,22 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>>          break;
>>
>>      case INDEX_op_qemu_ld_i32:
>> -        tcg_out_qemu_ld(s, args, 0);
>> +        tcg_out_qemu_ld(s, args, 0, 0);
>> +        break;
>> +    case INDEX_op_qemu_ldlink_i32:
>> +        tcg_out_qemu_ld(s, args, 0, 1); /* LoadLink */
>>          break;
>>      case INDEX_op_qemu_ld_i64:
>> -        tcg_out_qemu_ld(s, args, 1);
>> +        tcg_out_qemu_ld(s, args, 1, 0);
>>          break;
>>      case INDEX_op_qemu_st_i32:
>> -        tcg_out_qemu_st(s, args, 0);
>> +        tcg_out_qemu_st(s, args, 0, 0);
>> +        break;
>> +    case INDEX_op_qemu_stcond_i32:
>> +        tcg_out_qemu_st(s, args, 0, 1); /* StoreConditional */
>>          break;
>>      case INDEX_op_qemu_st_i64:
>> -        tcg_out_qemu_st(s, args, 1);
>> +        tcg_out_qemu_st(s, args, 1, 0);
>>          break;
>>
>>      case INDEX_op_bswap16_i32:
>> @@ -1957,8 +2020,10 @@ static const TCGTargetOpDef arm_op_defs[] = {
>>
>>  #if TARGET_LONG_BITS == 32
>>      { INDEX_op_qemu_ld_i32, { "r", "l" } },
>> +    { INDEX_op_qemu_ldlink_i32, { "r", "l" } },
>>      { INDEX_op_qemu_ld_i64, { "r", "r", "l" } },
>>      { INDEX_op_qemu_st_i32, { "s", "s" } },
>> +    { INDEX_op_qemu_stcond_i32, { "r", "s", "s" } },
>>      { INDEX_op_qemu_st_i64, { "s", "s", "s" } },
>>  #else
>>      { INDEX_op_qemu_ld_i32, { "r", "l", "l" } },
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 06/13] target-i386: translate: implement qemu_ldlink and qemu_stcond ops
  2015-07-17 12:56   ` Alex Bennée
@ 2015-07-17 13:27     ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-17 13:27 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

On Fri, Jul 17, 2015 at 2:56 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Implement strex and ldrex instruction relying on TCG's qemu_ldlink and
>> qemu_stcond.  For the time being only 32bit configurations are supported.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  tcg/i386/tcg-target.c | 136 ++++++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 114 insertions(+), 22 deletions(-)
>>
>> diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
>> index 0d7c99c..d8250a9 100644
>> --- a/tcg/i386/tcg-target.c
>> +++ b/tcg/i386/tcg-target.c
>> @@ -1141,6 +1141,17 @@ static void * const qemu_ld_helpers[16] = {
>>      [MO_BEQ]  = helper_be_ldq_mmu,
>>  };
>>
>> +/* LoadLink helpers, only unsigned. Use the macro below to access them. */
>> +static void * const qemu_ldex_helpers[16] = {
>> +    [MO_LEUL] = helper_le_ldlinkul_mmu,
>> +};
>> +
>> +#define LDEX_HELPER(mem_op)                                             \
>> +({                                                                      \
>> +    assert(mem_op & MO_EXCL);                                           \
>> +    qemu_ldex_helpers[((int)mem_op - MO_EXCL)];                         \
>> +})
>> +
>>  /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
>>   *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
>>   */
>> @@ -1154,6 +1165,17 @@ static void * const qemu_st_helpers[16] = {
>>      [MO_BEQ]  = helper_be_stq_mmu,
>>  };
>>
>> +/* StoreConditional helpers. Use the macro below to access them. */
>> +static void * const qemu_stex_helpers[16] = {
>> +    [MO_LEUL] = helper_le_stcondl_mmu,
>> +};
>> +
>> +#define STEX_HELPER(mem_op)                                             \
>> +({                                                                      \
>> +    assert(mem_op & MO_EXCL);                                           \
>> +    qemu_stex_helpers[(int)mem_op - MO_EXCL];                           \
>> +})
>> +
>
> Same comments as for target-arm.
>
> Do we need to be protecting backends with HAS_LDST_EXCL defines or some
> such macro hackery? What currently happens if you use the new TCG ops
> when the backend doesn't support them? Is supporting all backends a
> prerequisite for the series?

I think that the ideal approach would be to have all the backends
implementing the slowpath for atomic instructions so that the
HAS_LDST_EXCL macro will not be needed. Then a frontend can rely on
the slowpath or not.
So, ideally, it's a prerequisite.

Regards,
alvise

>
>>  /* Perform the TLB load and compare.
>>
>>     Inputs:
>> @@ -1249,6 +1271,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
>>   * for a load or store, so that we can later generate the correct helper code
>>   */
>>  static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
>> +                                TCGReg llsc_success,
>>                                  TCGReg datalo, TCGReg datahi,
>>                                  TCGReg addrlo, TCGReg addrhi,
>>                                  tcg_insn_unit *raddr,
>> @@ -1257,6 +1280,7 @@ static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
>>      TCGLabelQemuLdst *label = new_ldst_label(s);
>>
>>      label->is_ld = is_ld;
>> +    label->llsc_success = llsc_success;
>>      label->oi = oi;
>>      label->datalo_reg = datalo;
>>      label->datahi_reg = datahi;
>> @@ -1311,7 +1335,11 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
>>                       (uintptr_t)l->raddr);
>>      }
>>
>> -    tcg_out_call(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
>> +    if (opc & MO_EXCL) {
>> +        tcg_out_call(s, LDEX_HELPER(opc));
>> +    } else {
>> +        tcg_out_call(s, qemu_ld_helpers[opc & ~MO_SIGN]);
>> +    }
>>
>>      data_reg = l->datalo_reg;
>>      switch (opc & MO_SSIZE) {
>> @@ -1415,9 +1443,16 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
>>          }
>>      }
>>
>> -    /* "Tail call" to the helper, with the return address back inline.  */
>> -    tcg_out_push(s, retaddr);
>> -    tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
>> +    if (opc & MO_EXCL) {
>> +        tcg_out_call(s, STEX_HELPER(opc));
>> +        /* Save the output of the StoreConditional */
>> +        tcg_out_mov(s, TCG_TYPE_I32, l->llsc_success, TCG_REG_EAX);
>> +        tcg_out_jmp(s, l->raddr);
>> +    } else {
>> +        /* "Tail call" to the helper, with the return address back inline.  */
>> +        tcg_out_push(s, retaddr);
>> +        tcg_out_jmp(s, qemu_st_helpers[opc]);
>> +    }
>>  }
>>  #elif defined(__x86_64__) && defined(__linux__)
>>  # include <asm/prctl.h>
>> @@ -1530,7 +1565,8 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
>>  /* XXX: qemu_ld and qemu_st could be modified to clobber only EDX and
>>     EAX. It will be useful once fixed registers globals are less
>>     common. */
>> -static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
>> +static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64,
>> +                            bool isLoadLink)
>>  {
>>      TCGReg datalo, datahi, addrlo;
>>      TCGReg addrhi __attribute__((unused));
>> @@ -1553,14 +1589,34 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
>>      mem_index = get_mmuidx(oi);
>>      s_bits = opc & MO_SIZE;
>>
>> -    tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
>> -                     label_ptr, offsetof(CPUTLBEntry, addr_read));
>> +    if (isLoadLink) {
>> +        TCGType t = ((TCG_TARGET_REG_BITS == 64) && (TARGET_LONG_BITS == 64)) ?
>> +                                                   TCG_TYPE_I64 : TCG_TYPE_I32;
>> +        /* The JMP address will be patched afterwards,
>> +         * in tcg_out_qemu_ld_slow_path (two times when
>> +         * TARGET_LONG_BITS > TCG_TARGET_REG_BITS). */
>> +        tcg_out_mov(s, t, TCG_REG_L1, addrlo);
>> +
>> +        if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
>> +            /* Store the second part of the address. */
>> +            tcg_out_mov(s, t, TCG_REG_L0, addrhi);
>> +            /* We add 4 to include the jmp that follows. */
>> +            label_ptr[1] = s->code_ptr + 4;
>> +        }
>>
>> -    /* TLB Hit.  */
>> -    tcg_out_qemu_ld_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
>> +        tcg_out_opc(s, OPC_JMP_long, 0, 0, 0);
>> +        label_ptr[0] = s->code_ptr;
>> +        s->code_ptr += 4;
>> +    } else {
>> +        tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
>> +                         label_ptr, offsetof(CPUTLBEntry, addr_read));
>> +
>> +        /* TLB Hit.  */
>> +        tcg_out_qemu_ld_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
>> +    }
>>
>>      /* Record the current context of a load into ldst label */
>> -    add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
>> +    add_qemu_ldst_label(s, true, oi, 0, datalo, datahi, addrlo, addrhi,
>>                          s->code_ptr, label_ptr);
>>  #else
>>      {
>> @@ -1663,9 +1719,10 @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
>>      }
>>  }
>>
>> -static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>> +static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64,
>> +                            bool isStoreCond)
>>  {
>> -    TCGReg datalo, datahi, addrlo;
>> +    TCGReg datalo, datahi, addrlo, llsc_success;
>>      TCGReg addrhi __attribute__((unused));
>>      TCGMemOpIdx oi;
>>      TCGMemOp opc;
>> @@ -1675,6 +1732,9 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>>      tcg_insn_unit *label_ptr[2];
>>  #endif
>>
>> +    /* The stcond variant has one more param */
>> +    llsc_success = (isStoreCond ? *args++ : 0);
>> +
>>      datalo = *args++;
>>      datahi = (TCG_TARGET_REG_BITS == 32 && is64 ? *args++ : 0);
>>      addrlo = *args++;
>> @@ -1686,15 +1746,35 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
>>      mem_index = get_mmuidx(oi);
>>      s_bits = opc & MO_SIZE;
>>
>> -    tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
>> -                     label_ptr, offsetof(CPUTLBEntry, addr_write));
>> +    if (isStoreCond) {
>> +        TCGType t = ((TCG_TARGET_REG_BITS == 64) && (TARGET_LONG_BITS == 64)) ?
>> +                                                   TCG_TYPE_I64 : TCG_TYPE_I32;
>> +        /* The JMP address will be filled afterwards,
>> +         * in tcg_out_qemu_ld_slow_path (two times when
>> +         * TARGET_LONG_BITS > TCG_TARGET_REG_BITS). */
>> +        tcg_out_mov(s, t, TCG_REG_L1, addrlo);
>> +
>> +        if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
>> +            /* Store the second part of the address. */
>> +            tcg_out_mov(s, t, TCG_REG_L0, addrhi);
>> +            /* We add 4 to include the jmp that follows. */
>> +            label_ptr[1] = s->code_ptr + 4;
>> +        }
>>
>> -    /* TLB Hit.  */
>> -    tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
>> +        tcg_out_opc(s, OPC_JMP_long, 0, 0, 0);
>> +        label_ptr[0] = s->code_ptr;
>> +        s->code_ptr += 4;
>> +    } else {
>> +        tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
>> +                         label_ptr, offsetof(CPUTLBEntry, addr_write));
>> +
>> +        /* TLB Hit.  */
>> +        tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
>> +    }
>>
>>      /* Record the current context of a store into ldst label */
>> -    add_qemu_ldst_label(s, false, oi, datalo, datahi, addrlo, addrhi,
>> -                        s->code_ptr, label_ptr);
>> +    add_qemu_ldst_label(s, false, oi, llsc_success, datalo, datahi, addrlo,
>> +                        addrhi, s->code_ptr, label_ptr);
>>  #else
>>      {
>>          int32_t offset = GUEST_BASE;
>> @@ -1955,16 +2035,22 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>>          break;
>>
>>      case INDEX_op_qemu_ld_i32:
>> -        tcg_out_qemu_ld(s, args, 0);
>> +        tcg_out_qemu_ld(s, args, 0, 0);
>> +        break;
>> +    case INDEX_op_qemu_ldlink_i32:
>> +        tcg_out_qemu_ld(s, args, 0, 1);
>>          break;
>>      case INDEX_op_qemu_ld_i64:
>> -        tcg_out_qemu_ld(s, args, 1);
>> +        tcg_out_qemu_ld(s, args, 1, 0);
>>          break;
>>      case INDEX_op_qemu_st_i32:
>> -        tcg_out_qemu_st(s, args, 0);
>> +        tcg_out_qemu_st(s, args, 0, 0);
>> +        break;
>> +    case INDEX_op_qemu_stcond_i32:
>> +        tcg_out_qemu_st(s, args, 0, 1);
>>          break;
>>      case INDEX_op_qemu_st_i64:
>> -        tcg_out_qemu_st(s, args, 1);
>> +        tcg_out_qemu_st(s, args, 1, 0);
>>          break;
>>
>>      OP_32_64(mulu2):
>> @@ -2186,17 +2272,23 @@ static const TCGTargetOpDef x86_op_defs[] = {
>>
>>  #if TCG_TARGET_REG_BITS == 64
>>      { INDEX_op_qemu_ld_i32, { "r", "L" } },
>> +    { INDEX_op_qemu_ldlink_i32, { "r", "L" } },
>>      { INDEX_op_qemu_st_i32, { "L", "L" } },
>> +    { INDEX_op_qemu_stcond_i32, { "r", "L", "L" } },
>>      { INDEX_op_qemu_ld_i64, { "r", "L" } },
>>      { INDEX_op_qemu_st_i64, { "L", "L" } },
>>  #elif TARGET_LONG_BITS <= TCG_TARGET_REG_BITS
>>      { INDEX_op_qemu_ld_i32, { "r", "L" } },
>> +    { INDEX_op_qemu_ldlink_i32, { "r", "L" } },
>>      { INDEX_op_qemu_st_i32, { "L", "L" } },
>> +    { INDEX_op_qemu_stcond_i32, { "r", "L", "L" } },
>>      { INDEX_op_qemu_ld_i64, { "r", "r", "L" } },
>>      { INDEX_op_qemu_st_i64, { "L", "L", "L" } },
>>  #else
>>      { INDEX_op_qemu_ld_i32, { "r", "L", "L" } },
>> +    { INDEX_op_qemu_ldlink_i32, { "r", "L", "L" } },
>>      { INDEX_op_qemu_st_i32, { "L", "L", "L" } },
>> +    { INDEX_op_qemu_stcond_i32, { "r", "L", "L", "L" } },
>>      { INDEX_op_qemu_ld_i64, { "r", "r", "L", "L" } },
>>      { INDEX_op_qemu_st_i64, { "L", "L", "L", "L" } },
>>  #endif
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 07/13] ram_addr.h: Make exclusive bitmap accessors atomic
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 07/13] ram_addr.h: Make exclusive bitmap accessors atomic Alvise Rigo
@ 2015-07-17 13:32   ` Alex Bennée
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2015-07-17 13:32 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  include/exec/ram_addr.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index 2766541..e51bd65 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -255,7 +255,7 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
>  /* Exclusive bitmap accessors. */
>  static inline void cpu_physical_memory_set_excl_dirty(ram_addr_t addr)
>  {
> -    set_bit(addr >> TARGET_PAGE_BITS,
> +    set_bit_atomic(addr >> TARGET_PAGE_BITS,
>              ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
>  }
>  
> @@ -267,8 +267,8 @@ static inline int cpu_physical_memory_excl_is_dirty(ram_addr_t addr)
>  
>  static inline void cpu_physical_memory_clear_excl_dirty(ram_addr_t addr)
>  {
> -    clear_bit(addr >> TARGET_PAGE_BITS,
> -              ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
> +    bitmap_test_and_clear_atomic(ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
> +                                 addr >> TARGET_PAGE_BITS, 1);

Does this call for simply implementing a clear_bit_atomic() rather than
the fancy bitmap_test_and_clear_atomic. Looking at atomic.h it seems the
primitives you need are there.

>  }
>  
>  #endif

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 08/13] exec.c: introduce a simple rendezvous support
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 08/13] exec.c: introduce a simple rendezvous support Alvise Rigo
@ 2015-07-17 13:45   ` Alex Bennée
  2015-07-17 13:54     ` alvise rigo
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2015-07-17 13:45 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> When a vCPU is about to set a memory page as exclusive, it needs to wait
> that all the running vCPUs finish to execute the current TB and to be aware
> of the exact moment when that happens. For this, add a simple rendezvous
> mechanism that will be used in softmmu_llsc_template.h to implement the
> ldlink operation.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cpus.c            |  5 +++++
>  exec.c            | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  include/qom/cpu.h | 16 ++++++++++++++++
>  3 files changed, 66 insertions(+)
>
> diff --git a/cpus.c b/cpus.c
> index aee445a..f4d938e 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -1423,6 +1423,11 @@ static int tcg_cpu_exec(CPUArchState *env)
>      qemu_mutex_unlock_iothread();
>      ret = cpu_exec(env);
>      cpu->tcg_executing = 0;
> +
> +    if (unlikely(cpu->pending_rdv)) {
> +        cpu_exit_do_rendezvous(cpu);
> +    }
> +

I'll ignore this stuff for now as I assume we can all use the async work
patch of Fred's?


>      qemu_mutex_lock_iothread();
>  #ifdef CONFIG_PROFILER
>      tcg_time += profile_getclock() - ti;
> diff --git a/exec.c b/exec.c
> index 964e922..51958ed 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -746,6 +746,51 @@ void cpu_breakpoint_remove_all(CPUState *cpu, int mask)
>      }
>  }
>  
> +/* Rendezvous implementation.
> + * The corresponding definitions are in include/qom/cpu.h. */
> +CpuExitRendezvous cpu_exit_rendezvous;
> +inline void cpu_exit_init_rendezvous(void)
> +{
> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
> +
> +    rdv->attendees = 0;
> +}
> +
> +inline void cpu_exit_rendezvous_add_attendee(CPUState *cpu)
> +{
> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
> +
> +    if (!cpu->pending_rdv) {
> +        cpu->pending_rdv = 1;
> +        atomic_inc(&rdv->attendees);
> +    }
> +}
> +
> +void cpu_exit_do_rendezvous(CPUState *cpu)
> +{
> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
> +
> +    atomic_dec(&rdv->attendees);
> +
> +    cpu->pending_rdv = 0;
> +}
> +
> +void cpu_exit_rendezvous_wait_others(CPUState *cpu)
> +{
> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
> +
> +    while (rdv->attendees) {
> +        g_usleep(TCG_RDV_POLLING_PERIOD);
> +    }
> +}
> +
> +void cpu_exit_rendezvous_release(void)
> +{
> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
> +
> +    rdv->attendees = 0;
> +}
> +
>  /* enable or disable single step mode. EXCP_DEBUG is returned by the
>     CPU loop after each instruction */
>  void cpu_single_step(CPUState *cpu, int enabled)
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 8f3fe56..8d121b3 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -201,6 +201,20 @@ typedef struct CPUWatchpoint {
>      QTAILQ_ENTRY(CPUWatchpoint) entry;
>  } CPUWatchpoint;
>  
> +/* Rendezvous support */
> +#define TCG_RDV_POLLING_PERIOD 10
> +typedef struct CpuExitRendezvous {
> +    volatile int attendees;
> +    QemuMutex lock;
> +} CpuExitRendezvous;
> +
> +extern CpuExitRendezvous cpu_exit_rendezvous;
> +void cpu_exit_init_rendezvous(void);
> +void cpu_exit_rendezvous_add_attendee(CPUState *cpu);
> +void cpu_exit_do_rendezvous(CPUState *cpu);
> +void cpu_exit_rendezvous_wait_others(CPUState *cpu);
> +void cpu_exit_rendezvous_release(void);
> +
>  struct KVMState;
>  struct kvm_run;
>  
> @@ -291,6 +305,8 @@ struct CPUState {
>  
>      void *opaque;
>  
> +    volatile int pending_rdv;
> +

I will however echo the "hmmm" on Fred's patch about the optimistic use
of volatile here. As Peter mentioned it is a red flag and I would prefer
explicit memory consistency behaviour used and documented.


>      /* In order to avoid passing too many arguments to the MMIO helpers,
>       * we store some rarely used information in the CPU context.
>       */

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 08/13] exec.c: introduce a simple rendezvous support
  2015-07-17 13:45   ` Alex Bennée
@ 2015-07-17 13:54     ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-17 13:54 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

On Fri, Jul 17, 2015 at 3:45 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> When a vCPU is about to set a memory page as exclusive, it needs to wait
>> that all the running vCPUs finish to execute the current TB and to be aware
>> of the exact moment when that happens. For this, add a simple rendezvous
>> mechanism that will be used in softmmu_llsc_template.h to implement the
>> ldlink operation.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  cpus.c            |  5 +++++
>>  exec.c            | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  include/qom/cpu.h | 16 ++++++++++++++++
>>  3 files changed, 66 insertions(+)
>>
>> diff --git a/cpus.c b/cpus.c
>> index aee445a..f4d938e 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -1423,6 +1423,11 @@ static int tcg_cpu_exec(CPUArchState *env)
>>      qemu_mutex_unlock_iothread();
>>      ret = cpu_exec(env);
>>      cpu->tcg_executing = 0;
>> +
>> +    if (unlikely(cpu->pending_rdv)) {
>> +        cpu_exit_do_rendezvous(cpu);
>> +    }
>> +
>
> I'll ignore this stuff for now as I assume we can all use the async work
> patch of Fred's?

Yes, it will be more likely based on the plain async_run_on_cpu.

>
>
>>      qemu_mutex_lock_iothread();
>>  #ifdef CONFIG_PROFILER
>>      tcg_time += profile_getclock() - ti;
>> diff --git a/exec.c b/exec.c
>> index 964e922..51958ed 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -746,6 +746,51 @@ void cpu_breakpoint_remove_all(CPUState *cpu, int mask)
>>      }
>>  }
>>
>> +/* Rendezvous implementation.
>> + * The corresponding definitions are in include/qom/cpu.h. */
>> +CpuExitRendezvous cpu_exit_rendezvous;
>> +inline void cpu_exit_init_rendezvous(void)
>> +{
>> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
>> +
>> +    rdv->attendees = 0;
>> +}
>> +
>> +inline void cpu_exit_rendezvous_add_attendee(CPUState *cpu)
>> +{
>> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
>> +
>> +    if (!cpu->pending_rdv) {
>> +        cpu->pending_rdv = 1;
>> +        atomic_inc(&rdv->attendees);
>> +    }
>> +}
>> +
>> +void cpu_exit_do_rendezvous(CPUState *cpu)
>> +{
>> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
>> +
>> +    atomic_dec(&rdv->attendees);
>> +
>> +    cpu->pending_rdv = 0;
>> +}
>> +
>> +void cpu_exit_rendezvous_wait_others(CPUState *cpu)
>> +{
>> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
>> +
>> +    while (rdv->attendees) {
>> +        g_usleep(TCG_RDV_POLLING_PERIOD);
>> +    }
>> +}
>> +
>> +void cpu_exit_rendezvous_release(void)
>> +{
>> +    CpuExitRendezvous *rdv = &cpu_exit_rendezvous;
>> +
>> +    rdv->attendees = 0;
>> +}
>> +
>>  /* enable or disable single step mode. EXCP_DEBUG is returned by the
>>     CPU loop after each instruction */
>>  void cpu_single_step(CPUState *cpu, int enabled)
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 8f3fe56..8d121b3 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -201,6 +201,20 @@ typedef struct CPUWatchpoint {
>>      QTAILQ_ENTRY(CPUWatchpoint) entry;
>>  } CPUWatchpoint;
>>
>> +/* Rendezvous support */
>> +#define TCG_RDV_POLLING_PERIOD 10
>> +typedef struct CpuExitRendezvous {
>> +    volatile int attendees;
>> +    QemuMutex lock;
>> +} CpuExitRendezvous;
>> +
>> +extern CpuExitRendezvous cpu_exit_rendezvous;
>> +void cpu_exit_init_rendezvous(void);
>> +void cpu_exit_rendezvous_add_attendee(CPUState *cpu);
>> +void cpu_exit_do_rendezvous(CPUState *cpu);
>> +void cpu_exit_rendezvous_wait_others(CPUState *cpu);
>> +void cpu_exit_rendezvous_release(void);
>> +
>>  struct KVMState;
>>  struct kvm_run;
>>
>> @@ -291,6 +305,8 @@ struct CPUState {
>>
>>      void *opaque;
>>
>> +    volatile int pending_rdv;
>> +
>
> I will however echo the "hmmm" on Fred's patch about the optimistic use
> of volatile here. As Peter mentioned it is a red flag and I would prefer
> explicit memory consistency behaviour used and documented.

In my local branch I'm now using atomic_set/atomic_read, basically
what is defined in qemu/atomic.h. Is this enough?

Regards,
alvise

>
>
>>      /* In order to avoid passing too many arguments to the MMIO helpers,
>>       * we store some rarely used information in the CPU context.
>>       */
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 12/13] softmmu_llsc_template.h: move to multithreading
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 12/13] softmmu_llsc_template.h: move to multithreading Alvise Rigo
@ 2015-07-17 15:27   ` Alex Bennée
  2015-07-17 15:31     ` alvise rigo
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2015-07-17 15:27 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Update the TCG LL/SC instructions to work in multi-threading.
>
> The basic idea remains untouched, but the whole mechanism is improved to
> make use of the callback support to query TLB flush requests and the
> rendezvous callback to synchronize all the currently running vCPUs.
>
> In essence, if a vCPU wants to LL to a page which is not already set as
> EXCL, it will arrange a rendezvous with all the vCPUs that are executing
> a TB and query a TLB flush for *all* the vCPUs.
> Doing so, we make sure that:
> - the running vCPUs do not touch the EXCL page while the requesting vCPU
>   is setting the transaction to EXCL of the page
> - all the vCPUs will have the EXCL flag in the TLB entry for that
>   specific page *before* entering the next TB
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cputlb.c                |  2 +
>  include/exec/cpu-defs.h |  4 ++
>  softmmu_llsc_template.h | 97 ++++++++++++++++++++++++++++++++-----------------
>  3 files changed, 69 insertions(+), 34 deletions(-)
>
> diff --git a/cputlb.c b/cputlb.c
> index 66df41a..0566e0f 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -30,6 +30,8 @@
>  #include "exec/ram_addr.h"
>  #include "tcg/tcg.h"
>  
> +#include "sysemu/cpus.h"
> +
>  void qemu_mutex_lock_iothread(void);
>  void qemu_mutex_unlock_iothread(void);
>  
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index c73a75f..40742b3 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -169,5 +169,9 @@ typedef struct CPUIOTLBEntry {
>      /* Used for atomic instruction translation. */                      \
>      bool ll_sc_context;                                                 \
>      hwaddr excl_protected_hwaddr;                                       \
> +    /* Used to carry the stcond result and also as a flag to flag a
> +     * normal store access made by a stcond. */                         \
> +    int excl_succeeded;                                                 \
> +
>  
>  #endif
> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
> index 81e9d8e..4105e72 100644
> --- a/softmmu_llsc_template.h
> +++ b/softmmu_llsc_template.h
> @@ -54,7 +54,21 @@
>                   (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
>  })                                                                           \
>  
> -#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
> +#define is_read_tlb_entry_set(env, page, index)                              \
> +({                                                                           \
> +    (addr & TARGET_PAGE_MASK)                                                \
> +         == ((env->tlb_table[mmu_idx][index].addr_read) &                    \
> +                 (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
> +})                                                                           \
> +
> +/* Whenever a SC operation fails, we add a small delay to reduce the
> + * concurrency among the atomic instruction emulation code. Without this delay,
> + * in very congested situation where plain stores make all the pending LLs
> + * fail, the code could reach a stalling situation in which all the SCs happen
> + * to fail.
> + * TODO: make the delay dynamic according with the SC failing rate.
> + * */
> +#define TCG_ATOMIC_INSN_EMUL_DELAY 100

I'd be tempted to split out this sort of chicanery into a separate patch. 

>  
>  WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr,
>                                  TCGMemOpIdx oi, uintptr_t retaddr)
> @@ -65,35 +79,58 @@ WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr,
>      hwaddr hw_addr;
>      unsigned mmu_idx = get_mmuidx(oi);
>  
> -    /* Use the proper load helper from cpu_ldst.h */
> -    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
> -
> -    /* The last legacy access ensures that the TLB and IOTLB entry for 'addr'
> -     * have been created. */
>      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +    if (!is_read_tlb_entry_set(env, addr, index)) {
> +        tlb_fill(ENV_GET_CPU(env), addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
> +    }
>  
>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>  
>      /* Set the exclusive-protected hwaddr. */
> -    env->excl_protected_hwaddr = hw_addr;
> -    env->ll_sc_context = true;
> +    qemu_mutex_lock(&tcg_excl_access_lock);
> +    if (cpu_physical_memory_excl_is_dirty(hw_addr) && !exit_flush_request) {
> +        exit_flush_request = 1;
>  
> -    /* No need to mask hw_addr with TARGET_PAGE_MASK since
> -     * cpu_physical_memory_excl_is_dirty() will take care of that. */
> -    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
> -        cpu_physical_memory_clear_excl_dirty(hw_addr);
> +        qemu_mutex_unlock(&tcg_excl_access_lock);
> +
> +        cpu_exit_init_rendezvous();
>  
> -        /* Invalidate the TLB entry for the other processors. The next TLB
> -         * entries for this page will have the TLB_EXCL flag set. */
>          CPU_FOREACH(cpu) {
> -            if (cpu != current_cpu) {
> -                tlb_flush(cpu, 1);
> +            if ((cpu->thread_id != qemu_get_thread_id())) {
> +                if (!cpu->pending_tlb_flush) {
> +                    /* Flush the TLB cache before executing the next TB. */
> +                    cpu->pending_tlb_flush = 1;
> +                    cpu_exit_callback_add(cpu, cpu_exit_tlb_flush_all_cb, NULL);
> +                }
> +                if (cpu->tcg_executing) {
> +                    /* We want to wait all the vCPUs that are running in this
> +                     * exact moment.
> +                     * Add a callback to be executed as soon as the vCPU exits
> +                     * from the current TB. Force it to exit. */
> +                    cpu_exit_rendezvous_add_attendee(cpu);
> +                    qemu_cpu_kick_thread(cpu);
> +                }
>              }
>          }
> +
> +        cpu_exit_rendezvous_wait_others(ENV_GET_CPU(env));
> +
> +        exit_flush_request = 0;
> +
> +        qemu_mutex_lock(&tcg_excl_access_lock);
> +        cpu_physical_memory_clear_excl_dirty(hw_addr);
>      }
>  
> +    env->ll_sc_context = true;
> +
> +    /* Use the proper load helper from cpu_ldst.h */
> +    env->excl_protected_hwaddr = hw_addr;
> +    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
> +
> +    qemu_mutex_unlock(&tcg_excl_access_lock);
> +
>      /* For this vCPU, just update the TLB entry, no need to flush. */
>      env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>  
> @@ -106,7 +143,6 @@ WORD_TYPE helper_le_stcond_name(CPUArchState *env, target_ulong addr,
>  {
>      WORD_TYPE ret;
>      int index;
> -    hwaddr hw_addr;
>      unsigned mmu_idx = get_mmuidx(oi);
>  
>      /* If the TLB entry is not the right one, create it. */
> @@ -115,29 +151,22 @@ WORD_TYPE helper_le_stcond_name(CPUArchState *env, target_ulong addr,
>          tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
>      }
>  
> -    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
> -     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
> -    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
> -
> -    if (!env->ll_sc_context) {
> -        /* No LoakLink has been set, the StoreCond has to fail. */
> -        return 1;
> -    }
> -
>      env->ll_sc_context = 0;
>  
> -    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
> -        /* Another vCPU has accessed the memory after the LoadLink. */
> -        ret = 1;
> -    } else {
> -        helper_st_legacy(env, addr, val, mmu_idx, retaddr);
> +    /* We set it preventively to true to distinguish the following legacy
> +     * access as one made by the store conditional wrapper. If the store
> +     * conditional does not succeed, the value will be set to 0.*/
> +    env->excl_succeeded = 1;
> +    helper_st_legacy(env, addr, val, mmu_idx, retaddr);
>  
> -        /* The StoreConditional succeeded */
> +    if (env->excl_succeeded) {
> +        env->excl_succeeded = 0;
>          ret = 0;
> +    } else {
> +        g_usleep(TCG_ATOMIC_INSN_EMUL_DELAY);
> +        ret = 1;
>      }
>  
> -    env->tlb_table[mmu_idx][index].addr_write &= ~TLB_EXCL;
> -    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
>      /* It's likely that the page will be used again for exclusive accesses,
>       * for this reason we don't flush any TLB cache at the price of some
>       * additional slow paths and we don't set the page bit as dirty.

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 12/13] softmmu_llsc_template.h: move to multithreading
  2015-07-17 15:27   ` Alex Bennée
@ 2015-07-17 15:31     ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-17 15:31 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

On Fri, Jul 17, 2015 at 5:27 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Update the TCG LL/SC instructions to work in multi-threading.
>>
>> The basic idea remains untouched, but the whole mechanism is improved to
>> make use of the callback support to query TLB flush requests and the
>> rendezvous callback to synchronize all the currently running vCPUs.
>>
>> In essence, if a vCPU wants to LL to a page which is not already set as
>> EXCL, it will arrange a rendezvous with all the vCPUs that are executing
>> a TB and query a TLB flush for *all* the vCPUs.
>> Doing so, we make sure that:
>> - the running vCPUs do not touch the EXCL page while the requesting vCPU
>>   is setting the transaction to EXCL of the page
>> - all the vCPUs will have the EXCL flag in the TLB entry for that
>>   specific page *before* entering the next TB
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  cputlb.c                |  2 +
>>  include/exec/cpu-defs.h |  4 ++
>>  softmmu_llsc_template.h | 97 ++++++++++++++++++++++++++++++++-----------------
>>  3 files changed, 69 insertions(+), 34 deletions(-)
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index 66df41a..0566e0f 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -30,6 +30,8 @@
>>  #include "exec/ram_addr.h"
>>  #include "tcg/tcg.h"
>>
>> +#include "sysemu/cpus.h"
>> +
>>  void qemu_mutex_lock_iothread(void);
>>  void qemu_mutex_unlock_iothread(void);
>>
>> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
>> index c73a75f..40742b3 100644
>> --- a/include/exec/cpu-defs.h
>> +++ b/include/exec/cpu-defs.h
>> @@ -169,5 +169,9 @@ typedef struct CPUIOTLBEntry {
>>      /* Used for atomic instruction translation. */                      \
>>      bool ll_sc_context;                                                 \
>>      hwaddr excl_protected_hwaddr;                                       \
>> +    /* Used to carry the stcond result and also as a flag to flag a
>> +     * normal store access made by a stcond. */                         \
>> +    int excl_succeeded;                                                 \
>> +
>>
>>  #endif
>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>> index 81e9d8e..4105e72 100644
>> --- a/softmmu_llsc_template.h
>> +++ b/softmmu_llsc_template.h
>> @@ -54,7 +54,21 @@
>>                   (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
>>  })                                                                           \
>>
>> -#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
>> +#define is_read_tlb_entry_set(env, page, index)                              \
>> +({                                                                           \
>> +    (addr & TARGET_PAGE_MASK)                                                \
>> +         == ((env->tlb_table[mmu_idx][index].addr_read) &                    \
>> +                 (TARGET_PAGE_MASK | TLB_INVALID_MASK));                     \
>> +})                                                                           \
>> +
>> +/* Whenever a SC operation fails, we add a small delay to reduce the
>> + * concurrency among the atomic instruction emulation code. Without this delay,
>> + * in very congested situation where plain stores make all the pending LLs
>> + * fail, the code could reach a stalling situation in which all the SCs happen
>> + * to fail.
>> + * TODO: make the delay dynamic according with the SC failing rate.
>> + * */
>> +#define TCG_ATOMIC_INSN_EMUL_DELAY 100
>
> I'd be tempted to split out this sort of chicanery into a separate patch.

OK, I think it's a good idea since it's not strictly required.

Regards,
alvise

>
>>
>>  WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr,
>>                                  TCGMemOpIdx oi, uintptr_t retaddr)
>> @@ -65,35 +79,58 @@ WORD_TYPE helper_le_ldlink_name(CPUArchState *env, target_ulong addr,
>>      hwaddr hw_addr;
>>      unsigned mmu_idx = get_mmuidx(oi);
>>
>> -    /* Use the proper load helper from cpu_ldst.h */
>> -    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
>> -
>> -    /* The last legacy access ensures that the TLB and IOTLB entry for 'addr'
>> -     * have been created. */
>>      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> +    if (!is_read_tlb_entry_set(env, addr, index)) {
>> +        tlb_fill(ENV_GET_CPU(env), addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>> +    }
>>
>>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>>
>>      /* Set the exclusive-protected hwaddr. */
>> -    env->excl_protected_hwaddr = hw_addr;
>> -    env->ll_sc_context = true;
>> +    qemu_mutex_lock(&tcg_excl_access_lock);
>> +    if (cpu_physical_memory_excl_is_dirty(hw_addr) && !exit_flush_request) {
>> +        exit_flush_request = 1;
>>
>> -    /* No need to mask hw_addr with TARGET_PAGE_MASK since
>> -     * cpu_physical_memory_excl_is_dirty() will take care of that. */
>> -    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
>> -        cpu_physical_memory_clear_excl_dirty(hw_addr);
>> +        qemu_mutex_unlock(&tcg_excl_access_lock);
>> +
>> +        cpu_exit_init_rendezvous();
>>
>> -        /* Invalidate the TLB entry for the other processors. The next TLB
>> -         * entries for this page will have the TLB_EXCL flag set. */
>>          CPU_FOREACH(cpu) {
>> -            if (cpu != current_cpu) {
>> -                tlb_flush(cpu, 1);
>> +            if ((cpu->thread_id != qemu_get_thread_id())) {
>> +                if (!cpu->pending_tlb_flush) {
>> +                    /* Flush the TLB cache before executing the next TB. */
>> +                    cpu->pending_tlb_flush = 1;
>> +                    cpu_exit_callback_add(cpu, cpu_exit_tlb_flush_all_cb, NULL);
>> +                }
>> +                if (cpu->tcg_executing) {
>> +                    /* We want to wait all the vCPUs that are running in this
>> +                     * exact moment.
>> +                     * Add a callback to be executed as soon as the vCPU exits
>> +                     * from the current TB. Force it to exit. */
>> +                    cpu_exit_rendezvous_add_attendee(cpu);
>> +                    qemu_cpu_kick_thread(cpu);
>> +                }
>>              }
>>          }
>> +
>> +        cpu_exit_rendezvous_wait_others(ENV_GET_CPU(env));
>> +
>> +        exit_flush_request = 0;
>> +
>> +        qemu_mutex_lock(&tcg_excl_access_lock);
>> +        cpu_physical_memory_clear_excl_dirty(hw_addr);
>>      }
>>
>> +    env->ll_sc_context = true;
>> +
>> +    /* Use the proper load helper from cpu_ldst.h */
>> +    env->excl_protected_hwaddr = hw_addr;
>> +    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
>> +
>> +    qemu_mutex_unlock(&tcg_excl_access_lock);
>> +
>>      /* For this vCPU, just update the TLB entry, no need to flush. */
>>      env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>
>> @@ -106,7 +143,6 @@ WORD_TYPE helper_le_stcond_name(CPUArchState *env, target_ulong addr,
>>  {
>>      WORD_TYPE ret;
>>      int index;
>> -    hwaddr hw_addr;
>>      unsigned mmu_idx = get_mmuidx(oi);
>>
>>      /* If the TLB entry is not the right one, create it. */
>> @@ -115,29 +151,22 @@ WORD_TYPE helper_le_stcond_name(CPUArchState *env, target_ulong addr,
>>          tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
>>      }
>>
>> -    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>> -     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>> -    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>> -
>> -    if (!env->ll_sc_context) {
>> -        /* No LoakLink has been set, the StoreCond has to fail. */
>> -        return 1;
>> -    }
>> -
>>      env->ll_sc_context = 0;
>>
>> -    if (cpu_physical_memory_excl_is_dirty(hw_addr)) {
>> -        /* Another vCPU has accessed the memory after the LoadLink. */
>> -        ret = 1;
>> -    } else {
>> -        helper_st_legacy(env, addr, val, mmu_idx, retaddr);
>> +    /* We set it preventively to true to distinguish the following legacy
>> +     * access as one made by the store conditional wrapper. If the store
>> +     * conditional does not succeed, the value will be set to 0.*/
>> +    env->excl_succeeded = 1;
>> +    helper_st_legacy(env, addr, val, mmu_idx, retaddr);
>>
>> -        /* The StoreConditional succeeded */
>> +    if (env->excl_succeeded) {
>> +        env->excl_succeeded = 0;
>>          ret = 0;
>> +    } else {
>> +        g_usleep(TCG_ATOMIC_INSN_EMUL_DELAY);
>> +        ret = 1;
>>      }
>>
>> -    env->tlb_table[mmu_idx][index].addr_write &= ~TLB_EXCL;
>> -    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
>>      /* It's likely that the page will be used again for exclusive accesses,
>>       * for this reason we don't flush any TLB cache at the price of some
>>       * additional slow paths and we don't set the page bit as dirty.
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 13/13] softmmu_template.h: move to multithreading
  2015-07-10  8:23 ` [Qemu-devel] [RFC v3 13/13] softmmu_template.h: " Alvise Rigo
@ 2015-07-17 15:57   ` Alex Bennée
  2015-07-17 16:19     ` alvise rigo
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2015-07-17 15:57 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Exploiting the tcg_excl_access_lock, port the helper_{le,be}_st_name to
> work in real multithreading.
>
> - The macro lookup_cpus_ll_addr now uses directly the
>   env->excl_protected_addr to invalidate others' LL/SC operations
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  softmmu_template.h | 110 +++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 89 insertions(+), 21 deletions(-)
>
> diff --git a/softmmu_template.h b/softmmu_template.h
> index bc767f6..522454f 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -141,21 +141,24 @@
>      vidx >= 0;                                                                \
>  })
>  
> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
> +
> +/* This macro requires the caller to have the tcg_excl_access_lock lock since
> + * it modifies the excl_protected_hwaddr of a running vCPU.
> + * The macros scans all the excl_protected_hwaddr of all the vCPUs and compare
> + * them with the address the current vCPU is writing to. If there is a match,
> + * we reset the value, making the SC fail. */

It would have been nice if we had started with a comment when the
function^H^H^H^H^H macro was first introduced and then updated here.

>  #define lookup_cpus_ll_addr(addr)                                             \
>  ({                                                                            \
>      CPUState *cpu;                                                            \
>      CPUArchState *acpu;                                                       \
> -    bool hit = false;                                                         \
>                                                                                \
>      CPU_FOREACH(cpu) {                                                        \
>          acpu = (CPUArchState *)cpu->env_ptr;                                  \
>          if (cpu != current_cpu && acpu->excl_protected_hwaddr == addr) {      \
> -            hit = true;                                                       \
> -            break;                                                            \
> +            acpu->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;               \
>          }                                                                     \
>      }                                                                         \
> -                                                                              \
> -    hit;                                                                      \
>  })

My comment about using an inline function in the earlier patch stands.

>  
>  #ifndef SOFTMMU_CODE_ACCESS
> @@ -439,18 +442,52 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>               * exclusive-protected memory. */
>              hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
>  
> -            bool set_to_dirty;
> -
>              /* Two cases of invalidation: the current vCPU is writing to another
>               * vCPU's exclusive address or the vCPU that issued the LoadLink is
>               * writing to it, but not through a StoreCond. */
> -            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
> -            set_to_dirty |= env->ll_sc_context &&
> -                           (env->excl_protected_hwaddr == hw_addr);
> +            qemu_mutex_lock(&tcg_excl_access_lock);
> +
> +            /* The macro lookup_cpus_ll_addr could have reset the exclusive
> +             * address. Fail the SC in this case.
> +             * N.B.: Here excl_succeeded == 0 means that we don't come from a
> +             * store conditional.  */
> +            if (env->excl_succeeded &&
> +                        (env->excl_protected_hwaddr == EXCLUSIVE_RESET_ADDR)) {
> +                env->excl_succeeded = 0;
> +                qemu_mutex_unlock(&tcg_excl_access_lock);
> +
> +                return;
> +            }
> +
> +            lookup_cpus_ll_addr(hw_addr);

Add comment for side effect, also confused by the above comment given we
call lookups_cpus_ll_addr() after the comment about it tweaking excl_succedded.

> +
> +            if (!env->excl_succeeded) {
> +                if (env->ll_sc_context &&
> +                            (env->excl_protected_hwaddr == hw_addr)) {
> +                    cpu_physical_memory_set_excl_dirty(hw_addr);
> +                }
> +            } else {
> +                if (cpu_physical_memory_excl_is_dirty(hw_addr) ||
> +                        env->excl_protected_hwaddr != hw_addr) {
> +                    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
> +                    qemu_mutex_unlock(&tcg_excl_access_lock);
> +                    env->excl_succeeded = 0;
> +
> +                    return;
> +                }
> +            }

I'm wondering if it can be more naturally written the other way round to
aid comprehension:

if (env->excl_succeeded) {
   if (cpu_physical_memory_excl_is_dirty(hw_addr) ||
       env->excl_protected_hwaddr != hw_addr) {
       ..do stuff..
       return
   }
} else {
   if (env->ll_sc_context &&
      (env->excl_protected_hwaddr == hw_addr)) {
      cpu_physical_memory_set_excl_dirty(hw_addr);
   }
}

Although now I'm confused as to why we push on in 3 of the 4 cases.

> +
> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
> +        #if DATA_SIZE == 1
> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
> +        #else
> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
> +        #endif
> +
> +            env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
> +            qemu_mutex_unlock(&tcg_excl_access_lock);
>  
> -            if (set_to_dirty) {
> -                cpu_physical_memory_set_excl_dirty(hw_addr);
> -            } /* the vCPU is legitimately writing to the protected address */
> +            return;
>          } else {
>              if ((addr & (DATA_SIZE - 1)) != 0) {
>                  goto do_unaligned_access;
> @@ -537,18 +574,49 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>               * exclusive-protected memory. */
>              hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
>  
> -            bool set_to_dirty;
> -
>              /* Two cases of invalidation: the current vCPU is writing to another
>               * vCPU's exclusive address or the vCPU that issued the LoadLink is
>               * writing to it, but not through a StoreCond. */
> -            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
> -            set_to_dirty |= env->ll_sc_context &&
> -                           (env->excl_protected_hwaddr == hw_addr);
> +            qemu_mutex_lock(&tcg_excl_access_lock);
> +
> +            /* The macro lookup_cpus_ll_addr could have reset the exclusive
> +             * address. Fail the SC in this case.
> +             * N.B.: Here excl_succeeded == 0 means that we don't come from a
> +             * store conditional.  */
> +            if (env->excl_succeeded &&
> +                        (env->excl_protected_hwaddr == EXCLUSIVE_RESET_ADDR)) {
> +                env->excl_succeeded = 0;
> +                qemu_mutex_unlock(&tcg_excl_access_lock);
> +
> +                return;
> +            }
> +
> +            lookup_cpus_ll_addr(hw_addr);
> +
> +            if (!env->excl_succeeded) {
> +                if (env->ll_sc_context &&
> +                            (env->excl_protected_hwaddr == hw_addr)) {
> +                    cpu_physical_memory_set_excl_dirty(hw_addr);
> +                }
> +            } else {
> +                if (cpu_physical_memory_excl_is_dirty(hw_addr) ||
> +                        env->excl_protected_hwaddr != hw_addr) {
> +                    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
> +                    qemu_mutex_unlock(&tcg_excl_access_lock);
> +                    env->excl_succeeded = 0;
> +
> +                    return;
> +                }
> +            }

Given the amount of copy/paste between the two functions I wonder if
there is some commonality to be re-factored out here?

>  
> -            if (set_to_dirty) {
> -                cpu_physical_memory_set_excl_dirty(hw_addr);
> -            } /* the vCPU is legitimately writing to the protected address */
> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
> +
> +            glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
> +
> +            qemu_mutex_unlock(&tcg_excl_access_lock);
> +            env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
> +
> +            return;
>          } else {
>              if ((addr & (DATA_SIZE - 1)) != 0) {
>                  goto do_unaligned_access;

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v3 13/13] softmmu_template.h: move to multithreading
  2015-07-17 15:57   ` Alex Bennée
@ 2015-07-17 16:19     ` alvise rigo
  0 siblings, 0 replies; 41+ messages in thread
From: alvise rigo @ 2015-07-17 16:19 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

On Fri, Jul 17, 2015 at 5:57 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Exploiting the tcg_excl_access_lock, port the helper_{le,be}_st_name to
>> work in real multithreading.
>>
>> - The macro lookup_cpus_ll_addr now uses directly the
>>   env->excl_protected_addr to invalidate others' LL/SC operations
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  softmmu_template.h | 110 +++++++++++++++++++++++++++++++++++++++++++----------
>>  1 file changed, 89 insertions(+), 21 deletions(-)
>>
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index bc767f6..522454f 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -141,21 +141,24 @@
>>      vidx >= 0;                                                                \
>>  })
>>
>> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
>> +
>> +/* This macro requires the caller to have the tcg_excl_access_lock lock since
>> + * it modifies the excl_protected_hwaddr of a running vCPU.
>> + * The macros scans all the excl_protected_hwaddr of all the vCPUs and compare
>> + * them with the address the current vCPU is writing to. If there is a match,
>> + * we reset the value, making the SC fail. */
>
> It would have been nice if we had started with a comment when the
> function^H^H^H^H^H macro was first introduced and then updated here.

OK. Related to this, I think to refactor the patches in such a way to
drop the "move to multithreading" part, since it makes things more
confusing (even if my intent was the opposite).

>
>>  #define lookup_cpus_ll_addr(addr)                                             \
>>  ({                                                                            \
>>      CPUState *cpu;                                                            \
>>      CPUArchState *acpu;                                                       \
>> -    bool hit = false;                                                         \
>>                                                                                \
>>      CPU_FOREACH(cpu) {                                                        \
>>          acpu = (CPUArchState *)cpu->env_ptr;                                  \
>>          if (cpu != current_cpu && acpu->excl_protected_hwaddr == addr) {      \
>> -            hit = true;                                                       \
>> -            break;                                                            \
>> +            acpu->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;               \
>>          }                                                                     \
>>      }                                                                         \
>> -                                                                              \
>> -    hit;                                                                      \
>>  })
>
> My comment about using an inline function in the earlier patch stands.
>
>>
>>  #ifndef SOFTMMU_CODE_ACCESS
>> @@ -439,18 +442,52 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>               * exclusive-protected memory. */
>>              hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
>>
>> -            bool set_to_dirty;
>> -
>>              /* Two cases of invalidation: the current vCPU is writing to another
>>               * vCPU's exclusive address or the vCPU that issued the LoadLink is
>>               * writing to it, but not through a StoreCond. */
>> -            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
>> -            set_to_dirty |= env->ll_sc_context &&
>> -                           (env->excl_protected_hwaddr == hw_addr);
>> +            qemu_mutex_lock(&tcg_excl_access_lock);
>> +
>> +            /* The macro lookup_cpus_ll_addr could have reset the exclusive
>> +             * address. Fail the SC in this case.
>> +             * N.B.: Here excl_succeeded == 0 means that we don't come from a
>> +             * store conditional.  */
>> +            if (env->excl_succeeded &&
>> +                        (env->excl_protected_hwaddr == EXCLUSIVE_RESET_ADDR)) {
>> +                env->excl_succeeded = 0;
>> +                qemu_mutex_unlock(&tcg_excl_access_lock);
>> +
>> +                return;
>> +            }
>> +
>> +            lookup_cpus_ll_addr(hw_addr);
>
> Add comment for side effect, also confused by the above comment given we
> call lookups_cpus_ll_addr() after the comment about it tweaking excl_succedded.

I will improve all the comments to avoid any misinterpretation. I will
also be more verbose in the cover letter about how the whole machinery
works.

>
>> +
>> +            if (!env->excl_succeeded) {
>> +                if (env->ll_sc_context &&
>> +                            (env->excl_protected_hwaddr == hw_addr)) {
>> +                    cpu_physical_memory_set_excl_dirty(hw_addr);
>> +                }
>> +            } else {
>> +                if (cpu_physical_memory_excl_is_dirty(hw_addr) ||
>> +                        env->excl_protected_hwaddr != hw_addr) {
>> +                    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
>> +                    qemu_mutex_unlock(&tcg_excl_access_lock);
>> +                    env->excl_succeeded = 0;
>> +
>> +                    return;
>> +                }
>> +            }
>
> I'm wondering if it can be more naturally written the other way round to
> aid comprehension:
>
> if (env->excl_succeeded) {
>    if (cpu_physical_memory_excl_is_dirty(hw_addr) ||
>        env->excl_protected_hwaddr != hw_addr) {
>        ..do stuff..
>        return
>    }
> } else {
>    if (env->ll_sc_context &&
>       (env->excl_protected_hwaddr == hw_addr)) {
>       cpu_physical_memory_set_excl_dirty(hw_addr);
>    }
> }
>
> Although now I'm confused as to why we push on in 3 of the 4 cases.

I will try to find another way to write this code, starting from
adding more comments about the corner cases that we are addressing.

>
>> +
>> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
>> +        #if DATA_SIZE == 1
>> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>> +        #else
>> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>> +        #endif
>> +
>> +            env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
>> +            qemu_mutex_unlock(&tcg_excl_access_lock);
>>
>> -            if (set_to_dirty) {
>> -                cpu_physical_memory_set_excl_dirty(hw_addr);
>> -            } /* the vCPU is legitimately writing to the protected address */
>> +            return;
>>          } else {
>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>>                  goto do_unaligned_access;
>> @@ -537,18 +574,49 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>               * exclusive-protected memory. */
>>              hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
>>
>> -            bool set_to_dirty;
>> -
>>              /* Two cases of invalidation: the current vCPU is writing to another
>>               * vCPU's exclusive address or the vCPU that issued the LoadLink is
>>               * writing to it, but not through a StoreCond. */
>> -            set_to_dirty = lookup_cpus_ll_addr(hw_addr);
>> -            set_to_dirty |= env->ll_sc_context &&
>> -                           (env->excl_protected_hwaddr == hw_addr);
>> +            qemu_mutex_lock(&tcg_excl_access_lock);
>> +
>> +            /* The macro lookup_cpus_ll_addr could have reset the exclusive
>> +             * address. Fail the SC in this case.
>> +             * N.B.: Here excl_succeeded == 0 means that we don't come from a
>> +             * store conditional.  */
>> +            if (env->excl_succeeded &&
>> +                        (env->excl_protected_hwaddr == EXCLUSIVE_RESET_ADDR)) {
>> +                env->excl_succeeded = 0;
>> +                qemu_mutex_unlock(&tcg_excl_access_lock);
>> +
>> +                return;
>> +            }
>> +
>> +            lookup_cpus_ll_addr(hw_addr);
>> +
>> +            if (!env->excl_succeeded) {
>> +                if (env->ll_sc_context &&
>> +                            (env->excl_protected_hwaddr == hw_addr)) {
>> +                    cpu_physical_memory_set_excl_dirty(hw_addr);
>> +                }
>> +            } else {
>> +                if (cpu_physical_memory_excl_is_dirty(hw_addr) ||
>> +                        env->excl_protected_hwaddr != hw_addr) {
>> +                    env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
>> +                    qemu_mutex_unlock(&tcg_excl_access_lock);
>> +                    env->excl_succeeded = 0;
>> +
>> +                    return;
>> +                }
>> +            }
>
> Given the amount of copy/paste between the two functions I wonder if
> there is some commonality to be re-factored out here?

Either we create inline functions outside this file (like in cputlb.c)
or we define new macros.
I don't see other viable ways to re-factor this code.

Thank you,
alvise

>
>>
>> -            if (set_to_dirty) {
>> -                cpu_physical_memory_set_excl_dirty(hw_addr);
>> -            } /* the vCPU is legitimately writing to the protected address */
>> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
>> +
>> +            glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
>> +
>> +            qemu_mutex_unlock(&tcg_excl_access_lock);
>> +            env->excl_protected_hwaddr = EXCLUSIVE_RESET_ADDR;
>> +
>> +            return;
>>          } else {
>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>>                  goto do_unaligned_access;
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2015-07-17 16:19 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-10  8:23 [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Alvise Rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 01/13] exec: Add new exclusive bitmap to ram_list Alvise Rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 02/13] cputlb: Add new TLB_EXCL flag Alvise Rigo
2015-07-16 14:32   ` Alex Bennée
2015-07-16 15:04     ` alvise rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 03/13] softmmu: Add helpers for a new slow-path Alvise Rigo
2015-07-16 14:53   ` Alex Bennée
2015-07-16 15:15     ` alvise rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 04/13] tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions Alvise Rigo
2015-07-17  9:49   ` Alex Bennée
2015-07-17 10:05     ` alvise rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 05/13] target-arm: translate: implement qemu_ldlink and qemu_stcond ops Alvise Rigo
2015-07-17 12:51   ` Alex Bennée
2015-07-17 13:01     ` alvise rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 06/13] target-i386: " Alvise Rigo
2015-07-17 12:56   ` Alex Bennée
2015-07-17 13:27     ` alvise rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 07/13] ram_addr.h: Make exclusive bitmap accessors atomic Alvise Rigo
2015-07-17 13:32   ` Alex Bennée
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 08/13] exec.c: introduce a simple rendezvous support Alvise Rigo
2015-07-17 13:45   ` Alex Bennée
2015-07-17 13:54     ` alvise rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support Alvise Rigo
2015-07-10  9:36   ` Paolo Bonzini
2015-07-10  9:47     ` alvise rigo
2015-07-10  9:53       ` Frederic Konrad
2015-07-10 10:06         ` alvise rigo
2015-07-10 10:24       ` Paolo Bonzini
2015-07-10 12:16         ` Frederic Konrad
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 10/13] Simple TLB flush wrap to use as exit callback Alvise Rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 11/13] Introduce exit_flush_req and tcg_excl_access_lock Alvise Rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 12/13] softmmu_llsc_template.h: move to multithreading Alvise Rigo
2015-07-17 15:27   ` Alex Bennée
2015-07-17 15:31     ` alvise rigo
2015-07-10  8:23 ` [Qemu-devel] [RFC v3 13/13] softmmu_template.h: " Alvise Rigo
2015-07-17 15:57   ` Alex Bennée
2015-07-17 16:19     ` alvise rigo
2015-07-10  8:31 ` [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation Mark Burton
2015-07-10  8:58   ` alvise rigo
2015-07-10  8:39 ` Frederic Konrad
2015-07-10  9:04   ` alvise rigo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.