All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation
@ 2015-09-24  8:32 Alvise Rigo
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 1/6] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Alvise Rigo @ 2015-09-24  8:32 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

This is the fifth iteration of the patch series which applies to the
upstream branch of QEMU (v2.4.0).

Changes versus previous versions are at the bottom of this cover letter.

The code is also available at following repository:
https://git.virtualopensystems.com/dev/qemu-mt.git
branch:
slowpath-for-atomic-v5-no-mttcg
(branch slowpath-for-atomic-v5-mttcg for the version based on mttcg)

This patch series provides an infrastructure for atomic instruction
implementation in QEMU, thus offering a 'legacy' solution for
translating guest atomic instructions. Moreover, it can be considered as
a first step toward a multi-thread TCG.

The underlying idea is to provide new TCG helpers (sort of softmmu
helpers) that guarantee atomicity to some memory accesses or in general
a way to define memory transactions.

More specifically, the new softmmu helpers behave as LoadLink and
StoreConditional instructions, and are called from TCG code by means of
target specific helpers. This work includes the implementation for all
the ARM atomic instructions, see target-arm/op_helper.c.

The implementation heavily uses the software TLB together with a new
bitmap that has been added to the ram_list structure which flags, on a
per-CPU basis, all the memory pages that are in the middle of a LoadLink
(LL), StoreConditional (SC) operation.  Since all these pages can be
accessed directly through the fast-path and alter a vCPU's linked value,
the new bitmap has been coupled with a new TLB flag for the TLB virtual
address which forces the slow-path execution for all the accesses to a
page containing a linked address.

The new slow-path is implemented such that:
- the LL behaves as a normal load slow-path, except for clearing the
  dirty flag in the bitmap.  The cputlb.c code while generating a TLB
  entry, checks if there is at least one vCPU that has the bit cleared
  in the exclusive bitmap, it that case the TLB entry will have the EXCL
  flag set, thus forcing the slow-path.  In order to ensure that all the
  vCPUs will follow the slow-path for that page, we flush the TLB cache
  of all the other vCPUs.

  The LL will also set the linked address and size of the access in a
  vCPU's private variable. After the corresponding SC, this address will
  be set to a reset value.

- the SC can fail returning 1, or succeed, returning 0.  It has to come
  always after a LL and has to access the same address 'linked' by the
  previous LL, otherwise it will fail. If in the time window delimited
  by a legit pair of LL/SC operations another write access happens to
  the linked address, the SC will fail.

In theory, the provided implementation of TCG LoadLink/StoreConditional
can be used to properly handle atomic instructions on any architecture.

The code has been tested with bare-metal test cases and by booting Linux.

* Performance considerations
The new slow-path adds some overhead to the translation of the ARM
atomic instructions, since their emulation doesn't happen anymore only
in the guest (by mean of pure TCG generated code), but requires the
execution of two helpers functions. Despite this, the additional time
required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
negligible.
Instead, on a LL/SC bound test scenario - like:
https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
solution requires 30% (1 million iterations) and 70% (10 millions
iterations) of additional time for the test to complete.

Changes from v4:
- Reworked the exclusive bitmap to be of fixed size (8 bits per address)
- The slow-path is now TCG backend independent, no need to touch
  tcg/* anymore as suggested by Aurelien Jarno.

Changes from v3:
- based on upstream QEMU
- addressed comments from Alex Bennée
- the slow path can be enabled by the user with:
  ./configure --enable-tcg-ldst-excl only if the backend supports it
- all the ARM ldex/stex instructions make now use of the slow path
- added aarch64 TCG backend support
- part of the code has been rewritten

Changes from v2:
- the bitmap accessors are now atomic
- a rendezvous between vCPUs and a simple callback support before executing
  a TB have been added to handle the TLB flush support
- the softmmu_template and softmmu_llsc_template have been adapted to work
  on real multi-threading

Changes from v1:
- The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
- The way how the offset to access the bitmap is calculated has
  been improved and fixed
- A page to be set as dirty requires a vCPU to target the protected address
  and not just an address in the page
- Addressed comments from Richard Henderson to improve the logic in
  softmmu_template.h and to simplify the methods generation through
  softmmu_llsc_template.h
- Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386

This work has been sponsored by Huawei Technologies Duesseldorf GmbH.

Alvise Rigo (6):
  exec.c: Add new exclusive bitmap to ram_list
  softmmu: Add new TLB_EXCL flag
  softmmu: Add helpers for a new slowpath
  target-arm: Create new runtime helpers for excl accesses
  configure: Use slow-path for atomic only when the softmmu is enabled
  target-arm: translate: Use ld/st excl for atomic insns

 configure               |  11 +++++
 cputlb.c                |  43 ++++++++++++++++-
 exec.c                  |   8 +++-
 include/exec/cpu-all.h  |   8 ++++
 include/exec/cpu-defs.h |  12 +++++
 include/exec/memory.h   |   3 +-
 include/exec/ram_addr.h |  75 +++++++++++++++++++++++++++++
 softmmu_llsc_template.h | 124 ++++++++++++++++++++++++++++++++++++++++++++++++
 softmmu_template.h      | 124 +++++++++++++++++++++++++++++++++++++++---------
 target-arm/helper.h     |  10 ++++
 target-arm/op_helper.c  |  94 ++++++++++++++++++++++++++++++++++++
 target-arm/translate.c  | 121 ++++++++++++++++++++++++++++++++++++++++++++--
 tcg/tcg.h               |  30 ++++++++++++
 13 files changed, 633 insertions(+), 30 deletions(-)
 create mode 100644 softmmu_llsc_template.h

-- 
2.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v5 1/6] exec.c: Add new exclusive bitmap to ram_list
  2015-09-24  8:32 [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Alvise Rigo
@ 2015-09-24  8:32 ` Alvise Rigo
  2015-09-26 17:15   ` Richard Henderson
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 2/6] softmmu: Add new TLB_EXCL flag Alvise Rigo
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Alvise Rigo @ 2015-09-24  8:32 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

The purpose of this new bitmap is to flag the memory pages that are in
the middle of LL/SC operations (after a LL, before a SC) on a per-vCPU
basis.
For all these pages, the corresponding TLB entries will be generated
in such a way to force the slow-path if at least one vCPU has the bit
not set.
When the system starts, the whole memory is dirty (all the bitmap is
set). A page, after being marked as exclusively-clean, will be
restored as dirty after the SC.

For each page we keep 8 bits to be shared among all the vCPUs available
in the system. In general, the to the vCPU n correspond the bit n % 8.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 exec.c                  |  8 ++++--
 include/exec/memory.h   |  3 +-
 include/exec/ram_addr.h | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 83 insertions(+), 3 deletions(-)

diff --git a/exec.c b/exec.c
index 0a4a0c5..cbe559f 100644
--- a/exec.c
+++ b/exec.c
@@ -1496,11 +1496,15 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
         int i;
 
         /* ram_list.dirty_memory[] is protected by the iothread lock.  */
-        for (i = 0; i < DIRTY_MEMORY_NUM; i++) {
+        for (i = 0; i < DIRTY_MEMORY_EXCLUSIVE; i++) {
             ram_list.dirty_memory[i] =
                 bitmap_zero_extend(ram_list.dirty_memory[i],
                                    old_ram_size, new_ram_size);
-       }
+        }
+        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] = bitmap_zero_extend(
+                ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                old_ram_size * EXCL_BITMAP_CELL_SZ,
+                new_ram_size * EXCL_BITMAP_CELL_SZ);
     }
     cpu_physical_memory_set_dirty_range(new_block->offset,
                                         new_block->used_length,
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 94d20ea..b71cb98 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -19,7 +19,8 @@
 #define DIRTY_MEMORY_VGA       0
 #define DIRTY_MEMORY_CODE      1
 #define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
+#define DIRTY_MEMORY_EXCLUSIVE 3
+#define DIRTY_MEMORY_NUM       4        /* num of dirty bits */
 
 #include <stdint.h>
 #include <stdbool.h>
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index c113f21..0016a4b 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "sysemu/sysemu.h"
 
 ram_addr_t qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
                                     bool share, const char *mem_path,
@@ -44,6 +45,13 @@ int qemu_ram_resize(ram_addr_t base, ram_addr_t newsize, Error **errp);
 #define DIRTY_CLIENTS_ALL     ((1 << DIRTY_MEMORY_NUM) - 1)
 #define DIRTY_CLIENTS_NOCODE  (DIRTY_CLIENTS_ALL & ~(1 << DIRTY_MEMORY_CODE))
 
+/* Exclusive bitmap support. */
+#define EXCL_BITMAP_CELL_SZ 8
+#define EXCL_BITMAP_GET_BIT_OFFSET(addr) \
+        (EXCL_BITMAP_CELL_SZ * (addr >> TARGET_PAGE_BITS))
+#define EXCL_BITMAP_GET_BYTE_OFFSET(addr) (addr >> TARGET_PAGE_BITS)
+#define EXCL_IDX(cpu) (cpu % EXCL_BITMAP_CELL_SZ)
+
 static inline bool cpu_physical_memory_get_dirty(ram_addr_t start,
                                                  ram_addr_t length,
                                                  unsigned client)
@@ -135,6 +143,11 @@ static inline void cpu_physical_memory_set_dirty_range(ram_addr_t start,
     if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
         bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
     }
+    if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
+        bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE],
+                        page * EXCL_BITMAP_CELL_SZ,
+                        (end - page) * EXCL_BITMAP_CELL_SZ);
+    }
     xen_modified_memory(start, length);
 }
 
@@ -249,5 +262,67 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
     return num_dirty;
 }
 
+/* One cell for each page. The n-th bit of a cell describes all the i-th vCPUs
+ * such that (i % EXCL_BITMAP_CELL_SZ) == n.
+ * A bit set to zero ensures that all the vCPUs described by the bit have the
+ * EXCL_BIT set for the page. */
+static inline void cpu_physical_memory_set_excl_dirty(ram_addr_t addr,
+                                                      uint32_t cpu)
+{
+    set_bit_atomic(EXCL_BITMAP_GET_BIT_OFFSET(addr) + EXCL_IDX(cpu),
+            ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+static inline int cpu_physical_memory_excl_atleast_one_clean(ram_addr_t addr)
+{
+    uint8_t *bitmap;
+
+    bitmap = (uint8_t *)(ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+
+    /* This is safe even if smp_cpus < 8 since the unused bits are always 1. */
+    return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] != UCHAR_MAX;
+}
+
+/* Return true if the @cpu has the bit set for the page of @addr.
+ * If @cpu == smp_cpus return true if at least one vCPU has the dirty bit set
+ * for that page. */
+static inline int cpu_physical_memory_excl_is_dirty(ram_addr_t addr,
+                                                    unsigned long cpu)
+{
+    uint8_t *bitmap;
+
+    bitmap = (uint8_t *)ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
+
+    if (cpu == smp_cpus) {
+        if (smp_cpus >= EXCL_BITMAP_CELL_SZ) {
+            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)];
+        } else {
+            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] &
+                                            ((1 << smp_cpus) - 1);
+        }
+    } else {
+        return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] & (1 << EXCL_IDX(cpu));
+    }
+}
+
+/* Clean the dirty bit of @cpu. If @cpu == smp_cpus clean the dirty bit for all
+ * the vCPUs. */
+static inline int cpu_physical_memory_clear_excl_dirty(ram_addr_t addr,
+                                                        uint32_t cpu)
+{
+    if (cpu == smp_cpus) {
+        int nr = (smp_cpus >= EXCL_BITMAP_CELL_SZ) ?
+                            EXCL_BITMAP_CELL_SZ : smp_cpus;
+
+        return bitmap_test_and_clear_atomic(
+                        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                        EXCL_BITMAP_GET_BIT_OFFSET(addr), nr);
+    } else {
+        return bitmap_test_and_clear_atomic(
+                        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                        EXCL_BITMAP_GET_BIT_OFFSET(addr) + EXCL_IDX(cpu), 1);
+    }
+}
+
 #endif
 #endif
-- 
2.5.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel]  [RFC v5 2/6] softmmu: Add new TLB_EXCL flag
  2015-09-24  8:32 [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Alvise Rigo
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 1/6] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2015-09-24  8:32 ` Alvise Rigo
  2015-09-30  3:34   ` Richard Henderson
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath Alvise Rigo
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Alvise Rigo @ 2015-09-24  8:32 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Add a new TLB flag to force all the accesses made to a page to follow
the slow-path.

In the case we remove a TLB entry marked as EXCL, we unset the
corresponding exclusive bit in the bitmap.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |  40 ++++++++++++++++-
 include/exec/cpu-all.h  |   8 ++++
 include/exec/cpu-defs.h |  12 ++++++
 softmmu_template.h      | 112 ++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 149 insertions(+), 23 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index a506086..1e25a2a 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -299,6 +299,14 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     env->tlb_v_table[mmu_idx][vidx] = *te;
     env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
 
+    if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL))) {
+        /* We are removing an exclusive entry, set the page to dirty. This
+         * is not be necessary if the vCPU has performed both SC and LL. */
+        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
+                                          (te->addr_write & TARGET_PAGE_MASK);
+        cpu_physical_memory_set_excl_dirty(hw_addr, cpu->cpu_index);
+    }
+
     /* refill the tlb */
     env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
     env->iotlb[mmu_idx][index].attrs = attrs;
@@ -324,7 +332,15 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
                                                    + xlat)) {
             te->addr_write = address | TLB_NOTDIRTY;
         } else {
-            te->addr_write = address;
+            if (!(address & TLB_MMIO) &&
+                cpu_physical_memory_excl_atleast_one_clean(section->mr->ram_addr
+                                                           + xlat)) {
+                /* There is at least one vCPU that has flagged the address as
+                 * exclusive. */
+                te->addr_write = address | TLB_EXCL;
+            } else {
+                te->addr_write = address;
+            }
         }
     } else {
         te->addr_write = -1;
@@ -376,6 +392,28 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
     return qemu_ram_addr_from_host_nofail(p);
 }
 
+/* Atomic insn translation TLB support. */
+#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+/* For every vCPU compare the exclusive address and reset it in case of a
+ * match. Since only one vCPU is running at once, no lock has to be held to
+ * guard this operation. */
+static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
+{
+    CPUState *cpu;
+    CPUArchState *acpu;
+
+    CPU_FOREACH(cpu) {
+        acpu = (CPUArchState *)cpu->env_ptr;
+
+        if (acpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
+            ranges_overlap(acpu->excl_protected_range.begin,
+            acpu->excl_protected_range.end - acpu->excl_protected_range.begin,
+            addr, size)) {
+            acpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+        }
+    }
+}
+
 #define MMUSUFFIX _mmu
 
 #define SHIFT 0
diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index ea6a9a6..ad6afcb 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -320,6 +320,14 @@ extern RAMList ram_list;
 #define TLB_NOTDIRTY    (1 << 4)
 /* Set if TLB entry is an IO callback.  */
 #define TLB_MMIO        (1 << 5)
+/* Set if TLB entry references a page that requires exclusive access.  */
+#define TLB_EXCL        (1 << 6)
+
+/* Do not allow a TARGET_PAGE_MASK which covers one or more bits defined
+ * above. */
+#if TLB_EXCL >= TARGET_PAGE_SIZE
+#error TARGET_PAGE_MASK covering the low bits of the TLB virtual address
+#endif
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 98b9cff..a67f295 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -27,6 +27,7 @@
 #include <inttypes.h>
 #include "qemu/osdep.h"
 #include "qemu/queue.h"
+#include "qemu/range.h"
 #include "tcg-target.h"
 #ifndef CONFIG_USER_ONLY
 #include "exec/hwaddr.h"
@@ -150,5 +151,16 @@ typedef struct CPUIOTLBEntry {
 #define CPU_COMMON                                                      \
     /* soft mmu support */                                              \
     CPU_COMMON_TLB                                                      \
+                                                                        \
+    /* Used by the atomic insn translation backend. */                  \
+    int ll_sc_context;                                                  \
+    /* vCPU current exclusive addresses range.
+     * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
+     * in the middle of a LL/SC. */                                     \
+    struct Range excl_protected_range;                                  \
+    /* Used to carry the SC result but also to flag a normal (legacy)
+     * store access made by a stcond (see softmmu_template.h). */       \
+    int excl_succeeded;                                                 \
+
 
 #endif
diff --git a/softmmu_template.h b/softmmu_template.h
index d42d89d..e4431e8 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -409,19 +409,53 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
+        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+
+        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+            /* The slow-path has been forced since we are writing to
+             * exclusive-protected memory. */
+            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+            /* The function lookup_and_reset_cpus_ll_addr could have reset the
+             * exclusive address. Fail the SC in this case.
+             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
+             * not been called by a softmmu_llsc_template.h. */
+            if(env->excl_succeeded) {
+                if (env->excl_protected_range.begin != hw_addr) {
+                    /* The vCPU is SC-ing to an unprotected address. */
+                    env->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+                    env->excl_succeeded = 0;
+
+                    return;
+                }
+
+                cpu_physical_memory_set_excl_dirty(hw_addr, ENV_GET_CPU(env)->cpu_index);
+            }
+
+            haddr = addr + env->tlb_table[mmu_idx][index].addend;
+        #if DATA_SIZE == 1
+            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+        #else
+            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+        #endif
+
+            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
+
+            return;
+        } else {
+            if ((addr & (DATA_SIZE - 1)) != 0) {
+                goto do_unaligned_access;
+            }
+            iotlbentry = &env->iotlb[mmu_idx][index];
+
+            /* ??? Note that the io helpers always read data in the target
+               byte ordering.  We should push the LE/BE request down into io.  */
+            val = TGT_LE(val);
+            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            return;
         }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_LE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
-        return;
     }
 
     /* Handle slow unaligned access (it spans two pages or IO).  */
@@ -489,19 +523,53 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
+        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+
+        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+            /* The slow-path has been forced since we are writing to
+             * exclusive-protected memory. */
+            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+            /* The function lookup_and_reset_cpus_ll_addr could have reset the
+             * exclusive address. Fail the SC in this case.
+             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
+             * not been called by a softmmu_llsc_template.h. */
+            if(env->excl_succeeded) {
+                if (env->excl_protected_range.begin != hw_addr) {
+                    /* The vCPU is SC-ing to an unprotected address. */
+                    env->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+                    env->excl_succeeded = 0;
+
+                    return;
+                }
+
+                cpu_physical_memory_set_excl_dirty(hw_addr, ENV_GET_CPU(env)->cpu_index);
+            }
+
+            haddr = addr + env->tlb_table[mmu_idx][index].addend;
+        #if DATA_SIZE == 1
+            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+        #else
+            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+        #endif
+
+            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
+
+            return;
+        } else {
+            if ((addr & (DATA_SIZE - 1)) != 0) {
+                goto do_unaligned_access;
+            }
+            iotlbentry = &env->iotlb[mmu_idx][index];
+
+            /* ??? Note that the io helpers always read data in the target
+               byte ordering.  We should push the LE/BE request down into io.  */
+            val = TGT_BE(val);
+            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            return;
         }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_BE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
-        return;
     }
 
     /* Handle slow unaligned access (it spans two pages or IO).  */
-- 
2.5.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath
  2015-09-24  8:32 [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Alvise Rigo
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 1/6] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 2/6] softmmu: Add new TLB_EXCL flag Alvise Rigo
@ 2015-09-24  8:32 ` Alvise Rigo
  2015-09-30  3:58   ` Richard Henderson
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 4/6] target-arm: Create new runtime helpers for excl accesses Alvise Rigo
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Alvise Rigo @ 2015-09-24  8:32 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

The new helpers rely on the legacy ones to perform the actual read/write.

The LoadLink helper (helper_ldlink_name) prepares the way for the
following SC operation. It sets the linked address and the size of the
access.
These helper also update the TLB entry of the page involved in the
LL/SC for those vCPUs that have the bit set (dirty), so that the
following accesses made by all the vCPUs will follow the slow path.

The StoreConditional helper (helper_stcond_name) returns 1 if the
store has to fail due to a concurrent access to the same page by
another vCPU. A 'concurrent access' can be a store made by *any* vCPU
(although, some implementations allow stores made by the CPU that issued
the LoadLink).

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |   3 ++
 softmmu_llsc_template.h | 124 ++++++++++++++++++++++++++++++++++++++++++++++++
 softmmu_template.h      |  12 +++++
 tcg/tcg.h               |  30 ++++++++++++
 4 files changed, 169 insertions(+)
 create mode 100644 softmmu_llsc_template.h

diff --git a/cputlb.c b/cputlb.c
index 1e25a2a..d5aae7c 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -416,6 +416,8 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
 
 #define MMUSUFFIX _mmu
 
+/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
+#define GEN_EXCLUSIVE_HELPERS
 #define SHIFT 0
 #include "softmmu_template.h"
 
@@ -428,6 +430,7 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
 #define SHIFT 3
 #include "softmmu_template.h"
 #undef MMUSUFFIX
+#undef GEN_EXCLUSIVE_HELPERS
 
 #define MMUSUFFIX _cmmu
 #undef GETPC_ADJ
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
new file mode 100644
index 0000000..9f22834
--- /dev/null
+++ b/softmmu_llsc_template.h
@@ -0,0 +1,124 @@
+/*
+ *  Software MMU support (esclusive load/store operations)
+ *
+ * Generate helpers used by TCG for qemu_ldlink/stcond ops.
+ *
+ * Included from softmmu_template.h only.
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.rigo@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* This template does not generate together the le and be version, but only one
+ * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been set.
+ * The same nomenclature as softmmu_template.h is used for the exclusive
+ * helpers.  */
+
+#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
+
+#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_be_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_be_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_be_st, SUFFIX), MMUSUFFIX)
+
+#else /* LE helpers + 8bit helpers (generated only once for both LE end BE) */
+
+#if DATA_SIZE > 1
+#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
+#else /* DATA_SIZE <= 1 */
+#define helper_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
+#endif
+
+#endif
+
+WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
+                                TCGMemOpIdx oi, uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    int index;
+    CPUState *cpu;
+    hwaddr hw_addr;
+    unsigned mmu_idx = get_mmuidx(oi);
+
+    /* Use the proper load helper from cpu_ldst.h */
+    ret = helper_ld(env, addr, mmu_idx, retaddr);
+
+    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+
+    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
+     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
+    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
+
+    cpu_physical_memory_clear_excl_dirty(hw_addr, ENV_GET_CPU(env)->cpu_index);
+    /* If all the vCPUs have the EXCL bit set for this page there is no need
+     * to request any flush. */
+    if (cpu_physical_memory_excl_is_dirty(hw_addr, smp_cpus)) {
+        CPU_FOREACH(cpu) {
+            if (current_cpu != cpu) {
+                if (cpu_physical_memory_excl_is_dirty(hw_addr,
+                                                    cpu->cpu_index)) {
+                    cpu_physical_memory_clear_excl_dirty(hw_addr,
+                                                         cpu->cpu_index);
+                    tlb_flush(cpu, 1);
+                }
+            }
+        }
+    }
+
+    env->excl_protected_range.begin = hw_addr;
+    env->excl_protected_range.end = hw_addr + DATA_SIZE;
+
+    /* For this vCPU, just update the TLB entry, no need to flush. */
+    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
+
+    return ret;
+}
+
+WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
+                             DATA_TYPE val, TCGMemOpIdx oi,
+                             uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    unsigned mmu_idx = get_mmuidx(oi);
+
+    /* We set it preventively to true to distinguish the following legacy
+     * access as one made by the store conditional wrapper. If the store
+     * conditional does not succeed, the value will be set to 0.*/
+    env->excl_succeeded = 1;
+    helper_st(env, addr, val, mmu_idx, retaddr);
+
+    if (env->excl_succeeded) {
+        env->excl_succeeded = 0;
+        ret = 0;
+    } else {
+        ret = 1;
+    }
+
+    return ret;
+}
+
+#undef helper_ldlink_name
+#undef helper_stcond_name
+#undef helper_ld
+#undef helper_st
diff --git a/softmmu_template.h b/softmmu_template.h
index e4431e8..ad65d20 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -640,6 +640,18 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
 #endif
 #endif /* !defined(SOFTMMU_CODE_ACCESS) */
 
+#ifdef GEN_EXCLUSIVE_HELPERS
+
+#if DATA_SIZE > 1 /* The 8-bit helpers are generate along with LE helpers */
+#define BIGENDIAN_EXCLUSIVE_HELPERS
+#include "softmmu_llsc_template.h"
+#undef BIGENDIAN_EXCLUSIVE_HELPERS
+#endif
+
+#include "softmmu_llsc_template.h"
+
+#endif /* !defined(GEN_EXCLUSIVE_HELPERS) */
+
 #undef READ_ACCESS_TYPE
 #undef SHIFT
 #undef DATA_TYPE
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 231a781..f8e6e68 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -957,6 +957,21 @@ tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     TCGMemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_be_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
 
 /* Value sign-extended to tcg register size.  */
 tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
@@ -984,6 +999,21 @@ void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        TCGMemOpIdx oi, uintptr_t retaddr);
 void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
+                            uint8_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
+                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
+                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
+                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_stcondw_mmu(CPUArchState *env, target_ulong addr,
+                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_stcondl_mmu(CPUArchState *env, target_ulong addr,
+                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_be_stcondq_mmu(CPUArchState *env, target_ulong addr,
+                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
 
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
-- 
2.5.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v5 4/6] target-arm: Create new runtime helpers for excl accesses
  2015-09-24  8:32 [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Alvise Rigo
                   ` (2 preceding siblings ...)
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath Alvise Rigo
@ 2015-09-24  8:32 ` Alvise Rigo
  2015-09-30  4:03   ` Richard Henderson
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 5/6] configure: Use slow-path for atomic only when the softmmu is enabled Alvise Rigo
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Alvise Rigo @ 2015-09-24  8:32 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Introduce a set of new runtime helpers do handle exclusive instructions.
This helpers are used as hooks to call the respective LL/SC helpers in
softmmu_llsc_template.h from TCG code.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/helper.h    | 10 ++++++
 target-arm/op_helper.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 104 insertions(+)

diff --git a/target-arm/helper.h b/target-arm/helper.h
index 827b33d..8e7a7c2 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -530,6 +530,16 @@ DEF_HELPER_2(dc_zva, void, env, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 
+DEF_HELPER_3(ldlink_aa32_i8, i32, env, i32, i32)
+DEF_HELPER_3(ldlink_aa32_i16, i32, env, i32, i32)
+DEF_HELPER_3(ldlink_aa32_i32, i32, env, i32, i32)
+DEF_HELPER_3(ldlink_aa32_i64, i64, env, i32, i32)
+
+DEF_HELPER_4(stcond_aa32_i8, i32, env, i32, i32, i32)
+DEF_HELPER_4(stcond_aa32_i16, i32, env, i32, i32, i32)
+DEF_HELPER_4(stcond_aa32_i32, i32, env, i32, i32, i32)
+DEF_HELPER_4(stcond_aa32_i64, i32, env, i32, i64, i32)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #endif
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index 663c05d..d832ba8 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -969,3 +969,97 @@ uint32_t HELPER(ror_cc)(CPUARMState *env, uint32_t x, uint32_t i)
         return ((uint32_t)x >> shift) | (x << (32 - shift));
     }
 }
+
+/* LoadLink helpers, only unsigned. */
+static void * const qemu_ldex_helpers[16] = {
+    [MO_UB]   = helper_ret_ldlinkub_mmu,
+
+    [MO_LEUW] = helper_le_ldlinkuw_mmu,
+    [MO_LEUL] = helper_le_ldlinkul_mmu,
+    [MO_LEQ]  = helper_le_ldlinkq_mmu,
+
+    [MO_BEUW] = helper_be_ldlinkuw_mmu,
+    [MO_BEUL] = helper_be_ldlinkul_mmu,
+    [MO_BEQ]  = helper_be_ldlinkq_mmu,
+};
+
+#define LDEX_HELPER(SUFF, OPC)                                          \
+uint32_t HELPER(ldlink_aa32_i##SUFF)(CPUARMState *env, uint32_t addr,   \
+                                                       uint32_t index)  \
+{                                                                       \
+    CPUArchState *state = env;                                          \
+    TCGMemOpIdx op;                                                     \
+                                                                        \
+    op = make_memop_idx(OPC, index);                                    \
+                                                                        \
+    tcg_target_ulong (*func)(CPUArchState *env, target_ulong addr,      \
+                             TCGMemOpIdx oi, uintptr_t retaddr);        \
+    func = qemu_ldex_helpers[OPC];                                      \
+                                                                        \
+    return (uint32_t)func(state, addr, op, GETRA());                    \
+}
+
+LDEX_HELPER(8, MO_UB)
+LDEX_HELPER(16, MO_TEUW)
+LDEX_HELPER(32, MO_TEUL)
+
+uint64_t HELPER(ldlink_aa32_i64)(CPUARMState *env, uint32_t addr,
+                                                   uint32_t index)
+{
+    CPUArchState *state = env;
+    TCGMemOpIdx op;
+
+    op = make_memop_idx(MO_TEQ, index);
+
+    uint64_t (*func)(CPUArchState *env, target_ulong addr,
+                             TCGMemOpIdx oi, uintptr_t retaddr);
+    func = qemu_ldex_helpers[MO_TEQ];
+
+    return func(state, addr, op, GETRA());
+}
+
+/* StoreConditional helpers. Use the macro below to access them. */
+static void * const qemu_stex_helpers[16] = {
+    [MO_UB]   = helper_ret_stcondb_mmu,
+    [MO_LEUW] = helper_le_stcondw_mmu,
+    [MO_LEUL] = helper_le_stcondl_mmu,
+    [MO_LEQ]  = helper_le_stcondq_mmu,
+    [MO_BEUW] = helper_be_stcondw_mmu,
+    [MO_BEUL] = helper_be_stcondl_mmu,
+    [MO_BEQ]  = helper_be_stcondq_mmu,
+};
+
+#define STEX_HELPER(SUFF, DATA_TYPE, OPC)                               \
+uint32_t HELPER(stcond_aa32_i##SUFF)(CPUARMState *env, uint32_t addr,   \
+                                     uint32_t val, uint32_t index)      \
+{                                                                       \
+    CPUArchState *state = env;                                          \
+    TCGMemOpIdx op;                                                     \
+                                                                        \
+    op = make_memop_idx(OPC, index);                                    \
+                                                                        \
+    tcg_target_ulong (*func)(CPUArchState *env, target_ulong addr,      \
+                DATA_TYPE val, TCGMemOpIdx oi, uintptr_t retaddr);      \
+    func = qemu_stex_helpers[OPC];                                      \
+                                                                        \
+    return (uint32_t)func(state, addr, val, op, GETRA());               \
+}
+
+STEX_HELPER(8, uint8_t, MO_UB)
+STEX_HELPER(16, uint16_t, MO_TEUW)
+STEX_HELPER(32, uint32_t, MO_TEUL)
+
+uint32_t HELPER(stcond_aa32_i64)(CPUARMState *env, uint32_t addr,
+                                 uint64_t val, uint32_t index)
+{
+    CPUArchState *state = env;
+    TCGMemOpIdx op;
+
+    op = make_memop_idx(MO_TEQ, index);
+
+    tcg_target_ulong (*func)(CPUArchState *env, target_ulong addr,
+                 uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+    func = qemu_stex_helpers[MO_TEQ];
+
+    return (uint32_t)func(state, addr, val, op, GETRA());
+}
-- 
2.5.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v5 5/6] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-09-24  8:32 [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Alvise Rigo
                   ` (3 preceding siblings ...)
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 4/6] target-arm: Create new runtime helpers for excl accesses Alvise Rigo
@ 2015-09-24  8:32 ` Alvise Rigo
  2015-09-30  4:05   ` Richard Henderson
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 6/6] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
  2015-09-30  4:44 ` [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Paolo Bonzini
  6 siblings, 1 reply; 27+ messages in thread
From: Alvise Rigo @ 2015-09-24  8:32 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Use the new slow path for atomic instruction translation when the
softmmu is enabled.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 configure | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/configure b/configure
index cd219d8..5f72977 100755
--- a/configure
+++ b/configure
@@ -1391,6 +1391,13 @@ if test "$ARCH" = "unknown"; then
     fi
 fi
 
+# Use the slow-path for atomic instructions if the softmmu is enabled
+if test "$softmmu" = "yes"; then
+    tcg_use_ldst_excl="yes"
+else
+    tcg_use_ldst_excl="no"
+fi
+
 # Consult white-list to determine whether to enable werror
 # by default.  Only enable by default for git builds
 z_version=`cut -f3 -d. $source_path/VERSION`
@@ -4542,6 +4549,7 @@ echo "Install blobs     $blobs"
 echo "KVM support       $kvm"
 echo "RDMA support      $rdma"
 echo "TCG interpreter   $tcg_interpreter"
+echo "use ld/st excl    $tcg_use_ldst_excl"
 echo "fdt support       $fdt"
 echo "preadv support    $preadv"
 echo "fdatasync         $fdatasync"
@@ -4920,6 +4928,9 @@ fi
 if test "$tcg_interpreter" = "yes" ; then
   echo "CONFIG_TCG_INTERPRETER=y" >> $config_host_mak
 fi
+if test "$tcg_use_ldst_excl" = "yes" ; then
+  echo "CONFIG_TCG_USE_LDST_EXCL=y" >> $config_host_mak
+fi
 if test "$fdatasync" = "yes" ; then
   echo "CONFIG_FDATASYNC=y" >> $config_host_mak
 fi
-- 
2.5.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v5 6/6] target-arm: translate: Use ld/st excl for atomic insns
  2015-09-24  8:32 [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Alvise Rigo
                   ` (4 preceding siblings ...)
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 5/6] configure: Use slow-path for atomic only when the softmmu is enabled Alvise Rigo
@ 2015-09-24  8:32 ` Alvise Rigo
  2015-09-30  4:44 ` [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Paolo Bonzini
  6 siblings, 0 replies; 27+ messages in thread
From: Alvise Rigo @ 2015-09-24  8:32 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Use the new LL/SC runtime helpers to handle the ARM atomic
instructions in softmmu_llsc_template.h.

In general, the helper generator
gen_helper_{ldlink,stcond}_aa32_i{8,16,32,64}() calls the function
helper_{le,be}_{ldlink,stcond}{ub,uw,ulq}_mmu() implemented in
softmmu_llsc_template.h.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/translate.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 117 insertions(+), 4 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 69ac18c..fa85455 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -65,8 +65,12 @@ TCGv_ptr cpu_env;
 static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
 static TCGv_i32 cpu_R[16];
 static TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
+#ifndef CONFIG_TCG_USE_LDST_EXCL
 static TCGv_i64 cpu_exclusive_addr;
 static TCGv_i64 cpu_exclusive_val;
+#else
+static TCGv_i32 cpu_ll_sc_context;
+#endif
 #ifdef CONFIG_USER_ONLY
 static TCGv_i64 cpu_exclusive_test;
 static TCGv_i32 cpu_exclusive_info;
@@ -99,10 +103,15 @@ void arm_translate_init(void)
     cpu_VF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, VF), "VF");
     cpu_ZF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, ZF), "ZF");
 
+#ifdef CONFIG_TCG_USE_LDST_EXCL
+    cpu_ll_sc_context = tcg_global_mem_new_i32(TCG_AREG0,
+        offsetof(CPUARMState, ll_sc_context), "ll_sc_context");
+#else
     cpu_exclusive_addr = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
     cpu_exclusive_val = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_val), "exclusive_val");
+#endif
 #ifdef CONFIG_USER_ONLY
     cpu_exclusive_test = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_test), "exclusive_test");
@@ -7382,15 +7391,62 @@ static void gen_logicq_cc(TCGv_i32 lo, TCGv_i32 hi)
     tcg_gen_or_i32(cpu_ZF, lo, hi);
 }
 
-/* Load/Store exclusive instructions are implemented by remembering
+/* If the softmmu is enabled, the translation of Load/Store exclusive
+ * instructions will rely on the gen_helper_{ldlink,stcond} helpers,
+ * offloading most of the work to the softmmu_llsc_template.h functions.
+
+   Otherwise, these instructions are implemented by remembering
    the value/address loaded, and seeing if these are the same
    when the store is performed. This should be sufficient to implement
    the architecturally mandated semantics, and avoids having to monitor
    regular stores.
 
-   In system emulation mode only one CPU will be running at once, so
-   this sequence is effectively atomic.  In user emulation mode we
-   throw an exception and handle the atomic operation elsewhere.  */
+   In user emulation mode we throw an exception and handle the atomic
+   operation elsewhere.  */
+#ifdef CONFIG_TCG_USE_LDST_EXCL
+static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
+                               TCGv_i32 addr, int size)
+ {
+    TCGv_i32 tmp = tcg_temp_new_i32();
+    TCGv_i32 mem_idx = tcg_temp_new_i32();
+
+    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
+
+    if (size != 3) {
+        switch (size) {
+        case 0:
+            gen_helper_ldlink_aa32_i8(tmp, cpu_env, addr, mem_idx);
+            break;
+        case 1:
+            gen_helper_ldlink_aa32_i16(tmp, cpu_env, addr, mem_idx);
+            break;
+        case 2:
+            gen_helper_ldlink_aa32_i32(tmp, cpu_env, addr, mem_idx);
+            break;
+        default:
+            abort();
+        }
+
+        store_reg(s, rt, tmp);
+    } else {
+        TCGv_i64 tmp64 = tcg_temp_new_i64();
+        TCGv_i32 tmph = tcg_temp_new_i32();
+
+        gen_helper_ldlink_aa32_i64(tmp64, cpu_env, addr, mem_idx);
+        tcg_gen_extr_i64_i32(tmp, tmph, tmp64);
+
+        store_reg(s, rt, tmp);
+        store_reg(s, rt2, tmph);
+
+        tcg_temp_free_i64(tmp64);
+    }
+
+    tcg_temp_free_i32(mem_idx);
+
+    /* From now on we are in LL/SC context. */
+    tcg_gen_movi_i32(cpu_ll_sc_context, 1);
+}
+#else
 static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
                                TCGv_i32 addr, int size)
 {
@@ -7429,10 +7485,14 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     store_reg(s, rt, tmp);
     tcg_gen_extu_i32_i64(cpu_exclusive_addr, addr);
 }
+#endif
 
 static void gen_clrex(DisasContext *s)
 {
+#ifdef CONFIG_TCG_USE_LDST_EXCL
+#else
     tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+#endif
 }
 
 #ifdef CONFIG_USER_ONLY
@@ -7444,6 +7504,59 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                      size | (rd << 4) | (rt << 8) | (rt2 << 12));
     gen_exception_internal_insn(s, 4, EXCP_STREX);
 }
+#elif defined CONFIG_TCG_USE_LDST_EXCL
+static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
+                                TCGv_i32 addr, int size)
+{
+    TCGv_i32 tmp, mem_idx;
+    TCGLabel *done_label, *fail_label;
+
+    fail_label = gen_new_label();
+    done_label = gen_new_label();
+    mem_idx = tcg_temp_new_i32();
+
+    /* Fail if we are not in LL/SC context. */
+    tcg_gen_brcondi_i32(TCG_COND_NE, cpu_ll_sc_context, 1, fail_label);
+
+    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
+    tmp = load_reg(s, rt);
+
+    if (size != 3) {
+        switch (size) {
+        case 0:
+            gen_helper_stcond_aa32_i8(cpu_R[rd], cpu_env, addr, tmp, mem_idx);
+            break;
+        case 1:
+            gen_helper_stcond_aa32_i16(cpu_R[rd], cpu_env, addr, tmp, mem_idx);
+            break;
+        case 2:
+            gen_helper_stcond_aa32_i32(cpu_R[rd], cpu_env, addr, tmp, mem_idx);
+            break;
+        default:
+            abort();
+        }
+    } else {
+        TCGv_i64 tmp64;
+        TCGv_i32 tmp2;
+
+        tmp64 = tcg_temp_new_i64();
+        tmp2 = load_reg(s, rt2);
+        tcg_gen_concat_i32_i64(tmp64, tmp, tmp2);
+        gen_helper_stcond_aa32_i64(cpu_R[rd], cpu_env, addr, tmp64, mem_idx);
+
+        tcg_temp_free_i32(tmp2);
+        tcg_temp_free_i64(tmp64);
+    }
+    tcg_gen_br(done_label);
+
+    gen_set_label(fail_label);
+    tcg_gen_movi_i32(cpu_ll_sc_context, 0);
+    tcg_gen_movi_i32(cpu_R[rd], 1);
+    gen_set_label(done_label);
+
+    tcg_temp_free_i32(tmp);
+    tcg_temp_free_i32(mem_idx);
+}
 #else
 static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                                 TCGv_i32 addr, int size)
-- 
2.5.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 1/6] exec.c: Add new exclusive bitmap to ram_list
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 1/6] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2015-09-26 17:15   ` Richard Henderson
  2015-09-28  7:28     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-09-26 17:15 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, jani.kokkonen, tech, alex.bennee, pbonzini

On 09/24/2015 01:32 AM, Alvise Rigo wrote:
> +    if (cpu == smp_cpus) {
> +        if (smp_cpus >= EXCL_BITMAP_CELL_SZ) {
> +            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)];
> +        } else {
> +            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] &
> +                                            ((1 << smp_cpus) - 1);
> +        }
> +    } else {
> +        return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] & (1 << EXCL_IDX(cpu));
> +    }

How can more than one cpu have the same address exclusively?

Isn't this scheme giving a whole page to a cpu, not a cacheline?
That's going to cause ll/sc conflicts where real hardware wouldn't.



r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 1/6] exec.c: Add new exclusive bitmap to ram_list
  2015-09-26 17:15   ` Richard Henderson
@ 2015-09-28  7:28     ` alvise rigo
  0 siblings, 0 replies; 27+ messages in thread
From: alvise rigo @ 2015-09-28  7:28 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 1132 bytes --]

On Sat, Sep 26, 2015 at 7:15 PM, Richard Henderson <rth@twiddle.net> wrote:

> On 09/24/2015 01:32 AM, Alvise Rigo wrote:
> > +    if (cpu == smp_cpus) {
> > +        if (smp_cpus >= EXCL_BITMAP_CELL_SZ) {
> > +            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)];
> > +        } else {
> > +            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] &
> > +                                            ((1 << smp_cpus) - 1);
> > +        }
> > +    } else {
> > +        return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] & (1 <<
> EXCL_IDX(cpu));
> > +    }
>
> How can more than one cpu have the same address exclusively?
>

The bitmap is used to track which cpus have the EXCL flag set for that
particular page.
A bit set to 0 assures that all the corresponding cpus are following the
slow-path in that page.


>
> Isn't this scheme giving a whole page to a cpu, not a cacheline?
>

The actual exclusive range in set in CPUArchState, this is the information
that we evaluate to see whether there was a conflict or not.

Regards,
alvise


> That's going to cause ll/sc conflicts where real hardware wouldn't.
>
>
>
> r~
>

[-- Attachment #2: Type: text/html, Size: 2067 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 2/6] softmmu: Add new TLB_EXCL flag
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 2/6] softmmu: Add new TLB_EXCL flag Alvise Rigo
@ 2015-09-30  3:34   ` Richard Henderson
  2015-09-30  9:24     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-09-30  3:34 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, jani.kokkonen, tech, alex.bennee, pbonzini

On 09/24/2015 06:32 PM, Alvise Rigo wrote:
> +    if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL))) {
> +        /* We are removing an exclusive entry, set the page to dirty. This
> +         * is not be necessary if the vCPU has performed both SC and LL. */
> +        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
> +                                          (te->addr_write & TARGET_PAGE_MASK);
> +        cpu_physical_memory_set_excl_dirty(hw_addr, cpu->cpu_index);
> +    }

Um... this seems dangerous.

(1) I don't see why EXCL support should differ whether MMIO is set or not. 
Either we support exclusive accesses on memory-mapped io like we do on ram (in 
which case this is wrong) or we don't (in which case this is unnecessary).

(2) Doesn't this prevent a target from accessing a page during a ll/sc sequence 
that aliases within our trivial hash?  Such a case on arm might be

	mov	r3, #0x100000
	ldrex	r0, [r2]
	ldr	r1, [r2, r3]
	add	r0, r0, r1
	strex	r0, [r2]

AFAIK, Alpha is the only target we have which specifies that any normal memory 
access during a ll+sc sequence may fail the sc.

(3) I'm finding the "clean/dirty" words less helpful than they could be, 
especially since "clean" implies "some cpu has an excl lock on the page",
which is reverse of what seems natural but understandable given the 
implementation.  Perhaps we could rename these helpers?

> @@ -376,6 +392,28 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
>       return qemu_ram_addr_from_host_nofail(p);
>   }
>
> +/* Atomic insn translation TLB support. */
> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
> +/* For every vCPU compare the exclusive address and reset it in case of a
> + * match. Since only one vCPU is running at once, no lock has to be held to
> + * guard this operation. */
> +static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
> +{
> +    CPUState *cpu;
> +    CPUArchState *acpu;
> +
> +    CPU_FOREACH(cpu) {
> +        acpu = (CPUArchState *)cpu->env_ptr;
> +
> +        if (acpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
> +            ranges_overlap(acpu->excl_protected_range.begin,
> +            acpu->excl_protected_range.end - acpu->excl_protected_range.begin,
> +            addr, size)) {

Watch the indentation here... it ought to line up with the previous argument on 
the line above, just after the (.  This may require you split the subtract 
across the line too but that's ok.


>   void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
>   void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 98b9cff..a67f295 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -27,6 +27,7 @@
>   #include <inttypes.h>
>   #include "qemu/osdep.h"
>   #include "qemu/queue.h"
> +#include "qemu/range.h"
>   #include "tcg-target.h"
>   #ifndef CONFIG_USER_ONLY
>   #include "exec/hwaddr.h"
> @@ -150,5 +151,16 @@ typedef struct CPUIOTLBEntry {
>   #define CPU_COMMON                                                      \
>       /* soft mmu support */                                              \
>       CPU_COMMON_TLB                                                      \
> +                                                                        \
> +    /* Used by the atomic insn translation backend. */                  \
> +    int ll_sc_context;                                                  \
> +    /* vCPU current exclusive addresses range.
> +     * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
> +     * in the middle of a LL/SC. */                                     \
> +    struct Range excl_protected_range;                                  \
> +    /* Used to carry the SC result but also to flag a normal (legacy)
> +     * store access made by a stcond (see softmmu_template.h). */       \
> +    int excl_succeeded;                                                 \


This seems to be required by softmmu_template.h?  In which case this must be in 
CPU_COMMON_TLB.

Please expand on this "legacy store access" comment here.  I don't understand.

> +
>
>   #endif
> diff --git a/softmmu_template.h b/softmmu_template.h
> index d42d89d..e4431e8 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -409,19 +409,53 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>           tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>       }
>
> -    /* Handle an IO access.  */
> +    /* Handle an IO access or exclusive access.  */
>       if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> -        CPUIOTLBEntry *iotlbentry;
> -        if ((addr & (DATA_SIZE - 1)) != 0) {
> -            goto do_unaligned_access;
> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
> +
> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
> +            /* The slow-path has been forced since we are writing to
> +             * exclusive-protected memory. */
> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
> +
> +            /* The function lookup_and_reset_cpus_ll_addr could have reset the
> +             * exclusive address. Fail the SC in this case.
> +             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
> +             * not been called by a softmmu_llsc_template.h. */
> +            if(env->excl_succeeded) {
> +                if (env->excl_protected_range.begin != hw_addr) {
> +                    /* The vCPU is SC-ing to an unprotected address. */
> +                    env->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
> +                    env->excl_succeeded = 0;
> +
> +                    return;

Huh?  How does a normal store ever fail.  Surely you aren't using the normal 
store path to support true store_conditional?

> +                }
> +
> +                cpu_physical_memory_set_excl_dirty(hw_addr, ENV_GET_CPU(env)->cpu_index);
> +            }
> +
> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
> +        #if DATA_SIZE == 1
> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
> +        #else
> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
> +        #endif

You're assuming that TLB_EXCL can never have any other TLB_ bits set.  We 
definitely have to support TLB_EXCL+TLB_NOTDIRTY, and probably +TLB_MMIO too 
(iirc that's how watchpoints are implemeneted).


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath Alvise Rigo
@ 2015-09-30  3:58   ` Richard Henderson
  2015-09-30  9:46     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-09-30  3:58 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, jani.kokkonen, tech, alex.bennee, pbonzini

On 09/24/2015 06:32 PM, Alvise Rigo wrote:
> The new helpers rely on the legacy ones to perform the actual read/write.
>
> The LoadLink helper (helper_ldlink_name) prepares the way for the
> following SC operation. It sets the linked address and the size of the
> access.
> These helper also update the TLB entry of the page involved in the
> LL/SC for those vCPUs that have the bit set (dirty), so that the
> following accesses made by all the vCPUs will follow the slow path.
>
> The StoreConditional helper (helper_stcond_name) returns 1 if the
> store has to fail due to a concurrent access to the same page by
> another vCPU. A 'concurrent access' can be a store made by *any* vCPU
> (although, some implementations allow stores made by the CPU that issued
> the LoadLink).
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>   cputlb.c                |   3 ++
>   softmmu_llsc_template.h | 124 ++++++++++++++++++++++++++++++++++++++++++++++++
>   softmmu_template.h      |  12 +++++
>   tcg/tcg.h               |  30 ++++++++++++
>   4 files changed, 169 insertions(+)
>   create mode 100644 softmmu_llsc_template.h
>
> diff --git a/cputlb.c b/cputlb.c
> index 1e25a2a..d5aae7c 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -416,6 +416,8 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>
>   #define MMUSUFFIX _mmu
>
> +/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
> +#define GEN_EXCLUSIVE_HELPERS
>   #define SHIFT 0
>   #include "softmmu_template.h"
>
> @@ -428,6 +430,7 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>   #define SHIFT 3
>   #include "softmmu_template.h"
>   #undef MMUSUFFIX
> +#undef GEN_EXCLUSIVE_HELPERS
>
>   #define MMUSUFFIX _cmmu
>   #undef GETPC_ADJ
> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
> new file mode 100644
> index 0000000..9f22834
> --- /dev/null
> +++ b/softmmu_llsc_template.h
> @@ -0,0 +1,124 @@
> +/*
> + *  Software MMU support (esclusive load/store operations)
> + *
> + * Generate helpers used by TCG for qemu_ldlink/stcond ops.
> + *
> + * Included from softmmu_template.h only.
> + *
> + * Copyright (c) 2015 Virtual Open Systems
> + *
> + * Authors:
> + *  Alvise Rigo <a.rigo@virtualopensystems.com>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +/* This template does not generate together the le and be version, but only one
> + * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been set.
> + * The same nomenclature as softmmu_template.h is used for the exclusive
> + * helpers.  */
> +
> +#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
> +
> +#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_stcond_name  glue(glue(helper_be_stcond, SUFFIX), MMUSUFFIX)
> +#define helper_ld glue(glue(helper_be_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st glue(glue(helper_be_st, SUFFIX), MMUSUFFIX)
> +
> +#else /* LE helpers + 8bit helpers (generated only once for both LE end BE) */
> +
> +#if DATA_SIZE > 1
> +#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
> +#define helper_ld glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
> +#else /* DATA_SIZE <= 1 */
> +#define helper_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
> +#define helper_ld glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
> +#endif
> +
> +#endif
> +
> +WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
> +                                TCGMemOpIdx oi, uintptr_t retaddr)
> +{
> +    WORD_TYPE ret;
> +    int index;
> +    CPUState *cpu;
> +    hwaddr hw_addr;
> +    unsigned mmu_idx = get_mmuidx(oi);
> +
> +    /* Use the proper load helper from cpu_ldst.h */
> +    ret = helper_ld(env, addr, mmu_idx, retaddr);
> +
> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +
> +    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
> +     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
> +    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
> +
> +    cpu_physical_memory_clear_excl_dirty(hw_addr, ENV_GET_CPU(env)->cpu_index);
> +    /* If all the vCPUs have the EXCL bit set for this page there is no need
> +     * to request any flush. */
> +    if (cpu_physical_memory_excl_is_dirty(hw_addr, smp_cpus)) {
> +        CPU_FOREACH(cpu) {
> +            if (current_cpu != cpu) {
> +                if (cpu_physical_memory_excl_is_dirty(hw_addr,
> +                                                    cpu->cpu_index)) {
> +                    cpu_physical_memory_clear_excl_dirty(hw_addr,
> +                                                         cpu->cpu_index);
> +                    tlb_flush(cpu, 1);
> +                }

Why would you need to indicate that another cpu has started an exclusive 
operation on this page?  That seems definitely wrong.

I think that all you wanted was to flush the other cpu, so that it notices that 
this cpu has started an exclusive operation.

It would be great if most of this were pulled out to a subroutine so that these 
helper functions didn't accrue so much duplicate code.

It would be fantastic to implement a tlb_flush_phys_page to that we didn't have 
to flush the entire tlb of the other cpus.

> +    /* We set it preventively to true to distinguish the following legacy
> +     * access as one made by the store conditional wrapper. If the store
> +     * conditional does not succeed, the value will be set to 0.*/
> +    env->excl_succeeded = 1;

Ah -- "distinguish" is the word that was missing from the comments in the 
previous patches.

> diff --git a/softmmu_template.h b/softmmu_template.h
> index e4431e8..ad65d20 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -640,6 +640,18 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
>   #endif
>   #endif /* !defined(SOFTMMU_CODE_ACCESS) */
>
> +#ifdef GEN_EXCLUSIVE_HELPERS
> +
> +#if DATA_SIZE > 1 /* The 8-bit helpers are generate along with LE helpers */
> +#define BIGENDIAN_EXCLUSIVE_HELPERS
> +#include "softmmu_llsc_template.h"
> +#undef BIGENDIAN_EXCLUSIVE_HELPERS
> +#endif
> +
> +#include "softmmu_llsc_template.h"
> +
> +#endif /* !defined(GEN_EXCLUSIVE_HELPERS) */

I'm not especially keen on this.  Not that what we currently have in 
softmmu_template.h is terribly better.

I wonder what can be done to clean all of this up, short of actually 
transitioning to c++ templates...


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 4/6] target-arm: Create new runtime helpers for excl accesses
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 4/6] target-arm: Create new runtime helpers for excl accesses Alvise Rigo
@ 2015-09-30  4:03   ` Richard Henderson
  2015-09-30 10:16     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-09-30  4:03 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, jani.kokkonen, tech, alex.bennee, pbonzini

On 09/24/2015 06:32 PM, Alvise Rigo wrote:
> Introduce a set of new runtime helpers do handle exclusive instructions.
> This helpers are used as hooks to call the respective LL/SC helpers in
> softmmu_llsc_template.h from TCG code.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>   target-arm/helper.h    | 10 ++++++
>   target-arm/op_helper.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 104 insertions(+)
>
> diff --git a/target-arm/helper.h b/target-arm/helper.h
> index 827b33d..8e7a7c2 100644
> --- a/target-arm/helper.h
> +++ b/target-arm/helper.h
> @@ -530,6 +530,16 @@ DEF_HELPER_2(dc_zva, void, env, i64)
>   DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>   DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>
> +DEF_HELPER_3(ldlink_aa32_i8, i32, env, i32, i32)
> +DEF_HELPER_3(ldlink_aa32_i16, i32, env, i32, i32)
> +DEF_HELPER_3(ldlink_aa32_i32, i32, env, i32, i32)
> +DEF_HELPER_3(ldlink_aa32_i64, i64, env, i32, i32)
> +
> +DEF_HELPER_4(stcond_aa32_i8, i32, env, i32, i32, i32)
> +DEF_HELPER_4(stcond_aa32_i16, i32, env, i32, i32, i32)
> +DEF_HELPER_4(stcond_aa32_i32, i32, env, i32, i32, i32)
> +DEF_HELPER_4(stcond_aa32_i64, i32, env, i32, i64, i32)
> +
>   #ifdef TARGET_AARCH64
>   #include "helper-a64.h"
>   #endif
> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
> index 663c05d..d832ba8 100644
> --- a/target-arm/op_helper.c
> +++ b/target-arm/op_helper.c
> @@ -969,3 +969,97 @@ uint32_t HELPER(ror_cc)(CPUARMState *env, uint32_t x, uint32_t i)
>           return ((uint32_t)x >> shift) | (x << (32 - shift));
>       }
>   }
> +
> +/* LoadLink helpers, only unsigned. */
> +static void * const qemu_ldex_helpers[16] = {
> +    [MO_UB]   = helper_ret_ldlinkub_mmu,
> +
> +    [MO_LEUW] = helper_le_ldlinkuw_mmu,
> +    [MO_LEUL] = helper_le_ldlinkul_mmu,
> +    [MO_LEQ]  = helper_le_ldlinkq_mmu,
> +
> +    [MO_BEUW] = helper_be_ldlinkuw_mmu,
> +    [MO_BEUL] = helper_be_ldlinkul_mmu,
> +    [MO_BEQ]  = helper_be_ldlinkq_mmu,
> +};
> +
> +#define LDEX_HELPER(SUFF, OPC)                                          \
> +uint32_t HELPER(ldlink_aa32_i##SUFF)(CPUARMState *env, uint32_t addr,   \
> +                                                       uint32_t index)  \
> +{                                                                       \
> +    CPUArchState *state = env;                                          \
> +    TCGMemOpIdx op;                                                     \
> +                                                                        \
> +    op = make_memop_idx(OPC, index);                                    \
> +                                                                        \
> +    tcg_target_ulong (*func)(CPUArchState *env, target_ulong addr,      \
> +                             TCGMemOpIdx oi, uintptr_t retaddr);        \
> +    func = qemu_ldex_helpers[OPC];                                      \
> +                                                                        \
> +    return (uint32_t)func(state, addr, op, GETRA());                    \
> +}
> +
> +LDEX_HELPER(8, MO_UB)
> +LDEX_HELPER(16, MO_TEUW)
> +LDEX_HELPER(32, MO_TEUL)

This is not what Aurelien meant.  I cannot see any reason at present why 
generic wrappers, available for all targets, shouldn't be sufficient.

See tcg/tcg-runtime.h and tcg-runtime.c.

You shouldn't need to look up a function in a table like this.  The decision 
about whether to call a BE or LE helper should have been made in the translator.


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 5/6] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 5/6] configure: Use slow-path for atomic only when the softmmu is enabled Alvise Rigo
@ 2015-09-30  4:05   ` Richard Henderson
  2015-09-30  9:51     ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-09-30  4:05 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, jani.kokkonen, tech, alex.bennee, pbonzini

On 09/24/2015 06:32 PM, Alvise Rigo wrote:
> Use the new slow path for atomic instruction translation when the
> softmmu is enabled.

Um... why?  TCG_USE_LDST_EXCL would appear to be 100% redundant with SOFTMMU.


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation
  2015-09-24  8:32 [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Alvise Rigo
                   ` (5 preceding siblings ...)
  2015-09-24  8:32 ` [Qemu-devel] [RFC v5 6/6] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
@ 2015-09-30  4:44 ` Paolo Bonzini
  2015-09-30  8:14   ` alvise rigo
  2015-10-01 19:32   ` Emilio G. Cota
  6 siblings, 2 replies; 27+ messages in thread
From: Paolo Bonzini @ 2015-09-30  4:44 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, Emilio G. Cota



On 24/09/2015 10:32, Alvise Rigo wrote:
> The implementation heavily uses the software TLB together with a new
> bitmap that has been added to the ram_list structure which flags, on a
> per-CPU basis, all the memory pages that are in the middle of a LoadLink
> (LL), StoreConditional (SC) operation.  Since all these pages can be
> accessed directly through the fast-path and alter a vCPU's linked value,
> the new bitmap has been coupled with a new TLB flag for the TLB virtual
> address which forces the slow-path execution for all the accesses to a
> page containing a linked address.

Alvise, Emilio,

I have a doubt about your patches for ll/sc emulation, that I hope you
can clarify.

>From 10000ft, both approaches rely on checking a flag during stores.
This is split between the TLB and the CPUState for Alvise's patches (in
order to exploit the existing fast-path checks), and entirely in the
radix tree for Emilio's.  However, the idea is the same.

Now, the patch are okay for serial emulation, but I am not sure if it's
possible to do lock-free ll/sc emulation, because there is a race.

If we check the flag before the store, the race is as follows:

   CPU0        CPU1
   -------------------------------------------------------
   check flag
               load locked:
                  set flag
                  load value (normal load on CPU)
   store
               store conditional (normal store on CPU)

where the sc doesn't fail.  For completeness, if we check it afterwards
(which would be possible with Emilio's approach, though not for the
TLB-based one):

   CPU0        CPU1
   ------------------------------------------------------
               load locked
                  set bit
                  load value (normal load on CPU)
   store
               store conditional (normal store on CPU)
   check flag

and again the sc doesn't fail.

Most solutions I can think of are impractical:

- hardware ll/sc in CPU1. x86 doesn't have it.

- hardware transactional memory in CPU0, checking the bit after the
store and abort the transaction (I think).  HTM just doesn't exist.

- some kind of store-in-progress (SIP) flag that ll can test and force
failure of the corresponding sc.  For example, each CPU could store a
(last_store_address, last_store_value) tuple. If the value that LL loads
disagrees with any CPU, the LL would direct the SC to fail.  A store
would look like:

         store value to last_store_value
         smp_wmb()
         store address to last_store_address
         smp_mb()
         load TLB or radix tree

The memory barrier orders the store to the SIP flag and the load from
the TLB, and is probably too expensive. :(

- some array of atomic global generation counts, incremented
unconditionally on every store and checked between ll and sc.  Cacheline
bounce fiesta, hence extremely slow. :(

Tell me I'm wrong. :)

If I'm right, we can still keep the opcodes and implement them with a
simple cmpxchg.  It would provide a nice generic tool to implement
atomic operations, and it will work correctly if the target has ll/sc.
However, ll/sc-on-cmpxchg (e.g., ARM-on-x86) would be susceptible to the
ABA problem.

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation
  2015-09-30  4:44 ` [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Paolo Bonzini
@ 2015-09-30  8:14   ` alvise rigo
  2015-09-30 13:20     ` Paolo Bonzini
  2015-10-01 19:32   ` Emilio G. Cota
  1 sibling, 1 reply; 27+ messages in thread
From: alvise rigo @ 2015-09-30  8:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Emilio G. Cota,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

Hi Paolo,

On Wed, Sep 30, 2015 at 6:44 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 24/09/2015 10:32, Alvise Rigo wrote:
>> The implementation heavily uses the software TLB together with a new
>> bitmap that has been added to the ram_list structure which flags, on a
>> per-CPU basis, all the memory pages that are in the middle of a LoadLink
>> (LL), StoreConditional (SC) operation.  Since all these pages can be
>> accessed directly through the fast-path and alter a vCPU's linked value,
>> the new bitmap has been coupled with a new TLB flag for the TLB virtual
>> address which forces the slow-path execution for all the accesses to a
>> page containing a linked address.
>
> Alvise, Emilio,
>
> I have a doubt about your patches for ll/sc emulation, that I hope you
> can clarify.
>
> From 10000ft, both approaches rely on checking a flag during stores.
> This is split between the TLB and the CPUState for Alvise's patches (in
> order to exploit the existing fast-path checks), and entirely in the
> radix tree for Emilio's.  However, the idea is the same.
>
> Now, the patch are okay for serial emulation, but I am not sure if it's
> possible to do lock-free ll/sc emulation, because there is a race.

Do you mean to not use any locking mechanism at all at the emulation side?

>
> If we check the flag before the store, the race is as follows:
>
>    CPU0        CPU1
>    -------------------------------------------------------
>    check flag
>                load locked:
>                   set flag
>                   load value (normal load on CPU)
>    store
>                store conditional (normal store on CPU)
>
> where the sc doesn't fail.  For completeness, if we check it afterwards

Shouldn't this be prevented by the tcg_excl_access_lock in the
patchseries based on mttcg (branch slowpath-for-atomic-v5-mttcg)?
Consider also that CPU0 will always finish its store operation before
the transition "flag not set -> flag set" finishes.

> (which would be possible with Emilio's approach, though not for the
> TLB-based one):
>
>    CPU0        CPU1
>    ------------------------------------------------------
>                load locked
>                   set bit
>                   load value (normal load on CPU)
>    store
>                store conditional (normal store on CPU)
>    check flag
>
> and again the sc doesn't fail.
>
> Most solutions I can think of are impractical:
>
> - hardware ll/sc in CPU1. x86 doesn't have it.
>
> - hardware transactional memory in CPU0, checking the bit after the
> store and abort the transaction (I think).  HTM just doesn't exist.
>
> - some kind of store-in-progress (SIP) flag that ll can test and force
> failure of the corresponding sc.  For example, each CPU could store a
> (last_store_address, last_store_value) tuple. If the value that LL loads
> disagrees with any CPU, the LL would direct the SC to fail.  A store
> would look like:
>
>          store value to last_store_value
>          smp_wmb()
>          store address to last_store_address
>          smp_mb()
>          load TLB or radix tree
>
> The memory barrier orders the store to the SIP flag and the load from
> the TLB, and is probably too expensive. :(

Umm, I agree with you that this could be too expensive.

>
> - some array of atomic global generation counts, incremented
> unconditionally on every store and checked between ll and sc.  Cacheline
> bounce fiesta, hence extremely slow. :(
>
> Tell me I'm wrong. :)
>
> If I'm right, we can still keep the opcodes and implement them with a
> simple cmpxchg.  It would provide a nice generic tool to implement
> atomic operations, and it will work correctly if the target has ll/sc.
> However, ll/sc-on-cmpxchg (e.g., ARM-on-x86) would be susceptible to the
> ABA problem.

This was one of my fears that led me to the ll/sc approach. I think it
could be even more probable in emulation since we can't assume the
distance in time between LLs and SCs to be small to avoid "aba"
accesses.
We could solve this issue once again with an access counter, but this
would require to increment it in the fast path, which will be kill the
performance.

Regards,
alvise

>
> Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 2/6] softmmu: Add new TLB_EXCL flag
  2015-09-30  3:34   ` Richard Henderson
@ 2015-09-30  9:24     ` alvise rigo
  2015-09-30 11:09       ` Peter Maydell
  0 siblings, 1 reply; 27+ messages in thread
From: alvise rigo @ 2015-09-30  9:24 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On Wed, Sep 30, 2015 at 5:34 AM, Richard Henderson <rth@twiddle.net> wrote:
> On 09/24/2015 06:32 PM, Alvise Rigo wrote:
>>
>> +    if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write &
>> TLB_EXCL))) {
>> +        /* We are removing an exclusive entry, set the page to dirty.
>> This
>> +         * is not be necessary if the vCPU has performed both SC and LL.
>> */
>> +        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr &
>> TARGET_PAGE_MASK) +
>> +                                          (te->addr_write &
>> TARGET_PAGE_MASK);
>> +        cpu_physical_memory_set_excl_dirty(hw_addr, cpu->cpu_index);
>> +    }
>
>
> Um... this seems dangerous.
>
> (1) I don't see why EXCL support should differ whether MMIO is set or not.
> Either we support exclusive accesses on memory-mapped io like we do on ram
> (in which case this is wrong) or we don't (in which case this is
> unnecessary).

I was not sure whether or not we had to support also MMIO memory.
In theory there shouldn't be any issues for including also
memory-mapped io, I will consider this for the next version.

>
> (2) Doesn't this prevent a target from accessing a page during a ll/sc
> sequence that aliases within our trivial hash?  Such a case on arm might be
>
>         mov     r3, #0x100000
>         ldrex   r0, [r2]
>         ldr     r1, [r2, r3]
>         add     r0, r0, r1
>         strex   r0, [r2]
>

I'm not sure I got it. When the CPU issues the ldrex the page will be
set as "clean" (meaning that all the CPUs will then follow the
slow-path for that page) and the exclusive range - [r2, r2+4] in this
case - is stored in the CPU state.
The forced slow-path is used to check if the normal store is hitting
the exclusive range of any CPUs, the normal loads are not affected.
I don't see any problem in the code above, what am I missing?

> AFAIK, Alpha is the only target we have which specifies that any normal
> memory access during a ll+sc sequence may fail the sc.

I will dig into it because I remember that the Alpha architecture
behaves like ARM in the handling of LDxL/STxC instructions.

>
> (3) I'm finding the "clean/dirty" words less helpful than they could be,
> especially since "clean" implies "some cpu has an excl lock on the page",
> which is reverse of what seems natural but understandable given the
> implementation.  Perhaps we could rename these helpers?
>
>> @@ -376,6 +392,28 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1,
>> target_ulong addr)
>>       return qemu_ram_addr_from_host_nofail(p);
>>   }
>>
>> +/* Atomic insn translation TLB support. */
>> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
>> +/* For every vCPU compare the exclusive address and reset it in case of a
>> + * match. Since only one vCPU is running at once, no lock has to be held
>> to
>> + * guard this operation. */
>> +static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr
>> size)
>> +{
>> +    CPUState *cpu;
>> +    CPUArchState *acpu;
>> +
>> +    CPU_FOREACH(cpu) {
>> +        acpu = (CPUArchState *)cpu->env_ptr;
>> +
>> +        if (acpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
>> +            ranges_overlap(acpu->excl_protected_range.begin,
>> +            acpu->excl_protected_range.end -
>> acpu->excl_protected_range.begin,
>> +            addr, size)) {
>
>
> Watch the indentation here... it ought to line up with the previous argument
> on the line above, just after the (.  This may require you split the
> subtract across the line too but that's ok.

OK, I will fix it.

>
>
>
>>   void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
>>   void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
>> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
>> index 98b9cff..a67f295 100644
>> --- a/include/exec/cpu-defs.h
>> +++ b/include/exec/cpu-defs.h
>> @@ -27,6 +27,7 @@
>>   #include <inttypes.h>
>>   #include "qemu/osdep.h"
>>   #include "qemu/queue.h"
>> +#include "qemu/range.h"
>>   #include "tcg-target.h"
>>   #ifndef CONFIG_USER_ONLY
>>   #include "exec/hwaddr.h"
>> @@ -150,5 +151,16 @@ typedef struct CPUIOTLBEntry {
>>   #define CPU_COMMON
>> \
>>       /* soft mmu support */
>> \
>>       CPU_COMMON_TLB
>> \
>> +                                                                        \
>> +    /* Used by the atomic insn translation backend. */                  \
>> +    int ll_sc_context;                                                  \
>> +    /* vCPU current exclusive addresses range.
>> +     * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
>> +     * in the middle of a LL/SC. */                                     \
>> +    struct Range excl_protected_range;                                  \
>> +    /* Used to carry the SC result but also to flag a normal (legacy)
>> +     * store access made by a stcond (see softmmu_template.h). */       \
>> +    int excl_succeeded;                                                 \
>
>
>
> This seems to be required by softmmu_template.h?  In which case this must be
> in CPU_COMMON_TLB.
>
> Please expand on this "legacy store access" comment here.  I don't
> understand.

ACK.

>
>
>> +
>>
>>   #endif
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index d42d89d..e4431e8 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -409,19 +409,53 @@ void helper_le_st_name(CPUArchState *env,
>> target_ulong addr, DATA_TYPE val,
>>           tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>       }
>>
>> -    /* Handle an IO access.  */
>> +    /* Handle an IO access or exclusive access.  */
>>       if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>> -        CPUIOTLBEntry *iotlbentry;
>> -        if ((addr & (DATA_SIZE - 1)) != 0) {
>> -            goto do_unaligned_access;
>> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
>> +
>> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
>> +            /* The slow-path has been forced since we are writing to
>> +             * exclusive-protected memory. */
>> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) +
>> addr;
>> +
>> +            /* The function lookup_and_reset_cpus_ll_addr could have
>> reset the
>> +             * exclusive address. Fail the SC in this case.
>> +             * N.B.: Here excl_succeeded == 0 means that
>> helper_le_st_name has
>> +             * not been called by a softmmu_llsc_template.h. */
>> +            if(env->excl_succeeded) {
>> +                if (env->excl_protected_range.begin != hw_addr) {
>> +                    /* The vCPU is SC-ing to an unprotected address. */
>> +                    env->excl_protected_range.begin =
>> EXCLUSIVE_RESET_ADDR;
>> +                    env->excl_succeeded = 0;
>> +
>> +                    return;
>
>
> Huh?  How does a normal store ever fail.  Surely you aren't using the normal
> store path to support true store_conditional?

As written in the comment, we are not doing a normal store, but the
store for a SC operation.

>
>> +                }
>> +
>> +                cpu_physical_memory_set_excl_dirty(hw_addr,
>> ENV_GET_CPU(env)->cpu_index);
>> +            }
>> +
>> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
>> +        #if DATA_SIZE == 1
>> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>> +        #else
>> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>> +        #endif
>
>
> You're assuming that TLB_EXCL can never have any other TLB_ bits set.  We

Yes, I was assuming this...

> definitely have to support TLB_EXCL+TLB_NOTDIRTY, and probably +TLB_MMIO too
> (iirc that's how watchpoints are implemeneted).

...but indeed I was wrong, we need to support also these combinations.
I will address this points for the next version.

Thank you,
alvise

>
>
> r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath
  2015-09-30  3:58   ` Richard Henderson
@ 2015-09-30  9:46     ` alvise rigo
  2015-09-30 20:42       ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: alvise rigo @ 2015-09-30  9:46 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On Wed, Sep 30, 2015 at 5:58 AM, Richard Henderson <rth@twiddle.net> wrote:
> On 09/24/2015 06:32 PM, Alvise Rigo wrote:
>>
>> The new helpers rely on the legacy ones to perform the actual read/write.
>>
>> The LoadLink helper (helper_ldlink_name) prepares the way for the
>> following SC operation. It sets the linked address and the size of the
>> access.
>> These helper also update the TLB entry of the page involved in the
>> LL/SC for those vCPUs that have the bit set (dirty), so that the
>> following accesses made by all the vCPUs will follow the slow path.
>>
>> The StoreConditional helper (helper_stcond_name) returns 1 if the
>> store has to fail due to a concurrent access to the same page by
>> another vCPU. A 'concurrent access' can be a store made by *any* vCPU
>> (although, some implementations allow stores made by the CPU that issued
>> the LoadLink).
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>   cputlb.c                |   3 ++
>>   softmmu_llsc_template.h | 124
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>   softmmu_template.h      |  12 +++++
>>   tcg/tcg.h               |  30 ++++++++++++
>>   4 files changed, 169 insertions(+)
>>   create mode 100644 softmmu_llsc_template.h
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index 1e25a2a..d5aae7c 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -416,6 +416,8 @@ static inline void
>> lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>>
>>   #define MMUSUFFIX _mmu
>>
>> +/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
>> +#define GEN_EXCLUSIVE_HELPERS
>>   #define SHIFT 0
>>   #include "softmmu_template.h"
>>
>> @@ -428,6 +430,7 @@ static inline void
>> lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>>   #define SHIFT 3
>>   #include "softmmu_template.h"
>>   #undef MMUSUFFIX
>> +#undef GEN_EXCLUSIVE_HELPERS
>>
>>   #define MMUSUFFIX _cmmu
>>   #undef GETPC_ADJ
>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>> new file mode 100644
>> index 0000000..9f22834
>> --- /dev/null
>> +++ b/softmmu_llsc_template.h
>> @@ -0,0 +1,124 @@
>> +/*
>> + *  Software MMU support (esclusive load/store operations)
>> + *
>> + * Generate helpers used by TCG for qemu_ldlink/stcond ops.
>> + *
>> + * Included from softmmu_template.h only.
>> + *
>> + * Copyright (c) 2015 Virtual Open Systems
>> + *
>> + * Authors:
>> + *  Alvise Rigo <a.rigo@virtualopensystems.com>
>> + *
>> + * This library is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU Lesser General Public
>> + * License as published by the Free Software Foundation; either
>> + * version 2 of the License, or (at your option) any later version.
>> + *
>> + * This library is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * Lesser General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU Lesser General Public
>> + * License along with this library; if not, see
>> <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +/* This template does not generate together the le and be version, but
>> only one
>> + * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been
>> set.
>> + * The same nomenclature as softmmu_template.h is used for the exclusive
>> + * helpers.  */
>> +
>> +#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
>> +
>> +#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX),
>> MMUSUFFIX)
>> +#define helper_stcond_name  glue(glue(helper_be_stcond, SUFFIX),
>> MMUSUFFIX)
>> +#define helper_ld glue(glue(helper_be_ld, USUFFIX), MMUSUFFIX)
>> +#define helper_st glue(glue(helper_be_st, SUFFIX), MMUSUFFIX)
>> +
>> +#else /* LE helpers + 8bit helpers (generated only once for both LE end
>> BE) */
>> +
>> +#if DATA_SIZE > 1
>> +#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX),
>> MMUSUFFIX)
>> +#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX),
>> MMUSUFFIX)
>> +#define helper_ld glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
>> +#define helper_st glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
>> +#else /* DATA_SIZE <= 1 */
>> +#define helper_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX),
>> MMUSUFFIX)
>> +#define helper_stcond_name  glue(glue(helper_ret_stcond, SUFFIX),
>> MMUSUFFIX)
>> +#define helper_ld glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
>> +#define helper_st glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
>> +#endif
>> +
>> +#endif
>> +
>> +WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>> +                                TCGMemOpIdx oi, uintptr_t retaddr)
>> +{
>> +    WORD_TYPE ret;
>> +    int index;
>> +    CPUState *cpu;
>> +    hwaddr hw_addr;
>> +    unsigned mmu_idx = get_mmuidx(oi);
>> +
>> +    /* Use the proper load helper from cpu_ldst.h */
>> +    ret = helper_ld(env, addr, mmu_idx, retaddr);
>> +
>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> +
>> +    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>> +     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>> +    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
>> addr;
>> +
>> +    cpu_physical_memory_clear_excl_dirty(hw_addr,
>> ENV_GET_CPU(env)->cpu_index);
>> +    /* If all the vCPUs have the EXCL bit set for this page there is no
>> need
>> +     * to request any flush. */
>> +    if (cpu_physical_memory_excl_is_dirty(hw_addr, smp_cpus)) {
>> +        CPU_FOREACH(cpu) {
>> +            if (current_cpu != cpu) {
>> +                if (cpu_physical_memory_excl_is_dirty(hw_addr,
>> +                                                    cpu->cpu_index)) {
>> +                    cpu_physical_memory_clear_excl_dirty(hw_addr,
>> +                                                         cpu->cpu_index);
>> +                    tlb_flush(cpu, 1);
>> +                }
>
>
> Why would you need to indicate that another cpu has started an exclusive
> operation on this page?  That seems definitely wrong.

The cpu_physical_memory_clear_excl_dirty() sets the flag to generate
the TLB entry with the EXCL flag.

>
> I think that all you wanted was to flush the other cpu, so that it notices
> that this cpu has started an exclusive operation.

Indeed, after calling cpu_physical_memory_clear_excl_dirty and
flushing the TLB, the CPU will be forced to create the TLB entries
from scratch, with the TLB flag if necessary.

>
> It would be great if most of this were pulled out to a subroutine so that
> these helper functions didn't accrue so much duplicate code.
>
> It would be fantastic to implement a tlb_flush_phys_page to that we didn't
> have to flush the entire tlb of the other cpus.

Yes, this would be great, but it would also require a structure
mapping PHYS_ADDR -> TLB_ENTRIES, which is not provided by the softmmu
at the moment.

>
>> +    /* We set it preventively to true to distinguish the following legacy
>> +     * access as one made by the store conditional wrapper. If the store
>> +     * conditional does not succeed, the value will be set to 0.*/
>> +    env->excl_succeeded = 1;
>
>
> Ah -- "distinguish" is the word that was missing from the comments in the
> previous patches.
>
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index e4431e8..ad65d20 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -640,6 +640,18 @@ void probe_write(CPUArchState *env, target_ulong
>> addr, int mmu_idx,
>>   #endif
>>   #endif /* !defined(SOFTMMU_CODE_ACCESS) */
>>
>> +#ifdef GEN_EXCLUSIVE_HELPERS
>> +
>> +#if DATA_SIZE > 1 /* The 8-bit helpers are generate along with LE helpers
>> */
>> +#define BIGENDIAN_EXCLUSIVE_HELPERS
>> +#include "softmmu_llsc_template.h"
>> +#undef BIGENDIAN_EXCLUSIVE_HELPERS
>> +#endif
>> +
>> +#include "softmmu_llsc_template.h"
>> +
>> +#endif /* !defined(GEN_EXCLUSIVE_HELPERS) */
>
>
> I'm not especially keen on this.  Not that what we currently have in
> softmmu_template.h is terribly better.

What I've already proposed in the past was to copy the actual store
code from softmmu_template to softmmu_llsc_template.
But since now we need to support all the TLB_ flags, this will really
produce a lot of code duplication.

>
> I wonder what can be done to clean all of this up, short of actually
> transitioning to c++ templates...
>

Is there already an example in QEMU I can look at?
Wouldn't this require a new compiler dependency?

Thanks,
alvise

>
> r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 5/6] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-09-30  4:05   ` Richard Henderson
@ 2015-09-30  9:51     ` alvise rigo
  0 siblings, 0 replies; 27+ messages in thread
From: alvise rigo @ 2015-09-30  9:51 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On Wed, Sep 30, 2015 at 6:05 AM, Richard Henderson <rth@twiddle.net> wrote:
> On 09/24/2015 06:32 PM, Alvise Rigo wrote:
>>
>> Use the new slow path for atomic instruction translation when the
>> softmmu is enabled.
>
>
> Um... why?  TCG_USE_LDST_EXCL would appear to be 100% redundant with
> SOFTMMU.

Oops, modifying the previous version of the patch I didn't notice I
created a redundant variable.
I will remove it.

Regards,
alvise

>
>
> r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 4/6] target-arm: Create new runtime helpers for excl accesses
  2015-09-30  4:03   ` Richard Henderson
@ 2015-09-30 10:16     ` alvise rigo
  0 siblings, 0 replies; 27+ messages in thread
From: alvise rigo @ 2015-09-30 10:16 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On Wed, Sep 30, 2015 at 6:03 AM, Richard Henderson <rth@twiddle.net> wrote:
> On 09/24/2015 06:32 PM, Alvise Rigo wrote:
>>
>> Introduce a set of new runtime helpers do handle exclusive instructions.
>> This helpers are used as hooks to call the respective LL/SC helpers in
>> softmmu_llsc_template.h from TCG code.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>   target-arm/helper.h    | 10 ++++++
>>   target-arm/op_helper.c | 94
>> ++++++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 104 insertions(+)
>>
>> diff --git a/target-arm/helper.h b/target-arm/helper.h
>> index 827b33d..8e7a7c2 100644
>> --- a/target-arm/helper.h
>> +++ b/target-arm/helper.h
>> @@ -530,6 +530,16 @@ DEF_HELPER_2(dc_zva, void, env, i64)
>>   DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>>   DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>>
>> +DEF_HELPER_3(ldlink_aa32_i8, i32, env, i32, i32)
>> +DEF_HELPER_3(ldlink_aa32_i16, i32, env, i32, i32)
>> +DEF_HELPER_3(ldlink_aa32_i32, i32, env, i32, i32)
>> +DEF_HELPER_3(ldlink_aa32_i64, i64, env, i32, i32)
>> +
>> +DEF_HELPER_4(stcond_aa32_i8, i32, env, i32, i32, i32)
>> +DEF_HELPER_4(stcond_aa32_i16, i32, env, i32, i32, i32)
>> +DEF_HELPER_4(stcond_aa32_i32, i32, env, i32, i32, i32)
>> +DEF_HELPER_4(stcond_aa32_i64, i32, env, i32, i64, i32)
>> +
>>   #ifdef TARGET_AARCH64
>>   #include "helper-a64.h"
>>   #endif
>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>> index 663c05d..d832ba8 100644
>> --- a/target-arm/op_helper.c
>> +++ b/target-arm/op_helper.c
>> @@ -969,3 +969,97 @@ uint32_t HELPER(ror_cc)(CPUARMState *env, uint32_t x,
>> uint32_t i)
>>           return ((uint32_t)x >> shift) | (x << (32 - shift));
>>       }
>>   }
>> +
>> +/* LoadLink helpers, only unsigned. */
>> +static void * const qemu_ldex_helpers[16] = {
>> +    [MO_UB]   = helper_ret_ldlinkub_mmu,
>> +
>> +    [MO_LEUW] = helper_le_ldlinkuw_mmu,
>> +    [MO_LEUL] = helper_le_ldlinkul_mmu,
>> +    [MO_LEQ]  = helper_le_ldlinkq_mmu,
>> +
>> +    [MO_BEUW] = helper_be_ldlinkuw_mmu,
>> +    [MO_BEUL] = helper_be_ldlinkul_mmu,
>> +    [MO_BEQ]  = helper_be_ldlinkq_mmu,
>> +};
>> +
>> +#define LDEX_HELPER(SUFF, OPC)                                          \
>> +uint32_t HELPER(ldlink_aa32_i##SUFF)(CPUARMState *env, uint32_t addr,   \
>> +                                                       uint32_t index)  \
>> +{                                                                       \
>> +    CPUArchState *state = env;                                          \
>> +    TCGMemOpIdx op;                                                     \
>> +                                                                        \
>> +    op = make_memop_idx(OPC, index);                                    \
>> +                                                                        \
>> +    tcg_target_ulong (*func)(CPUArchState *env, target_ulong addr,      \
>> +                             TCGMemOpIdx oi, uintptr_t retaddr);        \
>> +    func = qemu_ldex_helpers[OPC];                                      \
>> +                                                                        \
>> +    return (uint32_t)func(state, addr, op, GETRA());                    \
>> +}
>> +
>> +LDEX_HELPER(8, MO_UB)
>> +LDEX_HELPER(16, MO_TEUW)
>> +LDEX_HELPER(32, MO_TEUL)
>
>
> This is not what Aurelien meant.  I cannot see any reason at present why
> generic wrappers, available for all targets, shouldn't be sufficient.

I thought we could create ad-hoc helpers for each architecture - for
instance cmpxchg-like helpers for x86.
But now that I think about it, I can image just two types of helpers
(for the two atomic approaches - ll/sc and cmpxchg), that of course
can be generic.

>
> See tcg/tcg-runtime.h and tcg-runtime.c.

I will move them there.

>
> You shouldn't need to look up a function in a table like this.  The decision
> about whether to call a BE or LE helper should have been made in the
> translator.

In the next version I will:

- extend the macro to generate both versions of the helper and not just one
- make the decision in translate.c, always using MO_TE as 'toggle'

Thank you,
alvise

>
>
> r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 2/6] softmmu: Add new TLB_EXCL flag
  2015-09-30  9:24     ` alvise rigo
@ 2015-09-30 11:09       ` Peter Maydell
  2015-09-30 12:44         ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2015-09-30 11:09 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée, Richard Henderson

On 30 September 2015 at 10:24, alvise rigo
<a.rigo@virtualopensystems.com> wrote:
> On Wed, Sep 30, 2015 at 5:34 AM, Richard Henderson <rth@twiddle.net> wrote:
>> (1) I don't see why EXCL support should differ whether MMIO is set or not.
>> Either we support exclusive accesses on memory-mapped io like we do on ram
>> (in which case this is wrong) or we don't (in which case this is
>> unnecessary).
>
> I was not sure whether or not we had to support also MMIO memory.
> In theory there shouldn't be any issues for including also
> memory-mapped io, I will consider this for the next version.

Worth considering the interaction between exclusives and other
cases for which we force the slowpath, notably watchpoints.

>> AFAIK, Alpha is the only target we have which specifies that any normal
>> memory access during a ll+sc sequence may fail the sc.
>
> I will dig into it because I remember that the Alpha architecture
> behaves like ARM in the handling of LDxL/STxC instructions.

ARM semantics are that a non-exclusive store by this CPU between
a ldrex and a strex might result in loss of the (local) monitor,
but non-exclusive loads by this CPU won't. (It's an IMPDEF
choice.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 2/6] softmmu: Add new TLB_EXCL flag
  2015-09-30 11:09       ` Peter Maydell
@ 2015-09-30 12:44         ` alvise rigo
  2015-09-30 20:37           ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: alvise rigo @ 2015-09-30 12:44 UTC (permalink / raw)
  To: Peter Maydell
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée, Richard Henderson

On Wed, Sep 30, 2015 at 1:09 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 30 September 2015 at 10:24, alvise rigo
> <a.rigo@virtualopensystems.com> wrote:
>> On Wed, Sep 30, 2015 at 5:34 AM, Richard Henderson <rth@twiddle.net> wrote:
>>> (1) I don't see why EXCL support should differ whether MMIO is set or not.
>>> Either we support exclusive accesses on memory-mapped io like we do on ram
>>> (in which case this is wrong) or we don't (in which case this is
>>> unnecessary).
>>
>> I was not sure whether or not we had to support also MMIO memory.
>> In theory there shouldn't be any issues for including also
>> memory-mapped io, I will consider this for the next version.
>
> Worth considering the interaction between exclusives and other
> cases for which we force the slowpath, notably watchpoints.
>
>>> AFAIK, Alpha is the only target we have which specifies that any normal
>>> memory access during a ll+sc sequence may fail the sc.
>>
>> I will dig into it because I remember that the Alpha architecture
>> behaves like ARM in the handling of LDxL/STxC instructions.
>
> ARM semantics are that a non-exclusive store by this CPU between
> a ldrex and a strex might result in loss of the (local) monitor,
> but non-exclusive loads by this CPU won't. (It's an IMPDEF
> choice.)

Indeed, what is implemented by this patch series is one of the
permissible choices - very close to the one implemented by the current
TCG - that could match all the other architectures with similar
semantics (now I'm not sure about Alpha).
In this regard, I was wondering, should these semantics be somehow
target-* dependent?
Like having some per-architecture functions that, for each LoadLink,
set the size of the exclusive memory region to be protected and decide
whether a normal store/load will make one CPU's SC fail.

Thank you,
alvise

>
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation
  2015-09-30  8:14   ` alvise rigo
@ 2015-09-30 13:20     ` Paolo Bonzini
  0 siblings, 0 replies; 27+ messages in thread
From: Paolo Bonzini @ 2015-09-30 13:20 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Emilio G. Cota,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 3421 bytes --]



On 30/09/2015 10:14, alvise rigo wrote:
>> From 10000ft, both approaches rely on checking a flag during stores.
>> This is split between the TLB and the CPUState for Alvise's patches (in
>> order to exploit the existing fast-path checks), and entirely in the
>> radix tree for Emilio's.  However, the idea is the same.
>>
>> Now, the patch are okay for serial emulation, but I am not sure if it's
>> possible to do lock-free ll/sc emulation, because there is a race.
> 
> Do you mean to not use any locking mechanism at all at the emulation side?

Not using it in the fast path of the store, at least.

>>
>> If we check the flag before the store, the race is as follows:
>>
>>    CPU0        CPU1
>>    -------------------------------------------------------
>>    check flag
>>                load locked:
>>                   set flag
>>                   load value (normal load on CPU)
>>    store
>>                store conditional (normal store on CPU)
>>
>> where the sc doesn't fail.  For completeness, if we check it afterwards
> 
> Shouldn't this be prevented by the tcg_excl_access_lock in the
> patchseries based on mttcg (branch slowpath-for-atomic-v5-mttcg)?

No, the issue happens already in the fast path, i.e. when checking the TLB.

Have you ever heard of consensus numbers?  Basically, it's a tool to
prove that it is impossible to implement an algorithm X in a wait-free
manner.

For LL/SC/store, consider a simple case with two CPUs, one executing LL
and one executing a store.  This does not require any lock on the LL/SC
side, because there is only one CPU running that code.  It is also okay
if it requires a lock to synchronize between LL/store in the slow path.
 However, we want a wait-free fast path.  If we can formalize the fast
path as a consensus problem, consensus numbers let us prove whether it
can be done or not.

In fact, it's easy for the CPUs to use consensus to decide the outcome
of a subsequent SC instruction.  If the LL comes before the store, the
SC fails.  If the LL comes after the store, the SC succeeds.  Because
there's two CPUs, this consensus problem can be solved with any
primitive whose consensus number is >= 2.

Unfortunately atomic registers (i.e. memory cells, like QEMU's TLBs)
have consensus number 1.  You need test-and-set, compare-and-swap,
atomic writes to two memory locations or something like that.  All very
expensive stuff.

I attach a PDF with some pseudocode examples, and a Promela model of the
same.  (Yes, I was nerd-sniped).

>> If I'm right, we can still keep the opcodes and implement them with a
>> simple cmpxchg.  It would provide a nice generic tool to implement
>> atomic operations, and it will work correctly if the target has ll/sc.
>> However, ll/sc-on-cmpxchg (e.g., ARM-on-x86) would be susceptible to the
>> ABA problem.
> 
> This was one of my fears that led me to the ll/sc approach. I think it
> could be even more probable in emulation since we can't assume the
> distance in time between LLs and SCs to be small to avoid "aba"
> accesses.

In practice cmpxchg works because most primitives and lock-free data
structures are written against cmpxchg, not LL/SC.  ABA is avoided
through garbage collection, RCU or hazard pointers, without relying on
LL/SC semantics.

Again, your patches are still very useful to provide the abstraction.

Paolo

[-- Attachment #2: llsc.pdf --]
[-- Type: application/pdf, Size: 45140 bytes --]

[-- Attachment #3: llsc.promela --]
[-- Type: text/plain, Size: 3300 bytes --]

int ll_val;        // value read by LL
int val;           // current value of memory cell
int mark;          // mark to trigger slow path in stores
int listed;        // abstract representation of "locked list"
int sc_failure;    // did SC succeed or fail?

/* common implementation of store-conditional */
#define SC                                           \
   atomic {                                          \
       sc_failure = !listed;                         \
       assert(sc_failure || val == ll_val);          \
       if :: sc_failure -> skip;                     \
          :: else -> mark = 0; listed = 0; val = 2;  \
       fi;                                           \
   }

// trivial implementation, obviously broken - does not even try
#if 0
#define LL                                    \
   listed = 1;                                \
   ll_val = val;                              \
   mark = 1;

#define STORE                                 \
   val = 1
#endif

// -------------------------------------------------------------------

// broken implementation #1
// STORE can read the mark before LL has set it.  If it stores
// the new value after LL has read the memory, SC will not
// notice the conflict.

#if 0
#define LL                                    \
   ll_val = val;                              \
   atomic {                                   \
     listed = 1;                              \
     mark = 1;                                \
   }

#define STORE                                     \
   if :: mark -> atomic { listed = 0; mark = 0; } \
      :: else -> skip;                            \
   fi;                                            \
   val = 1
#endif

// -------------------------------------------------------------------

// broken implementation #2
// If STORE reads the mark before LL has set it, SC will not
// notice the conflict.

#if 0
#define LL                                    \
   atomic {                                   \
     listed = 1;                              \
     ll_val = val;                            \
     mark = 1;                                \
   }

#define STORE                                 \
   if :: mark -> atomic { listed = 0; mark = 0; } \
      :: else -> skip;                            \
   fi;                                            \
   val = 1
#endif

// -------------------------------------------------------------------

// This works but uses transactional memory in STORE

#if 1
#define LL                                    \
   listed = 1;                                \
   mark = 1;                                  \
   ll_val = val;                              \

#define STORE                                              \
  atomic {                                                 \
    if :: mark -> slow = 1;                                \
       :: else -> val = 1;                                 \
    fi                                                     \
  }                                                        \
  if :: slow -> atomic { mark = 0; val = 1; listed = 0; }; \
     :: else -> skip;                                      \
  fi

#endif

active proctype llsc()
{
   LL;
   SC;
}

active proctype store()
{
   int slow;
   STORE;
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 2/6] softmmu: Add new TLB_EXCL flag
  2015-09-30 12:44         ` alvise rigo
@ 2015-09-30 20:37           ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2015-09-30 20:37 UTC (permalink / raw)
  To: alvise rigo, Peter Maydell
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On 09/30/2015 10:44 PM, alvise rigo wrote:
> On Wed, Sep 30, 2015 at 1:09 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 30 September 2015 at 10:24, alvise rigo
>> <a.rigo@virtualopensystems.com> wrote:
>>> On Wed, Sep 30, 2015 at 5:34 AM, Richard Henderson <rth@twiddle.net> wrote:
>>>> (1) I don't see why EXCL support should differ whether MMIO is set or not.
>>>> Either we support exclusive accesses on memory-mapped io like we do on ram
>>>> (in which case this is wrong) or we don't (in which case this is
>>>> unnecessary).
>>>
>>> I was not sure whether or not we had to support also MMIO memory.
>>> In theory there shouldn't be any issues for including also
>>> memory-mapped io, I will consider this for the next version.
>>
>> Worth considering the interaction between exclusives and other
>> cases for which we force the slowpath, notably watchpoints.
>>
>>>> AFAIK, Alpha is the only target we have which specifies that any normal
>>>> memory access during a ll+sc sequence may fail the sc.
>>>
>>> I will dig into it because I remember that the Alpha architecture
>>> behaves like ARM in the handling of LDxL/STxC instructions.
>>
>> ARM semantics are that a non-exclusive store by this CPU between
>> a ldrex and a strex might result in loss of the (local) monitor,
>> but non-exclusive loads by this CPU won't. (It's an IMPDEF
>> choice.)
>
> Indeed, what is implemented by this patch series is one of the
> permissible choices - very close to the one implemented by the current
> TCG - that could match all the other architectures with similar
> semantics

Except for the case I pointed out -- where a ldr to a page that conflicts with 
the ldrex in the tlb will result in the replacement of the tlb entry, and thus 
you'll see the outgoing tlb entry has TLB_EXCL set and set the dirty bit (i.e. 
clear the lock bit) on the ldrex page.

I.e. this is a case that *must* pass on arm, but your implementation won't.

> (now I'm not sure about Alpha).

Alpha is trivial, as always.  Those guys wrote the spec such that anything hard 
isn't guaranteed to work, but also isn't guaranteed to fail.

The rule is: no memory accesses of any kind, lest the lock flag be lost.  In 
practice, it's a test of the MESI state of the cacheline.  Which, I assume is 
what most other targets use.

> Like having some per-architecture functions that, for each LoadLink,
> set the size of the exclusive memory region to be protected

We probably do need that; I sort of assumed that simply be part of the 
implementation on the instruction side, not done as a side-effect of the actual 
load.

> and decide whether a normal store/load will make one CPU's SC fail.

There's likely no need to do this.  And certainly I don't think we should do it 
right away.


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath
  2015-09-30  9:46     ` alvise rigo
@ 2015-09-30 20:42       ` Richard Henderson
  2015-10-01  8:05         ` alvise rigo
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2015-09-30 20:42 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On 09/30/2015 07:46 PM, alvise rigo wrote:
> On Wed, Sep 30, 2015 at 5:58 AM, Richard Henderson <rth@twiddle.net> wrote:
>> Why would you need to indicate that another cpu has started an exclusive
>> operation on this page?  That seems definitely wrong.
>
> The cpu_physical_memory_clear_excl_dirty() sets the flag to generate
> the TLB entry with the EXCL flag.

Yes, but surely the clearing of dirty on current_cpu is enough to cause the 
other cpus to see that they need to set TLB_EXCL when reloading their tlb entries.

Why do you need to manipulate the *other* cpu's dirty bit?

>> I wonder what can be done to clean all of this up, short of actually
>> transitioning to c++ templates...
>>
>
> Is there already an example in QEMU I can look at?

No.


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath
  2015-09-30 20:42       ` Richard Henderson
@ 2015-10-01  8:05         ` alvise rigo
  2015-10-01 19:34           ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: alvise rigo @ 2015-10-01  8:05 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On Wed, Sep 30, 2015 at 10:42 PM, Richard Henderson <rth@twiddle.net> wrote:
>
> On 09/30/2015 07:46 PM, alvise rigo wrote:
>>
>> On Wed, Sep 30, 2015 at 5:58 AM, Richard Henderson <rth@twiddle.net> wrote:
>>>
>>> Why would you need to indicate that another cpu has started an exclusive
>>> operation on this page?  That seems definitely wrong.
>>
>>
>> The cpu_physical_memory_clear_excl_dirty() sets the flag to generate
>> the TLB entry with the EXCL flag.
>
>
> Yes, but surely the clearing of dirty on current_cpu is enough to cause the other cpus to see that they need to set TLB_EXCL when reloading their tlb entries.
>
> Why do you need to manipulate the *other* cpu's dirty bit?

Because then we can assume that a cpu with the bit cleared has for
sure the TLB entries with the EXCL flag set for that specific page.
Moreover, knowing which cpus have the EXCL flag set allows to reduce
the flushing requests whenever a new LL is issued on the same page.

>
>
>>> I wonder what can be done to clean all of this up, short of actually
>>> transitioning to c++ templates...
>>>
>>
>> Is there already an example in QEMU I can look at?
>
>
> No.

Thank,
alvise

>
>
>
> r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation
  2015-09-30  4:44 ` [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Paolo Bonzini
  2015-09-30  8:14   ` alvise rigo
@ 2015-10-01 19:32   ` Emilio G. Cota
  1 sibling, 0 replies; 27+ messages in thread
From: Emilio G. Cota @ 2015-10-01 19:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, claudio.fontana, Alvise Rigo, qemu-devel, jani.kokkonen,
	tech, alex.bennee

On Wed, Sep 30, 2015 at 06:44:32 +0200, Paolo Bonzini wrote:
> I have a doubt about your patches for ll/sc emulation, that I hope you
> can clarify.
> 
> From 10000ft, both approaches rely on checking a flag during stores.
> This is split between the TLB and the CPUState for Alvise's patches (in
> order to exploit the existing fast-path checks), and entirely in the
> radix tree for Emilio's.  However, the idea is the same.

Not quite the same idea, see below.

> Now, the patch are okay for serial emulation, but I am not sure if it's
> possible to do lock-free ll/sc emulation, because there is a race.
> 
> If we check the flag before the store, the race is as follows:
> 
>    CPU0        CPU1
>    -------------------------------------------------------
>    check flag
>                load locked:
>                   set flag
>                   load value (normal load on CPU)
>    store
>                store conditional (normal store on CPU)
> 
> where the sc doesn't fail.  For completeness, if we check it afterwards
> (which would be possible with Emilio's approach, though not for the
> TLB-based one):
> 
>    CPU0        CPU1
>    ------------------------------------------------------
>                load locked
>                   set bit
>                   load value (normal load on CPU)
>    store
>                store conditional (normal store on CPU)
>    check flag
> 
> and again the sc doesn't fail.

(snip)
> Tell me I'm wrong. :)

The difference between Alvise's implementation and what I submitted is
that in the radix tree a cache line that has *ever* had an atomic op performed
on it, is marked as "slow path" for the rest of the execution, meaning
that *all* subsequent stores to said cache line will take the cache line
entry's lock.

This does not fix the race completely (e.g. you could have a store
and an atomic op concurrently executing on a line that hasn't yet
had an atomic op on it), but it significantly closes it. My guess
is that it would be very hard to trigger by practical programs.

		E.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath
  2015-10-01  8:05         ` alvise rigo
@ 2015-10-01 19:34           ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2015-10-01 19:34 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On 10/01/2015 06:05 PM, alvise rigo wrote:
> On Wed, Sep 30, 2015 at 10:42 PM, Richard Henderson <rth@twiddle.net> wrote:
>>
>> On 09/30/2015 07:46 PM, alvise rigo wrote:
>>>
>>> On Wed, Sep 30, 2015 at 5:58 AM, Richard Henderson <rth@twiddle.net> wrote:
>>>>
>>>> Why would you need to indicate that another cpu has started an exclusive
>>>> operation on this page?  That seems definitely wrong.
>>>
>>>
>>> The cpu_physical_memory_clear_excl_dirty() sets the flag to generate
>>> the TLB entry with the EXCL flag.
>>
>>
>> Yes, but surely the clearing of dirty on current_cpu is enough to cause the other cpus to see that they need to set TLB_EXCL when reloading their tlb entries.
>>
>> Why do you need to manipulate the *other* cpu's dirty bit?
>
> Because then we can assume that a cpu with the bit cleared has for
> sure the TLB entries with the EXCL flag set for that specific page.
> Moreover, knowing which cpus have the EXCL flag set allows to reduce
> the flushing requests whenever a new LL is issued on the same page.

Does it actually help, or is that a guess without numbers?


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-10-01 19:34 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-24  8:32 [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Alvise Rigo
2015-09-24  8:32 ` [Qemu-devel] [RFC v5 1/6] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
2015-09-26 17:15   ` Richard Henderson
2015-09-28  7:28     ` alvise rigo
2015-09-24  8:32 ` [Qemu-devel] [RFC v5 2/6] softmmu: Add new TLB_EXCL flag Alvise Rigo
2015-09-30  3:34   ` Richard Henderson
2015-09-30  9:24     ` alvise rigo
2015-09-30 11:09       ` Peter Maydell
2015-09-30 12:44         ` alvise rigo
2015-09-30 20:37           ` Richard Henderson
2015-09-24  8:32 ` [Qemu-devel] [RFC v5 3/6] softmmu: Add helpers for a new slowpath Alvise Rigo
2015-09-30  3:58   ` Richard Henderson
2015-09-30  9:46     ` alvise rigo
2015-09-30 20:42       ` Richard Henderson
2015-10-01  8:05         ` alvise rigo
2015-10-01 19:34           ` Richard Henderson
2015-09-24  8:32 ` [Qemu-devel] [RFC v5 4/6] target-arm: Create new runtime helpers for excl accesses Alvise Rigo
2015-09-30  4:03   ` Richard Henderson
2015-09-30 10:16     ` alvise rigo
2015-09-24  8:32 ` [Qemu-devel] [RFC v5 5/6] configure: Use slow-path for atomic only when the softmmu is enabled Alvise Rigo
2015-09-30  4:05   ` Richard Henderson
2015-09-30  9:51     ` alvise rigo
2015-09-24  8:32 ` [Qemu-devel] [RFC v5 6/6] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
2015-09-30  4:44 ` [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation Paolo Bonzini
2015-09-30  8:14   ` alvise rigo
2015-09-30 13:20     ` Paolo Bonzini
2015-10-01 19:32   ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.