All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics
@ 2016-06-27 19:01 Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers Emilio G. Cota
                   ` (30 more replies)
  0 siblings, 31 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

This RFC provides softmmu/cpu primitives for cmpxchg operations.
These primitives are not TCG ops; they are meant to be called from target
helper functions. They apply to both system and user modes.

To increase performance for architectures that have "locked" versions
of regular ops (e.g. x86), helpers are also provided for most
of these atomic ops (e.g. atomic_add_and_fetch). Performance numbers
that compare atomic_* vs. cmpxchg-only emulation of atomics for x86
are available in patch 20's commit log.

cmpxchg helpers are also used for emulating LL/SC (in arm/aarch64),
starting from patch 21.

# Correctness
Using cmpxchg to emulate LL/SC is not correct, because it can
suffer from the ABA problem. However, I argue that a fully correct
implementation is not necessary for most real applications out
there. In other words, while it would be trivial to write a
parallel program that would incorrectly be emulated due to ABA,
most parallel code is written assuming only cmpxchg is available,
e.g. the kernel or gcc provide cmpxchg and other (simpler) "atomic"
ops, but nothing that would prevent ABA. [*]

[*] http://yarchive.net/comp/linux/cmpxchg_ll_sc_portability.html

I propose to have a cmpxchg-based implementation as the default,
to then add an option to enable a worse-performing-yet-correct LL/SC
implementation. I have a working solution that works in user and
system modes, but requires instrumenting *all* stores with helpers,
which has significant overhead when compared to cmpxchg. See
this bar chart, with measurements taken on an Intel Haswell:
http://imgur.com/KKb7S4t Overhead for those SPEC workloads is
on average ~20% ("store tracking"). Note that most of this comes from
calling the helpers ("store helpers only"). This implementation, however,
scales well, since no single lock is taken for emulating atomic accesses
(each cache line, when necessary, gets its own lock). If there's interest,
I can share this implementation.

# Performance and Scalability
The cmpxchg-based implementation in this RFC scales. It doesn't
require cross-CPU communication apart from the inevitable cache
line contention in the guest workload. In other words, no external
subsystems (e.g. TLBs, CPU loop) are touched, nor heavily-contended
locks are added.
The only locks in this RFC are added to emulate 16b atomics;
for now just a small table of locks is used. Using locks for
emulating these atomics is OK, since there are no "regular" (i.e.
non-atomic) loads/stores that could race with 16b atomics.

See the commit log in patches 22 (ARM) and 26 (aarch64) for a performance
comparison vs. the existing linux-user emulation on a 64-core system.
Tests are done using a newly added benchmark that stresses the cache
hierarchy with a configurable level of contention (patch 19).

# Testing
All implementations (x86_64, ARM, aarch64) pass the ck_pr validation
tests in concurrencykit[*] (ck/regressions/ck_pr_validate) in user-mode.
I have not tested the "paired" flavours of LL/SC in aarch64, so help
on this would be appreciated. I haven't even compile-tested on
32-bit hosts.
[*] http://concurrencykit.org/

# Why this is an RFC, and not a PATCH set
I'd like to have your input on the following:

- Is it OK *not* to write on a failed cmpxchg on user-only x86?
  On system-mode we have the write TLB access, so we'd get
  a write fault (all atomic/cmpxchg ops fault as a write fault,
  which AFAIK is the right thing to do), so that should suffice.

- What to do when atomic ops are used on something other than RAM?
  Should we have a "slow path" that is not atomic for these cases, or
  it's OK to assume code is bogus? For now, I just wrote XXX.

- In target-i386 code, I might have added unnecessary temps; I'm not
  sure which temps need to retain a meaningful value after translating
  instructions, so what I did is to make sure they retain the same
  value they'd have if the lock prefix wasn't there.

- Is it worth adding more checks to ensure that the LOCK prefix is
  where it should be? In some places I'm assuming "if (LOCK_PREFIX)"
  then we must have a memory operand (not a register operand).
  Assuming that instructions are always well-formed might be overly
  optimistic.

- How/where to fail when cmpxchg16 is performed on an address that
  is not 16b-aligned? x86 requires cmpxchg16b to be aligned, but
  doesn't enforce alignment on other atomics.

- Unaligned atomic ops: AFAIK only x86 supports them. I'm just punting
  on this since I'm passing the possibly-unaligned address to the hosts'
  compiler. AFAIK gcc deals with this by enlarging the size of the
  atomic to be used, but if for instance we're doing a cmpxchg8b
  on an unaligned address, very few hosts have a cmpxchg16b to deal
  with this.

- What to do with architectures that cannot guarantee atomic regular
  accesses? For instance, when emulating a 64-bit system on a 32-bit
  one. This will break a lot of parallel code, unless we serialize all
  loads/stores. I assume we won't support MTTCG on these architectures,
  and be done with this, right? Otherwise we'd have to instrument all
  loads and stores.

- ARM's cmpxchg syscall in linux-user could be improved; I've ignored
  it so far.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 20:11   ` Richard Henderson
  2016-06-27 19:01 ` [Qemu-devel] [RFC 02/30] tcg: add tcg_cmpxchg_lock Emilio G. Cota
                   ` (29 subsequent siblings)
  30 siblings, 1 reply; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 softmmu_template.h | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.h          | 16 +++++++++++++++
 2 files changed, 74 insertions(+)

diff --git a/softmmu_template.h b/softmmu_template.h
index 208f808..7b519dc 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -548,6 +548,64 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
     }
 }
 #endif
+
+DATA_TYPE
+glue(glue(helper_cmpxchg, SUFFIX),
+     MMUSUFFIX)(CPUArchState *env, target_ulong addr, DATA_TYPE old,
+                DATA_TYPE new, TCGMemOpIdx oi, uintptr_t retaddr)
+{
+    unsigned mmu_idx = get_mmuidx(oi);
+    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+    uintptr_t haddr;
+
+    /* Adjust the given return address.  */
+    retaddr -= GETPC_ADJ;
+
+    /* If the TLB entry is for a different page, reload and try again.  */
+    if ((addr & TARGET_PAGE_MASK)
+        != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
+        if (unlikely((addr & (DATA_SIZE - 1)) != 0
+                     && (get_memop(oi) & MO_AMASK) == MO_ALIGN)) {
+            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                                 mmu_idx, retaddr);
+        }
+        if (!VICTIM_TLB_HIT(addr_write)) {
+            tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
+        }
+        tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+    }
+
+    /* Handle an IO access.  */
+    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
+        /* XXX */
+        abort();
+    }
+
+    /* Handle slow unaligned access (it spans two pages or IO).  */
+    if (DATA_SIZE > 1
+        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                     >= TARGET_PAGE_SIZE)) {
+        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                                 mmu_idx, retaddr);
+        }
+    }
+
+    /* Handle aligned access or unaligned access in the same page.  */
+    if (unlikely((addr & (DATA_SIZE - 1)) != 0
+                 && (get_memop(oi) & MO_AMASK) == MO_ALIGN)) {
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                             mmu_idx, retaddr);
+    }
+    /*
+     * If the host allows unaligned accesses, then let the compiler
+     * do its thing when performing the access on the host.
+     */
+    haddr = addr + env->tlb_table[mmu_idx][index].addend;
+    return atomic_cmpxchg((DATA_TYPE *)haddr, old, new);
+}
+
 #endif /* !defined(SOFTMMU_CODE_ACCESS) */
 
 #undef READ_ACCESS_TYPE
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 66d7fc0..1fd7ec3 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -1101,6 +1101,22 @@ uint64_t helper_be_ldq_cmmu(CPUArchState *env, target_ulong addr,
 # define helper_ret_ldq_cmmu  helper_le_ldq_cmmu
 #endif
 
+uint8_t helper_cmpxchgb_mmu(CPUArchState *env, target_ulong addr,
+                            uint8_t old, uint8_t new,
+                            TCGMemOpIdx oi, uintptr_t retaddr);
+
+uint16_t helper_cmpxchgw_mmu(CPUArchState *env, target_ulong addr,
+                             uint16_t old, uint16_t new,
+                             TCGMemOpIdx oi, uintptr_t retaddr);
+
+uint32_t helper_cmpxchgl_mmu(CPUArchState *env, target_ulong addr,
+                             uint32_t old, uint32_t new,
+                             TCGMemOpIdx oi, uintptr_t retaddr);
+
+uint64_t helper_cmpxchgq_mmu(CPUArchState *env, target_ulong addr,
+                             uint64_t old, uint64_t new,
+                             TCGMemOpIdx oi, uintptr_t retaddr);
+
 #endif /* CONFIG_SOFTMMU */
 
 #endif /* TCG_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 02/30] tcg: add tcg_cmpxchg_lock
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 20:07   ` Richard Henderson
  2016-06-27 19:01 ` [Qemu-devel] [RFC 03/30] cpu_ldst: add cpu_cmpxchg helpers Emilio G. Cota
                   ` (28 subsequent siblings)
  30 siblings, 1 reply; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

This set of locks will allow us to correctly emulate cmpxchg16
in a parallel TCG. The key observation is that no architecture
supports 16-byte regular atomic load/stores; only "locked" accesses
(e.g. via cmpxchg16b on x86) are allowed, and therefore we can emulate
them by using locks.

We use a small array of locks so that we can have some scalability.
Further improvements are possible (e.g. using a radix tree); but
we should have a workload to benchmark in order to justify the
additional complexity.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c        |  1 +
 linux-user/main.c |  1 +
 tcg/tcg.h         |  5 +++++
 translate-all.c   | 39 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 46 insertions(+)

diff --git a/cpu-exec.c b/cpu-exec.c
index b840e1d..26f3bd6 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -643,6 +643,7 @@ int cpu_exec(CPUState *cpu)
 #endif /* buggy compiler */
             cpu->can_do_io = 1;
             tb_lock_reset();
+            tcg_cmpxchg_lock_reset();
         }
     } /* for(;;) */
 
diff --git a/linux-user/main.c b/linux-user/main.c
index 78d8d04..af9e8e3 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -140,6 +140,7 @@ void fork_end(int child)
         pthread_cond_init(&exclusive_cond, NULL);
         pthread_cond_init(&exclusive_resume, NULL);
         qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
+        tcg_cmpxchg_lock_init();
         gdbserver_fork(thread_cpu);
     } else {
         pthread_mutex_unlock(&exclusive_lock);
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 1fd7ec3..1c9c8bc 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -650,6 +650,11 @@ void tb_lock(void);
 void tb_unlock(void);
 void tb_lock_reset(void);
 
+void tcg_cmpxchg_lock(uintptr_t addr);
+void tcg_cmpxchg_unlock(void);
+void tcg_cmpxchg_lock_reset(void);
+void tcg_cmpxchg_lock_init(void);
+
 static inline void *tcg_malloc(int size)
 {
     TCGContext *s = &tcg_ctx;
diff --git a/translate-all.c b/translate-all.c
index eaa95e4..19432e5 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -153,6 +153,44 @@ void tb_lock_reset(void)
 #endif
 }
 
+#define TCG_CMPXCHG_NR_LOCKS 16
+static QemuMutex tcg_cmpxchg_locks[TCG_CMPXCHG_NR_LOCKS];
+static __thread QemuMutex *tcg_cmpxchg_curr_lock;
+
+void tcg_cmpxchg_lock(uintptr_t addr)
+{
+    assert(tcg_cmpxchg_curr_lock == NULL);
+    /* choose lock based on cache line address. We assume lines are 64b long */
+    addr >>= 6;
+    addr &= TCG_CMPXCHG_NR_LOCKS - 1;
+    tcg_cmpxchg_curr_lock = &tcg_cmpxchg_locks[addr];
+    qemu_mutex_lock(tcg_cmpxchg_curr_lock);
+}
+
+void tcg_cmpxchg_unlock(void)
+{
+    qemu_mutex_unlock(tcg_cmpxchg_curr_lock);
+    tcg_cmpxchg_curr_lock = NULL;
+}
+
+void tcg_cmpxchg_lock_reset(void)
+{
+    if (unlikely(tcg_cmpxchg_curr_lock)) {
+        tcg_cmpxchg_unlock();
+    }
+}
+
+void tcg_cmpxchg_lock_init(void)
+{
+    int i;
+
+    /* set current to NULL; useful after a child forks in user-mode */
+    tcg_cmpxchg_curr_lock = NULL;
+    for (i = 0; i < TCG_CMPXCHG_NR_LOCKS; i++) {
+        qemu_mutex_init(&tcg_cmpxchg_locks[i]);
+    }
+}
+
 static TranslationBlock *tb_find_pc(uintptr_t tc_ptr);
 
 void cpu_gen_init(void)
@@ -731,6 +769,7 @@ static inline void code_gen_alloc(size_t tb_size)
     tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock, tcg_ctx.code_gen_max_blocks);
 
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
+    tcg_cmpxchg_lock_init();
 }
 
 static void tb_htable_init(void)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 03/30] cpu_ldst: add cpu_cmpxchg helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 02/30] tcg: add tcg_cmpxchg_lock Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 04/30] target-i386: add cmpxchg helpers Emilio G. Cota
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu_atomic_template.h          | 56 +++++++++++++++++++++++++++++
 include/exec/cpu_atomic_useronly_template.h | 39 ++++++++++++++++++++
 include/exec/cpu_ldst_template.h            |  2 ++
 include/exec/cpu_ldst_useronly_template.h   |  2 ++
 4 files changed, 99 insertions(+)
 create mode 100644 include/exec/cpu_atomic_template.h
 create mode 100644 include/exec/cpu_atomic_useronly_template.h

diff --git a/include/exec/cpu_atomic_template.h b/include/exec/cpu_atomic_template.h
new file mode 100644
index 0000000..13f4ffd
--- /dev/null
+++ b/include/exec/cpu_atomic_template.h
@@ -0,0 +1,56 @@
+#include "tcg/tcg.h"
+
+static inline DATA_TYPE
+glue(glue(glue(cpu_cmpxchg, SUFFIX), MEMSUFFIX),
+     _ra)(CPUArchState *env, target_ulong ptr, DATA_TYPE old, DATA_TYPE new,
+          uintptr_t ra)
+{
+    target_ulong addr;
+    TCGMemOpIdx oi;
+    int page_index;
+    DATA_TYPE ret;
+    int mmu_idx;
+
+    addr = ptr;
+    page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    mmu_idx = CPU_MMU_INDEX;
+    if (unlikely(env->tlb_table[mmu_idx][page_index].addr_write !=
+                 (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
+        oi = make_memop_idx(SHIFT, mmu_idx);
+        ret = glue(glue(helper_cmpxchg, SUFFIX), MMUSUFFIX)(env, addr, old, new,
+                                                            oi, ra);
+    } else {
+        uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend;
+
+        ret = atomic_cmpxchg((DATA_TYPE *)hostaddr, old, new);
+    }
+    return ret;
+}
+
+/* define cmpxchgo once ldq and stq have been defined */
+#if DATA_SIZE == 8
+/* returns true on success, false on failure */
+static inline bool
+glue(glue(cpu_cmpxchgo, MEMSUFFIX), _ra)(CPUArchState *env, target_ulong ptr,
+                                         uint64_t *old_lo, uint64_t *old_hi,
+                                         uint64_t new_lo, uint64_t new_hi,
+                                         uintptr_t retaddr)
+{
+    uint64_t orig_lo, orig_hi;
+    bool ret = true;
+
+    tcg_cmpxchg_lock(ptr);
+    orig_lo = glue(glue(cpu_ldq, MEMSUFFIX), _ra)(env, ptr, retaddr);
+    orig_hi = glue(glue(cpu_ldq, MEMSUFFIX), _ra)(env, ptr + 8, retaddr);
+    if (orig_lo == *old_lo && orig_hi == *old_hi) {
+        glue(glue(cpu_stq, MEMSUFFIX), _ra)(env, ptr, new_lo, retaddr);
+        glue(glue(cpu_stq, MEMSUFFIX), _ra)(env, ptr + 8, new_hi, retaddr);
+    } else {
+        *old_lo = orig_lo;
+        *old_hi = orig_hi;
+        ret = false;
+    }
+    tcg_cmpxchg_unlock();
+    return ret;
+}
+#endif
diff --git a/include/exec/cpu_atomic_useronly_template.h b/include/exec/cpu_atomic_useronly_template.h
new file mode 100644
index 0000000..c7c0a3e
--- /dev/null
+++ b/include/exec/cpu_atomic_useronly_template.h
@@ -0,0 +1,39 @@
+#include "tcg/tcg.h"
+
+static inline DATA_TYPE
+glue(glue(glue(cpu_cmpxchg, SUFFIX), MEMSUFFIX),
+     _ra)(CPUArchState *env, target_ulong ptr, DATA_TYPE old, DATA_TYPE new,
+          uintptr_t ra)
+{
+    DATA_TYPE *hostaddr = g2h(ptr);
+
+    return atomic_cmpxchg(hostaddr, old, new);
+}
+
+/* define cmpxchgo once ldq and stq have been defined */
+#if DATA_SIZE == 8
+/* returns true on success, false on failure */
+static inline bool
+glue(glue(cpu_cmpxchgo, MEMSUFFIX),
+     _ra)(CPUArchState *env, target_ulong ptr, uint64_t *old_lo,
+          uint64_t *old_hi, uint64_t new_lo, uint64_t new_hi, uintptr_t retaddr)
+{
+    uint64_t *hostaddr = g2h(ptr);
+    uint64_t orig_lo, orig_hi;
+    bool ret = true;
+
+    tcg_cmpxchg_lock(ptr);
+    orig_lo = *hostaddr;
+    orig_hi = *(hostaddr + 1);
+    if (orig_lo == *old_lo && orig_hi == *old_hi) {
+        *hostaddr = new_lo;
+        *(hostaddr + 1) = new_hi;
+    } else {
+        *old_lo = orig_lo;
+        *old_hi = orig_hi;
+        ret = false;
+    }
+    tcg_cmpxchg_unlock();
+    return ret;
+}
+#endif
diff --git a/include/exec/cpu_ldst_template.h b/include/exec/cpu_ldst_template.h
index eaf69a1..2329b66 100644
--- a/include/exec/cpu_ldst_template.h
+++ b/include/exec/cpu_ldst_template.h
@@ -194,6 +194,8 @@ glue(glue(cpu_st, SUFFIX), MEMSUFFIX)(CPUArchState *env, target_ulong ptr,
     glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(env, ptr, v, 0);
 }
 
+#include "exec/cpu_atomic_template.h"
+
 #endif /* !SOFTMMU_CODE_ACCESS */
 
 #undef RES_TYPE
diff --git a/include/exec/cpu_ldst_useronly_template.h b/include/exec/cpu_ldst_useronly_template.h
index b1378bf..e16f892 100644
--- a/include/exec/cpu_ldst_useronly_template.h
+++ b/include/exec/cpu_ldst_useronly_template.h
@@ -118,6 +118,8 @@ glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
 {
     glue(glue(cpu_st, SUFFIX), MEMSUFFIX)(env, ptr, v);
 }
+
+#include "exec/cpu_atomic_useronly_template.h"
 #endif
 
 #undef RES_TYPE
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 04/30] target-i386: add cmpxchg helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (2 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 03/30] cpu_ldst: add cpu_cmpxchg helpers Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 05/30] target-i386: emulate LOCK'ed cmpxchg using " Emilio G. Cota
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/helper.h     |  4 ++++
 target-i386/mem_helper.c | 15 +++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/target-i386/helper.h b/target-i386/helper.h
index 1320edc..af84836 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -74,8 +74,12 @@ DEF_HELPER_3(boundw, void, env, tl, int)
 DEF_HELPER_3(boundl, void, env, tl, int)
 DEF_HELPER_1(rsm, void, env)
 DEF_HELPER_2(into, void, env, int)
+DEF_HELPER_4(cmpxchgb, tl, env, tl, tl, tl)
+DEF_HELPER_4(cmpxchgw, tl, env, tl, tl, tl)
+DEF_HELPER_4(cmpxchgl, tl, env, tl, tl, tl)
 DEF_HELPER_2(cmpxchg8b, void, env, tl)
 #ifdef TARGET_X86_64
+DEF_HELPER_4(cmpxchgq, tl, env, tl, tl, tl)
 DEF_HELPER_2(cmpxchg16b, void, env, tl)
 #endif
 DEF_HELPER_1(single_step, void, env)
diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
index c2f4769..3b17326 100644
--- a/target-i386/mem_helper.c
+++ b/target-i386/mem_helper.c
@@ -56,6 +56,21 @@ void helper_lock_init(void)
 }
 #endif
 
+#define GEN_CMPXCHG_HELPER(NAME)                                      \
+target_ulong glue(helper_, NAME)(CPUX86State *env, target_ulong addr, \
+                                 target_ulong old, target_ulong new)  \
+{                                                                     \
+    return glue(glue(cpu_, NAME), _data_ra)(env, addr, old, new, GETPC()); \
+}
+
+GEN_CMPXCHG_HELPER(cmpxchgb)
+GEN_CMPXCHG_HELPER(cmpxchgw)
+GEN_CMPXCHG_HELPER(cmpxchgl)
+#ifdef TARGET_X86_64
+GEN_CMPXCHG_HELPER(cmpxchgq)
+#endif
+#undef GEN_CMPXCHG_HELPER
+
 void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
 {
     uint64_t d;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 05/30] target-i386: emulate LOCK'ed cmpxchg using cmpxchg helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (3 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 04/30] target-i386: add cmpxchg helpers Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 06/30] target-i386: emulate LOCK'ed cmpxchg8b/16b " Emilio G. Cota
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

The diff here is uglier than necessary. All this does is to turn

FOO

into:

if (s->prefix & PREFIX_LOCK) {
  BAR
} else {
  FOO
}

where FOO is the original implementation of an unlocked cmpxchg.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/translate.c | 83 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 58 insertions(+), 25 deletions(-)

diff --git a/target-i386/translate.c b/target-i386/translate.c
index 7dea18b..fba90e7 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -1253,6 +1253,24 @@ static void gen_helper_fp_arith_STN_ST0(int op, int opreg)
     }
 }
 
+static void gen_cmpxchg(TCGv ret, TCGv addr, TCGv old, TCGv new, TCGMemOp ot)
+{
+    switch (ot & 3) {
+    case 0:
+        return gen_helper_cmpxchgb(ret, cpu_env, addr, old, new);
+    case 1:
+        return gen_helper_cmpxchgw(ret, cpu_env, addr, old, new);
+    case 2:
+        return gen_helper_cmpxchgl(ret, cpu_env, addr, old, new);
+#ifdef TARGET_X86_64
+    case 3:
+        return gen_helper_cmpxchgq(ret, cpu_env, addr, old, new);
+#endif
+    default:
+        tcg_abort();
+    }
+}
+
 /* if d == OR_TMP0, it means memory operand (address in A0) */
 static void gen_op(DisasContext *s1, int op, TCGMemOp ot, int d)
 {
@@ -5082,37 +5100,52 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             t2 = tcg_temp_local_new();
             a0 = tcg_temp_local_new();
             gen_op_mov_v_reg(ot, t1, reg);
-            if (mod == 3) {
-                rm = (modrm & 7) | REX_B(s);
-                gen_op_mov_v_reg(ot, t0, rm);
-            } else {
+
+            if (s->prefix & PREFIX_LOCK) {
+                if (mod == 3) {
+                    goto illegal_op;
+                }
+                label1 = gen_new_label();
                 gen_lea_modrm(env, s, modrm);
                 tcg_gen_mov_tl(a0, cpu_A0);
-                gen_op_ld_v(s, ot, t0, a0);
-                rm = 0; /* avoid warning */
-            }
-            label1 = gen_new_label();
-            tcg_gen_mov_tl(t2, cpu_regs[R_EAX]);
-            gen_extu(ot, t0);
-            gen_extu(ot, t2);
-            tcg_gen_brcond_tl(TCG_COND_EQ, t2, t0, label1);
-            label2 = gen_new_label();
-            if (mod == 3) {
+                tcg_gen_mov_tl(t2, cpu_regs[R_EAX]);
+                gen_cmpxchg(t0, a0, t2, t1, ot);
+                tcg_gen_brcond_tl(TCG_COND_EQ, t2, t0, label1);
                 gen_op_mov_reg_v(ot, R_EAX, t0);
-                tcg_gen_br(label2);
                 gen_set_label(label1);
-                gen_op_mov_reg_v(ot, rm, t1);
             } else {
-                /* perform no-op store cycle like physical cpu; must be
-                   before changing accumulator to ensure idempotency if
-                   the store faults and the instruction is restarted */
-                gen_op_st_v(s, ot, t0, a0);
-                gen_op_mov_reg_v(ot, R_EAX, t0);
-                tcg_gen_br(label2);
-                gen_set_label(label1);
-                gen_op_st_v(s, ot, t1, a0);
+                if (mod == 3) {
+                    rm = (modrm & 7) | REX_B(s);
+                    gen_op_mov_v_reg(ot, t0, rm);
+                } else {
+                    gen_lea_modrm(env, s, modrm);
+                    tcg_gen_mov_tl(a0, cpu_A0);
+                    gen_op_ld_v(s, ot, t0, a0);
+                    rm = 0; /* avoid warning */
+                }
+                label1 = gen_new_label();
+                tcg_gen_mov_tl(t2, cpu_regs[R_EAX]);
+                gen_extu(ot, t0);
+                gen_extu(ot, t2);
+                tcg_gen_brcond_tl(TCG_COND_EQ, t2, t0, label1);
+                label2 = gen_new_label();
+                if (mod == 3) {
+                    gen_op_mov_reg_v(ot, R_EAX, t0);
+                    tcg_gen_br(label2);
+                    gen_set_label(label1);
+                    gen_op_mov_reg_v(ot, rm, t1);
+                } else {
+                    /* perform no-op store cycle like physical cpu; must be
+                       before changing accumulator to ensure idempotency if
+                       the store faults and the instruction is restarted */
+                    gen_op_st_v(s, ot, t0, a0);
+                    gen_op_mov_reg_v(ot, R_EAX, t0);
+                    tcg_gen_br(label2);
+                    gen_set_label(label1);
+                    gen_op_st_v(s, ot, t1, a0);
+                }
+                gen_set_label(label2);
             }
-            gen_set_label(label2);
             tcg_gen_mov_tl(cpu_cc_src, t0);
             tcg_gen_mov_tl(cpu_cc_srcT, t2);
             tcg_gen_sub_tl(cpu_cc_dst, t2, t0);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 06/30] target-i386: emulate LOCK'ed cmpxchg8b/16b using cmpxchg helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (4 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 05/30] target-i386: emulate LOCK'ed cmpxchg using " Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 07/30] atomics: add atomic_xor Emilio G. Cota
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

For consistency, rename the existing cmpxchg8b/16b helpers by appending _unlocked
to them, to stress that they are not atomic.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/helper.h     |  2 ++
 target-i386/mem_helper.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--
 target-i386/translate.c  | 12 +++++++++--
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/target-i386/helper.h b/target-i386/helper.h
index af84836..2bb0d1f 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -78,9 +78,11 @@ DEF_HELPER_4(cmpxchgb, tl, env, tl, tl, tl)
 DEF_HELPER_4(cmpxchgw, tl, env, tl, tl, tl)
 DEF_HELPER_4(cmpxchgl, tl, env, tl, tl, tl)
 DEF_HELPER_2(cmpxchg8b, void, env, tl)
+DEF_HELPER_2(cmpxchg8b_unlocked, void, env, tl)
 #ifdef TARGET_X86_64
 DEF_HELPER_4(cmpxchgq, tl, env, tl, tl, tl)
 DEF_HELPER_2(cmpxchg16b, void, env, tl)
+DEF_HELPER_2(cmpxchg16b_unlocked, void, env, tl)
 #endif
 DEF_HELPER_1(single_step, void, env)
 DEF_HELPER_1(cpuid, void, env)
diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
index 3b17326..b002aba 100644
--- a/target-i386/mem_helper.c
+++ b/target-i386/mem_helper.c
@@ -71,7 +71,7 @@ GEN_CMPXCHG_HELPER(cmpxchgq)
 #endif
 #undef GEN_CMPXCHG_HELPER
 
-void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
+void helper_cmpxchg8b_unlocked(CPUX86State *env, target_ulong a0)
 {
     uint64_t d;
     int eflags;
@@ -92,8 +92,36 @@ void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
     CC_SRC = eflags;
 }
 
+void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
+{
+    uint64_t d;
+    uint64_t old;
+    uint64_t new;
+    int eflags;
+
+    old = env->regs[R_EDX];
+    old <<= 32;
+    old |= env->regs[R_EAX];
+
+    new = env->regs[R_ECX];
+    new <<= 32;
+    new |= env->regs[R_EBX];
+
+    eflags = cpu_cc_compute_all(env, CC_OP);
+
+    d = cpu_cmpxchgq_data_ra(env, a0, old, new, GETPC());
+    if (d == old) {
+        eflags |= CC_Z;
+    } else {
+        env->regs[R_EDX] = (uint32_t)(d >> 32);
+        env->regs[R_EAX] = (uint32_t)d;
+        eflags &= ~CC_Z;
+    }
+    CC_SRC = eflags;
+}
+
 #ifdef TARGET_X86_64
-void helper_cmpxchg16b(CPUX86State *env, target_ulong a0)
+void helper_cmpxchg16b_unlocked(CPUX86State *env, target_ulong a0)
 {
     uint64_t d0, d1;
     int eflags;
@@ -118,6 +146,28 @@ void helper_cmpxchg16b(CPUX86State *env, target_ulong a0)
     }
     CC_SRC = eflags;
 }
+
+void helper_cmpxchg16b(CPUX86State *env, target_ulong a0)
+{
+    uint64_t d0 = env->regs[R_EAX];
+    uint64_t d1 = env->regs[R_EDX];
+    int eflags;
+
+    if ((a0 & 0xf) != 0) {
+        raise_exception_ra(env, EXCP0D_GPF, GETPC());
+    }
+    eflags = cpu_cc_compute_all(env, CC_OP);
+
+    if (cpu_cmpxchgo_data_ra(env, a0, &d0, &d1, env->regs[R_EBX],
+                             env->regs[R_ECX], GETPC())) {
+        eflags |= CC_Z;
+    } else {
+        env->regs[R_EDX] = d1;
+        env->regs[R_EAX] = d0;
+        eflags &= ~CC_Z;
+    }
+    CC_SRC = eflags;
+}
 #endif
 
 void helper_boundw(CPUX86State *env, target_ulong a0, int v)
diff --git a/target-i386/translate.c b/target-i386/translate.c
index fba90e7..9abd82f 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -5166,14 +5166,22 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             if (!(s->cpuid_ext_features & CPUID_EXT_CX16))
                 goto illegal_op;
             gen_lea_modrm(env, s, modrm);
-            gen_helper_cmpxchg16b(cpu_env, cpu_A0);
+            if (s->prefix & PREFIX_LOCK) {
+                gen_helper_cmpxchg16b(cpu_env, cpu_A0);
+            } else {
+                gen_helper_cmpxchg16b_unlocked(cpu_env, cpu_A0);
+            }
         } else
 #endif        
         {
             if (!(s->cpuid_features & CPUID_CX8))
                 goto illegal_op;
             gen_lea_modrm(env, s, modrm);
-            gen_helper_cmpxchg8b(cpu_env, cpu_A0);
+            if (s->prefix & PREFIX_LOCK) {
+                gen_helper_cmpxchg8b(cpu_env, cpu_A0);
+            } else {
+                gen_helper_cmpxchg8b_unlocked(cpu_env, cpu_A0);
+            }
         }
         set_cc_op(s, CC_OP_EFLAGS);
         break;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 07/30] atomics: add atomic_xor
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (5 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 06/30] target-i386: emulate LOCK'ed cmpxchg8b/16b " Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 08/30] atomics: add atomic_op_fetch variants Emilio G. Cota
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

This paves the way for upcoming work.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/atomic.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 7a59096..a5531da 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -161,6 +161,7 @@
 #define atomic_fetch_sub(ptr, n) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST)
 #define atomic_fetch_and(ptr, n) __atomic_fetch_and(ptr, n, __ATOMIC_SEQ_CST)
 #define atomic_fetch_or(ptr, n)  __atomic_fetch_or(ptr, n, __ATOMIC_SEQ_CST)
+#define atomic_fetch_xor(ptr, n) __atomic_fetch_xor(ptr, n, __ATOMIC_SEQ_CST)
 
 /* And even shorter names that return void.  */
 #define atomic_inc(ptr)    ((void) __atomic_fetch_add(ptr, 1, __ATOMIC_SEQ_CST))
@@ -169,6 +170,7 @@
 #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
 #define atomic_and(ptr, n) ((void) __atomic_fetch_and(ptr, n, __ATOMIC_SEQ_CST))
 #define atomic_or(ptr, n)  ((void) __atomic_fetch_or(ptr, n, __ATOMIC_SEQ_CST))
+#define atomic_xor(ptr, n) ((void) __atomic_fetch_xor(ptr, n, __ATOMIC_SEQ_CST))
 
 #else /* __ATOMIC_RELAXED */
 
@@ -355,6 +357,7 @@
 #define atomic_fetch_sub       __sync_fetch_and_sub
 #define atomic_fetch_and       __sync_fetch_and_and
 #define atomic_fetch_or        __sync_fetch_and_or
+#define atomic_fetch_xor       __sync_fetch_and_xor
 #define atomic_cmpxchg         __sync_val_compare_and_swap
 
 /* And even shorter names that return void.  */
@@ -364,6 +367,7 @@
 #define atomic_sub(ptr, n)     ((void) __sync_fetch_and_sub(ptr, n))
 #define atomic_and(ptr, n)     ((void) __sync_fetch_and_and(ptr, n))
 #define atomic_or(ptr, n)      ((void) __sync_fetch_and_or(ptr, n))
+#define atomic_xor(ptr, n)     ((void) __sync_fetch_and_xor(ptr, n))
 
 #endif /* __ATOMIC_RELAXED */
 #endif /* __QEMU_ATOMIC_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 08/30] atomics: add atomic_op_fetch variants
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (6 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 07/30] atomics: add atomic_xor Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 09/30] softmmu: add atomic helpers Emilio G. Cota
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

This paves the way for upcoming work.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/atomic.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index a5531da..d75bfde 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -163,6 +163,14 @@
 #define atomic_fetch_or(ptr, n)  __atomic_fetch_or(ptr, n, __ATOMIC_SEQ_CST)
 #define atomic_fetch_xor(ptr, n) __atomic_fetch_xor(ptr, n, __ATOMIC_SEQ_CST)
 
+#define atomic_inc_fetch(ptr)    __atomic_add_fetch(ptr, 1, __ATOMIC_SEQ_CST)
+#define atomic_dec_fetch(ptr)    __atomic_sub_fetch(ptr, 1, __ATOMIC_SEQ_CST)
+#define atomic_add_fetch(ptr, n) __atomic_add_fetch(ptr, n, __ATOMIC_SEQ_CST)
+#define atomic_sub_fetch(ptr, n) __atomic_sub_fetch(ptr, n, __ATOMIC_SEQ_CST)
+#define atomic_and_fetch(ptr, n) __atomic_and_fetch(ptr, n, __ATOMIC_SEQ_CST)
+#define atomic_or_fetch(ptr, n)  __atomic_or_fetch(ptr, n, __ATOMIC_SEQ_CST)
+#define atomic_xor_fetch(ptr, n) __atomic_xor_fetch(ptr, n, __ATOMIC_SEQ_CST)
+
 /* And even shorter names that return void.  */
 #define atomic_inc(ptr)    ((void) __atomic_fetch_add(ptr, 1, __ATOMIC_SEQ_CST))
 #define atomic_dec(ptr)    ((void) __atomic_fetch_sub(ptr, 1, __ATOMIC_SEQ_CST))
@@ -358,6 +366,15 @@
 #define atomic_fetch_and       __sync_fetch_and_and
 #define atomic_fetch_or        __sync_fetch_and_or
 #define atomic_fetch_xor       __sync_fetch_and_xor
+
+#define atomic_inc_fetch(ptr)  __sync_add_and_fetch(ptr, 1)
+#define atomic_dec_fetch(ptr)  __sync_add_and_fetch(ptr, -1)
+#define atomic_add_fetch       __sync_add_and_fetch
+#define atomic_sub_fetch       __sync_sub_and_fetch
+#define atomic_and_fetch       __sync_and_and_fetch
+#define atomic_or_fetch        __sync_or_and_fetch
+#define atomic_xor_fetch       __sync_xor_and_fetch
+
 #define atomic_cmpxchg         __sync_val_compare_and_swap
 
 /* And even shorter names that return void.  */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 09/30] softmmu: add atomic helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (7 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 08/30] atomics: add atomic_op_fetch variants Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 10/30] cpu_ldst: add cpu_atomic helpers Emilio G. Cota
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 softmmu_template.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.h          | 30 ++++++++++++++++++++++
 2 files changed, 105 insertions(+)

diff --git a/softmmu_template.h b/softmmu_template.h
index 7b519dc..fda92ff 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -606,6 +606,81 @@ glue(glue(helper_cmpxchg, SUFFIX),
     return atomic_cmpxchg((DATA_TYPE *)haddr, old, new);
 }
 
+#define GEN_ATOMIC_HELPER(NAME)                                         \
+DATA_TYPE                                                               \
+glue(glue(glue(helper_atomic_, NAME), SUFFIX),                          \
+     MMUSUFFIX)(CPUArchState *env, target_ulong addr, DATA_TYPE val,    \
+                TCGMemOpIdx oi, uintptr_t retaddr)                      \
+{                                                                       \
+    unsigned mmu_idx = get_mmuidx(oi);                                  \
+    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);        \
+    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;  \
+    uintptr_t haddr;                                                    \
+                                                                        \
+    /* Adjust the given return address.  */                             \
+    retaddr -= GETPC_ADJ;                                               \
+                                                                        \
+    /* If the TLB entry is for a different page, reload and try again */\
+    if ((addr & TARGET_PAGE_MASK)                                       \
+        != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {        \
+        if (unlikely((addr & (DATA_SIZE - 1)) != 0                      \
+                     && (get_memop(oi) & MO_AMASK) == MO_ALIGN)) {      \
+            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,\
+                                 mmu_idx, retaddr);                     \
+        }                                                               \
+        if (!VICTIM_TLB_HIT(addr_write)) {                              \
+            tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx,   \
+                     retaddr);                                          \
+        }                                                               \
+        tlb_addr = env->tlb_table[mmu_idx][index].addr_write;           \
+    }                                                                   \
+                                                                        \
+    /* Handle an IO access.  */                                         \
+    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {                       \
+        /* XXX */                                                       \
+        abort();                                                        \
+    }                                                                   \
+                                                                        \
+    /* Handle slow unaligned access (it spans two pages or IO).  */     \
+    if (DATA_SIZE > 1                                                   \
+        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1          \
+                    >= TARGET_PAGE_SIZE)) {                             \
+        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {                   \
+            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE, \
+                                 mmu_idx, retaddr);                     \
+        }                                                               \
+    }                                                                   \
+                                                                        \
+    /* Handle aligned access or unaligned access in the same page.  */  \
+    if (unlikely((addr & (DATA_SIZE - 1)) != 0                          \
+                 && (get_memop(oi) & MO_AMASK) == MO_ALIGN)) {          \
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,    \
+                             mmu_idx, retaddr);                         \
+    }                                                                   \
+    /*                                                                  \
+     * If the host allows unaligned accesses, then let the compiler     \
+     * do its thing when performing the access on the host.             \
+     */                                                                 \
+    haddr = addr + env->tlb_table[mmu_idx][index].addend;               \
+    return glue(atomic_, NAME)((DATA_TYPE *)haddr, val);                \
+}                                                                       \
+
+GEN_ATOMIC_HELPER(fetch_add)
+GEN_ATOMIC_HELPER(fetch_sub)
+GEN_ATOMIC_HELPER(fetch_and)
+GEN_ATOMIC_HELPER(fetch_or)
+GEN_ATOMIC_HELPER(fetch_xor)
+
+GEN_ATOMIC_HELPER(add_fetch)
+GEN_ATOMIC_HELPER(sub_fetch)
+GEN_ATOMIC_HELPER(and_fetch)
+GEN_ATOMIC_HELPER(or_fetch)
+GEN_ATOMIC_HELPER(xor_fetch)
+
+GEN_ATOMIC_HELPER(xchg)
+
+#undef GEN_ATOMIC_HELPER
+
 #endif /* !defined(SOFTMMU_CODE_ACCESS) */
 
 #undef READ_ACCESS_TYPE
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 1c9c8bc..09aab4e 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -1122,6 +1122,36 @@ uint64_t helper_cmpxchgq_mmu(CPUArchState *env, target_ulong addr,
                              uint64_t old, uint64_t new,
                              TCGMemOpIdx oi, uintptr_t retaddr);
 
+#define GEN_ATOMIC_HELPER(NAME, TYPE, SUFFIX)                           \
+TYPE glue(glue(glue(helper_atomic_,                                     \
+                    NAME),                                              \
+               SUFFIX),                                                 \
+          _mmu)(CPUArchState *env, target_ulong addr, TYPE val,         \
+                TCGMemOpIdx oi, uintptr_t retaddr);
+
+#define GEN_ATOMIC_HELPER_ALL(NAME) \
+    GEN_ATOMIC_HELPER(NAME, uint8_t, b)   \
+    GEN_ATOMIC_HELPER(NAME, uint16_t, w)  \
+    GEN_ATOMIC_HELPER(NAME, uint32_t, l)  \
+    GEN_ATOMIC_HELPER(NAME, uint64_t, q)
+
+GEN_ATOMIC_HELPER_ALL(fetch_add)
+GEN_ATOMIC_HELPER_ALL(fetch_sub)
+GEN_ATOMIC_HELPER_ALL(fetch_and)
+GEN_ATOMIC_HELPER_ALL(fetch_or)
+GEN_ATOMIC_HELPER_ALL(fetch_xor)
+
+GEN_ATOMIC_HELPER_ALL(add_fetch)
+GEN_ATOMIC_HELPER_ALL(sub_fetch)
+GEN_ATOMIC_HELPER_ALL(and_fetch)
+GEN_ATOMIC_HELPER_ALL(or_fetch)
+GEN_ATOMIC_HELPER_ALL(xor_fetch)
+
+GEN_ATOMIC_HELPER_ALL(xchg)
+
+#undef GEN_ATOMIC_HELPER_ALL
+#undef GEN_ATOMIC_HELPER
+
 #endif /* CONFIG_SOFTMMU */
 
 #endif /* TCG_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 10/30] cpu_ldst: add cpu_atomic helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (8 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 09/30] softmmu: add atomic helpers Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 11/30] target-i386: add atomic helpers Emilio G. Cota
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu_atomic_template.h          | 41 +++++++++++++++++++++++++++++
 include/exec/cpu_atomic_useronly_template.h | 25 ++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/include/exec/cpu_atomic_template.h b/include/exec/cpu_atomic_template.h
index 13f4ffd..36c9593 100644
--- a/include/exec/cpu_atomic_template.h
+++ b/include/exec/cpu_atomic_template.h
@@ -54,3 +54,44 @@ glue(glue(cpu_cmpxchgo, MEMSUFFIX), _ra)(CPUArchState *env, target_ulong ptr,
     return ret;
 }
 #endif
+
+#define GEN_ATOMIC_HELPER(NAME)                                         \
+static inline DATA_TYPE                                                 \
+glue(glue(glue(glue(cpu_atomic_, NAME), SUFFIX), MEMSUFFIX),            \
+     _ra)(CPUArchState *env, target_ulong ptr, DATA_TYPE val, uintptr_t ra) \
+{                                                                       \
+    int page_index;                                                     \
+    target_ulong addr;                                                  \
+    int mmu_idx;                                                        \
+    TCGMemOpIdx oi;                                                     \
+                                                                        \
+    addr = ptr;                                                         \
+    page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);       \
+    mmu_idx = CPU_MMU_INDEX;                                            \
+    if (unlikely(env->tlb_table[mmu_idx][page_index].addr_write !=      \
+                 (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {      \
+        oi = make_memop_idx(SHIFT, mmu_idx);                            \
+        return glue(glue(glue(helper_atomic_, NAME), SUFFIX),           \
+                    MMUSUFFIX)(env, addr, val, oi, ra);                 \
+    } else {                                                            \
+        uintptr_t hostaddr = addr +                                     \
+            env->tlb_table[mmu_idx][page_index].addend;                 \
+                                                                        \
+        return glue(atomic_, NAME)((DATA_TYPE *)hostaddr, val);         \
+    }                                                                   \
+}
+
+GEN_ATOMIC_HELPER(fetch_add)
+GEN_ATOMIC_HELPER(fetch_sub)
+GEN_ATOMIC_HELPER(fetch_and)
+GEN_ATOMIC_HELPER(fetch_or)
+GEN_ATOMIC_HELPER(fetch_xor)
+
+GEN_ATOMIC_HELPER(add_fetch)
+GEN_ATOMIC_HELPER(sub_fetch)
+GEN_ATOMIC_HELPER(and_fetch)
+GEN_ATOMIC_HELPER(or_fetch)
+GEN_ATOMIC_HELPER(xor_fetch)
+
+GEN_ATOMIC_HELPER(xchg)
+#undef GEN_ATOMIC_HELPER
diff --git a/include/exec/cpu_atomic_useronly_template.h b/include/exec/cpu_atomic_useronly_template.h
index c7c0a3e..c4b447e 100644
--- a/include/exec/cpu_atomic_useronly_template.h
+++ b/include/exec/cpu_atomic_useronly_template.h
@@ -37,3 +37,28 @@ glue(glue(cpu_cmpxchgo, MEMSUFFIX),
     return ret;
 }
 #endif
+
+#define GEN_ATOMIC_HELPER(NAME)                                             \
+static inline DATA_TYPE                                                     \
+glue(glue(glue(glue(cpu_atomic_, NAME), SUFFIX), MEMSUFFIX),                \
+     _ra)(CPUArchState *env, target_ulong ptr, DATA_TYPE val, uintptr_t ra) \
+{                                                                           \
+    DATA_TYPE *hostaddr = g2h(ptr);                                         \
+                                                                            \
+    return glue(atomic_, NAME)(hostaddr, val);                              \
+}
+
+GEN_ATOMIC_HELPER(fetch_add)
+GEN_ATOMIC_HELPER(fetch_sub)
+GEN_ATOMIC_HELPER(fetch_and)
+GEN_ATOMIC_HELPER(fetch_or)
+GEN_ATOMIC_HELPER(fetch_xor)
+
+GEN_ATOMIC_HELPER(add_fetch)
+GEN_ATOMIC_HELPER(sub_fetch)
+GEN_ATOMIC_HELPER(and_fetch)
+GEN_ATOMIC_HELPER(or_fetch)
+GEN_ATOMIC_HELPER(xor_fetch)
+
+GEN_ATOMIC_HELPER(xchg)
+#undef GEN_ATOMIC_HELPER
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 11/30] target-i386: add atomic helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (9 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 10/30] cpu_ldst: add cpu_atomic helpers Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 20:27   ` Richard Henderson
  2016-06-27 19:01 ` [Qemu-devel] [RFC 12/30] target-i386: emulate LOCK'ed OP instructions using " Emilio G. Cota
                   ` (19 subsequent siblings)
  30 siblings, 1 reply; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

This patch only adds the helpers. Functions to invoke the helpers
from translated code are generated in subsequent patches.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/helper.h     | 34 ++++++++++++++++++++++++++++++++++
 target-i386/mem_helper.c | 38 ++++++++++++++++++++++++++++++++++++++
 target-i386/translate.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 117 insertions(+)

diff --git a/target-i386/helper.h b/target-i386/helper.h
index 2bb0d1f..df68204 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -84,6 +84,40 @@ DEF_HELPER_4(cmpxchgq, tl, env, tl, tl, tl)
 DEF_HELPER_2(cmpxchg16b, void, env, tl)
 DEF_HELPER_2(cmpxchg16b_unlocked, void, env, tl)
 #endif
+
+/*
+ * Use glue(foo, glue(bar, baz)) instead of glue(glue(foo, bar), baz), otherwise
+ * some gcc's (e.g. v4.6.3) can get confused with the surrounding DEF_HELPER.
+ */
+#ifndef TARGET_X86_64
+#define DEF_ATOMIC_ALL(NAME)                                    \
+    DEF_HELPER_3(glue(atomic_, glue(NAME, b)), tl, env, tl, tl) \
+    DEF_HELPER_3(glue(atomic_, glue(NAME, w)), tl, env, tl, tl) \
+    DEF_HELPER_3(glue(atomic_, glue(NAME, l)), tl, env, tl, tl)
+#else /* 64-bit */
+#define DEF_ATOMIC_ALL(NAME)                                    \
+    DEF_HELPER_3(glue(atomic_, glue(NAME, b)), tl, env, tl, tl) \
+    DEF_HELPER_3(glue(atomic_, glue(NAME, w)), tl, env, tl, tl) \
+    DEF_HELPER_3(glue(atomic_, glue(NAME, l)), tl, env, tl, tl) \
+    DEF_HELPER_3(glue(atomic_, glue(NAME, q)), tl, env, tl, tl)
+#endif
+
+DEF_ATOMIC_ALL(fetch_add)
+DEF_ATOMIC_ALL(fetch_and)
+DEF_ATOMIC_ALL(fetch_or)
+DEF_ATOMIC_ALL(fetch_sub)
+DEF_ATOMIC_ALL(fetch_xor)
+
+DEF_ATOMIC_ALL(add_fetch)
+DEF_ATOMIC_ALL(and_fetch)
+DEF_ATOMIC_ALL(or_fetch)
+DEF_ATOMIC_ALL(sub_fetch)
+DEF_ATOMIC_ALL(xor_fetch)
+
+DEF_ATOMIC_ALL(xchg)
+
+#undef DEF_ATOMIC_ALL
+
 DEF_HELPER_1(single_step, void, env)
 DEF_HELPER_1(cpuid, void, env)
 DEF_HELPER_1(rdtsc, void, env)
diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
index b002aba..13f4f3b 100644
--- a/target-i386/mem_helper.c
+++ b/target-i386/mem_helper.c
@@ -170,6 +170,44 @@ void helper_cmpxchg16b(CPUX86State *env, target_ulong a0)
 }
 #endif
 
+#define GEN_ATOMIC_HELPER(NAME)                                     \
+target_ulong                                                        \
+glue(helper_atomic_,                                                \
+     NAME)(CPUArchState *env, target_ulong addr, target_ulong val)  \
+{                                                                   \
+    return glue(glue(cpu_atomic_, NAME), _data_ra)(env, addr, val, GETPC()); \
+}
+
+#ifndef TARGET_X86_64
+#define GEN_ATOMIC_HELPER_ALL(NAME)              \
+    GEN_ATOMIC_HELPER(glue(NAME, b))             \
+    GEN_ATOMIC_HELPER(glue(NAME, w))             \
+    GEN_ATOMIC_HELPER(glue(NAME, l))
+#else /* 64-bit */
+#define GEN_ATOMIC_HELPER_ALL(NAME)              \
+    GEN_ATOMIC_HELPER(glue(NAME, b))             \
+    GEN_ATOMIC_HELPER(glue(NAME, w))             \
+    GEN_ATOMIC_HELPER(glue(NAME, l))             \
+    GEN_ATOMIC_HELPER(glue(NAME, q))
+#endif /* TARGET_X86_64 */
+
+GEN_ATOMIC_HELPER_ALL(fetch_add)
+GEN_ATOMIC_HELPER_ALL(fetch_and)
+GEN_ATOMIC_HELPER_ALL(fetch_or)
+GEN_ATOMIC_HELPER_ALL(fetch_sub)
+GEN_ATOMIC_HELPER_ALL(fetch_xor)
+
+GEN_ATOMIC_HELPER_ALL(add_fetch)
+GEN_ATOMIC_HELPER_ALL(and_fetch)
+GEN_ATOMIC_HELPER_ALL(or_fetch)
+GEN_ATOMIC_HELPER_ALL(sub_fetch)
+GEN_ATOMIC_HELPER_ALL(xor_fetch)
+
+GEN_ATOMIC_HELPER_ALL(xchg)
+
+#undef GEN_ATOMIC_HELPER
+#undef GEN_ATOMIC_HELPER_ALL
+
 void helper_boundw(CPUX86State *env, target_ulong a0, int v)
 {
     int low, high;
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 9abd82f..eead9d7 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -1271,6 +1271,51 @@ static void gen_cmpxchg(TCGv ret, TCGv addr, TCGv old, TCGv new, TCGMemOp ot)
     }
 }
 
+#ifndef TARGET_X86_64
+#define GEN_ATOMIC_HELPER(NAME)                                           \
+static void                                                               \
+glue(gen_atomic_, NAME)(TCGv ret, TCGv addr, TCGv reg, TCGMemOp ot)       \
+{                                                                         \
+    switch (ot & 3) {                                                     \
+    case 0:                                                               \
+        glue(glue(gen_helper_atomic_, NAME), b)(ret, cpu_env, addr, reg); \
+        break;                                                            \
+    case 1:                                                               \
+        glue(glue(gen_helper_atomic_, NAME), w)(ret, cpu_env, addr, reg); \
+        break;                                                            \
+    case 2:                                                               \
+        glue(glue(gen_helper_atomic_, NAME), l)(ret, cpu_env, addr, reg); \
+        break;                                                            \
+    default:                                                              \
+        tcg_abort();                                                      \
+    }                                                                     \
+}
+#else /* 64-bit */
+#define GEN_ATOMIC_HELPER(NAME)                                           \
+static void                                                               \
+glue(gen_atomic_, NAME)(TCGv ret, TCGv addr, TCGv reg, TCGMemOp ot)       \
+{                                                                         \
+    switch (ot & 3) {                                                     \
+    case 0:                                                               \
+        glue(glue(gen_helper_atomic_, NAME), b)(ret, cpu_env, addr, reg); \
+        break;                                                            \
+    case 1:                                                               \
+        glue(glue(gen_helper_atomic_, NAME), w)(ret, cpu_env, addr, reg); \
+        break;                                                            \
+    case 2:                                                               \
+        glue(glue(gen_helper_atomic_, NAME), l)(ret, cpu_env, addr, reg); \
+        break;                                                            \
+    case 3:                                                               \
+        glue(glue(gen_helper_atomic_, NAME), q)(ret, cpu_env, addr, reg); \
+        break;                                                            \
+    default:                                                              \
+        tcg_abort();                                                      \
+    }                                                                     \
+}
+#endif /* TARGET_X86_64 */
+
+#undef GEN_ATOMIC_HELPER
+
 /* if d == OR_TMP0, it means memory operand (address in A0) */
 static void gen_op(DisasContext *s1, int op, TCGMemOp ot, int d)
 {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 12/30] target-i386: emulate LOCK'ed OP instructions using atomic helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (10 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 11/30] target-i386: add atomic helpers Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 13/30] target-i386: emulate LOCK'ed INC using atomic helper Emilio G. Cota
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/translate.c | 87 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 69 insertions(+), 18 deletions(-)

diff --git a/target-i386/translate.c b/target-i386/translate.c
index eead9d7..1dc7014 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -1314,6 +1314,13 @@ glue(gen_atomic_, NAME)(TCGv ret, TCGv addr, TCGv reg, TCGMemOp ot)       \
 }
 #endif /* TARGET_X86_64 */
 
+GEN_ATOMIC_HELPER(fetch_sub)
+
+GEN_ATOMIC_HELPER(add_fetch)
+GEN_ATOMIC_HELPER(and_fetch)
+GEN_ATOMIC_HELPER(or_fetch)
+GEN_ATOMIC_HELPER(sub_fetch)
+GEN_ATOMIC_HELPER(xor_fetch)
 #undef GEN_ATOMIC_HELPER
 
 /* if d == OR_TMP0, it means memory operand (address in A0) */
@@ -1321,55 +1328,99 @@ static void gen_op(DisasContext *s1, int op, TCGMemOp ot, int d)
 {
     if (d != OR_TMP0) {
         gen_op_mov_v_reg(ot, cpu_T0, d);
-    } else {
+    } else if (!(s1->prefix & PREFIX_LOCK)) {
         gen_op_ld_v(s1, ot, cpu_T0, cpu_A0);
     }
     switch(op) {
     case OP_ADCL:
         gen_compute_eflags_c(s1, cpu_tmp4);
-        tcg_gen_add_tl(cpu_T0, cpu_T0, cpu_T1);
-        tcg_gen_add_tl(cpu_T0, cpu_T0, cpu_tmp4);
-        gen_op_st_rm_T0_A0(s1, ot, d);
+        if (s1->prefix & PREFIX_LOCK) {
+            TCGv t0;
+
+            t0 = tcg_temp_new();
+            tcg_gen_add_tl(t0, cpu_tmp4, cpu_T1);
+            gen_atomic_add_fetch(cpu_T0, cpu_A0, t0, ot);
+            tcg_temp_free(t0);
+        } else {
+            tcg_gen_add_tl(cpu_T0, cpu_T0, cpu_T1);
+            tcg_gen_add_tl(cpu_T0, cpu_T0, cpu_tmp4);
+            gen_op_st_rm_T0_A0(s1, ot, d);
+        }
         gen_op_update3_cc(cpu_tmp4);
         set_cc_op(s1, CC_OP_ADCB + ot);
         break;
     case OP_SBBL:
         gen_compute_eflags_c(s1, cpu_tmp4);
-        tcg_gen_sub_tl(cpu_T0, cpu_T0, cpu_T1);
-        tcg_gen_sub_tl(cpu_T0, cpu_T0, cpu_tmp4);
-        gen_op_st_rm_T0_A0(s1, ot, d);
+        if (s1->prefix & PREFIX_LOCK) {
+            TCGv t0;
+
+            t0 = tcg_temp_new();
+            tcg_gen_add_tl(t0, cpu_T1, cpu_tmp4);
+            gen_atomic_sub_fetch(cpu_T0, cpu_A0, t0, ot);
+            tcg_temp_free(t0);
+        } else {
+            tcg_gen_sub_tl(cpu_T0, cpu_T0, cpu_T1);
+            tcg_gen_sub_tl(cpu_T0, cpu_T0, cpu_tmp4);
+            gen_op_st_rm_T0_A0(s1, ot, d);
+        }
         gen_op_update3_cc(cpu_tmp4);
         set_cc_op(s1, CC_OP_SBBB + ot);
         break;
     case OP_ADDL:
-        tcg_gen_add_tl(cpu_T0, cpu_T0, cpu_T1);
-        gen_op_st_rm_T0_A0(s1, ot, d);
+        if (s1->prefix & PREFIX_LOCK) {
+            gen_atomic_add_fetch(cpu_T0, cpu_A0, cpu_T1, ot);
+        } else {
+            tcg_gen_add_tl(cpu_T0, cpu_T0, cpu_T1);
+            gen_op_st_rm_T0_A0(s1, ot, d);
+        }
         gen_op_update2_cc();
         set_cc_op(s1, CC_OP_ADDB + ot);
         break;
     case OP_SUBL:
-        tcg_gen_mov_tl(cpu_cc_srcT, cpu_T0);
-        tcg_gen_sub_tl(cpu_T0, cpu_T0, cpu_T1);
-        gen_op_st_rm_T0_A0(s1, ot, d);
+        if (s1->prefix & PREFIX_LOCK) {
+            TCGv t0;
+
+            t0 = tcg_temp_new();
+            gen_atomic_fetch_sub(t0, cpu_A0, cpu_T1, ot);
+            tcg_gen_mov_tl(cpu_cc_srcT, t0);
+            tcg_gen_sub_tl(cpu_T0, t0, cpu_T1);
+            tcg_temp_free(t0);
+        } else {
+            tcg_gen_mov_tl(cpu_cc_srcT, cpu_T0);
+            tcg_gen_sub_tl(cpu_T0, cpu_T0, cpu_T1);
+            gen_op_st_rm_T0_A0(s1, ot, d);
+        }
         gen_op_update2_cc();
         set_cc_op(s1, CC_OP_SUBB + ot);
         break;
     default:
     case OP_ANDL:
-        tcg_gen_and_tl(cpu_T0, cpu_T0, cpu_T1);
-        gen_op_st_rm_T0_A0(s1, ot, d);
+        if (s1->prefix & PREFIX_LOCK) {
+            gen_atomic_and_fetch(cpu_T0, cpu_A0, cpu_T1, ot);
+        } else {
+            tcg_gen_and_tl(cpu_T0, cpu_T0, cpu_T1);
+            gen_op_st_rm_T0_A0(s1, ot, d);
+        }
         gen_op_update1_cc();
         set_cc_op(s1, CC_OP_LOGICB + ot);
         break;
     case OP_ORL:
-        tcg_gen_or_tl(cpu_T0, cpu_T0, cpu_T1);
-        gen_op_st_rm_T0_A0(s1, ot, d);
+        if (s1->prefix & PREFIX_LOCK) {
+            gen_atomic_or_fetch(cpu_T0, cpu_A0, cpu_T1, ot);
+        } else {
+            tcg_gen_or_tl(cpu_T0, cpu_T0, cpu_T1);
+            gen_op_st_rm_T0_A0(s1, ot, d);
+        }
         gen_op_update1_cc();
         set_cc_op(s1, CC_OP_LOGICB + ot);
         break;
     case OP_XORL:
-        tcg_gen_xor_tl(cpu_T0, cpu_T0, cpu_T1);
-        gen_op_st_rm_T0_A0(s1, ot, d);
+        if (s1->prefix & PREFIX_LOCK) {
+            gen_atomic_xor_fetch(cpu_T0, cpu_A0, cpu_T1, ot);
+        } else {
+            tcg_gen_xor_tl(cpu_T0, cpu_T0, cpu_T1);
+            gen_op_st_rm_T0_A0(s1, ot, d);
+        }
         gen_op_update1_cc();
         set_cc_op(s1, CC_OP_LOGICB + ot);
         break;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 13/30] target-i386: emulate LOCK'ed INC using atomic helper
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (11 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 12/30] target-i386: emulate LOCK'ed OP instructions using " Emilio G. Cota
@ 2016-06-27 19:01 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 14/30] target-i386: emulate LOCK'ed NOT " Emilio G. Cota
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/translate.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/target-i386/translate.c b/target-i386/translate.c
index 1dc7014..a9fa25a 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -1433,9 +1433,30 @@ static void gen_op(DisasContext *s1, int op, TCGMemOp ot, int d)
     }
 }
 
+static void gen_inc_locked(DisasContext *s1, TCGMemOp ot, int c)
+{
+    TCGv t0;
+
+    t0 = tcg_temp_new();
+    tcg_gen_movi_tl(t0, 1);
+    gen_compute_eflags_c(s1, cpu_cc_src);
+    if (c > 0) {
+        gen_atomic_add_fetch(cpu_T0, cpu_A0, t0, ot);
+        set_cc_op(s1, CC_OP_INCB + ot);
+    } else {
+        gen_atomic_sub_fetch(cpu_T0, cpu_A0, t0, ot);
+        set_cc_op(s1, CC_OP_DECB + ot);
+    }
+    tcg_gen_mov_tl(cpu_cc_dst, cpu_T0);
+    tcg_temp_free(t0);
+}
+
 /* if d == OR_TMP0, it means memory operand (address in A0) */
 static void gen_inc(DisasContext *s1, TCGMemOp ot, int d, int c)
 {
+    if (s1->prefix & PREFIX_LOCK) {
+        return gen_inc_locked(s1, ot, c);
+    }
     if (d != OR_TMP0) {
         gen_op_mov_v_reg(ot, cpu_T0, d);
     } else {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 14/30] target-i386: emulate LOCK'ed NOT using atomic helper
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (12 preceding siblings ...)
  2016-06-27 19:01 ` [Qemu-devel] [RFC 13/30] target-i386: emulate LOCK'ed INC using atomic helper Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 15/30] target-i386: emulate LOCK'ed NEG using cmpxchg helper Emilio G. Cota
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/translate.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/target-i386/translate.c b/target-i386/translate.c
index a9fa25a..fcccb1a 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -4784,11 +4784,19 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             set_cc_op(s, CC_OP_LOGICB + ot);
             break;
         case 2: /* not */
-            tcg_gen_not_tl(cpu_T0, cpu_T0);
-            if (mod != 3) {
-                gen_op_st_v(s, ot, cpu_T0, cpu_A0);
+            if (s->prefix & PREFIX_LOCK) {
+                if (mod == 3) {
+                    goto illegal_op;
+                }
+                tcg_gen_movi_tl(cpu_T0, ~0);
+                gen_atomic_xor_fetch(cpu_T0, cpu_A0, cpu_T0, ot);
             } else {
-                gen_op_mov_reg_v(ot, rm, cpu_T0);
+                tcg_gen_not_tl(cpu_T0, cpu_T0);
+                if (mod != 3) {
+                    gen_op_st_v(s, ot, cpu_T0, cpu_A0);
+                } else {
+                    gen_op_mov_reg_v(ot, rm, cpu_T0);
+                }
             }
             break;
         case 3: /* neg */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 15/30] target-i386: emulate LOCK'ed NEG using cmpxchg helper
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (13 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 14/30] target-i386: emulate LOCK'ed NOT " Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 16/30] target-i386: emulate LOCK'ed XADD using atomic helper Emilio G. Cota
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/translate.c | 34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/target-i386/translate.c b/target-i386/translate.c
index fcccb1a..a5a633b 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -4800,11 +4800,37 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             }
             break;
         case 3: /* neg */
-            tcg_gen_neg_tl(cpu_T0, cpu_T0);
-            if (mod != 3) {
-                gen_op_st_v(s, ot, cpu_T0, cpu_A0);
+            if (s->prefix & PREFIX_LOCK) {
+                TCGLabel *label1;
+                TCGv a0, t0, t1, t2;
+
+                if (mod == 3) {
+                    goto illegal_op;
+                }
+                a0 = tcg_temp_local_new();
+                t0 = tcg_temp_local_new();
+                t1 = tcg_temp_local_new();
+                t2 = tcg_temp_new();
+                label1 = gen_new_label();
+
+                tcg_gen_mov_tl(a0, cpu_A0);
+                gen_set_label(label1);
+                gen_op_ld_v(s, ot, t0, a0);
+                tcg_gen_neg_tl(t1, t0);
+                gen_cmpxchg(t2, a0, t0, t1, ot);
+                tcg_gen_brcond_tl(TCG_COND_NE, t2, t0, label1);
+                tcg_temp_free(t2);
+                tcg_temp_free(a0);
+                tcg_temp_free(t1);
+                tcg_gen_mov_tl(cpu_T0, t0);
+                tcg_temp_free(t0);
             } else {
-                gen_op_mov_reg_v(ot, rm, cpu_T0);
+                tcg_gen_neg_tl(cpu_T0, cpu_T0);
+                if (mod != 3) {
+                    gen_op_st_v(s, ot, cpu_T0, cpu_A0);
+                } else {
+                    gen_op_mov_reg_v(ot, rm, cpu_T0);
+                }
             }
             gen_op_update_neg_cc();
             set_cc_op(s, CC_OP_SUBB + ot);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 16/30] target-i386: emulate LOCK'ed XADD using atomic helper
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (14 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 15/30] target-i386: emulate LOCK'ed NEG using cmpxchg helper Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 17/30] target-i386: emulate LOCK'ed BTX ops using atomic helpers Emilio G. Cota
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/translate.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/target-i386/translate.c b/target-i386/translate.c
index a5a633b..7df744e 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -1314,6 +1314,7 @@ glue(gen_atomic_, NAME)(TCGv ret, TCGv addr, TCGv reg, TCGMemOp ot)       \
 }
 #endif /* TARGET_X86_64 */
 
+GEN_ATOMIC_HELPER(fetch_add)
 GEN_ATOMIC_HELPER(fetch_sub)
 
 GEN_ATOMIC_HELPER(add_fetch)
@@ -5227,10 +5228,16 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             gen_op_mov_reg_v(ot, rm, cpu_T0);
         } else {
             gen_lea_modrm(env, s, modrm);
-            gen_op_mov_v_reg(ot, cpu_T0, reg);
-            gen_op_ld_v(s, ot, cpu_T1, cpu_A0);
-            tcg_gen_add_tl(cpu_T0, cpu_T0, cpu_T1);
-            gen_op_st_v(s, ot, cpu_T0, cpu_A0);
+            if (s->prefix & PREFIX_LOCK) {
+                gen_op_mov_v_reg(ot, cpu_T0, reg);
+                gen_atomic_fetch_add(cpu_T1, cpu_A0, cpu_T0, ot);
+                tcg_gen_add_tl(cpu_T0, cpu_T0, cpu_T1);
+            } else {
+                gen_op_ld_v(s, ot, cpu_T1, cpu_A0);
+                gen_op_mov_v_reg(ot, cpu_T0, reg);
+                tcg_gen_add_tl(cpu_T0, cpu_T0, cpu_T1);
+                gen_op_st_v(s, ot, cpu_T0, cpu_A0);
+            }
             gen_op_mov_reg_v(ot, reg, cpu_T1);
         }
         gen_op_update2_cc();
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 17/30] target-i386: emulate LOCK'ed BTX ops using atomic helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (15 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 16/30] target-i386: emulate LOCK'ed XADD using atomic helper Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 18/30] target-i386: emulate XCHG using atomic helper Emilio G. Cota
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/translate.c | 76 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 52 insertions(+), 24 deletions(-)

diff --git a/target-i386/translate.c b/target-i386/translate.c
index 7df744e..b76b0ae 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -1315,7 +1315,10 @@ glue(gen_atomic_, NAME)(TCGv ret, TCGv addr, TCGv reg, TCGMemOp ot)       \
 #endif /* TARGET_X86_64 */
 
 GEN_ATOMIC_HELPER(fetch_add)
+GEN_ATOMIC_HELPER(fetch_and)
+GEN_ATOMIC_HELPER(fetch_or)
 GEN_ATOMIC_HELPER(fetch_sub)
+GEN_ATOMIC_HELPER(fetch_xor)
 
 GEN_ATOMIC_HELPER(add_fetch)
 GEN_ATOMIC_HELPER(and_fetch)
@@ -6796,32 +6799,57 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
         }
     bt_op:
         tcg_gen_andi_tl(cpu_T1, cpu_T1, (1 << (3 + ot)) - 1);
-        tcg_gen_shr_tl(cpu_tmp4, cpu_T0, cpu_T1);
-        switch(op) {
-        case 0:
-            break;
-        case 1:
-            tcg_gen_movi_tl(cpu_tmp0, 1);
-            tcg_gen_shl_tl(cpu_tmp0, cpu_tmp0, cpu_T1);
-            tcg_gen_or_tl(cpu_T0, cpu_T0, cpu_tmp0);
-            break;
-        case 2:
-            tcg_gen_movi_tl(cpu_tmp0, 1);
-            tcg_gen_shl_tl(cpu_tmp0, cpu_tmp0, cpu_T1);
-            tcg_gen_andc_tl(cpu_T0, cpu_T0, cpu_tmp0);
-            break;
-        default:
-        case 3:
+        if (s->prefix & PREFIX_LOCK) {
+            TCGv t0;
+
+            t0 = tcg_temp_new();
             tcg_gen_movi_tl(cpu_tmp0, 1);
             tcg_gen_shl_tl(cpu_tmp0, cpu_tmp0, cpu_T1);
-            tcg_gen_xor_tl(cpu_T0, cpu_T0, cpu_tmp0);
-            break;
-        }
-        if (op != 0) {
-            if (mod != 3) {
-                gen_op_st_v(s, ot, cpu_T0, cpu_A0);
-            } else {
-                gen_op_mov_reg_v(ot, rm, cpu_T0);
+            switch (op) {
+            case 1:
+                gen_atomic_fetch_or(t0, cpu_A0, cpu_tmp0, ot);
+                tcg_gen_or_tl(cpu_T0, t0, cpu_tmp0);
+                break;
+            case 2:
+                tcg_gen_not_tl(cpu_tmp0, cpu_tmp0);
+                gen_atomic_fetch_and(t0, cpu_A0, cpu_tmp0, ot);
+                tcg_gen_and_tl(cpu_T0, t0, cpu_tmp0);
+                break;
+            case 3:
+                gen_atomic_fetch_xor(t0, cpu_A0, cpu_tmp0, ot);
+                tcg_gen_xor_tl(cpu_T0, t0, cpu_tmp0);
+                break;
+            }
+            tcg_gen_shr_tl(cpu_tmp4, t0, cpu_T1);
+            tcg_temp_free(t0);
+        } else {
+            tcg_gen_shr_tl(cpu_tmp4, cpu_T0, cpu_T1);
+            switch (op) {
+            case 0:
+                break;
+            case 1:
+                tcg_gen_movi_tl(cpu_tmp0, 1);
+                tcg_gen_shl_tl(cpu_tmp0, cpu_tmp0, cpu_T1);
+                tcg_gen_or_tl(cpu_T0, cpu_T0, cpu_tmp0);
+                break;
+            case 2:
+                tcg_gen_movi_tl(cpu_tmp0, 1);
+                tcg_gen_shl_tl(cpu_tmp0, cpu_tmp0, cpu_T1);
+                tcg_gen_andc_tl(cpu_T0, cpu_T0, cpu_tmp0);
+                break;
+            default:
+            case 3:
+                tcg_gen_movi_tl(cpu_tmp0, 1);
+                tcg_gen_shl_tl(cpu_tmp0, cpu_tmp0, cpu_T1);
+                tcg_gen_xor_tl(cpu_T0, cpu_T0, cpu_tmp0);
+                break;
+            }
+            if (op != 0) {
+                if (mod != 3) {
+                    gen_op_st_v(s, ot, cpu_T0, cpu_A0);
+                } else {
+                    gen_op_mov_reg_v(ot, rm, cpu_T0);
+                }
             }
         }
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 18/30] target-i386: emulate XCHG using atomic helper
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (16 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 17/30] target-i386: emulate LOCK'ed BTX ops using atomic helpers Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 19/30] tests: add atomic_add-bench Emilio G. Cota
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/translate.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/target-i386/translate.c b/target-i386/translate.c
index b76b0ae..48c6742 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -1325,6 +1325,8 @@ GEN_ATOMIC_HELPER(and_fetch)
 GEN_ATOMIC_HELPER(or_fetch)
 GEN_ATOMIC_HELPER(sub_fetch)
 GEN_ATOMIC_HELPER(xor_fetch)
+
+GEN_ATOMIC_HELPER(xchg)
 #undef GEN_ATOMIC_HELPER
 
 /* if d == OR_TMP0, it means memory operand (address in A0) */
@@ -5666,12 +5668,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             gen_lea_modrm(env, s, modrm);
             gen_op_mov_v_reg(ot, cpu_T0, reg);
             /* for xchg, lock is implicit */
-            if (!(prefixes & PREFIX_LOCK))
-                gen_helper_lock();
-            gen_op_ld_v(s, ot, cpu_T1, cpu_A0);
-            gen_op_st_v(s, ot, cpu_T0, cpu_A0);
-            if (!(prefixes & PREFIX_LOCK))
-                gen_helper_unlock();
+            gen_atomic_xchg(cpu_T1, cpu_A0, cpu_T0, ot);
             gen_op_mov_reg_v(ot, reg, cpu_T1);
         }
         break;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 19/30] tests: add atomic_add-bench
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (17 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 18/30] target-i386: emulate XCHG using atomic helper Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 20/30] target-i386: remove helper_lock() Emilio G. Cota
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

With this microbenchmark we can measure the overhead of emulating atomic
instructions with a configurable degree of contention.

The benchmark spawns $n threads, each performing $o atomic ops (additions)
in a loop. Each atomic operation is performed on a different cache line
(assuming lines are 64b long) that is randomly selected from a range [0, $r).

[ Note: each $foo corresponds to a -foo flag ]

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore         |   1 +
 tests/Makefile.include   |   4 +-
 tests/atomic_add-bench.c | 180 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 184 insertions(+), 1 deletion(-)
 create mode 100644 tests/atomic_add-bench.c

diff --git a/tests/.gitignore b/tests/.gitignore
index 840ea39..52488a0 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -1,3 +1,4 @@
+atomic_add-bench
 check-qdict
 check-qfloat
 check-qint
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 6c09962..7421778 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -408,7 +408,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
 	tests/rcutorture.o tests/test-rcu-list.o \
 	tests/test-qdist.o \
-	tests/test-qht.o tests/qht-bench.o tests/test-qht-par.o
+	tests/test-qht.o tests/qht-bench.o tests/test-qht-par.o \
+	tests/atomic_add-bench.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -451,6 +452,7 @@ tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
 tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
 tests/test-qht-par$(EXESUF): tests/test-qht-par.o tests/qht-bench$(EXESUF) $(test-util-obj-y)
 tests/qht-bench$(EXESUF): tests/qht-bench.o $(test-util-obj-y)
+tests/atomic_add-bench$(EXESUF): tests/atomic_add-bench.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/atomic_add-bench.c b/tests/atomic_add-bench.c
new file mode 100644
index 0000000..5bbecf6
--- /dev/null
+++ b/tests/atomic_add-bench.c
@@ -0,0 +1,180 @@
+#include "qemu/osdep.h"
+#include "qemu/thread.h"
+#include "qemu/host-utils.h"
+#include "qemu/processor.h"
+
+struct thread_info {
+    uint64_t r;
+} QEMU_ALIGNED(64);
+
+struct count {
+    unsigned long val;
+} QEMU_ALIGNED(64);
+
+static QemuThread *threads;
+static struct thread_info *th_info;
+static unsigned int n_threads = 1;
+static unsigned int n_ready_threads;
+static struct count *counts;
+static unsigned long n_ops = 10000;
+static double duration;
+static unsigned int range = 1;
+static bool test_start;
+
+static const char commands_string[] =
+    " -n = number of threads\n"
+    " -o = number of ops per thread\n"
+    " -r = range (will be rounded up to pow2)";
+
+static void usage_complete(char *argv[])
+{
+    fprintf(stderr, "Usage: %s [options]\n", argv[0]);
+    fprintf(stderr, "options:\n%s\n", commands_string);
+}
+
+/*
+ * From: https://en.wikipedia.org/wiki/Xorshift
+ * This is faster than rand_r(), and gives us a wider range (RAND_MAX is only
+ * guaranteed to be >= INT_MAX).
+ */
+static uint64_t xorshift64star(uint64_t x)
+{
+    x ^= x >> 12; /* a */
+    x ^= x << 25; /* b */
+    x ^= x >> 27; /* c */
+    return x * UINT64_C(2685821657736338717);
+}
+
+static void *thread_func(void *arg)
+{
+    struct thread_info *info = arg;
+    unsigned long i;
+
+    atomic_inc(&n_ready_threads);
+    while (!atomic_mb_read(&test_start)) {
+        cpu_relax();
+    }
+
+    for (i = 0; i < n_ops; i++) {
+        unsigned int index;
+
+        info->r = xorshift64star(info->r);
+        index = info->r & (range - 1);
+        atomic_inc(&counts[index].val);
+    }
+    return NULL;
+}
+
+static inline
+uint64_t ts_subtract(const struct timespec *a, const struct timespec *b)
+{
+    uint64_t ns;
+
+    ns = (b->tv_sec - a->tv_sec) * 1000000000ULL;
+    ns += (b->tv_nsec - a->tv_nsec);
+    return ns;
+}
+
+static void run_test(void)
+{
+    unsigned int i;
+    struct timespec ts_start, ts_end;
+
+    while (atomic_read(&n_ready_threads) != n_threads) {
+        cpu_relax();
+    }
+    atomic_mb_set(&test_start, true);
+
+    clock_gettime(CLOCK_MONOTONIC, &ts_start);
+    for (i = 0; i < n_threads; i++) {
+        qemu_thread_join(&threads[i]);
+    }
+    clock_gettime(CLOCK_MONOTONIC, &ts_end);
+    duration = ts_subtract(&ts_start, &ts_end) / 1e9;
+}
+
+static void create_threads(void)
+{
+    unsigned int i;
+
+    threads = g_new(QemuThread, n_threads);
+    th_info = g_new(struct thread_info, n_threads);
+    counts = qemu_memalign(64, sizeof(*counts) * range);
+
+    for (i = 0; i < n_threads; i++) {
+        struct thread_info *info = &th_info[i];
+
+        info->r = (i + 1) ^ time(NULL);
+        qemu_thread_create(&threads[i], NULL, thread_func, info,
+                           QEMU_THREAD_JOINABLE);
+    }
+}
+
+static void pr_params(void)
+{
+    printf("Parameters:\n");
+    printf(" # of threads:      %u\n", n_threads);
+    printf(" n_ops:             %lu\n", n_ops);
+    printf(" ops' range:        %u\n", range);
+}
+
+static void pr_stats(void)
+{
+    unsigned long long val = 0;
+    unsigned int i;
+    double tx;
+
+    for (i = 0; i < range; i++) {
+        val += counts[i].val;
+    }
+    assert(val == n_threads * n_ops);
+    tx = val / duration / 1e6;
+
+    printf("Results:\n");
+    printf("Duration:            %.2f s\n", duration);
+    printf(" Throughput:         %.2f Mops/s\n", tx);
+    printf(" Throughput/thread:  %.2f Mops/s/thread\n", tx / n_threads);
+}
+
+static void parse_args(int argc, char *argv[])
+{
+    unsigned long long n_ops_ull;
+    int c;
+
+    for (;;) {
+        c = getopt(argc, argv, "hn:o:r:");
+        if (c < 0) {
+            break;
+        }
+        switch (c) {
+        case 'h':
+            usage_complete(argv);
+            exit(0);
+        case 'n':
+            n_threads = atoi(optarg);
+            break;
+        case 'o':
+            n_ops_ull = atoll(optarg);
+            if (n_ops_ull > ULONG_MAX) {
+                fprintf(stderr,
+                        "fatal: -o cannot be greater than %lu\n", ULONG_MAX);
+                exit(1);
+            }
+            n_ops = n_ops_ull;
+            break;
+        case 'r':
+            range = pow2ceil(atoi(optarg));
+            break;
+        }
+    }
+}
+
+int main(int argc, char *argv[])
+{
+    parse_args(argc, argv);
+    pr_params();
+    create_threads();
+    run_test();
+    pr_stats();
+    return 0;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 20/30] target-i386: remove helper_lock()
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (18 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 19/30] tests: add atomic_add-bench Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 21/30] target-arm: add cmpxchg helpers Emilio G. Cota
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

It's been superseded by the atomic helpers.

The use of the atomic helpers provides a significant performance and scalability
improvement. Below is the result of running the atomic_add-test microbenchmark with:
 $ x86_64-linux-user/qemu-x86_64 tests/atomic_add-bench -o 5000000 -r $r -n $n
, where $n is the number of threads and $r is the allowed range for the additions.

The scenarios measured are:
- atomic: implements x86' ADDL with the atomic_add helper (i.e. this patchset)
- cmpxchg: implement x86' ADDL with a TCG loop using the cmpxchg helper
- master: before this patchset

Results sorted in ascending range, i.e. descending degree of contention.
Y axis is Throughput in Mops/s. Tests are run on an AMD machine with 64
Opteron 6376 cores.

                atomic_add-bench: 5000000 ops/thread, [0,1] range

  25 ++---------+----------+---------+----------+----------+----------+---++
     + atomic +-E--+       +         +          +          +          +    |
     |cmpxchg +-H--+                                                       |
  20 +Emaster +-N--+                                                      ++
     ||                                                                    |
     |++                                                                   |
     ||                                                                    |
  15 +++                                                                  ++
     |N|                                                                   |
     |+|                                                                   |
  10 ++|                                                                  ++
     |+|+                                                                  |
     | |    -+E+------        +++  ---+E+------+E+------+E+-----+E+------+E|
     |+E+E+- +++     +E+------+E+--                                        |
   5 ++|+                                                                 ++
     |+N+H+---                                 +++                         |
     ++++N+--+H++----+++   +  +++  --++H+------+H+------+H++----+H+---+--- |
   0 ++---------+-----H----+---H-----+----------+----------+----------+---H+
     0          10         20        30         40         50         60
                                Number of threads

                atomic_add-bench: 5000000 ops/thread, [0,2] range

  25 ++---------+----------+---------+----------+----------+----------+---++
     ++atomic +-E--+       +         +          +          +          +    |
     |cmpxchg +-H--+                                                       |
  20 ++master +-N--+                                                      ++
     |E|                                                                   |
     |++                                                                   |
     ||E                                                                   |
  15 ++|                                                                  ++
     |N||                                                                  |
     |+||                                   ---+E+------+E+-----+E+------+E|
  10 ++| |        ---+E+------+E+-----+E+---                    +++      +++
     ||H+E+--+E+--                                                         |
     |+++++                                                                |
     | ||                                                                  |
   5 ++|+H+--                                  +++                        ++
     |+N+    -                              ---+H+------+H+------          |
     +  +N+--+H++----+H+---+--+H+----++H+---    +          +    +H+---+--+H|
   0 ++---------+----------+---------+----------+----------+----------+---++
     0          10         20        30         40         50         60
                                Number of threads

                atomic_add-bench: 5000000 ops/thread, [0,8] range

  40 ++---------+----------+---------+----------+----------+----------+---++
     ++atomic +-E--+       +         +          +          +          +    |
  35 +cmpxchg +-H--+                                                      ++
     | master +-N--+               ---+E+------+E+------+E+-----+E+------+E|
  30 ++|                   ---+E+--   +++                                 ++
     | |            -+E+---                                                |
  25 ++E        ---- +++                                                  ++
     |+++++ -+E+                                                           |
  20 +E+ E-- +++                                                          ++
     |H|+++                                                                |
     |+|                                       +H+-------                  |
  15 ++H+                                   ---+++      +H+------         ++
     |N++H+--                         +++---                    +H+------++|
  10 ++ +++  -       +++           ---+H+                       +++      +H+
     | |     +H+-----+H+------+H+--                                        |
   5 ++|                      +++                                         ++
     ++N+N+--+N++          +         +          +          +          +    |
   0 ++---------+----------+---------+----------+----------+----------+---++
     0          10         20        30         40         50         60
                                Number of threads

               atomic_add-bench: 5000000 ops/thread, [0,128] range

  160 ++---------+---------+----------+---------+----------+----------+---++
      + atomic +-E--+      +          +         +          +          +    |
  140 +cmpxchg +-H--+                          +++      +++               ++
      | master +-N--+                           E--------E------+E+------++|
  120 ++                                      --|        |      +++       E+
      |                                     -- +++      +++              ++|
  100 ++                                   -                              ++
      |                                +++-                     +++      ++|
   80 ++                              -+E+    -+H+------+H+------H--------++
      |                           ----    ----                  +++       H|
      |            ---+E+-----+E+-  ---+H+                               ++|
   60 ++     +E+---   +++  ---+H+---                                      ++
      |    --+++   ---+H+--                                                |
   40 ++ +E+-+H+---                                                       ++
      |  +H+                                                               |
   20 +EE+                                                                ++
      +N+        +         +          +         +          +          +    |
    0 ++N-N---N--+---------+----------+---------+----------+----------+---++
      0          10        20         30        40         50         60
                                Number of threads

              atomic_add-bench: 5000000 ops/thread, [0,1024] range

  350 ++---------+---------+----------+---------+----------+----------+---++
      + atomic +-E--+      +          +         +          +          +    |
  300 +cmpxchg +-H--+                                                    +++
      | master +-N--+                                           +++       ||
      |                                                 +++      |    ----E|
  250 ++                                                 |   ----E----    ++
      |                                              ----E---    |    ---+H|
  200 ++                                      -+E+---   +++  ---+H+---    ++
      |                                   ----         -+H+--              |
      |                                +E+     +++ ---- +++                |
  150 ++                            ---+++  ---+H+-                       ++
      |                          ---  -+H+--                               |
  100 ++                   ---+E+ ---- +++                                ++
      |      +++   ---+E+-----+H+-                                         |
      |     -+E+------+H+--                                                |
   50 ++ +E+                                                              ++
      +EE+       +         +          +         +          +          +    |
    0 ++N-N---N--+---------+----------+---------+----------+----------+---++
      0          10        20         30        40         50         60
                                Number of threads

  hi-res: http://imgur.com/a/fMRmq

For master I stopped measuring master after 8 threads, because there is little
point in measuring the well-known performance collapse of a contended lock.

Note that using atomic helpers instead of cmpxchg is not only more performant
(when the host implements natively the same atomics, as is in this case); it also
simplifies the necessary TCG/helper code of the implementation. Compare
the difference between the atomic and cmpxchg implementations:

     case OP_ADDL:
         if (s1->prefix & PREFIX_LOCK) {
-            gen_atomic_add_fetch(cpu_T0, cpu_A0, cpu_T1, ot);
+            TCGv t0, t1, t2, a0;
+            TCGLabel *retry;
+
+            t0 = tcg_temp_local_new();
+            t1 = tcg_temp_local_new();
+            t2 = tcg_temp_local_new();
+            a0 = tcg_temp_local_new();
+            retry = gen_new_label();
+
+            tcg_gen_mov_tl(a0, cpu_A0);
+            tcg_gen_mov_tl(t1, cpu_T1);
+            gen_set_label(retry);
+            gen_op_ld_v(s1, ot, cpu_T0, a0);
+            tcg_gen_add_tl(t2, cpu_T0, t1);
+            gen_cmpxchg(t0, a0, cpu_T0, t2, ot);
+            tcg_gen_brcond_tl(TCG_COND_NE, t0, cpu_T0, retry);
+
+            tcg_gen_mov_tl(cpu_T0, t2);
+            tcg_gen_mov_tl(cpu_T1, t1);
+
+            tcg_temp_free(t0);
+            tcg_temp_free(t1);
+            tcg_temp_free(t2);
+            tcg_temp_free(a0);
         } else {

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-i386/helper.h     |  2 --
 target-i386/mem_helper.c | 33 ---------------------------------
 target-i386/translate.c  | 15 ---------------
 3 files changed, 50 deletions(-)

diff --git a/target-i386/helper.h b/target-i386/helper.h
index df68204..17a750b 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -1,8 +1,6 @@
 DEF_HELPER_FLAGS_4(cc_compute_all, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 DEF_HELPER_FLAGS_4(cc_compute_c, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 
-DEF_HELPER_0(lock, void)
-DEF_HELPER_0(unlock, void)
 DEF_HELPER_3(write_eflags, void, env, tl, i32)
 DEF_HELPER_1(read_eflags, tl, env)
 DEF_HELPER_2(divb_AL, void, env, tl)
diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
index 13f4f3b..949666a 100644
--- a/target-i386/mem_helper.c
+++ b/target-i386/mem_helper.c
@@ -23,39 +23,6 @@
 #include "exec/exec-all.h"
 #include "exec/cpu_ldst.h"
 
-/* broken thread support */
-
-#if defined(CONFIG_USER_ONLY)
-QemuMutex global_cpu_lock;
-
-void helper_lock(void)
-{
-    qemu_mutex_lock(&global_cpu_lock);
-}
-
-void helper_unlock(void)
-{
-    qemu_mutex_unlock(&global_cpu_lock);
-}
-
-void helper_lock_init(void)
-{
-    qemu_mutex_init(&global_cpu_lock);
-}
-#else
-void helper_lock(void)
-{
-}
-
-void helper_unlock(void)
-{
-}
-
-void helper_lock_init(void)
-{
-}
-#endif
-
 #define GEN_CMPXCHG_HELPER(NAME)                                      \
 target_ulong glue(helper_, NAME)(CPUX86State *env, target_ulong addr, \
                                  target_ulong old, target_ulong new)  \
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 48c6742..8365c83 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -4636,10 +4636,6 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
     s->aflag = aflag;
     s->dflag = dflag;
 
-    /* lock generation */
-    if (prefixes & PREFIX_LOCK)
-        gen_helper_lock();
-
     /* now check op code */
  reswitch:
     switch(b) {
@@ -8304,20 +8300,11 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
     default:
         goto unknown_op;
     }
-    /* lock generation */
-    if (s->prefix & PREFIX_LOCK)
-        gen_helper_unlock();
     return s->pc;
  illegal_op:
-    if (s->prefix & PREFIX_LOCK)
-        gen_helper_unlock();
-    /* XXX: ensure that no lock was generated */
     gen_illegal_opcode(s);
     return s->pc;
  unknown_op:
-    if (s->prefix & PREFIX_LOCK)
-        gen_helper_unlock();
-    /* XXX: ensure that no lock was generated */
     gen_unknown_opcode(env, s);
     return s->pc;
 }
@@ -8409,8 +8396,6 @@ void tcg_x86_init(void)
                                      offsetof(CPUX86State, bnd_regs[i].ub),
                                      bnd_regu_names[i]);
     }
-
-    helper_lock_init();
 }
 
 /* generate intermediate code for basic block 'tb'.  */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 21/30] target-arm: add cmpxchg helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (19 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 20/30] target-i386: remove helper_lock() Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 22/30] target-arm: emulate LL/SC using " Emilio G. Cota
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-arm/helper.c | 35 +++++++++++++++++++++++++++++++++++
 target-arm/helper.h |  5 +++++
 2 files changed, 40 insertions(+)

diff --git a/target-arm/helper.c b/target-arm/helper.c
index 1f9cdac..b38bfbd 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -9409,3 +9409,38 @@ uint32_t HELPER(crc32c)(uint32_t acc, uint32_t val, uint32_t bytes)
     /* Linux crc32c converts the output to one's complement.  */
     return crc32c(acc, buf, bytes) ^ 0xffffffff;
 }
+
+/* returns 0 on success; 1 otherwise */
+#define GEN_CMPXCHG(NAME)                                               \
+uint32_t                                                                \
+HELPER(NAME)(CPUARMState *env, uint32_t addr, uint64_t old64, uint32_t new) \
+{                                                                       \
+    uint32_t old = old64;                                               \
+    uint32_t read;                                                      \
+                                                                        \
+    read = glue(glue(cpu_, NAME), _data_ra)(env, addr, old, new, GETPC()); \
+    return read != old;                                                 \
+}
+GEN_CMPXCHG(cmpxchgb)
+GEN_CMPXCHG(cmpxchgw)
+GEN_CMPXCHG(cmpxchgl)
+#undef GEN_CMPXCHG
+
+/*
+ * Returns 0 on success; 1 otherwise.
+ * Bringing in @new_lo and @new_hi is unusual given that @old is 64-bit,
+ * but we do it to save a few TCG instructions.
+ */
+uint32_t HELPER(cmpxchgq)(CPUARMState *env, uint32_t addr, uint64_t old,
+                          uint32_t new_lo, uint32_t new_hi)
+{
+    uint64_t read;
+    uint64_t new;
+
+    new = new_hi;
+    new <<= 32;
+    new |= new_lo;
+
+    read = cpu_cmpxchgq_data_ra(env, addr, old, new, GETPC());
+    return read != old;
+}
diff --git a/target-arm/helper.h b/target-arm/helper.h
index 84aa637..f3d6f26 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -537,6 +537,11 @@ DEF_HELPER_2(dc_zva, void, env, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 
+DEF_HELPER_4(cmpxchgb, i32, env, i32, i64, i32)
+DEF_HELPER_4(cmpxchgw, i32, env, i32, i64, i32)
+DEF_HELPER_4(cmpxchgl, i32, env, i32, i64, i32)
+DEF_HELPER_5(cmpxchgq, i32, env, i32, i64, i32, i32)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 22/30] target-arm: emulate LL/SC using cmpxchg helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (20 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 21/30] target-arm: add cmpxchg helpers Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 23/30] target-arm: add atomic_xchg helper Emilio G. Cota
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Emulating LL/SC with cmpxchg is not correct, since it can
suffer from the ABA problem. Portable parallel code, however,
is written assuming only cmpxchg--and not LL/SC--is available.
This means that in practice emulating LL/SC with cmpxchg is
a viable alternative.

The appended emulates LL/SC pairs in ARM with cmpxchg helpers.
This works in both user and system mode. In usermode, it avoids
pausing all other CPUs to perform the LL/SC pair. The subsequent
performance and scalability improvement is significant, as the
plots below show. They plot the throughput of atomic_add-bench
compiled for ARM and executed on a 64-core x86 machine.

Hi-res plots: http://imgur.com/a/aNQpB

               atomic_add-bench: 1000000 ops/thread, [0,1] range

  9 ++---------+----------+----------+----------+----------+----------+---++
    +cmpxchg +-E--+       +          +          +          +          +    |
  8 +Emaster +-H--+                                                       ++
    | |                                                                    |
  7 ++E                                                                   ++
    | |                                                                    |
  6 ++++                                                                  ++
    |  |                                                                   |
  5 ++ |                                                                  ++
  4 ++ |                                                                  ++
    |  |                                                                   |
  3 ++ |                                                                  ++
    |   |                                                                  |
  2 ++  |                                                                 ++
    |H++E+---                                  +++  ---+E+------+E+------+E|
  1 +++     +E+-----+E+------+E+------+E+------+E+--   +++      +++       ++
    ++H+       +    +++   +  +++     ++++       +          +          +    |
  0 ++--H----H-+-----H----+----------+----------+----------+----------+---++
    0          10         20         30         40         50         60
                               Number of threads

                atomic_add-bench: 1000000 ops/thread, [0,2] range

  16 ++---------+----------+---------+----------+----------+----------+---++
     +cmpxchg +-E--+       +         +          +          +          +    |
  14 ++master +-H--+                                                      ++
     | |                                                                   |
  12 ++|                                                                  ++
     | E                                                                   |
  10 ++|                                                                  ++
     | |                                                                   |
   8 ++++                                                                 ++
     |E+|                                                                  |
     |  |                                                                  |
   6 ++ |                                                                 ++
     |   |                                                                 |
   4 ++  |                                                                ++
     |  +E+---       +++      +++              +++           ---+E+------+E|
   2 +H+     +E+------E-------+E+-----+E+------+E+------+E+--            +++
     + |        +    +++   +         ++++       +          +          +    |
   0 ++H-H----H-+-----H----+---------+----------+----------+----------+---++
     0          10         20        30         40         50         60
                                Number of threads

               atomic_add-bench: 1000000 ops/thread, [0,128] range

  70 ++---------+----------+---------+----------+----------+----------+---++
     +cmpxchg +-E--+       +         +          +       ++++          +    |
  60 ++master +-H--+                                 ----E------+E+-------++
     |                                        -+E+---   +++     +++      +E|
     |                                +++ ---- +++                       ++|
  50 ++                       +++  ---+E+-                                ++
     |                        -E---                                        |
  40 ++                    ---+++                                         ++
     |               +++---                                                |
     |              -+E+                                                   |
  30 ++      +++----                                                      ++
     |       +E+                                                           |
  20 ++ +++--                                                             ++
     |  +E+                                                                |
     |+E+                                                                  |
  10 +E+                                                                  ++
     +          +          +         +          +          +          +    |
   0 +HH-H----H-+-----H----+---------+----------+----------+----------+---++
     0          10         20        30         40         50         60
                                Number of threads

              atomic_add-bench: 1000000 ops/thread, [0,1024] range

  120 ++---------+---------+----------+---------+----------+----------+---++
      +cmpxchg +-E--+      +          +         +          +          +    |
      | master +-H--+                                                    ++|
  100 ++                                                              ----E+
      |                                                 +++  ---+E+---   ++|
      |                                                --E---   +++        |
   80 ++                                           ---- +++               ++
      |                                     ---+E+-                        |
   60 ++                              -+E+--                              ++
      |                       +++ ---- +++                                 |
      |                      -+E+-                                         |
   40 ++              +++----                                             ++
      |      +++   ---+E+                                                  |
      |     -+E+---                                                        |
   20 ++ +E+                                                              ++
      |+E+++                                                               |
      +E+        +         +          +         +          +          +    |
    0 +HH-H---H--+-----H---+----------+---------+----------+----------+---++
      0          10        20         30        40         50         60
                                Number of threads

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-arm/translate.c | 80 +++++++++-----------------------------------------
 1 file changed, 14 insertions(+), 66 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index bd5d5cb..0d4a1a9 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -7715,15 +7715,6 @@ static void gen_logicq_cc(TCGv_i32 lo, TCGv_i32 hi)
     tcg_gen_or_i32(cpu_ZF, lo, hi);
 }
 
-/* Load/Store exclusive instructions are implemented by remembering
-   the value/address loaded, and seeing if these are the same
-   when the store is performed. This should be sufficient to implement
-   the architecturally mandated semantics, and avoids having to monitor
-   regular stores.
-
-   In system emulation mode only one CPU will be running at once, so
-   this sequence is effectively atomic.  In user emulation mode we
-   throw an exception and handle the atomic operation elsewhere.  */
 static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
                                TCGv_i32 addr, int size)
 {
@@ -7768,21 +7759,11 @@ static void gen_clrex(DisasContext *s)
     tcg_gen_movi_i64(cpu_exclusive_addr, -1);
 }
 
-#ifdef CONFIG_USER_ONLY
-static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
-                                TCGv_i32 addr, int size)
-{
-    tcg_gen_extu_i32_i64(cpu_exclusive_test, addr);
-    tcg_gen_movi_i32(cpu_exclusive_info,
-                     size | (rd << 4) | (rt << 8) | (rt2 << 12));
-    gen_exception_internal_insn(s, 4, EXCP_STREX);
-}
-#else
 static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                                 TCGv_i32 addr, int size)
 {
-    TCGv_i32 tmp;
-    TCGv_i64 val64, extaddr;
+    TCGv_i32 t0, t1, t2;
+    TCGv_i64 extaddr;
     TCGLabel *done_label;
     TCGLabel *fail_label;
 
@@ -7799,69 +7780,36 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
     tcg_gen_brcond_i64(TCG_COND_NE, extaddr, cpu_exclusive_addr, fail_label);
     tcg_temp_free_i64(extaddr);
 
-    tmp = tcg_temp_new_i32();
+    t0 = tcg_temp_new_i32();
+    t1 = load_reg(s, rt);
     switch (size) {
     case 0:
-        gen_aa32_ld8u(s, tmp, addr, get_mem_index(s));
+        gen_helper_cmpxchgb(t0, cpu_env, addr, cpu_exclusive_val, t1);
         break;
     case 1:
-        gen_aa32_ld16u(s, tmp, addr, get_mem_index(s));
+        gen_helper_cmpxchgw(t0, cpu_env, addr, cpu_exclusive_val, t1);
         break;
     case 2:
-    case 3:
-        gen_aa32_ld32u(s, tmp, addr, get_mem_index(s));
-        break;
-    default:
-        abort();
-    }
-
-    val64 = tcg_temp_new_i64();
-    if (size == 3) {
-        TCGv_i32 tmp2 = tcg_temp_new_i32();
-        TCGv_i32 tmp3 = tcg_temp_new_i32();
-        tcg_gen_addi_i32(tmp2, addr, 4);
-        gen_aa32_ld32u(s, tmp3, tmp2, get_mem_index(s));
-        tcg_temp_free_i32(tmp2);
-        tcg_gen_concat_i32_i64(val64, tmp, tmp3);
-        tcg_temp_free_i32(tmp3);
-    } else {
-        tcg_gen_extu_i32_i64(val64, tmp);
-    }
-    tcg_temp_free_i32(tmp);
-
-    tcg_gen_brcond_i64(TCG_COND_NE, val64, cpu_exclusive_val, fail_label);
-    tcg_temp_free_i64(val64);
-
-    tmp = load_reg(s, rt);
-    switch (size) {
-    case 0:
-        gen_aa32_st8(s, tmp, addr, get_mem_index(s));
+        gen_helper_cmpxchgl(t0, cpu_env, addr, cpu_exclusive_val, t1);
         break;
-    case 1:
-        gen_aa32_st16(s, tmp, addr, get_mem_index(s));
-        break;
-    case 2:
     case 3:
-        gen_aa32_st32(s, tmp, addr, get_mem_index(s));
+        t2 = load_reg(s, rt2);
+        gen_helper_cmpxchgq(t0, cpu_env, addr, cpu_exclusive_val, t1, t2);
+        tcg_temp_free_i32(t2);
         break;
     default:
         abort();
     }
-    tcg_temp_free_i32(tmp);
-    if (size == 3) {
-        tcg_gen_addi_i32(addr, addr, 4);
-        tmp = load_reg(s, rt2);
-        gen_aa32_st32(s, tmp, addr, get_mem_index(s));
-        tcg_temp_free_i32(tmp);
-    }
-    tcg_gen_movi_i32(cpu_R[rd], 0);
+    tcg_temp_free_i32(t1);
+    tcg_gen_mov_i32(cpu_R[rd], t0);
+    tcg_temp_free_i32(t0);
     tcg_gen_br(done_label);
+
     gen_set_label(fail_label);
     tcg_gen_movi_i32(cpu_R[rd], 1);
     gen_set_label(done_label);
     tcg_gen_movi_i64(cpu_exclusive_addr, -1);
 }
-#endif
 
 /* gen_srs:
  * @env: CPUARMState
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 23/30] target-arm: add atomic_xchg helper
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (21 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 22/30] target-arm: emulate LL/SC using " Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 24/30] target-arm: emulate SWP with " Emilio G. Cota
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-arm/helper.c | 10 ++++++++++
 target-arm/helper.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/target-arm/helper.c b/target-arm/helper.c
index b38bfbd..adab296 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -9444,3 +9444,13 @@ uint32_t HELPER(cmpxchgq)(CPUARMState *env, uint32_t addr, uint64_t old,
     read = cpu_cmpxchgq_data_ra(env, addr, old, new, GETPC());
     return read != old;
 }
+
+uint32_t HELPER(atomic_xchgb)(CPUARMState *env, uint32_t addr, uint32_t val)
+{
+    return cpu_atomic_xchgb_data_ra(env, addr, val, GETPC());
+}
+
+uint32_t HELPER(atomic_xchgl)(CPUARMState *env, uint32_t addr, uint32_t val)
+{
+    return cpu_atomic_xchgl_data_ra(env, addr, val, GETPC());
+}
diff --git a/target-arm/helper.h b/target-arm/helper.h
index f3d6f26..7a36bb3 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -542,6 +542,9 @@ DEF_HELPER_4(cmpxchgw, i32, env, i32, i64, i32)
 DEF_HELPER_4(cmpxchgl, i32, env, i32, i64, i32)
 DEF_HELPER_5(cmpxchgq, i32, env, i32, i64, i32, i32)
 
+DEF_HELPER_3(atomic_xchgb, i32, env, i32, i32)
+DEF_HELPER_3(atomic_xchgl, i32, env, i32, i32)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 24/30] target-arm: emulate SWP with atomic_xchg helper
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (22 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 23/30] target-arm: add atomic_xchg helper Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 25/30] helper: add DEF_HELPER_6 Emilio G. Cota
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-arm/translate.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 0d4a1a9..b177388 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -8783,18 +8783,13 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
                         /* SWP instruction */
                         rm = (insn) & 0xf;
 
-                        /* ??? This is not really atomic.  However we know
-                           we never have multiple CPUs running in parallel,
-                           so it is good enough.  */
                         addr = load_reg(s, rn);
                         tmp = load_reg(s, rm);
                         tmp2 = tcg_temp_new_i32();
                         if (insn & (1 << 22)) {
-                            gen_aa32_ld8u(s, tmp2, addr, get_mem_index(s));
-                            gen_aa32_st8(s, tmp, addr, get_mem_index(s));
+                            gen_helper_atomic_xchgb(tmp2, cpu_env, addr, tmp);
                         } else {
-                            gen_aa32_ld32u(s, tmp2, addr, get_mem_index(s));
-                            gen_aa32_st32(s, tmp, addr, get_mem_index(s));
+                            gen_helper_atomic_xchgl(tmp2, cpu_env, addr, tmp);
                         }
                         tcg_temp_free_i32(tmp);
                         tcg_temp_free_i32(addr);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 25/30] helper: add DEF_HELPER_6
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (23 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 24/30] target-arm: emulate SWP with " Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 26/30] target-arm: add cmpxchg helpers for aarch64 Emilio G. Cota
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/helper-gen.h   | 11 +++++++++++
 include/exec/helper-head.h  |  2 ++
 include/exec/helper-proto.h |  5 +++++
 include/exec/helper-tcg.h   |  7 +++++++
 4 files changed, 25 insertions(+)

diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h
index 0d0da3a..050974d 100644
--- a/include/exec/helper-gen.h
+++ b/include/exec/helper-gen.h
@@ -56,6 +56,16 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
   tcg_gen_callN(&tcg_ctx, HELPER(name), dh_retvar(ret), 5, args);       \
 }
 
+#define DEF_HELPER_FLAGS_6(name, flags, ret, t1, t2, t3, t4, t5, t6)    \
+static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
+    dh_arg_decl(t1, 1),  dh_arg_decl(t2, 2), dh_arg_decl(t3, 3),        \
+    dh_arg_decl(t4, 4), dh_arg_decl(t5, 5), dh_arg_decl(t6, 6))         \
+{                                                                       \
+  TCGArg args[6] = { dh_arg(t1, 1), dh_arg(t2, 2), dh_arg(t3, 3),       \
+                     dh_arg(t4, 4), dh_arg(t5, 5), dh_arg(t6, 6) };     \
+  tcg_gen_callN(&tcg_ctx, HELPER(name), dh_retvar(ret), 6, args);       \
+}
+
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "trace/generated-helpers-wrappers.h"
@@ -67,6 +77,7 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
 #undef DEF_HELPER_FLAGS_3
 #undef DEF_HELPER_FLAGS_4
 #undef DEF_HELPER_FLAGS_5
+#undef DEF_HELPER_FLAGS_6
 #undef GEN_HELPER
 
 #endif /* HELPER_GEN_H */
diff --git a/include/exec/helper-head.h b/include/exec/helper-head.h
index 74f8f03..82d9b4e 100644
--- a/include/exec/helper-head.h
+++ b/include/exec/helper-head.h
@@ -130,6 +130,8 @@
     DEF_HELPER_FLAGS_4(name, 0, ret, t1, t2, t3, t4)
 #define DEF_HELPER_5(name, ret, t1, t2, t3, t4, t5) \
     DEF_HELPER_FLAGS_5(name, 0, ret, t1, t2, t3, t4, t5)
+#define DEF_HELPER_6(name, ret, t1, t2, t3, t4, t5, t6) \
+    DEF_HELPER_FLAGS_6(name, 0, ret, t1, t2, t3, t4, t5, t6)
 
 /* MAX_OPC_PARAM_IARGS must be set to n if last entry is DEF_HELPER_FLAGS_n. */
 
diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h
index effdd43..3f32d06 100644
--- a/include/exec/helper-proto.h
+++ b/include/exec/helper-proto.h
@@ -26,6 +26,10 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
 dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
                             dh_ctype(t4), dh_ctype(t5));
 
+#define DEF_HELPER_FLAGS_6(name, flags, ret, t1, t2, t3, t4, t5, t6)  \
+dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
+                            dh_ctype(t4), dh_ctype(t5), dh_ctype(t6));
+
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
@@ -36,5 +40,6 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
 #undef DEF_HELPER_FLAGS_3
 #undef DEF_HELPER_FLAGS_4
 #undef DEF_HELPER_FLAGS_5
+#undef DEF_HELPER_FLAGS_6
 
 #endif /* HELPER_PROTO_H */
diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h
index 79fa3c8..6bdb152 100644
--- a/include/exec/helper-tcg.h
+++ b/include/exec/helper-tcg.h
@@ -35,6 +35,12 @@
     | dh_sizemask(t2, 2) | dh_sizemask(t3, 3) | dh_sizemask(t4, 4) \
     | dh_sizemask(t5, 5) },
 
+#define DEF_HELPER_FLAGS_6(NAME, FLAGS, ret, t1, t2, t3, t4, t5, t6) \
+  { .func = HELPER(NAME), .name = #NAME, .flags = FLAGS, \
+    .sizemask = dh_sizemask(ret, 0) | dh_sizemask(t1, 1) \
+    | dh_sizemask(t2, 2) | dh_sizemask(t3, 3) | dh_sizemask(t4, 4) \
+    | dh_sizemask(t5, 5) | dh_sizemask(t6, 6) },
+
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
@@ -45,5 +51,6 @@
 #undef DEF_HELPER_FLAGS_3
 #undef DEF_HELPER_FLAGS_4
 #undef DEF_HELPER_FLAGS_5
+#undef DEF_HELPER_FLAGS_6
 
 #endif /* HELPER_TCG_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 26/30] target-arm: add cmpxchg helpers for aarch64
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (24 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 25/30] helper: add DEF_HELPER_6 Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 27/30] target-arm: emulate aarch64's LL/SC using cmpxchg helpers Emilio G. Cota
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-arm/helper-a64.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++
 target-arm/helper-a64.h |  6 ++++++
 2 files changed, 56 insertions(+)

diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index 41e48a4..f859e8e 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -19,6 +19,8 @@
 
 #include "qemu/osdep.h"
 #include "cpu.h"
+#include "exec/exec-all.h"
+#include "exec/cpu_ldst.h"
 #include "exec/gdbstub.h"
 #include "exec/helper-proto.h"
 #include "qemu/host-utils.h"
@@ -444,3 +446,51 @@ uint64_t HELPER(crc32c_64)(uint64_t acc, uint64_t val, uint32_t bytes)
     /* Linux crc32c converts the output to one's complement.  */
     return crc32c(acc, buf, bytes) ^ 0xffffffff;
 }
+
+/* returns 0 on success; 1 otherwise */
+#define GEN_CMPXCHG64(SUFFIX)                                           \
+uint64_t                                                                \
+HELPER(glue(cmpxchg64, SUFFIX))(CPUARMState *env, uint64_t addr, uint64_t old, \
+                                uint64_t new)                           \
+{                                                                       \
+    uint64_t read;                                                      \
+                                                                        \
+    read = glue(glue(glue(cpu_, cmpxchg), SUFFIX),                      \
+                _data_ra)(env, addr, old, new, GETPC());                \
+    return read != old;                                                 \
+}
+
+GEN_CMPXCHG64(b)
+GEN_CMPXCHG64(w)
+GEN_CMPXCHG64(l)
+GEN_CMPXCHG64(q)
+#undef GEN_CMPXCHG64
+
+/* returns 0 on success; 1 otherwise */
+uint64_t
+HELPER(paired_cmpxchg64l)(CPUARMState *env, uint64_t addr, uint64_t old_lo,
+                          uint64_t old_hi, uint64_t new_lo, uint64_t new_hi)
+{
+    uint64_t old, new;
+    uint64_t read;
+
+    old = old_hi;
+    old <<= 32;
+    old |= old_lo;
+
+    new = new_hi;
+    new <<= 32;
+    new |= new_lo;
+
+    read = cpu_cmpxchgq_data_ra(env, addr, old, new, GETPC());
+    return read != old;
+}
+
+/* returns 0 on success; 1 otherwise */
+uint64_t
+HELPER(paired_cmpxchg64q)(CPUARMState *env, uint64_t addr, uint64_t old_lo,
+                          uint64_t old_hi, uint64_t new_lo, uint64_t new_hi)
+{
+    return !cpu_cmpxchgo_data_ra(env, addr, &old_lo, &old_hi, new_lo, new_hi,
+                                 GETPC());
+}
diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
index 1d3d10f..e7880cf 100644
--- a/target-arm/helper-a64.h
+++ b/target-arm/helper-a64.h
@@ -46,3 +46,9 @@ DEF_HELPER_FLAGS_2(frecpx_f32, TCG_CALL_NO_RWG, f32, f32, ptr)
 DEF_HELPER_FLAGS_2(fcvtx_f64_to_f32, TCG_CALL_NO_RWG, f32, f64, env)
 DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
+DEF_HELPER_4(cmpxchg64b, i64, env, i64, i64, i64)
+DEF_HELPER_4(cmpxchg64w, i64, env, i64, i64, i64)
+DEF_HELPER_4(cmpxchg64l, i64, env, i64, i64, i64)
+DEF_HELPER_4(cmpxchg64q, i64, env, i64, i64, i64)
+DEF_HELPER_6(paired_cmpxchg64l, i64, env, i64, i64, i64, i64, i64)
+DEF_HELPER_6(paired_cmpxchg64q, i64, env, i64, i64, i64, i64, i64)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 27/30] target-arm: emulate aarch64's LL/SC using cmpxchg helpers
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (25 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 26/30] target-arm: add cmpxchg helpers for aarch64 Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 28/30] linux-user: remove handling of ARM's EXCP_STREX Emilio G. Cota
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

Emulating LL/SC with cmpxchg is not correct, since it can
suffer from the ABA problem. Portable parallel code, however,
is written assuming only cmpxchg--and not LL/SC--is available.
This means that in practice emulating LL/SC with cmpxchg is
a viable alternative.

The appended emulates LL/SC pairs in aarch64 with cmpxchg helpers.
This works in both user and system mode. In usermode, it avoids
pausing all other CPUs to perform the LL/SC pair. The subsequent
performance and scalability improvement is significant, as the
plots below show. They plot the throughput of atomic_add-bench
compiled for ARM and executed on a 64-core x86 machine.

Hi-res plots: http://imgur.com/a/JVc8Y

                atomic_add-bench: 1000000 ops/thread, [0,1] range

  18 ++---------+----------+---------+----------+----------+----------+---++
     +cmpxchg +-E--+       +         +          +          +          +    |
  16 ++master +-H--+                                                      ++
     ||                                                                    |
  14 ++                                                                   ++
     | |                                                                   |
  12 ++|                                                                  ++
     | |                                                                   |
  10 ++++                                                                 ++
   8 ++E                                                                  ++
     |+++                                                                  |
   6 ++ |                                                                 ++
     |  |                                                                  |
   4 ++ |                                                                 ++
     |   |                                                                 |
   2 +H++E+---                                                            ++
     + |     +E++----+E+---+--+E+----++E+------+E+------+E++----+E+---+--+E|
   0 ++H-H----H-+-----H----+---------+----------+----------+----------+---++
     0          10         20        30         40         50         60
                                Number of threads

                atomic_add-bench: 1000000 ops/thread, [0,2] range

  18 ++---------+----------+---------+----------+----------+----------+---++
     +cmpxchg +-E--+       +         +          +          +          +    |
  16 ++master +-H--+                                                      ++
     | |                                                                   |
  14 ++E                                                                  ++
     | |                                                                   |
  12 ++|                                                                  ++
     |+++                                                                  |
  10 ++ |                                                                 ++
   8 ++ |                                                                 ++
     |  |                                                                  |
   6 ++ |                                                                 ++
     |   |                                                                 |
   4 ++  |                                                                ++
     |  +E+---                                                             |
   2 +H+     +E+-----+++              +++      +++   ---+E+-----+E+------+++
     +++        +    +E+---+--+E+----++E+------+E+---   ++++    +++   +  +E|
   0 ++H-H----H-+-----H----+---------+----------+----------+----------+---++
     0          10         20        30         40         50         60
                                Number of threads

               atomic_add-bench: 1000000 ops/thread, [0,128] range

  70 ++---------+----------+---------+----------+----------+----------+---++
     +cmpxchg +-E--+       +         +          +          +          +    |
  60 ++master +-H--+                  +++            ---+E+-----+E+------+E+
     |                        +E+------E-------+E+---                      |
     |                     ---        +++                                  |
  50 ++              +++---                                               ++
     |              -+E+                                                   |
  40 ++      +++----                                                      ++
     |        E-                                                           |
     |      --|                                                            |
  30 ++   -- +++                                                          ++
     |  +E+                                                                |
  20 ++E+                                                                 ++
     |E+                                                                   |
     |                                                                     |
  10 ++                                                                   ++
     +          +          +         +          +          +          +    |
   0 +HH-H----H-+-----H----+---------+----------+----------+----------+---++
     0          10         20        30         40         50         60
                                Number of threads

              atomic_add-bench: 1000000 ops/thread, [0,1024] range

  160 ++---------+---------+----------+---------+----------+----------+---++
      +cmpxchg +-E--+      +          +         +          +          +    |
  140 ++master +-H--+                                           +++      +++
      |                                                -+E+-----+E+-------E|
  120 ++                                       +++ ----                  +++
      |                                +++  ----E--                        |
  100 ++                              --E---   +++                        ++
      |                       +++ ---- +++                                 |
   80 ++                     --E--                                        ++
      |                  ---- +++                                          |
      |              -+E+                                                  |
   60 ++         ---- +++                                                 ++
      |      +E+-                                                          |
   40 ++   --                                                             ++
      |  +E+                                                               |
   20 +EE+                                                                ++
      +++        +         +          +         +          +          +    |
    0 +HH-H---H--+-----H---+----------+---------+----------+----------+---++
      0          10        20         30        40         50         60
                                Number of threads

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-arm/translate-a64.c | 79 ++++++++++++++++++----------------------------
 1 file changed, 30 insertions(+), 49 deletions(-)

diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index f5e29d2..1676519 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -1754,17 +1754,6 @@ static void disas_b_exc_sys(DisasContext *s, uint32_t insn)
     }
 }
 
-/*
- * Load/Store exclusive instructions are implemented by remembering
- * the value/address loaded, and seeing if these are the same
- * when the store is performed. This is not actually the architecturally
- * mandated semantics, but it works for typical guest code sequences
- * and avoids having to monitor regular stores.
- *
- * In system emulation mode only one CPU will be running at once, so
- * this sequence is effectively atomic.  In user emulation mode we
- * throw an exception and handle the atomic operation elsewhere.
- */
 static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
                                TCGv_i64 addr, int size, bool is_pair)
 {
@@ -1794,16 +1783,6 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     tcg_gen_mov_i64(cpu_exclusive_addr, addr);
 }
 
-#ifdef CONFIG_USER_ONLY
-static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
-                                TCGv_i64 addr, int size, int is_pair)
-{
-    tcg_gen_mov_i64(cpu_exclusive_test, addr);
-    tcg_gen_movi_i32(cpu_exclusive_info,
-                     size | is_pair << 2 | (rd << 4) | (rt << 9) | (rt2 << 14));
-    gen_exception_internal_insn(s, 4, EXCP_STREX);
-}
-#else
 static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                                 TCGv_i64 inaddr, int size, int is_pair)
 {
@@ -1831,46 +1810,48 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
     tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label);
 
     tmp = tcg_temp_new_i64();
-    tcg_gen_qemu_ld_i64(tmp, addr, get_mem_index(s), s->be_data + size);
-    tcg_gen_brcond_i64(TCG_COND_NE, tmp, cpu_exclusive_val, fail_label);
-    tcg_temp_free_i64(tmp);
-
     if (is_pair) {
-        TCGv_i64 addrhi = tcg_temp_new_i64();
-        TCGv_i64 tmphi = tcg_temp_new_i64();
-
-        tcg_gen_addi_i64(addrhi, addr, 1 << size);
-        tcg_gen_qemu_ld_i64(tmphi, addrhi, get_mem_index(s),
-                            s->be_data + size);
-        tcg_gen_brcond_i64(TCG_COND_NE, tmphi, cpu_exclusive_high, fail_label);
-
-        tcg_temp_free_i64(tmphi);
-        tcg_temp_free_i64(addrhi);
-    }
-
-    /* We seem to still have the exclusive monitor, so do the store */
-    tcg_gen_qemu_st_i64(cpu_reg(s, rt), addr, get_mem_index(s),
-                        s->be_data + size);
-    if (is_pair) {
-        TCGv_i64 addrhi = tcg_temp_new_i64();
+        if (size == 2) {
+            gen_helper_paired_cmpxchg64l(tmp, cpu_env, addr, cpu_exclusive_val,
+                                         cpu_exclusive_high,
+                                         cpu_reg(s, rt), cpu_reg(s, rt2));
+        } else {
+            gen_helper_paired_cmpxchg64q(tmp, cpu_env, addr, cpu_exclusive_val,
+                                         cpu_exclusive_high, cpu_reg(s, rt),
+                                         cpu_reg(s, rt2));
+        }
+    } else {
+        TCGv_i64 val = cpu_reg(s, rt);
 
-        tcg_gen_addi_i64(addrhi, addr, 1 << size);
-        tcg_gen_qemu_st_i64(cpu_reg(s, rt2), addrhi,
-                            get_mem_index(s), s->be_data + size);
-        tcg_temp_free_i64(addrhi);
+        switch (size) {
+        case 0:
+            gen_helper_cmpxchg64b(tmp, cpu_env, addr, cpu_exclusive_val, val);
+            break;
+        case 1:
+            gen_helper_cmpxchg64w(tmp, cpu_env, addr, cpu_exclusive_val, val);
+            break;
+        case 2:
+            gen_helper_cmpxchg64l(tmp, cpu_env, addr, cpu_exclusive_val, val);
+            break;
+        case 3:
+            gen_helper_cmpxchg64q(tmp, cpu_env, addr, cpu_exclusive_val, val);
+            break;
+        default:
+            abort();
+        }
     }
 
     tcg_temp_free_i64(addr);
 
-    tcg_gen_movi_i64(cpu_reg(s, rd), 0);
+    tcg_gen_mov_i64(cpu_reg(s, rd), tmp);
+    tcg_temp_free_i64(tmp);
     tcg_gen_br(done_label);
+
     gen_set_label(fail_label);
     tcg_gen_movi_i64(cpu_reg(s, rd), 1);
     gen_set_label(done_label);
     tcg_gen_movi_i64(cpu_exclusive_addr, -1);
-
 }
-#endif
 
 /* Update the Sixty-Four bit (SF) registersize. This logic is derived
  * from the ARMv8 specs for LDR (Shared decode for all encodings).
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 28/30] linux-user: remove handling of ARM's EXCP_STREX
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (26 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 27/30] target-arm: emulate aarch64's LL/SC using cmpxchg helpers Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 29/30] linux-user: remove handling of aarch64's EXCP_STREX Emilio G. Cota
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

The exception is not emitted anymore.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 linux-user/main.c | 93 -------------------------------------------------------
 1 file changed, 93 deletions(-)

diff --git a/linux-user/main.c b/linux-user/main.c
index af9e8e3..c6c92a6 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -634,94 +634,6 @@ do_kernel_trap(CPUARMState *env)
     return 0;
 }
 
-/* Store exclusive handling for AArch32 */
-static int do_strex(CPUARMState *env)
-{
-    uint64_t val;
-    int size;
-    int rc = 1;
-    int segv = 0;
-    uint32_t addr;
-    start_exclusive();
-    if (env->exclusive_addr != env->exclusive_test) {
-        goto fail;
-    }
-    /* We know we're always AArch32 so the address is in uint32_t range
-     * unless it was the -1 exclusive-monitor-lost value (which won't
-     * match exclusive_test above).
-     */
-    assert(extract64(env->exclusive_addr, 32, 32) == 0);
-    addr = env->exclusive_addr;
-    size = env->exclusive_info & 0xf;
-    switch (size) {
-    case 0:
-        segv = get_user_u8(val, addr);
-        break;
-    case 1:
-        segv = get_user_data_u16(val, addr, env);
-        break;
-    case 2:
-    case 3:
-        segv = get_user_data_u32(val, addr, env);
-        break;
-    default:
-        abort();
-    }
-    if (segv) {
-        env->exception.vaddress = addr;
-        goto done;
-    }
-    if (size == 3) {
-        uint32_t valhi;
-        segv = get_user_data_u32(valhi, addr + 4, env);
-        if (segv) {
-            env->exception.vaddress = addr + 4;
-            goto done;
-        }
-        if (arm_cpu_bswap_data(env)) {
-            val = deposit64((uint64_t)valhi, 32, 32, val);
-        } else {
-            val = deposit64(val, 32, 32, valhi);
-        }
-    }
-    if (val != env->exclusive_val) {
-        goto fail;
-    }
-
-    val = env->regs[(env->exclusive_info >> 8) & 0xf];
-    switch (size) {
-    case 0:
-        segv = put_user_u8(val, addr);
-        break;
-    case 1:
-        segv = put_user_data_u16(val, addr, env);
-        break;
-    case 2:
-    case 3:
-        segv = put_user_data_u32(val, addr, env);
-        break;
-    }
-    if (segv) {
-        env->exception.vaddress = addr;
-        goto done;
-    }
-    if (size == 3) {
-        val = env->regs[(env->exclusive_info >> 12) & 0xf];
-        segv = put_user_data_u32(val, addr + 4, env);
-        if (segv) {
-            env->exception.vaddress = addr + 4;
-            goto done;
-        }
-    }
-    rc = 0;
-fail:
-    env->regs[15] += 4;
-    env->regs[(env->exclusive_info >> 4) & 0xf] = rc;
-done:
-    end_exclusive();
-    return segv;
-}
-
 void cpu_loop(CPUARMState *env)
 {
     CPUState *cs = CPU(arm_env_get_cpu(env));
@@ -891,11 +803,6 @@ void cpu_loop(CPUARMState *env)
         case EXCP_INTERRUPT:
             /* just indicate that signals should be handled asap */
             break;
-        case EXCP_STREX:
-            if (!do_strex(env)) {
-                break;
-            }
-            /* fall through for segv */
         case EXCP_PREFETCH_ABORT:
         case EXCP_DATA_ABORT:
             addr = env->exception.vaddress;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 29/30] linux-user: remove handling of aarch64's EXCP_STREX
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (27 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 28/30] linux-user: remove handling of ARM's EXCP_STREX Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-27 19:02 ` [Qemu-devel] [RFC 30/30] target-arm: remove EXCP_STREX + cpu_exclusive_{test, info} Emilio G. Cota
  2016-06-28  8:45 ` [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Lluís Vilanova
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

The exception is not emitted anymore.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 linux-user/main.c | 125 ------------------------------------------------------
 1 file changed, 125 deletions(-)

diff --git a/linux-user/main.c b/linux-user/main.c
index c6c92a6..1c3db37 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -848,124 +848,6 @@ void cpu_loop(CPUARMState *env)
 
 #else
 
-/*
- * Handle AArch64 store-release exclusive
- *
- * rs = gets the status result of store exclusive
- * rt = is the register that is stored
- * rt2 = is the second register store (in STP)
- *
- */
-static int do_strex_a64(CPUARMState *env)
-{
-    uint64_t val;
-    int size;
-    bool is_pair;
-    int rc = 1;
-    int segv = 0;
-    uint64_t addr;
-    int rs, rt, rt2;
-
-    start_exclusive();
-    /* size | is_pair << 2 | (rs << 4) | (rt << 9) | (rt2 << 14)); */
-    size = extract32(env->exclusive_info, 0, 2);
-    is_pair = extract32(env->exclusive_info, 2, 1);
-    rs = extract32(env->exclusive_info, 4, 5);
-    rt = extract32(env->exclusive_info, 9, 5);
-    rt2 = extract32(env->exclusive_info, 14, 5);
-
-    addr = env->exclusive_addr;
-
-    if (addr != env->exclusive_test) {
-        goto finish;
-    }
-
-    switch (size) {
-    case 0:
-        segv = get_user_u8(val, addr);
-        break;
-    case 1:
-        segv = get_user_u16(val, addr);
-        break;
-    case 2:
-        segv = get_user_u32(val, addr);
-        break;
-    case 3:
-        segv = get_user_u64(val, addr);
-        break;
-    default:
-        abort();
-    }
-    if (segv) {
-        env->exception.vaddress = addr;
-        goto error;
-    }
-    if (val != env->exclusive_val) {
-        goto finish;
-    }
-    if (is_pair) {
-        if (size == 2) {
-            segv = get_user_u32(val, addr + 4);
-        } else {
-            segv = get_user_u64(val, addr + 8);
-        }
-        if (segv) {
-            env->exception.vaddress = addr + (size == 2 ? 4 : 8);
-            goto error;
-        }
-        if (val != env->exclusive_high) {
-            goto finish;
-        }
-    }
-    /* handle the zero register */
-    val = rt == 31 ? 0 : env->xregs[rt];
-    switch (size) {
-    case 0:
-        segv = put_user_u8(val, addr);
-        break;
-    case 1:
-        segv = put_user_u16(val, addr);
-        break;
-    case 2:
-        segv = put_user_u32(val, addr);
-        break;
-    case 3:
-        segv = put_user_u64(val, addr);
-        break;
-    }
-    if (segv) {
-        goto error;
-    }
-    if (is_pair) {
-        /* handle the zero register */
-        val = rt2 == 31 ? 0 : env->xregs[rt2];
-        if (size == 2) {
-            segv = put_user_u32(val, addr + 4);
-        } else {
-            segv = put_user_u64(val, addr + 8);
-        }
-        if (segv) {
-            env->exception.vaddress = addr + (size == 2 ? 4 : 8);
-            goto error;
-        }
-    }
-    rc = 0;
-finish:
-    env->pc += 4;
-    /* rs == 31 encodes a write to the ZR, thus throwing away
-     * the status return. This is rather silly but valid.
-     */
-    if (rs < 31) {
-        env->xregs[rs] = rc;
-    }
-error:
-    /* instruction faulted, PC does not advance */
-    /* either way a strex releases any exclusive lock we have */
-    env->exclusive_addr = -1;
-    end_exclusive();
-    return segv;
-}
-
 /* AArch64 main loop */
 void cpu_loop(CPUARMState *env)
 {
@@ -1006,11 +888,6 @@ void cpu_loop(CPUARMState *env)
             info._sifields._sigfault._addr = env->pc;
             queue_signal(env, info.si_signo, &info);
             break;
-        case EXCP_STREX:
-            if (!do_strex_a64(env)) {
-                break;
-            }
-            /* fall through for segv */
         case EXCP_PREFETCH_ABORT:
         case EXCP_DATA_ABORT:
             info.si_signo = TARGET_SIGSEGV;
@@ -1043,8 +920,6 @@ void cpu_loop(CPUARMState *env)
         process_pending_signals(env);
         /* Exception return on AArch64 always clears the exclusive monitor,
          * so any return to running guest code implies this.
-         * A strex (successful or otherwise) also clears the monitor, so
-         * we don't need to specialcase EXCP_STREX.
          */
         env->exclusive_addr = -1;
     }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [RFC 30/30] target-arm: remove EXCP_STREX + cpu_exclusive_{test, info}
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (28 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 29/30] linux-user: remove handling of aarch64's EXCP_STREX Emilio G. Cota
@ 2016-06-27 19:02 ` Emilio G. Cota
  2016-06-28  8:45 ` [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Lluís Vilanova
  30 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:02 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

The exception is not emitted anymore; remove it and the associated
TCG variables.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target-arm/cpu.h       | 17 ++++++-----------
 target-arm/internals.h |  4 +---
 target-arm/translate.c | 10 ----------
 target-arm/translate.h |  4 ----
 4 files changed, 7 insertions(+), 28 deletions(-)

diff --git a/target-arm/cpu.h b/target-arm/cpu.h
index 7938ddc..0b2ed28 100644
--- a/target-arm/cpu.h
+++ b/target-arm/cpu.h
@@ -46,13 +46,12 @@
 #define EXCP_BKPT            7
 #define EXCP_EXCEPTION_EXIT  8   /* Return from v7M exception.  */
 #define EXCP_KERNEL_TRAP     9   /* Jumped to kernel code page.  */
-#define EXCP_STREX          10
-#define EXCP_HVC            11   /* HyperVisor Call */
-#define EXCP_HYP_TRAP       12
-#define EXCP_SMC            13   /* Secure Monitor Call */
-#define EXCP_VIRQ           14
-#define EXCP_VFIQ           15
-#define EXCP_SEMIHOST       16   /* semihosting call (A64 only) */
+#define EXCP_HVC            10   /* HyperVisor Call */
+#define EXCP_HYP_TRAP       11
+#define EXCP_SMC            12   /* Secure Monitor Call */
+#define EXCP_VIRQ           13
+#define EXCP_VFIQ           14
+#define EXCP_SEMIHOST       15   /* semihosting call (A64 only) */
 
 #define ARMV7M_EXCP_RESET   1
 #define ARMV7M_EXCP_NMI     2
@@ -475,10 +474,6 @@ typedef struct CPUARMState {
     uint64_t exclusive_addr;
     uint64_t exclusive_val;
     uint64_t exclusive_high;
-#if defined(CONFIG_USER_ONLY)
-    uint64_t exclusive_test;
-    uint32_t exclusive_info;
-#endif
 
     /* iwMMXt coprocessor state.  */
     struct {
diff --git a/target-arm/internals.h b/target-arm/internals.h
index 466be0b..5ab3b28 100644
--- a/target-arm/internals.h
+++ b/target-arm/internals.h
@@ -46,8 +46,7 @@ static inline bool excp_is_internal(int excp)
         || excp == EXCP_HALTED
         || excp == EXCP_EXCEPTION_EXIT
         || excp == EXCP_KERNEL_TRAP
-        || excp == EXCP_SEMIHOST
-        || excp == EXCP_STREX;
+        || excp == EXCP_SEMIHOST;
 }
 
 /* Exception names for debug logging; note that not all of these
@@ -63,7 +62,6 @@ static const char * const excnames[] = {
     [EXCP_BKPT] = "Breakpoint",
     [EXCP_EXCEPTION_EXIT] = "QEMU v7M exception exit",
     [EXCP_KERNEL_TRAP] = "QEMU intercept of kernel commpage",
-    [EXCP_STREX] = "QEMU intercept of STREX",
     [EXCP_HVC] = "Hypervisor Call",
     [EXCP_HYP_TRAP] = "Hypervisor Trap",
     [EXCP_SMC] = "Secure Monitor Call",
diff --git a/target-arm/translate.c b/target-arm/translate.c
index b177388..8dec0d7 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -64,10 +64,6 @@ static TCGv_i32 cpu_R[16];
 TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
 TCGv_i64 cpu_exclusive_addr;
 TCGv_i64 cpu_exclusive_val;
-#ifdef CONFIG_USER_ONLY
-TCGv_i64 cpu_exclusive_test;
-TCGv_i32 cpu_exclusive_info;
-#endif
 
 /* FIXME:  These should be removed.  */
 static TCGv_i32 cpu_F0s, cpu_F1s;
@@ -101,12 +97,6 @@ void arm_translate_init(void)
         offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
     cpu_exclusive_val = tcg_global_mem_new_i64(cpu_env,
         offsetof(CPUARMState, exclusive_val), "exclusive_val");
-#ifdef CONFIG_USER_ONLY
-    cpu_exclusive_test = tcg_global_mem_new_i64(cpu_env,
-        offsetof(CPUARMState, exclusive_test), "exclusive_test");
-    cpu_exclusive_info = tcg_global_mem_new_i32(cpu_env,
-        offsetof(CPUARMState, exclusive_info), "exclusive_info");
-#endif
 
     a64_translate_init();
 }
diff --git a/target-arm/translate.h b/target-arm/translate.h
index dbd7ac8..d4e205e 100644
--- a/target-arm/translate.h
+++ b/target-arm/translate.h
@@ -77,10 +77,6 @@ extern TCGv_env cpu_env;
 extern TCGv_i32 cpu_NF, cpu_ZF, cpu_CF, cpu_VF;
 extern TCGv_i64 cpu_exclusive_addr;
 extern TCGv_i64 cpu_exclusive_val;
-#ifdef CONFIG_USER_ONLY
-extern TCGv_i64 cpu_exclusive_test;
-extern TCGv_i32 cpu_exclusive_info;
-#endif
 
 static inline int arm_dc_feature(DisasContext *dc, int feature)
 {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 02/30] tcg: add tcg_cmpxchg_lock
  2016-06-27 19:01 ` [Qemu-devel] [RFC 02/30] tcg: add tcg_cmpxchg_lock Emilio G. Cota
@ 2016-06-27 20:07   ` Richard Henderson
  2016-06-27 20:41     ` Emilio G. Cota
  0 siblings, 1 reply; 44+ messages in thread
From: Richard Henderson @ 2016-06-27 20:07 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Alvise Rigo, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée

On 06/27/2016 12:01 PM, Emilio G. Cota wrote:
> This set of locks will allow us to correctly emulate cmpxchg16
> in a parallel TCG. The key observation is that no architecture
> supports 16-byte regular atomic load/stores; only "locked" accesses
> (e.g. via cmpxchg16b on x86) are allowed, and therefore we can emulate
> them by using locks.
>
> We use a small array of locks so that we can have some scalability.
> Further improvements are possible (e.g. using a radix tree); but
> we should have a workload to benchmark in order to justify the
> additional complexity.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c        |  1 +
>  linux-user/main.c |  1 +
>  tcg/tcg.h         |  5 +++++
>  translate-all.c   | 39 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 46 insertions(+)

As formulated, this doesn't work.

In order to support cmpxchg16 without a native one, you have to use locks on 
*all* operations, lest a 4-byte atomic operation and a 16-byte operation be 
simultaneous in the same address range.

Thankfully, the most common hosts (x86_64, aarch64, power7, s390x) do have a 
16-byte cmpxchg, so this shouldn't really matter much in practice.

It would be nice to continue to support the other hosts (arm32, mips, ppc32, 
sparc, i686) without locks when the guest doesn't require wider atomics than 
the host suports.


r~

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers
  2016-06-27 19:01 ` [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers Emilio G. Cota
@ 2016-06-27 20:11   ` Richard Henderson
  2016-06-27 21:19     ` Emilio G. Cota
  0 siblings, 1 reply; 44+ messages in thread
From: Richard Henderson @ 2016-06-27 20:11 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Alvise Rigo, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée

On 06/27/2016 12:01 PM, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  softmmu_template.h | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg.h          | 16 +++++++++++++++
>  2 files changed, 74 insertions(+)
>
> diff --git a/softmmu_template.h b/softmmu_template.h
> index 208f808..7b519dc 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -548,6 +548,64 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
>      }
>  }
>  #endif
> +
> +DATA_TYPE
> +glue(glue(helper_cmpxchg, SUFFIX),
> +     MMUSUFFIX)(CPUArchState *env, target_ulong addr, DATA_TYPE old,
> +                DATA_TYPE new, TCGMemOpIdx oi, uintptr_t retaddr)
> +{
> +    unsigned mmu_idx = get_mmuidx(oi);
> +    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> +    uintptr_t haddr;
> +
> +    /* Adjust the given return address.  */
> +    retaddr -= GETPC_ADJ;
> +
> +    /* If the TLB entry is for a different page, reload and try again.  */
> +    if ((addr & TARGET_PAGE_MASK)
> +        != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
> +        if (unlikely((addr & (DATA_SIZE - 1)) != 0
> +                     && (get_memop(oi) & MO_AMASK) == MO_ALIGN)) {
> +            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> +                                 mmu_idx, retaddr);
> +        }
> +        if (!VICTIM_TLB_HIT(addr_write)) {
> +            tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
> +        }
> +        tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> +    }

You need to verify that addr_read == addr_write as well, so that tlb_fill can 
signal an exception in the rare case of a guest attempting a cmpxchg on a 
write-only page.

> +    /*
> +     * If the host allows unaligned accesses, then let the compiler
> +     * do its thing when performing the access on the host.
> +     */
> +    haddr = addr + env->tlb_table[mmu_idx][index].addend;
> +    return atomic_cmpxchg((DATA_TYPE *)haddr, old, new);

Host endian operation?


r~

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 11/30] target-i386: add atomic helpers
  2016-06-27 19:01 ` [Qemu-devel] [RFC 11/30] target-i386: add atomic helpers Emilio G. Cota
@ 2016-06-27 20:27   ` Richard Henderson
  2016-06-27 21:39     ` Emilio G. Cota
  0 siblings, 1 reply; 44+ messages in thread
From: Richard Henderson @ 2016-06-27 20:27 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Alvise Rigo, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée

On 06/27/2016 12:01 PM, Emilio G. Cota wrote:
> This patch only adds the helpers. Functions to invoke the helpers
> from translated code are generated in subsequent patches.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  target-i386/helper.h     | 34 ++++++++++++++++++++++++++++++++++
>  target-i386/mem_helper.c | 38 ++++++++++++++++++++++++++++++++++++++
>  target-i386/translate.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 117 insertions(+)


These are bare wrappers around what you've just added to exec/cpu_ldst*.

(1) Is there any reason these shouldn't go into tcg-runtime.h and tcg-runtime.c 
instead?

(2) If so, is there any good reason to add these to cpu_ldst* instead of *only* 
adding them as helpers to tcg-runtime.c?


r~

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 02/30] tcg: add tcg_cmpxchg_lock
  2016-06-27 20:07   ` Richard Henderson
@ 2016-06-27 20:41     ` Emilio G. Cota
  2016-06-27 21:02       ` Richard Henderson
  0 siblings, 1 reply; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 20:41 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Peter Maydell, Alvise Rigo,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On Mon, Jun 27, 2016 at 13:07:42 -0700, Richard Henderson wrote:
> On 06/27/2016 12:01 PM, Emilio G. Cota wrote:
> >This set of locks will allow us to correctly emulate cmpxchg16
> >in a parallel TCG. The key observation is that no architecture
> >supports 16-byte regular atomic load/stores; only "locked" accesses
> >(e.g. via cmpxchg16b on x86) are allowed, and therefore we can emulate
> >them by using locks.
> >
> >We use a small array of locks so that we can have some scalability.
> >Further improvements are possible (e.g. using a radix tree); but
> >we should have a workload to benchmark in order to justify the
> >additional complexity.
> >
> >Signed-off-by: Emilio G. Cota <cota@braap.org>
> >---
> > cpu-exec.c        |  1 +
> > linux-user/main.c |  1 +
> > tcg/tcg.h         |  5 +++++
> > translate-all.c   | 39 +++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 46 insertions(+)
> 
> As formulated, this doesn't work.
> 
> In order to support cmpxchg16 without a native one, you have to use locks on
> *all* operations, lest a 4-byte atomic operation and a 16-byte operation be
> simultaneous in the same address range.

Ouch, of course you're right.
I had in mind a use case where cmpxchg16 races with other cmpxchg16's only,
which doesn't always have to be the case.

> Thankfully, the most common hosts (x86_64, aarch64, power7, s390x) do have a
> 16-byte cmpxchg, so this shouldn't really matter much in practice.
> 
> It would be nice to continue to support the other hosts (arm32, mips, ppc32,
> sparc, i686) without locks when the guest doesn't require wider atomics than
> the host suports.

ABA problem notwithstanding, as long as the guest workload doesn't require
atomics wider than that of the host, using cmpxchg as in this RFC can work.

Supporting 64-bit hosts on 32-bit guests has the problem of non-atomicity
of 64-bit accesses, however.

		E.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 02/30] tcg: add tcg_cmpxchg_lock
  2016-06-27 20:41     ` Emilio G. Cota
@ 2016-06-27 21:02       ` Richard Henderson
  0 siblings, 0 replies; 44+ messages in thread
From: Richard Henderson @ 2016-06-27 21:02 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Peter Maydell, Alvise Rigo,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On 06/27/2016 01:41 PM, Emilio G. Cota wrote:
> Supporting 64-bit hosts on 32-bit guests has the problem of non-atomicity
> of 64-bit accesses, however.

It does.  It would be possible to do something with armv7 and i686 hosts, as 
64-bit atomic ops exist, but it's probably not worth the effort.

As you might be able to glean from the multiple sparc64-on-i686 threads, I'm 
becoming annoyed with i686.  I wouldn't be disappointed to completely deprecate it.


r~

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers
  2016-06-27 20:11   ` Richard Henderson
@ 2016-06-27 21:19     ` Emilio G. Cota
  2016-06-27 21:43       ` Richard Henderson
  0 siblings, 1 reply; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 21:19 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Peter Maydell, Alvise Rigo,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On Mon, Jun 27, 2016 at 13:11:28 -0700, Richard Henderson wrote:
> On 06/27/2016 12:01 PM, Emilio G. Cota wrote:
> >Signed-off-by: Emilio G. Cota <cota@braap.org>
> >---
> > softmmu_template.h | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > tcg/tcg.h          | 16 +++++++++++++++
> > 2 files changed, 74 insertions(+)
> >
> >diff --git a/softmmu_template.h b/softmmu_template.h
> >index 208f808..7b519dc 100644
> >--- a/softmmu_template.h
> >+++ b/softmmu_template.h
> >@@ -548,6 +548,64 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
> >     }
> > }
> > #endif
> >+
> >+DATA_TYPE
> >+glue(glue(helper_cmpxchg, SUFFIX),
> >+     MMUSUFFIX)(CPUArchState *env, target_ulong addr, DATA_TYPE old,
> >+                DATA_TYPE new, TCGMemOpIdx oi, uintptr_t retaddr)
> >+{
> >+    unsigned mmu_idx = get_mmuidx(oi);
> >+    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> >+    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> >+    uintptr_t haddr;
> >+
> >+    /* Adjust the given return address.  */
> >+    retaddr -= GETPC_ADJ;
> >+
> >+    /* If the TLB entry is for a different page, reload and try again.  */
> >+    if ((addr & TARGET_PAGE_MASK)
> >+        != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
> >+        if (unlikely((addr & (DATA_SIZE - 1)) != 0
> >+                     && (get_memop(oi) & MO_AMASK) == MO_ALIGN)) {
> >+            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> >+                                 mmu_idx, retaddr);
> >+        }
> >+        if (!VICTIM_TLB_HIT(addr_write)) {
> >+            tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
> >+        }
> >+        tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> >+    }
> 
> You need to verify that addr_read == addr_write as well, so that tlb_fill
> can signal an exception in the rare case of a guest attempting a cmpxchg on
> a write-only page.

Will do.

> >+    /*
> >+     * If the host allows unaligned accesses, then let the compiler
> >+     * do its thing when performing the access on the host.
> >+     */
> >+    haddr = addr + env->tlb_table[mmu_idx][index].addend;
> >+    return atomic_cmpxchg((DATA_TYPE *)haddr, old, new);
> 
> Host endian operation?

I forgot to add byte ordering in the cover letter under "why this is
an RFC" -- I admit I'm confused by all the macro trickery done for
regular loads and stores.

We store data in memory as per the guests' byte ordering, right?
If so, I don't see how it would be possible to leverage the
host compiler for things like atomic_add -- we'd increment garbage,
not a meaningful value.

		Emilio

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 11/30] target-i386: add atomic helpers
  2016-06-27 20:27   ` Richard Henderson
@ 2016-06-27 21:39     ` Emilio G. Cota
  0 siblings, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 21:39 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Peter Maydell, Alvise Rigo,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On Mon, Jun 27, 2016 at 13:27:35 -0700, Richard Henderson wrote:
> On 06/27/2016 12:01 PM, Emilio G. Cota wrote:
> >This patch only adds the helpers. Functions to invoke the helpers
> >from translated code are generated in subsequent patches.
> >
> >Signed-off-by: Emilio G. Cota <cota@braap.org>
> >---
> > target-i386/helper.h     | 34 ++++++++++++++++++++++++++++++++++
> > target-i386/mem_helper.c | 38 ++++++++++++++++++++++++++++++++++++++
> > target-i386/translate.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 117 insertions(+)
> 
> 
> These are bare wrappers around what you've just added to exec/cpu_ldst*.
> 
> (1) Is there any reason these shouldn't go into tcg-runtime.h and
> tcg-runtime.c instead?
> 
> (2) If so, is there any good reason to add these to cpu_ldst* instead of
> *only* adding them as helpers to tcg-runtime.c?

The only reason is my incompetence :-) I didn't know about tcg-runtime.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers
  2016-06-27 21:19     ` Emilio G. Cota
@ 2016-06-27 21:43       ` Richard Henderson
  2016-06-27 21:48         ` Peter Maydell
  0 siblings, 1 reply; 44+ messages in thread
From: Richard Henderson @ 2016-06-27 21:43 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Peter Maydell, Alvise Rigo,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On 06/27/2016 02:19 PM, Emilio G. Cota wrote:
>> Host endian operation?
>
> I forgot to add byte ordering in the cover letter under "why this is
> an RFC" -- I admit I'm confused by all the macro trickery done for
> regular loads and stores.
>
> We store data in memory as per the guests' byte ordering, right?

Sometimes.  The guest can also explicitly request a reversed byte load/store. 
This is used both for explicit byte-reversing instructions (e.g. x86 movbe) and 
toggling the system byte order (arm setbe).


> If so, I don't see how it would be possible to leverage the
> host compiler for things like atomic_add -- we'd increment garbage,
> not a meaningful value.

All you need to do is byte-reverse the data.

   bswap(a + b) == bswap(a) + bswap(b).


r~

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers
  2016-06-27 21:43       ` Richard Henderson
@ 2016-06-27 21:48         ` Peter Maydell
  2016-06-27 21:53           ` Richard Henderson
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Maydell @ 2016-06-27 21:48 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Emilio G. Cota, QEMU Developers, MTTCG Devel, Alvise Rigo,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On 27 June 2016 at 22:43, Richard Henderson <rth@twiddle.net> wrote:
> All you need to do is byte-reverse the data.
>
>   bswap(a + b) == bswap(a) + bswap(b).

?

0xFF + 0xFF == 0x1FE, bswap(0x1FE) == 0xFE010000
bswap(0xFF) + bswap(0xFF) == 0xFF000000 + 0xFF000000 == 0x1FE000000
(or 0xFE000000 with truncate to 32-bit).

Or am I missing something?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers
  2016-06-27 21:48         ` Peter Maydell
@ 2016-06-27 21:53           ` Richard Henderson
  0 siblings, 0 replies; 44+ messages in thread
From: Richard Henderson @ 2016-06-27 21:53 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Emilio G. Cota, QEMU Developers, MTTCG Devel, Alvise Rigo,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On 06/27/2016 02:48 PM, Peter Maydell wrote:
> On 27 June 2016 at 22:43, Richard Henderson <rth@twiddle.net> wrote:
>> All you need to do is byte-reverse the data.
>>
>>   bswap(a + b) == bswap(a) + bswap(b).
>
> ?
>
> 0xFF + 0xFF == 0x1FE, bswap(0x1FE) == 0xFE010000
> bswap(0xFF) + bswap(0xFF) == 0xFF000000 + 0xFF000000 == 0x1FE000000
> (or 0xFE000000 with truncate to 32-bit).
>
> Or am I missing something?

No, you're quite right.  I think I didn't have enough coffee this morning.

So we'd need a cmpxchg loop for reverse-endian add.


r~

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics
  2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
                   ` (29 preceding siblings ...)
  2016-06-27 19:02 ` [Qemu-devel] [RFC 30/30] target-arm: remove EXCP_STREX + cpu_exclusive_{test, info} Emilio G. Cota
@ 2016-06-28  8:45 ` Lluís Vilanova
  2016-06-28 15:48   ` Richard Henderson
  2016-06-28 19:52   ` Emilio G. Cota
  30 siblings, 2 replies; 44+ messages in thread
From: Lluís Vilanova @ 2016-06-28  8:45 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson, Sergey Fedorov, Alvise Rigo, Peter Maydell

Emilio G Cota writes:
[...]
> - What to do when atomic ops are used on something other than RAM?
>   Should we have a "slow path" that is not atomic for these cases, or
>   it's OK to assume code is bogus? For now, I just wrote XXX.
[...]

You mean, for example, on I/O space? In these cases, it depends on the specific
device you're accessing and the interconnect used to access it.

TL;DR: In some cases, it makes sense to support atomics outside RAM.

For example, PCIe has support for expressing atomic operations in its messages
(I'm sure other interconnects do too). But in the end it depends on whether the
device supports them (I'm not sure if the device can reject atomics and produce
an error to whomever tried to do the atomic access, or if they are simply
ignored).

Cheers,
  Lluis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics
  2016-06-28  8:45 ` [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Lluís Vilanova
@ 2016-06-28 15:48   ` Richard Henderson
  2016-06-28 19:52   ` Emilio G. Cota
  1 sibling, 0 replies; 44+ messages in thread
From: Richard Henderson @ 2016-06-28 15:48 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel, Alex Bennée,
	Paolo Bonzini, Sergey Fedorov, Alvise Rigo, Peter Maydell

On 06/28/2016 01:45 AM, Lluís Vilanova wrote:
> Emilio G Cota writes:
> [...]
>> - What to do when atomic ops are used on something other than RAM?
>>   Should we have a "slow path" that is not atomic for these cases, or
>>   it's OK to assume code is bogus? For now, I just wrote XXX.
> [...]
>
> You mean, for example, on I/O space? In these cases, it depends on the specific
> device you're accessing and the interconnect used to access it.
>
> TL;DR: In some cases, it makes sense to support atomics outside RAM.
>
> For example, PCIe has support for expressing atomic operations in its messages
> (I'm sure other interconnects do too). But in the end it depends on whether the
> device supports them (I'm not sure if the device can reject atomics and produce
> an error to whomever tried to do the atomic access, or if they are simply
> ignored).

Indeed.  Thankfully, that's rare.  Many cpus explicitly say that the atomic ops 
can't be used on non-cachable memory, since they use the cache coherency 
protocol to implement the atomicity.

That said, I can imagine that this probably works on x86, and supporting this 
is going to require a stop-the-world kind of emulation.


r~

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics
  2016-06-28  8:45 ` [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Lluís Vilanova
  2016-06-28 15:48   ` Richard Henderson
@ 2016-06-28 19:52   ` Emilio G. Cota
  1 sibling, 0 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-28 19:52 UTC (permalink / raw)
  To: Lluís Vilanova, Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

On Tue, Jun 28, 2016 at 08:48:28 -0700, Richard Henderson wrote:
> On 06/28/2016 01:45 AM, Lluís Vilanova wrote:
> >Emilio G Cota writes:
> >[...]
> >>- What to do when atomic ops are used on something other than RAM?
> >>  Should we have a "slow path" that is not atomic for these cases, or
> >>  it's OK to assume code is bogus? For now, I just wrote XXX.
> >[...]
> >
> >You mean, for example, on I/O space? In these cases, it depends on the specific
> >device you're accessing and the interconnect used to access it.

Yes, exactly. Anything non-cacheable, really.

> >TL;DR: In some cases, it makes sense to support atomics outside RAM.
> >
> >For example, PCIe has support for expressing atomic operations in its messages
> >(I'm sure other interconnects do too). But in the end it depends on whether the
> >device supports them (I'm not sure if the device can reject atomics and produce
> >an error to whomever tried to do the atomic access, or if they are simply
> >ignored).

But these messages wouldn't be generated as a result of calling cmpxchg
on the memory-mapped I/O address, right?

> Indeed.  Thankfully, that's rare.  Many cpus explicitly say that the atomic
> ops can't be used on non-cachable memory, since they use the cache coherency
> protocol to implement the atomicity.
>
> That said, I can imagine that this probably works on x86, and supporting
> this is going to require a stop-the-world kind of emulation.

I'm assuming virtually all device drivers serialize accesses so that
"read-modify-write" cycles can be implemented as a read+write
on the bus. I have written quite a few drivers and it never occurred
to me to write cmpxchg or equivalent on an I/O address.

That said, for a non-RFC submission of this patchset, what should
we do? Just abort() for now, or do a non-atomic cycle? Stop-the-world
isn't available yet, and I wouldn't want to wait for it--this is not
a huge deal-breaker for most code out there, I think.

Thanks,
 
		Emilio

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2016-06-28 19:52 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers Emilio G. Cota
2016-06-27 20:11   ` Richard Henderson
2016-06-27 21:19     ` Emilio G. Cota
2016-06-27 21:43       ` Richard Henderson
2016-06-27 21:48         ` Peter Maydell
2016-06-27 21:53           ` Richard Henderson
2016-06-27 19:01 ` [Qemu-devel] [RFC 02/30] tcg: add tcg_cmpxchg_lock Emilio G. Cota
2016-06-27 20:07   ` Richard Henderson
2016-06-27 20:41     ` Emilio G. Cota
2016-06-27 21:02       ` Richard Henderson
2016-06-27 19:01 ` [Qemu-devel] [RFC 03/30] cpu_ldst: add cpu_cmpxchg helpers Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 04/30] target-i386: add cmpxchg helpers Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 05/30] target-i386: emulate LOCK'ed cmpxchg using " Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 06/30] target-i386: emulate LOCK'ed cmpxchg8b/16b " Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 07/30] atomics: add atomic_xor Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 08/30] atomics: add atomic_op_fetch variants Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 09/30] softmmu: add atomic helpers Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 10/30] cpu_ldst: add cpu_atomic helpers Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 11/30] target-i386: add atomic helpers Emilio G. Cota
2016-06-27 20:27   ` Richard Henderson
2016-06-27 21:39     ` Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 12/30] target-i386: emulate LOCK'ed OP instructions using " Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 13/30] target-i386: emulate LOCK'ed INC using atomic helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 14/30] target-i386: emulate LOCK'ed NOT " Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 15/30] target-i386: emulate LOCK'ed NEG using cmpxchg helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 16/30] target-i386: emulate LOCK'ed XADD using atomic helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 17/30] target-i386: emulate LOCK'ed BTX ops using atomic helpers Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 18/30] target-i386: emulate XCHG using atomic helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 19/30] tests: add atomic_add-bench Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 20/30] target-i386: remove helper_lock() Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 21/30] target-arm: add cmpxchg helpers Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 22/30] target-arm: emulate LL/SC using " Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 23/30] target-arm: add atomic_xchg helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 24/30] target-arm: emulate SWP with " Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 25/30] helper: add DEF_HELPER_6 Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 26/30] target-arm: add cmpxchg helpers for aarch64 Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 27/30] target-arm: emulate aarch64's LL/SC using cmpxchg helpers Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 28/30] linux-user: remove handling of ARM's EXCP_STREX Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 29/30] linux-user: remove handling of aarch64's EXCP_STREX Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 30/30] target-arm: remove EXCP_STREX + cpu_exclusive_{test, info} Emilio G. Cota
2016-06-28  8:45 ` [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Lluís Vilanova
2016-06-28 15:48   ` Richard Henderson
2016-06-28 19:52   ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.