All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel]  [RFC 00/10] MTTCG: Slow-path for atomic insns
@ 2016-05-26 16:35 Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}() Alvise Rigo
                   ` (10 more replies)
  0 siblings, 11 replies; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

Hi,

This series ports the latest iteration of the LL/SC work on top of the
latest MTTCG reference branch posted recently by Alex.

These patches apply on top of the following series:

- [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86
  https://github.com/stsquad/qemu/tree/mttcg/enable-mttcg-for-armv7-v1
- [RFC v8 00/14] Slow-path for atomic instruction translation
  https://git.virtualopensystems.com/dev/qemu-mt/tree/\
  slowpath-for-atomic-v8-no-mttcg - only minor changes have been necessary
- Few recent patches from Emilio regarding the spinlock implementation

Overall, these patches allow the LL/SC infrastructure to work in multi-threaded
mode (patches 01-02-04) and make TLB flushes to other VCPUs safe.

Patch 03 introduces a new API to submit a work item to a VCPU and wait for its
completion. This API is used to query TLB flushes that result from the
emulation of some ARM instructions. Patches 07, 08 and 09 modify the current
tlb_flush_* functions to use the new API.  Patch 10 fixes a rare hang that I
was experiencing with this branch.

The whole work can be fetched from the following repository:
git@git.virtualopensystems.com:dev/qemu-mt.git
at the branch "slowpath-for-atomic-v8-mttcg".

Alvise Rigo (10):
  exec: Introduce tcg_exclusive_{lock,unlock}()
  softmmu_llsc_template.h: Move to multi-threading
  cpus: Introduce async_wait_run_on_cpu()
  cputlb: Introduce tlb_flush_other()
  target-arm: End TB after ldrex instruction
  cputlb: Add tlb_tables_flush_bitmap()
  cputlb: Query tlb_flush_by_mmuidx
  cputlb: Query tlb_flush_page_by_mmuidx
  cputlb: Query tlb_flush_page_all
  cpus: Do not sleep if some work item is pending

 cpus.c                     |  48 ++++++++++-
 cputlb.c                   | 202 ++++++++++++++++++++++++++++++++++-----------
 exec.c                     |  18 ++++
 include/exec/exec-all.h    |  13 +--
 include/qom/cpu.h          |  36 ++++++++
 softmmu_llsc_template.h    |  13 ++-
 softmmu_template.h         |   6 ++
 target-arm/helper.c        |  79 +++++++++---------
 target-arm/op_helper.c     |   6 ++
 target-arm/translate-a64.c |   2 +
 target-arm/translate.c     |   2 +
 11 files changed, 327 insertions(+), 98 deletions(-)

-- 
2.8.3

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel]  [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-05-31 15:03   ` Pranith Kumar
  2016-06-08  9:21   ` Alex Bennée
  2016-05-26 16:35 ` [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading Alvise Rigo
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

Add tcg_exclusive_{lock,unlock}() functions that will be used for making
the emulation of LL and SC instructions thread safe.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cpus.c            |  2 ++
 exec.c            | 18 ++++++++++++++++++
 include/qom/cpu.h |  5 +++++
 3 files changed, 25 insertions(+)

diff --git a/cpus.c b/cpus.c
index 860e7ba..b9ec903 100644
--- a/cpus.c
+++ b/cpus.c
@@ -961,6 +961,8 @@ void qemu_init_cpu_loop(void)
     qemu_cond_init(&qemu_work_cond);
     qemu_mutex_init(&qemu_global_mutex);
 
+    qemu_spin_init(&cpu_exclusive_lock);
+
     qemu_thread_get_self(&io_thread);
 
     safe_work = g_array_sized_new(TRUE, TRUE, sizeof(qemu_safe_work_item), 128);
diff --git a/exec.c b/exec.c
index a24b31c..1c72113 100644
--- a/exec.c
+++ b/exec.c
@@ -197,6 +197,24 @@ void cpu_exclusive_history_free(void)
         g_free(excl_history.c_array);
     }
 }
+
+__thread bool cpu_have_exclusive_lock;
+QemuSpin cpu_exclusive_lock;
+inline void tcg_exclusive_lock(void)
+{
+    if (!cpu_have_exclusive_lock) {
+        qemu_spin_lock(&cpu_exclusive_lock);
+        cpu_have_exclusive_lock = true;
+    }
+}
+
+inline void tcg_exclusive_unlock(void)
+{
+    if (cpu_have_exclusive_lock) {
+        cpu_have_exclusive_lock = false;
+        qemu_spin_unlock(&cpu_exclusive_lock);
+    }
+}
 #endif
 
 #if !defined(CONFIG_USER_ONLY)
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 0f51870..019f06d 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -201,6 +201,11 @@ typedef struct CPUClass {
     void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
 } CPUClass;
 
+/* Protect cpu_exclusive_* variable .*/
+void tcg_exclusive_lock(void);
+void tcg_exclusive_unlock(void);
+extern QemuSpin cpu_exclusive_lock;
+
 #ifdef HOST_WORDS_BIGENDIAN
 typedef struct icount_decr_u16 {
     uint16_t high;
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}() Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-06-10 15:21   ` Sergey Fedorov
  2016-05-26 16:35 ` [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu() Alvise Rigo
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

Using tcg_exclusive_{lock,unlock}(), make the emulation of
LoadLink/StoreConditional thread safe.

During an LL access, this lock protects the load access itself, the
update of the exclusive history and the update of the VCPU's protected
range.  In a SC access, the lock protects the store access itself, the
possible reset of other VCPUs' protected range and the reset of the
exclusive context of calling VCPU.

The lock is also taken when a normal store happens to access an
exclusive page to reset other VCPUs' protected range in case of
collision.

Moreover, adapt target-arm to also cope with the new multi-threaded
execution.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 softmmu_llsc_template.h | 11 +++++++++--
 softmmu_template.h      |  6 ++++++
 target-arm/op_helper.c  |  6 ++++++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index 2c4a494..d3810c0 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
     hwaddr hw_addr;
     unsigned mmu_idx = get_mmuidx(oi);
 
+    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+
+    tcg_exclusive_lock();
+
     /* Use the proper load helper from cpu_ldst.h */
     ret = helper_ld(env, addr, oi, retaddr);
 
-    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-
     /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
      * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
     hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
@@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
 
     cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
 
+    tcg_exclusive_unlock();
+
     /* From now on we are in LL/SC context */
     this_cpu->ll_sc_context = true;
 
@@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
          * access as one made by the store conditional wrapper. If the store
          * conditional does not succeed, the value will be set to 0.*/
         cpu->excl_succeeded = true;
+
+        tcg_exclusive_lock();
         helper_st(env, addr, val, oi, retaddr);
 
         if (cpu->excl_succeeded) {
@@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
 
     /* Unset LL/SC context */
     cc->cpu_reset_excl_context(cpu);
+    tcg_exclusive_unlock();
 
     return ret;
 }
diff --git a/softmmu_template.h b/softmmu_template.h
index 76fe37e..9363a7b 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -537,11 +537,16 @@ static inline void smmu_helper(do_excl_store)(CPUArchState *env,
         }
     }
 
+    /* Take the lock in case we are not coming from a SC */
+    tcg_exclusive_lock();
+
     smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
                               get_mmuidx(oi), index, retaddr);
 
     reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
 
+    tcg_exclusive_unlock();
+
     return;
 }
 
@@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
         if (tlb_addr & TLB_EXCL) {
+
             smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
                                        retaddr);
             return;
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index e22afc5..19ea52d 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
     cs->exception_index = excp;
     env->exception.syndrome = syndrome;
     env->exception.target_el = target_el;
+    tcg_exclusive_lock();
     cc->cpu_reset_excl_context(cs);
+    tcg_exclusive_unlock();
     cpu_loop_exit(cs);
 }
 
@@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
     CPUState *cs = ENV_GET_CPU(env);
     CPUClass *cc = CPU_GET_CLASS(cs);
 
+    tcg_exclusive_lock();
     cc->cpu_reset_excl_context(cs);
+    tcg_exclusive_unlock();
 }
 
 uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
@@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
 
     aarch64_save_sp(env, cur_el);
 
+    tcg_exclusive_lock();
     cc->cpu_reset_excl_context(cs);
+    tcg_exclusive_unlock();
 
     /* We must squash the PSTATE.SS bit to zero unless both of the
      * following hold:
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel]  [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}() Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-06-08 13:54   ` Alex Bennée
  2016-05-26 16:35 ` [Qemu-devel] [RFC 04/10] cputlb: Introduce tlb_flush_other() Alvise Rigo
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

Introduce a new function that allows the calling VCPU to add a work item
to another VCPU (aka target VCPU). This new function differs from
async_run_on_cpu() since it makes the calling VCPU waiting for the target
VCPU to finish the work item. The mechanism makes use of the halt_cond
to wait and in case process pending work items.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cpus.c            | 44 ++++++++++++++++++++++++++++++++++++++++++--
 include/qom/cpu.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/cpus.c b/cpus.c
index b9ec903..7bc96e2 100644
--- a/cpus.c
+++ b/cpus.c
@@ -89,7 +89,7 @@ static bool cpu_thread_is_idle(CPUState *cpu)
     if (cpu->stop || cpu->queued_work_first) {
         return false;
     }
-    if (cpu_is_stopped(cpu)) {
+    if (cpu_is_stopped(cpu) || async_waiting_for_work(cpu)) {
         return true;
     }
     if (!cpu->halted || cpu_has_work(cpu) ||
@@ -1012,6 +1012,7 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
     wi->func = func;
     wi->data = data;
     wi->free = true;
+    wi->wcpu = NULL;
 
     qemu_mutex_lock(&cpu->work_mutex);
     if (cpu->queued_work_first == NULL) {
@@ -1027,6 +1028,40 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
     qemu_cpu_kick(cpu);
 }
 
+void async_wait_run_on_cpu(CPUState *cpu, CPUState *wcpu, run_on_cpu_func func,
+                                                                    void *data)
+{
+    struct qemu_work_item *wwi;
+
+    assert(wcpu != cpu);
+
+    wwi = g_malloc0(sizeof(struct qemu_work_item));
+    wwi->func = func;
+    wwi->data = data;
+    wwi->free = true;
+    wwi->wcpu = wcpu;
+
+    /* Increase the number of pending work items */
+    atomic_inc(&wcpu->pending_work_items);
+
+    qemu_mutex_lock(&cpu->work_mutex);
+    /* Add the waiting work items at the beginning to free as soon as possible
+     * the waiting CPU. */
+    if (cpu->queued_work_first == NULL) {
+        cpu->queued_work_last = wwi;
+    } else {
+        wwi->next = cpu->queued_work_first;
+    }
+    cpu->queued_work_first = wwi;
+    wwi->done = false;
+    qemu_mutex_unlock(&cpu->work_mutex);
+
+    qemu_cpu_kick(cpu);
+
+    /* In order to wait, @wcpu has to exit the CPU loop */
+    cpu_exit(wcpu);
+}
+
 /*
  * Safe work interface
  *
@@ -1120,6 +1155,10 @@ static void flush_queued_work(CPUState *cpu)
         qemu_mutex_unlock(&cpu->work_mutex);
         wi->func(cpu, wi->data);
         qemu_mutex_lock(&cpu->work_mutex);
+        if (wi->wcpu != NULL) {
+            atomic_dec(&wi->wcpu->pending_work_items);
+            qemu_cond_broadcast(wi->wcpu->halt_cond);
+        }
         if (wi->free) {
             g_free(wi);
         } else {
@@ -1406,7 +1445,8 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
     while (1) {
         bool sleep = false;
 
-        if (cpu_can_run(cpu) && !async_safe_work_pending()) {
+        if (cpu_can_run(cpu) && !async_safe_work_pending()
+                             && !async_waiting_for_work(cpu)) {
             int r;
 
             atomic_inc(&tcg_scheduled_cpus);
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 019f06d..7be82ed 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -259,6 +259,8 @@ struct qemu_work_item {
     void *data;
     int done;
     bool free;
+    /* CPU waiting for this work item to finish. If NULL, no CPU is waiting. */
+    CPUState *wcpu;
 };
 
 /**
@@ -303,6 +305,7 @@ struct qemu_work_item {
  * @kvm_fd: vCPU file descriptor for KVM.
  * @work_mutex: Lock to prevent multiple access to queued_work_*.
  * @queued_work_first: First asynchronous work pending.
+ * @pending_work_items: Work items for which the CPU needs to wait completion.
  *
  * State of one CPU core or thread.
  */
@@ -337,6 +340,7 @@ struct CPUState {
 
     QemuMutex work_mutex;
     struct qemu_work_item *queued_work_first, *queued_work_last;
+    int pending_work_items;
 
     CPUAddressSpace *cpu_ases;
     int num_ases;
@@ -398,6 +402,9 @@ struct CPUState {
      * by a stcond (see softmmu_template.h). */
     bool excl_succeeded;
 
+    /* True if some CPU requested a TLB flush for this CPU. */
+    bool pending_tlb_flush;
+
     /* Note that this is accessed at the start of every TB via a negative
        offset from AREG0.  Leave this field at the end so as to make the
        (absolute value) offset as small as possible.  This reduces code
@@ -680,6 +687,19 @@ void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
 void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
 
 /**
+ * async_wait_run_on_cpu:
+ * @cpu: The vCPU to run on.
+ * @wpu: The vCPU submitting the work.
+ * @func: The function to be executed.
+ * @data: Data to pass to the function.
+ *
+ * Schedules the function @func for execution on the vCPU @cpu asynchronously.
+ * The vCPU @wcpu will wait for @cpu to finish the job.
+ */
+void async_wait_run_on_cpu(CPUState *cpu, CPUState *wcpu, run_on_cpu_func func,
+                                                                   void *data);
+
+/**
  * async_safe_run_on_cpu:
  * @cpu: The vCPU to run on.
  * @func: The function to be executed.
@@ -699,6 +719,17 @@ void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
 bool async_safe_work_pending(void);
 
 /**
+ * async_waiting_for_work:
+ *
+ * Check whether there are work items for which @cpu is waiting completion.
+ * Returns: @true if work items are pending for completion, @false otherwise.
+ */
+static inline bool async_waiting_for_work(CPUState *cpu)
+{
+    return atomic_mb_read(&cpu->pending_work_items) != 0;
+}
+
+/**
  * qemu_get_cpu:
  * @index: The CPUState@cpu_index value of the CPU to obtain.
  *
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel]  [RFC 04/10] cputlb: Introduce tlb_flush_other()
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
                   ` (2 preceding siblings ...)
  2016-05-26 16:35 ` [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu() Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 05/10] target-arm: End TB after ldrex instruction Alvise Rigo
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

In some cases (like in softmmu_llsc_template.h) we know for certain that
we need to flush other VCPUs' TLB. tlb_flush_other() serves this
purpose, allowing the VCPU @cpu to query a global flush to @other.

In addition, use it also in softmmu_llsc_template.h and tlb_flush()
if possible.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                | 28 +++++++++++++++++++++++-----
 softmmu_llsc_template.h |  2 +-
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 1586b64..55f7447 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -81,12 +81,24 @@ static void tlb_flush_nocheck(CPUState *cpu, int flush_global)
     env->tlb_flush_addr = -1;
     env->tlb_flush_mask = 0;
     tlb_flush_count++;
-    /* atomic_mb_set(&cpu->pending_tlb_flush, 0); */
 }
 
 static void tlb_flush_global_async_work(CPUState *cpu, void *opaque)
 {
     tlb_flush_nocheck(cpu, GPOINTER_TO_INT(opaque));
+    atomic_mb_set(&cpu->pending_tlb_flush, false);
+}
+
+static void tlb_flush_other(CPUState *cpu, CPUState *other, int flush_global)
+{
+    if (other->created) {
+        if (!atomic_xchg(&other->pending_tlb_flush, true)) {
+            async_wait_run_on_cpu(other, cpu, tlb_flush_global_async_work,
+                                  GINT_TO_POINTER(flush_global));
+        }
+    } else {
+        tlb_flush_nocheck(other, flush_global);
+    }
 }
 
 /* NOTE:
@@ -103,11 +115,17 @@ static void tlb_flush_global_async_work(CPUState *cpu, void *opaque)
  */
 void tlb_flush(CPUState *cpu, int flush_global)
 {
-    if (cpu->created) {
-        async_run_on_cpu(cpu, tlb_flush_global_async_work,
-                         GINT_TO_POINTER(flush_global));
-    } else {
+    /* if @cpu has not been created yet or it is the current_cpu, we do not
+     * need to query the flush. */
+    if (current_cpu == cpu || !cpu->created) {
         tlb_flush_nocheck(cpu, flush_global);
+    } else {
+        if (current_cpu) {
+            tlb_flush_other(current_cpu, cpu, flush_global);
+        } else {
+            async_run_on_cpu(cpu, tlb_flush_global_async_work,
+                             GINT_TO_POINTER(flush_global));
+        }
     }
 }
 
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index d3810c0..51ce58f 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -81,7 +81,7 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
             excl_history_put_addr(hw_addr);
             CPU_FOREACH(cpu) {
                 if (this_cpu != cpu) {
-                    tlb_flush(cpu, 1);
+                    tlb_flush_other(this_cpu, cpu, 1);
                 }
             }
         }
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [RFC 05/10] target-arm: End TB after ldrex instruction
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
                   ` (3 preceding siblings ...)
  2016-05-26 16:35 ` [Qemu-devel] [RFC 04/10] cputlb: Introduce tlb_flush_other() Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 06/10] cputlb: Add tlb_tables_flush_bitmap() Alvise Rigo
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

A VCPU executing a ldrex instruction might query flushes to other VCPUs:
in this cases, the calling VCPU uses cpu_exit to exit from the cpu loop
and wait the other VCPUs to perform the flush. In order to exit from the
cpu loop as soon as possible, interrupt the TB after the ldrex
instruction.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/translate-a64.c | 2 ++
 target-arm/translate.c     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index 376cb1c..2a14c14 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -1875,6 +1875,8 @@ static void disas_ldst_excl(DisasContext *s, uint32_t insn)
         if (!is_store) {
             s->is_ldex = true;
             gen_load_exclusive(s, rt, rt2, tcg_addr, size, is_pair);
+            gen_a64_set_pc_im(s->pc);
+            s->is_jmp = DISAS_JUMP;
         } else {
             gen_store_exclusive(s, rs, rt, rt2, tcg_addr, size, is_pair);
         }
diff --git a/target-arm/translate.c b/target-arm/translate.c
index 0677e04..7c1cb19 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -8807,6 +8807,8 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
                             default:
                                 abort();
                             }
+                            gen_set_pc_im(s, s->pc);
+                            s->is_jmp = DISAS_JUMP;
                         } else {
                             rm = insn & 0xf;
                             switch (op1) {
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel]  [RFC 06/10] cputlb: Add tlb_tables_flush_bitmap()
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
                   ` (4 preceding siblings ...)
  2016-05-26 16:35 ` [Qemu-devel] [RFC 05/10] target-arm: End TB after ldrex instruction Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 07/10] cputlb: Query tlb_flush_by_mmuidx Alvise Rigo
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

Add a simple helper function to flush the TLB at the indexes specified
by a bitmap. The function will be more useful in the following patches,
when it will be possible to query tlb_flush_by_mmuidx() to VCPUs.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 55f7447..5bbbf1b 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -129,15 +129,34 @@ void tlb_flush(CPUState *cpu, int flush_global)
     }
 }
 
-static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+/* Flush tlb_table[] and tlb_v_table[] of @cpu at MMU indexes given by @bitmap.
+ * Flush also tb_jmp_cache. */
+static inline void tlb_tables_flush_bitmap(CPUState *cpu, unsigned long *bitmap)
 {
-    CPUArchState *env = cpu->env_ptr;
+    int mmu_idx;
 
     tlb_debug("start\n");
     /* must reset current TB so that interrupts cannot modify the
        links while we are modifying them */
     cpu->current_tb = NULL;
 
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        if (test_bit(mmu_idx, bitmap)) {
+            CPUArchState *env = cpu->env_ptr;
+
+            tlb_debug("%d\n", mmu_idx);
+
+            memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
+            memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+        }
+    }
+    memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+}
+
+static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+{
+    DECLARE_BITMAP(idxmap, NB_MMU_MODES) = { 0 };
+
     for (;;) {
         int mmu_idx = va_arg(argp, int);
 
@@ -145,13 +164,10 @@ static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
             break;
         }
 
-        tlb_debug("%d\n", mmu_idx);
-
-        memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
-        memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+        set_bit(mmu_idx, idxmap);
     }
 
-    memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+    tlb_tables_flush_bitmap(cpu, idxmap);
 }
 
 void tlb_flush_by_mmuidx(CPUState *cpu, ...)
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel]  [RFC 07/10] cputlb: Query tlb_flush_by_mmuidx
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
                   ` (5 preceding siblings ...)
  2016-05-26 16:35 ` [Qemu-devel] [RFC 06/10] cputlb: Add tlb_tables_flush_bitmap() Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 08/10] cputlb: Query tlb_flush_page_by_mmuidx Alvise Rigo
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

Some architectures need to flush the TLB by MMU index. As per
tlb_flush(), also these flushes have to be properly queried to the
target VCPU. For the time being, this type of flush is used only in the
ARM/aarch64 target architecture and is the result of guest instructions
emulation. As a result, we can always get safely the CPUState of the
current VCPU without relying on current_cpu. This however complicates a
bit the function prototype by adding an argument pointing to the current
VCPU's CPUState.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                | 49 +++++++++++++++++++++++++++++++++++++++----------
 include/exec/exec-all.h |  4 ++--
 target-arm/helper.c     | 40 +++++++++++++++++++++-------------------
 3 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 5bbbf1b..73624d6 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -59,6 +59,8 @@
 /* We need a solution for stuffing 64 bit pointers in 32 bit ones if
  * we care about this combination */
 QEMU_BUILD_BUG_ON(sizeof(target_ulong) > sizeof(void *));
+/* Size, in bytes, of the bitmap used by tlb_flush_by_mmuidx functions */
+#define MMUIDX_BITMAP_SIZE sizeof(unsigned long) * BITS_TO_LONGS(NB_MMU_MODES)
 
 /* statistics */
 int tlb_flush_count;
@@ -153,10 +155,41 @@ static inline void tlb_tables_flush_bitmap(CPUState *cpu, unsigned long *bitmap)
     memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
 }
 
-static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+struct TLBFlushByMMUIdxParams {
+    DECLARE_BITMAP(idx_to_flush, NB_MMU_MODES);
+};
+
+static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, void *opaque)
+{
+    struct TLBFlushByMMUIdxParams *params = opaque;
+
+    tlb_tables_flush_bitmap(cpu, params->idx_to_flush);
+
+    g_free(params);
+}
+
+static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target,
+                                         unsigned long *idxmap)
 {
+    if (!qemu_cpu_is_self(target)) {
+        struct TLBFlushByMMUIdxParams *params;
+
+        params = g_malloc(sizeof(struct TLBFlushByMMUIdxParams));
+        memcpy(params->idx_to_flush, idxmap, MMUIDX_BITMAP_SIZE);
+        async_wait_run_on_cpu(target, cpu, tlb_flush_by_mmuidx_async_work,
+                              params);
+    } else {
+        tlb_tables_flush_bitmap(cpu, idxmap);
+    }
+}
+
+void tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target_cpu, ...)
+{
+    va_list argp;
     DECLARE_BITMAP(idxmap, NB_MMU_MODES) = { 0 };
 
+    va_start(argp, target_cpu);
+
     for (;;) {
         int mmu_idx = va_arg(argp, int);
 
@@ -167,15 +200,9 @@ static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
         set_bit(mmu_idx, idxmap);
     }
 
-    tlb_tables_flush_bitmap(cpu, idxmap);
-}
-
-void tlb_flush_by_mmuidx(CPUState *cpu, ...)
-{
-    va_list argp;
-    va_start(argp, cpu);
-    v_tlb_flush_by_mmuidx(cpu, argp);
     va_end(argp);
+
+    v_tlb_flush_by_mmuidx(cpu, target_cpu, idxmap);
 }
 
 static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong addr)
@@ -244,7 +271,9 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...)
                   TARGET_FMT_lx "/" TARGET_FMT_lx ")\n",
                   env->tlb_flush_addr, env->tlb_flush_mask);
 
-        v_tlb_flush_by_mmuidx(cpu, argp);
+        /* Temporarily use current_cpu until tlb_flush_page_by_mmuidx
+         * is reworked */
+        tlb_flush_by_mmuidx(current_cpu, cpu, argp);
         va_end(argp);
         return;
     }
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index bc97683..066870b 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -152,7 +152,7 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...);
  * Flush all entries from the TLB of the specified CPU, for the specified
  * MMU indexes.
  */
-void tlb_flush_by_mmuidx(CPUState *cpu, ...);
+void tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target, ...);
 /**
  * tlb_set_page_with_attrs:
  * @cpu: CPU to add this TLB entry for
@@ -205,7 +205,7 @@ static inline void tlb_flush_page_by_mmuidx(CPUState *cpu,
 {
 }
 
-static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
+static inline void tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target ...)
 {
 }
 static inline void tlb_flush_page_all(target_ulong addr)
diff --git a/target-arm/helper.c b/target-arm/helper.c
index bc9fbda..3dcd910 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -2388,7 +2388,7 @@ static void vttbr_write(CPUARMState *env, const ARMCPRegInfo *ri,
 
     /* Accesses to VTTBR may change the VMID so we must flush the TLB.  */
     if (raw_read(env, ri) != value) {
-        tlb_flush_by_mmuidx(cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0,
+        tlb_flush_by_mmuidx(cs, cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0,
                             ARMMMUIdx_S2NS, -1);
         raw_write(env, ri, value);
     }
@@ -2748,9 +2748,9 @@ static void tlbi_aa64_vmalle1_write(CPUARMState *env, const ARMCPRegInfo *ri,
     CPUState *cs = CPU(cpu);
 
     if (arm_is_secure_below_el3(env)) {
-        tlb_flush_by_mmuidx(cs, ARMMMUIdx_S1SE1, ARMMMUIdx_S1SE0, -1);
+        tlb_flush_by_mmuidx(cs, cs, ARMMMUIdx_S1SE1, ARMMMUIdx_S1SE0, -1);
     } else {
-        tlb_flush_by_mmuidx(cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0, -1);
+        tlb_flush_by_mmuidx(cs, cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0, -1);
     }
 }
 
@@ -2758,13 +2758,14 @@ static void tlbi_aa64_vmalle1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                       uint64_t value)
 {
     bool sec = arm_is_secure_below_el3(env);
-    CPUState *other_cs;
+    CPUState *other_cs, *this_cs = ENV_GET_CPU(env);
 
     CPU_FOREACH(other_cs) {
         if (sec) {
-            tlb_flush_by_mmuidx(other_cs, ARMMMUIdx_S1SE1, ARMMMUIdx_S1SE0, -1);
+            tlb_flush_by_mmuidx(this_cs, other_cs, ARMMMUIdx_S1SE1,
+                                ARMMMUIdx_S1SE0, -1);
         } else {
-            tlb_flush_by_mmuidx(other_cs, ARMMMUIdx_S12NSE1,
+            tlb_flush_by_mmuidx(this_cs, other_cs, ARMMMUIdx_S12NSE1,
                                 ARMMMUIdx_S12NSE0, -1);
         }
     }
@@ -2781,13 +2782,13 @@ static void tlbi_aa64_alle1_write(CPUARMState *env, const ARMCPRegInfo *ri,
     CPUState *cs = CPU(cpu);
 
     if (arm_is_secure_below_el3(env)) {
-        tlb_flush_by_mmuidx(cs, ARMMMUIdx_S1SE1, ARMMMUIdx_S1SE0, -1);
+        tlb_flush_by_mmuidx(cs, cs, ARMMMUIdx_S1SE1, ARMMMUIdx_S1SE0, -1);
     } else {
         if (arm_feature(env, ARM_FEATURE_EL2)) {
-            tlb_flush_by_mmuidx(cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0,
+            tlb_flush_by_mmuidx(cs, cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0,
                                 ARMMMUIdx_S2NS, -1);
         } else {
-            tlb_flush_by_mmuidx(cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0, -1);
+            tlb_flush_by_mmuidx(cs, cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0, -1);
         }
     }
 }
@@ -2798,7 +2799,7 @@ static void tlbi_aa64_alle2_write(CPUARMState *env, const ARMCPRegInfo *ri,
     ARMCPU *cpu = arm_env_get_cpu(env);
     CPUState *cs = CPU(cpu);
 
-    tlb_flush_by_mmuidx(cs, ARMMMUIdx_S1E2, -1);
+    tlb_flush_by_mmuidx(cs, cs, ARMMMUIdx_S1E2, -1);
 }
 
 static void tlbi_aa64_alle3_write(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -2807,7 +2808,7 @@ static void tlbi_aa64_alle3_write(CPUARMState *env, const ARMCPRegInfo *ri,
     ARMCPU *cpu = arm_env_get_cpu(env);
     CPUState *cs = CPU(cpu);
 
-    tlb_flush_by_mmuidx(cs, ARMMMUIdx_S1E3, -1);
+    tlb_flush_by_mmuidx(cs, cs, ARMMMUIdx_S1E3, -1);
 }
 
 static void tlbi_aa64_alle1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -2819,16 +2820,17 @@ static void tlbi_aa64_alle1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
      */
     bool sec = arm_is_secure_below_el3(env);
     bool has_el2 = arm_feature(env, ARM_FEATURE_EL2);
-    CPUState *other_cs;
+    CPUState *other_cs, *this_cs = ENV_GET_CPU(env);
 
     CPU_FOREACH(other_cs) {
         if (sec) {
-            tlb_flush_by_mmuidx(other_cs, ARMMMUIdx_S1SE1, ARMMMUIdx_S1SE0, -1);
+            tlb_flush_by_mmuidx(this_cs, other_cs, ARMMMUIdx_S1SE1,
+                                ARMMMUIdx_S1SE0, -1);
         } else if (has_el2) {
-            tlb_flush_by_mmuidx(other_cs, ARMMMUIdx_S12NSE1,
+            tlb_flush_by_mmuidx(this_cs, other_cs, ARMMMUIdx_S12NSE1,
                                 ARMMMUIdx_S12NSE0, ARMMMUIdx_S2NS, -1);
         } else {
-            tlb_flush_by_mmuidx(other_cs, ARMMMUIdx_S12NSE1,
+            tlb_flush_by_mmuidx(this_cs, other_cs, ARMMMUIdx_S12NSE1,
                                 ARMMMUIdx_S12NSE0, -1);
         }
     }
@@ -2837,20 +2839,20 @@ static void tlbi_aa64_alle1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
 static void tlbi_aa64_alle2is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                     uint64_t value)
 {
-    CPUState *other_cs;
+    CPUState *other_cs, *this_cs = ENV_GET_CPU(env);
 
     CPU_FOREACH(other_cs) {
-        tlb_flush_by_mmuidx(other_cs, ARMMMUIdx_S1E2, -1);
+        tlb_flush_by_mmuidx(this_cs, other_cs, ARMMMUIdx_S1E2, -1);
     }
 }
 
 static void tlbi_aa64_alle3is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                     uint64_t value)
 {
-    CPUState *other_cs;
+    CPUState *other_cs, *this_cs = ENV_GET_CPU(env);
 
     CPU_FOREACH(other_cs) {
-        tlb_flush_by_mmuidx(other_cs, ARMMMUIdx_S1E3, -1);
+        tlb_flush_by_mmuidx(this_cs, other_cs, ARMMMUIdx_S1E3, -1);
     }
 }
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel]  [RFC 08/10] cputlb: Query tlb_flush_page_by_mmuidx
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
                   ` (6 preceding siblings ...)
  2016-05-26 16:35 ` [Qemu-devel] [RFC 07/10] cputlb: Query tlb_flush_by_mmuidx Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 09/10] cputlb: Query tlb_flush_page_all Alvise Rigo
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

Similarly to the previous commit, make tlb_flush_page_by_mmuidx query the
flushes when targeting different VCPUs.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                | 90 ++++++++++++++++++++++++++++++++++---------------
 include/exec/exec-all.h |  5 +--
 target-arm/helper.c     | 35 ++++++++++---------
 3 files changed, 85 insertions(+), 45 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 73624d6..77a1997 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -157,6 +157,8 @@ static inline void tlb_tables_flush_bitmap(CPUState *cpu, unsigned long *bitmap)
 
 struct TLBFlushByMMUIdxParams {
     DECLARE_BITMAP(idx_to_flush, NB_MMU_MODES);
+    /* Used by tlb_flush_page_by_mmuidx */
+    target_ulong addr;
 };
 
 static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, void *opaque)
@@ -255,28 +257,13 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
     tb_flush_jmp_cache(cpu, addr);
 }
 
-void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...)
+static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu, void *opaque)
 {
     CPUArchState *env = cpu->env_ptr;
-    int i, k;
-    va_list argp;
-
-    va_start(argp, addr);
-
-    tlb_debug("addr "TARGET_FMT_lx"\n", addr);
-
-    /* Check if we need to flush due to large pages.  */
-    if ((addr & env->tlb_flush_mask) == env->tlb_flush_addr) {
-        tlb_debug("forced full flush ("
-                  TARGET_FMT_lx "/" TARGET_FMT_lx ")\n",
-                  env->tlb_flush_addr, env->tlb_flush_mask);
+    struct TLBFlushByMMUIdxParams *params = opaque;
+    target_ulong addr = params->addr;
+    int mmu_idx, i;
 
-        /* Temporarily use current_cpu until tlb_flush_page_by_mmuidx
-         * is reworked */
-        tlb_flush_by_mmuidx(current_cpu, cpu, argp);
-        va_end(argp);
-        return;
-    }
     /* must reset current TB so that interrupts cannot modify the
        links while we are modifying them */
     cpu->current_tb = NULL;
@@ -284,6 +271,49 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...)
     addr &= TARGET_PAGE_MASK;
     i = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
 
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        if (test_bit(mmu_idx, params->idx_to_flush)) {
+            int k;
+
+            tlb_debug("idx %d\n", mmu_idx);
+            tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
+            /* check whether there are vltb entries that need to be flushed */
+            for (k = 0; k < CPU_VTLB_SIZE; k++) {
+                tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
+            }
+        }
+    }
+
+    tb_flush_jmp_cache(cpu, addr);
+
+    g_free(params);
+}
+
+static void v_tlb_flush_page_by_mmuidx(CPUState *cpu, CPUState *target_cpu,
+                                target_ulong addr, unsigned long *idxmap)
+{
+    if (!qemu_cpu_is_self(target_cpu)) {
+        struct TLBFlushByMMUIdxParams *params;
+
+        params = g_malloc(sizeof(struct TLBFlushByMMUIdxParams));
+        params->addr = addr;
+        memcpy(params->idx_to_flush, idxmap, MMUIDX_BITMAP_SIZE);
+        async_wait_run_on_cpu(target_cpu, cpu,
+                              tlb_flush_page_by_mmuidx_async_work, params);
+    } else {
+        tlb_tables_flush_bitmap(cpu, idxmap);
+    }
+}
+
+void tlb_flush_page_by_mmuidx(CPUState *cpu, CPUState *target,
+                              target_ulong addr, ...)
+{
+    DECLARE_BITMAP(idxmap, NB_MMU_MODES) = { 0 };
+    CPUArchState *env = target->env_ptr;
+    va_list argp;
+
+    va_start(argp, addr);
+
     for (;;) {
         int mmu_idx = va_arg(argp, int);
 
@@ -291,18 +321,24 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...)
             break;
         }
 
-        tlb_debug("idx %d\n", mmu_idx);
+        set_bit(mmu_idx, idxmap);
+    }
 
-        tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
+    va_end(argp);
 
-        /* check whether there are vltb entries that need to be flushed */
-        for (k = 0; k < CPU_VTLB_SIZE; k++) {
-            tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
-        }
+    tlb_debug("addr "TARGET_FMT_lx"\n", addr);
+
+    /* Check if we need to flush due to large pages.  */
+    if ((addr & env->tlb_flush_mask) == env->tlb_flush_addr) {
+        tlb_debug("forced full flush ("
+                  TARGET_FMT_lx "/" TARGET_FMT_lx ")\n",
+                  env->tlb_flush_addr, env->tlb_flush_mask);
+
+        v_tlb_flush_by_mmuidx(cpu, target, idxmap);
+        return;
     }
-    va_end(argp);
 
-    tb_flush_jmp_cache(cpu, addr);
+    v_tlb_flush_page_by_mmuidx(cpu, target, addr, idxmap);
 }
 
 static void tlb_flush_page_async_work(CPUState *cpu, void *opaque)
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 066870b..cb891d2 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -143,7 +143,8 @@ void tlb_flush(CPUState *cpu, int flush_global);
  * Flush one page from the TLB of the specified CPU, for the specified
  * MMU indexes.
  */
-void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...);
+void tlb_flush_page_by_mmuidx(CPUState *cpu, CPUState *target,
+                              target_ulong addr, ...);
 /**
  * tlb_flush_by_mmuidx:
  * @cpu: CPU whose TLB should be flushed
@@ -200,7 +201,7 @@ static inline void tlb_flush(CPUState *cpu, int flush_global)
 {
 }
 
-static inline void tlb_flush_page_by_mmuidx(CPUState *cpu,
+static inline void tlb_flush_page_by_mmuidx(CPUState *cpu, CPUState *target,
                                             target_ulong addr, ...)
 {
 }
diff --git a/target-arm/helper.c b/target-arm/helper.c
index 3dcd910..0187c0a 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -2869,10 +2869,10 @@ static void tlbi_aa64_vae1_write(CPUARMState *env, const ARMCPRegInfo *ri,
     uint64_t pageaddr = sextract64(value << 12, 0, 56);
 
     if (arm_is_secure_below_el3(env)) {
-        tlb_flush_page_by_mmuidx(cs, pageaddr, ARMMMUIdx_S1SE1,
+        tlb_flush_page_by_mmuidx(cs, cs, pageaddr, ARMMMUIdx_S1SE1,
                                  ARMMMUIdx_S1SE0, -1);
     } else {
-        tlb_flush_page_by_mmuidx(cs, pageaddr, ARMMMUIdx_S12NSE1,
+        tlb_flush_page_by_mmuidx(cs, cs, pageaddr, ARMMMUIdx_S12NSE1,
                                  ARMMMUIdx_S12NSE0, -1);
     }
 }
@@ -2888,7 +2888,7 @@ static void tlbi_aa64_vae2_write(CPUARMState *env, const ARMCPRegInfo *ri,
     CPUState *cs = CPU(cpu);
     uint64_t pageaddr = sextract64(value << 12, 0, 56);
 
-    tlb_flush_page_by_mmuidx(cs, pageaddr, ARMMMUIdx_S1E2, -1);
+    tlb_flush_page_by_mmuidx(cs, cs, pageaddr, ARMMMUIdx_S1E2, -1);
 }
 
 static void tlbi_aa64_vae3_write(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -2902,23 +2902,23 @@ static void tlbi_aa64_vae3_write(CPUARMState *env, const ARMCPRegInfo *ri,
     CPUState *cs = CPU(cpu);
     uint64_t pageaddr = sextract64(value << 12, 0, 56);
 
-    tlb_flush_page_by_mmuidx(cs, pageaddr, ARMMMUIdx_S1E3, -1);
+    tlb_flush_page_by_mmuidx(cs, cs, pageaddr, ARMMMUIdx_S1E3, -1);
 }
 
 static void tlbi_aa64_vae1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                    uint64_t value)
 {
     bool sec = arm_is_secure_below_el3(env);
-    CPUState *other_cs;
+    CPUState *other_cs, *this_cs = ENV_GET_CPU(env);
     uint64_t pageaddr = sextract64(value << 12, 0, 56);
 
     CPU_FOREACH(other_cs) {
         if (sec) {
-            tlb_flush_page_by_mmuidx(other_cs, pageaddr, ARMMMUIdx_S1SE1,
-                                     ARMMMUIdx_S1SE0, -1);
+            tlb_flush_page_by_mmuidx(this_cs, other_cs, pageaddr,
+                                     ARMMMUIdx_S1SE1, ARMMMUIdx_S1SE0, -1);
         } else {
-            tlb_flush_page_by_mmuidx(other_cs, pageaddr, ARMMMUIdx_S12NSE1,
-                                     ARMMMUIdx_S12NSE0, -1);
+            tlb_flush_page_by_mmuidx(this_cs, other_cs, pageaddr,
+                                     ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0, -1);
         }
     }
 }
@@ -2926,22 +2926,24 @@ static void tlbi_aa64_vae1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
 static void tlbi_aa64_vae2is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                    uint64_t value)
 {
-    CPUState *other_cs;
+    CPUState *other_cs, *this_cs = ENV_GET_CPU(env);
     uint64_t pageaddr = sextract64(value << 12, 0, 56);
 
     CPU_FOREACH(other_cs) {
-        tlb_flush_page_by_mmuidx(other_cs, pageaddr, ARMMMUIdx_S1E2, -1);
+        tlb_flush_page_by_mmuidx(this_cs, other_cs, pageaddr,
+                                 ARMMMUIdx_S1E2, -1);
     }
 }
 
 static void tlbi_aa64_vae3is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                    uint64_t value)
 {
-    CPUState *other_cs;
+    CPUState *other_cs, *this_cs = ENV_GET_CPU(env);
     uint64_t pageaddr = sextract64(value << 12, 0, 56);
 
     CPU_FOREACH(other_cs) {
-        tlb_flush_page_by_mmuidx(other_cs, pageaddr, ARMMMUIdx_S1E3, -1);
+        tlb_flush_page_by_mmuidx(this_cs, other_cs, pageaddr,
+                                 ARMMMUIdx_S1E3, -1);
     }
 }
 
@@ -2964,13 +2966,13 @@ static void tlbi_aa64_ipas2e1_write(CPUARMState *env, const ARMCPRegInfo *ri,
 
     pageaddr = sextract64(value << 12, 0, 48);
 
-    tlb_flush_page_by_mmuidx(cs, pageaddr, ARMMMUIdx_S2NS, -1);
+    tlb_flush_page_by_mmuidx(cs, cs, pageaddr, ARMMMUIdx_S2NS, -1);
 }
 
 static void tlbi_aa64_ipas2e1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                       uint64_t value)
 {
-    CPUState *other_cs;
+    CPUState *other_cs, *this_cs = ENV_GET_CPU(env);
     uint64_t pageaddr;
 
     if (!arm_feature(env, ARM_FEATURE_EL2) || !(env->cp15.scr_el3 & SCR_NS)) {
@@ -2980,7 +2982,8 @@ static void tlbi_aa64_ipas2e1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
     pageaddr = sextract64(value << 12, 0, 48);
 
     CPU_FOREACH(other_cs) {
-        tlb_flush_page_by_mmuidx(other_cs, pageaddr, ARMMMUIdx_S2NS, -1);
+        tlb_flush_page_by_mmuidx(this_cs, other_cs, pageaddr,
+                                 ARMMMUIdx_S2NS, -1);
     }
 }
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel]  [RFC 09/10] cputlb: Query tlb_flush_page_all
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
                   ` (7 preceding siblings ...)
  2016-05-26 16:35 ` [Qemu-devel] [RFC 08/10] cputlb: Query tlb_flush_page_by_mmuidx Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-05-26 16:35 ` [Qemu-devel] [RFC 10/10] cpus: Do not sleep if some work item is pending Alvise Rigo
  2016-06-10 15:21 ` [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alex Bennée
  10 siblings, 0 replies; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

Secure tlb_flush_page_all() by waiting the queried flushes to be
actually completed using async_wait_run_on_cpu();

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                | 15 ++++++++++-----
 include/exec/exec-all.h |  4 ++--
 target-arm/helper.c     |  4 ++--
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 77a1997..4ed0cc8 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -346,13 +346,18 @@ static void tlb_flush_page_async_work(CPUState *cpu, void *opaque)
     tlb_flush_page(cpu, GPOINTER_TO_UINT(opaque));
 }
 
-void tlb_flush_page_all(target_ulong addr)
+void tlb_flush_page_all(CPUState *this_cpu, target_ulong addr)
 {
-    CPUState *cpu;
+    CPUState *other_cpu;
 
-    CPU_FOREACH(cpu) {
-        async_run_on_cpu(cpu, tlb_flush_page_async_work,
-                         GUINT_TO_POINTER(addr));
+    CPU_FOREACH(other_cpu) {
+        if (other_cpu != this_cpu) {
+            async_wait_run_on_cpu(other_cpu, this_cpu,
+                                  tlb_flush_page_async_work,
+                                  GUINT_TO_POINTER(addr));
+        } else {
+            tlb_flush_page(current_cpu, addr);
+        }
     }
 }
 
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index cb891d2..36f1b81 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -191,7 +191,7 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
 void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr);
 void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
                  uintptr_t retaddr);
-void tlb_flush_page_all(target_ulong addr);
+void tlb_flush_page_all(CPUState *this_cpu, target_ulong addr);
 #else
 static inline void tlb_flush_page(CPUState *cpu, target_ulong addr)
 {
@@ -209,7 +209,7 @@ static inline void tlb_flush_page_by_mmuidx(CPUState *cpu, CPUState *target,
 static inline void tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target ...)
 {
 }
-static inline void tlb_flush_page_all(target_ulong addr)
+static inline void tlb_flush_page_all(CPUState *this_cpu, target_ulong addr)
 {
 }
 #endif
diff --git a/target-arm/helper.c b/target-arm/helper.c
index 0187c0a..8988c8b 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -554,13 +554,13 @@ static void tlbiasid_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
 static void tlbimva_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                              uint64_t value)
 {
-    tlb_flush_page_all(value & TARGET_PAGE_MASK);
+    tlb_flush_page_all(ENV_GET_CPU(env), value & TARGET_PAGE_MASK);
 }
 
 static void tlbimvaa_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                              uint64_t value)
 {
-    tlb_flush_page_all(value & TARGET_PAGE_MASK);
+    tlb_flush_page_all(ENV_GET_CPU(env), value & TARGET_PAGE_MASK);
 }
 
 static const ARMCPRegInfo cp_reginfo[] = {
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [RFC 10/10] cpus: Do not sleep if some work item is pending
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
                   ` (8 preceding siblings ...)
  2016-05-26 16:35 ` [Qemu-devel] [RFC 09/10] cputlb: Query tlb_flush_page_all Alvise Rigo
@ 2016-05-26 16:35 ` Alvise Rigo
  2016-06-10 15:21 ` [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alex Bennée
  10 siblings, 0 replies; 36+ messages in thread
From: Alvise Rigo @ 2016-05-26 16:35 UTC (permalink / raw)
  To: mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, serge.fdrv, cota, peter.maydell, Alvise Rigo

If a VCPU returns EXCP_HALTED from the guest code execution and in the
mean time receives a work item, it will go to sleep without processing
the job.

Before sleeping, check if any work has been added.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cpus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cpus.c b/cpus.c
index 7bc96e2..3d19a2e 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1477,7 +1477,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 
         handle_icount_deadline();
 
-        if (sleep) {
+        if (sleep && cpu->queued_work_first == NULL) {
             qemu_cond_wait(cpu->halt_cond, &qemu_global_mutex);
         }
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()
  2016-05-26 16:35 ` [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}() Alvise Rigo
@ 2016-05-31 15:03   ` Pranith Kumar
  2016-06-02 16:21     ` alvise rigo
  2016-06-08  9:21   ` Alex Bennée
  1 sibling, 1 reply; 36+ messages in thread
From: Pranith Kumar @ 2016-05-31 15:03 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: MTTCG Devel, Alex Bennée, qemu-devel, jani.kokkonen,
	claudio.fontana, tech, KONRAD Frédéric, Paolo Bonzini,
	Richard Henderson, Sergey Fedorov, Emilio G. Cota, Peter Maydell

Hi Alvise,

On Thu, May 26, 2016 at 12:35 PM, Alvise Rigo
<a.rigo@virtualopensystems.com> wrote:
> Add tcg_exclusive_{lock,unlock}() functions that will be used for making
> the emulation of LL and SC instructions thread safe.
>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>

<snip>

> +__thread bool cpu_have_exclusive_lock;
> +QemuSpin cpu_exclusive_lock;
> +inline void tcg_exclusive_lock(void)
> +{
> +    if (!cpu_have_exclusive_lock) {
> +        qemu_spin_lock(&cpu_exclusive_lock);
> +        cpu_have_exclusive_lock = true;
> +    }
> +}
> +
> +inline void tcg_exclusive_unlock(void)
> +{
> +    if (cpu_have_exclusive_lock) {
> +        cpu_have_exclusive_lock = false;
> +        qemu_spin_unlock(&cpu_exclusive_lock);
> +    }
> +}

I think the unlock() here should have an assert if
cpu_have_exclusive_lock is false. From what I can see, a thread should
either take the exclusive lock or wait spinning for it in lock(). So
unlock() should always see cpu_have_exclusive_lock as true. It is a
good place to find locking bugs.

-- 
Pranith

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()
  2016-05-31 15:03   ` Pranith Kumar
@ 2016-06-02 16:21     ` alvise rigo
  0 siblings, 0 replies; 36+ messages in thread
From: alvise rigo @ 2016-06-02 16:21 UTC (permalink / raw)
  To: Pranith Kumar
  Cc: MTTCG Devel, Alex Bennée, qemu-devel, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Emilio G. Cota, Peter Maydell

Hi Pranith,

Thank you for the hint, I will keep this in mind for the next version.

Regards,
alvise

On Tue, May 31, 2016 at 5:03 PM, Pranith Kumar <bobby.prani@gmail.com> wrote:
> Hi Alvise,
>
> On Thu, May 26, 2016 at 12:35 PM, Alvise Rigo
> <a.rigo@virtualopensystems.com> wrote:
>> Add tcg_exclusive_{lock,unlock}() functions that will be used for making
>> the emulation of LL and SC instructions thread safe.
>>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>
> <snip>
>
>> +__thread bool cpu_have_exclusive_lock;
>> +QemuSpin cpu_exclusive_lock;
>> +inline void tcg_exclusive_lock(void)
>> +{
>> +    if (!cpu_have_exclusive_lock) {
>> +        qemu_spin_lock(&cpu_exclusive_lock);
>> +        cpu_have_exclusive_lock = true;
>> +    }
>> +}
>> +
>> +inline void tcg_exclusive_unlock(void)
>> +{
>> +    if (cpu_have_exclusive_lock) {
>> +        cpu_have_exclusive_lock = false;
>> +        qemu_spin_unlock(&cpu_exclusive_lock);
>> +    }
>> +}
>
> I think the unlock() here should have an assert if
> cpu_have_exclusive_lock is false. From what I can see, a thread should
> either take the exclusive lock or wait spinning for it in lock(). So
> unlock() should always see cpu_have_exclusive_lock as true. It is a
> good place to find locking bugs.
>
> --
> Pranith

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()
  2016-05-26 16:35 ` [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}() Alvise Rigo
  2016-05-31 15:03   ` Pranith Kumar
@ 2016-06-08  9:21   ` Alex Bennée
  2016-06-08 10:00     ` alvise rigo
  1 sibling, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2016-06-08  9:21 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, qemu-devel, jani.kokkonen, claudio.fontana, tech,
	fred.konrad, pbonzini, rth, serge.fdrv, cota, peter.maydell


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Add tcg_exclusive_{lock,unlock}() functions that will be used for making
> the emulation of LL and SC instructions thread safe.

I wonder how much similarity there is to the mechanism linus-user ends
up using for it's exclusive-start/end?

>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cpus.c            |  2 ++
>  exec.c            | 18 ++++++++++++++++++
>  include/qom/cpu.h |  5 +++++
>  3 files changed, 25 insertions(+)
>
> diff --git a/cpus.c b/cpus.c
> index 860e7ba..b9ec903 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -961,6 +961,8 @@ void qemu_init_cpu_loop(void)
>      qemu_cond_init(&qemu_work_cond);
>      qemu_mutex_init(&qemu_global_mutex);
>
> +    qemu_spin_init(&cpu_exclusive_lock);
> +
>      qemu_thread_get_self(&io_thread);
>
>      safe_work = g_array_sized_new(TRUE, TRUE, sizeof(qemu_safe_work_item), 128);
> diff --git a/exec.c b/exec.c
> index a24b31c..1c72113 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -197,6 +197,24 @@ void cpu_exclusive_history_free(void)
>          g_free(excl_history.c_array);
>      }
>  }
> +
> +__thread bool cpu_have_exclusive_lock;
> +QemuSpin cpu_exclusive_lock;
> +inline void tcg_exclusive_lock(void)
> +{
> +    if (!cpu_have_exclusive_lock) {
> +        qemu_spin_lock(&cpu_exclusive_lock);
> +        cpu_have_exclusive_lock = true;
> +    }
> +}
> +
> +inline void tcg_exclusive_unlock(void)
> +{
> +    if (cpu_have_exclusive_lock) {
> +        cpu_have_exclusive_lock = false;
> +        qemu_spin_unlock(&cpu_exclusive_lock);
> +    }
> +}
>  #endif
>
>  #if !defined(CONFIG_USER_ONLY)
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 0f51870..019f06d 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -201,6 +201,11 @@ typedef struct CPUClass {
>      void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
>  } CPUClass;
>
> +/* Protect cpu_exclusive_* variable .*/
> +void tcg_exclusive_lock(void);
> +void tcg_exclusive_unlock(void);
> +extern QemuSpin cpu_exclusive_lock;
> +
>  #ifdef HOST_WORDS_BIGENDIAN
>  typedef struct icount_decr_u16 {
>      uint16_t high;


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()
  2016-06-08  9:21   ` Alex Bennée
@ 2016-06-08 10:00     ` alvise rigo
  2016-06-08 11:32       ` Peter Maydell
  2016-06-08 13:52       ` Alex Bennée
  0 siblings, 2 replies; 36+ messages in thread
From: alvise rigo @ 2016-06-08 10:00 UTC (permalink / raw)
  To: Alex Bennée
  Cc: MTTCG Devel, QEMU Developers, Jani Kokkonen, Claudio Fontana,
	VirtualOpenSystems Technical Team, KONRAD Frédéric,
	Paolo Bonzini, Richard Henderson, Sergey Fedorov, Emilio G. Cota,
	Peter Maydell

As far as I understand, linux-user uses a mutex to make the atomic
accesses exclusive with respect to other CPU's atomic accesses. So
basically in the LDREX case it implements:
lock() -> access() -> unlock()
This patch series makes the atomic accesses exclusive with respect to
every memory access, this is allowed by the softmmu. The access is now
something like:
lock() -> softmmu_access() -> unlock()
where "softmmu_access()" is not just a memory access, but includes a
manipulation of the EXCL bitmap and possible queries of TLB flushes.
So there are similarities, but are pretty much confined to the
locking/unlocking of a spinlock/mutex.

This made me think, how does linux-user can properly work with
upstream TCG, for instance, in an absurd configuration like target-arm
on ARM host?

alvise

On Wed, Jun 8, 2016 at 11:21 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Add tcg_exclusive_{lock,unlock}() functions that will be used for making
>> the emulation of LL and SC instructions thread safe.
>
> I wonder how much similarity there is to the mechanism linus-user ends
> up using for it's exclusive-start/end?
>
>>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  cpus.c            |  2 ++
>>  exec.c            | 18 ++++++++++++++++++
>>  include/qom/cpu.h |  5 +++++
>>  3 files changed, 25 insertions(+)
>>
>> diff --git a/cpus.c b/cpus.c
>> index 860e7ba..b9ec903 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -961,6 +961,8 @@ void qemu_init_cpu_loop(void)
>>      qemu_cond_init(&qemu_work_cond);
>>      qemu_mutex_init(&qemu_global_mutex);
>>
>> +    qemu_spin_init(&cpu_exclusive_lock);
>> +
>>      qemu_thread_get_self(&io_thread);
>>
>>      safe_work = g_array_sized_new(TRUE, TRUE, sizeof(qemu_safe_work_item), 128);
>> diff --git a/exec.c b/exec.c
>> index a24b31c..1c72113 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -197,6 +197,24 @@ void cpu_exclusive_history_free(void)
>>          g_free(excl_history.c_array);
>>      }
>>  }
>> +
>> +__thread bool cpu_have_exclusive_lock;
>> +QemuSpin cpu_exclusive_lock;
>> +inline void tcg_exclusive_lock(void)
>> +{
>> +    if (!cpu_have_exclusive_lock) {
>> +        qemu_spin_lock(&cpu_exclusive_lock);
>> +        cpu_have_exclusive_lock = true;
>> +    }
>> +}
>> +
>> +inline void tcg_exclusive_unlock(void)
>> +{
>> +    if (cpu_have_exclusive_lock) {
>> +        cpu_have_exclusive_lock = false;
>> +        qemu_spin_unlock(&cpu_exclusive_lock);
>> +    }
>> +}
>>  #endif
>>
>>  #if !defined(CONFIG_USER_ONLY)
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 0f51870..019f06d 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -201,6 +201,11 @@ typedef struct CPUClass {
>>      void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
>>  } CPUClass;
>>
>> +/* Protect cpu_exclusive_* variable .*/
>> +void tcg_exclusive_lock(void);
>> +void tcg_exclusive_unlock(void);
>> +extern QemuSpin cpu_exclusive_lock;
>> +
>>  #ifdef HOST_WORDS_BIGENDIAN
>>  typedef struct icount_decr_u16 {
>>      uint16_t high;
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()
  2016-06-08 10:00     ` alvise rigo
@ 2016-06-08 11:32       ` Peter Maydell
  2016-06-08 13:52       ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Maydell @ 2016-06-08 11:32 UTC (permalink / raw)
  To: alvise rigo
  Cc: Alex Bennée, MTTCG Devel, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Emilio G. Cota

On 8 June 2016 at 11:00, alvise rigo <a.rigo@virtualopensystems.com> wrote:
> As far as I understand, linux-user uses a mutex to make the atomic
> accesses exclusive with respect to other CPU's atomic accesses. So
> basically in the LDREX case it implements:
> lock() -> access() -> unlock()
> This patch series makes the atomic accesses exclusive with respect to
> every memory access, this is allowed by the softmmu. The access is now
> something like:
> lock() -> softmmu_access() -> unlock()
> where "softmmu_access()" is not just a memory access, but includes a
> manipulation of the EXCL bitmap and possible queries of TLB flushes.
> So there are similarities, but are pretty much confined to the
> locking/unlocking of a spinlock/mutex.
>
> This made me think, how does linux-user can properly work with
> upstream TCG, for instance, in an absurd configuration like target-arm
> on ARM host?

linux-user's exclusives handling is "broken but happens to sort of
work most of the time". Fixing this and bringing it into line with
how we want to handle exclusives in the multithreaded system emulation
case is one of the things I was hoping would come out of the MTTCG
work...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()
  2016-06-08 10:00     ` alvise rigo
  2016-06-08 11:32       ` Peter Maydell
@ 2016-06-08 13:52       ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2016-06-08 13:52 UTC (permalink / raw)
  To: alvise rigo
  Cc: MTTCG Devel, QEMU Developers, Jani Kokkonen, Claudio Fontana,
	VirtualOpenSystems Technical Team, KONRAD Frédéric,
	Paolo Bonzini, Richard Henderson, Sergey Fedorov, Emilio G. Cota,
	Peter Maydell


alvise rigo <a.rigo@virtualopensystems.com> writes:

> As far as I understand, linux-user uses a mutex to make the atomic
> accesses exclusive with respect to other CPU's atomic accesses. So
> basically in the LDREX case it implements:
> lock() -> access() -> unlock()
> This patch series makes the atomic accesses exclusive with respect to
> every memory access, this is allowed by the softmmu. The access is now
> something like:
> lock() -> softmmu_access() -> unlock()
> where "softmmu_access()" is not just a memory access, but includes a
> manipulation of the EXCL bitmap and possible queries of TLB flushes.
> So there are similarities, but are pretty much confined to the
> locking/unlocking of a spinlock/mutex.

But couldn't this be achieved with the various other solutions to
wanting the rest of the system to be quiescent while we do something
tricky?

> This made me think, how does linux-user can properly work with
> upstream TCG, for instance, in an absurd configuration like target-arm
> on ARM host?
>
> alvise
>
> On Wed, Jun 8, 2016 at 11:21 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>>
>>> Add tcg_exclusive_{lock,unlock}() functions that will be used for making
>>> the emulation of LL and SC instructions thread safe.
>>
>> I wonder how much similarity there is to the mechanism linus-user ends
>> up using for it's exclusive-start/end?
>>
>>>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>> ---
>>>  cpus.c            |  2 ++
>>>  exec.c            | 18 ++++++++++++++++++
>>>  include/qom/cpu.h |  5 +++++
>>>  3 files changed, 25 insertions(+)
>>>
>>> diff --git a/cpus.c b/cpus.c
>>> index 860e7ba..b9ec903 100644
>>> --- a/cpus.c
>>> +++ b/cpus.c
>>> @@ -961,6 +961,8 @@ void qemu_init_cpu_loop(void)
>>>      qemu_cond_init(&qemu_work_cond);
>>>      qemu_mutex_init(&qemu_global_mutex);
>>>
>>> +    qemu_spin_init(&cpu_exclusive_lock);
>>> +
>>>      qemu_thread_get_self(&io_thread);
>>>
>>>      safe_work = g_array_sized_new(TRUE, TRUE, sizeof(qemu_safe_work_item), 128);
>>> diff --git a/exec.c b/exec.c
>>> index a24b31c..1c72113 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -197,6 +197,24 @@ void cpu_exclusive_history_free(void)
>>>          g_free(excl_history.c_array);
>>>      }
>>>  }
>>> +
>>> +__thread bool cpu_have_exclusive_lock;
>>> +QemuSpin cpu_exclusive_lock;
>>> +inline void tcg_exclusive_lock(void)
>>> +{
>>> +    if (!cpu_have_exclusive_lock) {
>>> +        qemu_spin_lock(&cpu_exclusive_lock);
>>> +        cpu_have_exclusive_lock = true;
>>> +    }
>>> +}
>>> +
>>> +inline void tcg_exclusive_unlock(void)
>>> +{
>>> +    if (cpu_have_exclusive_lock) {
>>> +        cpu_have_exclusive_lock = false;
>>> +        qemu_spin_unlock(&cpu_exclusive_lock);
>>> +    }
>>> +}
>>>  #endif
>>>
>>>  #if !defined(CONFIG_USER_ONLY)
>>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>>> index 0f51870..019f06d 100644
>>> --- a/include/qom/cpu.h
>>> +++ b/include/qom/cpu.h
>>> @@ -201,6 +201,11 @@ typedef struct CPUClass {
>>>      void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
>>>  } CPUClass;
>>>
>>> +/* Protect cpu_exclusive_* variable .*/
>>> +void tcg_exclusive_lock(void);
>>> +void tcg_exclusive_unlock(void);
>>> +extern QemuSpin cpu_exclusive_lock;
>>> +
>>>  #ifdef HOST_WORDS_BIGENDIAN
>>>  typedef struct icount_decr_u16 {
>>>      uint16_t high;
>>
>>
>> --
>> Alex Bennée


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()
  2016-05-26 16:35 ` [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu() Alvise Rigo
@ 2016-06-08 13:54   ` Alex Bennée
  2016-06-08 14:10     ` alvise rigo
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2016-06-08 13:54 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, qemu-devel, jani.kokkonen, claudio.fontana, tech,
	fred.konrad, pbonzini, rth, serge.fdrv, cota, peter.maydell


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Introduce a new function that allows the calling VCPU to add a work item
> to another VCPU (aka target VCPU). This new function differs from
> async_run_on_cpu() since it makes the calling VCPU waiting for the target
> VCPU to finish the work item. The mechanism makes use of the halt_cond
> to wait and in case process pending work items.

Isn't this exactly what would happen if you use run_on_cpu(). That will
stall the current vCPU and busy wait until the work item is completed.

>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cpus.c            | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  include/qom/cpu.h | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 73 insertions(+), 2 deletions(-)
>
> diff --git a/cpus.c b/cpus.c
> index b9ec903..7bc96e2 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -89,7 +89,7 @@ static bool cpu_thread_is_idle(CPUState *cpu)
>      if (cpu->stop || cpu->queued_work_first) {
>          return false;
>      }
> -    if (cpu_is_stopped(cpu)) {
> +    if (cpu_is_stopped(cpu) || async_waiting_for_work(cpu)) {
>          return true;
>      }
>      if (!cpu->halted || cpu_has_work(cpu) ||
> @@ -1012,6 +1012,7 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>      wi->func = func;
>      wi->data = data;
>      wi->free = true;
> +    wi->wcpu = NULL;
>
>      qemu_mutex_lock(&cpu->work_mutex);
>      if (cpu->queued_work_first == NULL) {
> @@ -1027,6 +1028,40 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>      qemu_cpu_kick(cpu);
>  }
>
> +void async_wait_run_on_cpu(CPUState *cpu, CPUState *wcpu, run_on_cpu_func func,
> +                                                                    void *data)
> +{
> +    struct qemu_work_item *wwi;
> +
> +    assert(wcpu != cpu);
> +
> +    wwi = g_malloc0(sizeof(struct qemu_work_item));
> +    wwi->func = func;
> +    wwi->data = data;
> +    wwi->free = true;
> +    wwi->wcpu = wcpu;
> +
> +    /* Increase the number of pending work items */
> +    atomic_inc(&wcpu->pending_work_items);
> +
> +    qemu_mutex_lock(&cpu->work_mutex);
> +    /* Add the waiting work items at the beginning to free as soon as possible
> +     * the waiting CPU. */
> +    if (cpu->queued_work_first == NULL) {
> +        cpu->queued_work_last = wwi;
> +    } else {
> +        wwi->next = cpu->queued_work_first;
> +    }
> +    cpu->queued_work_first = wwi;
> +    wwi->done = false;
> +    qemu_mutex_unlock(&cpu->work_mutex);
> +
> +    qemu_cpu_kick(cpu);
> +
> +    /* In order to wait, @wcpu has to exit the CPU loop */
> +    cpu_exit(wcpu);
> +}
> +
>  /*
>   * Safe work interface
>   *
> @@ -1120,6 +1155,10 @@ static void flush_queued_work(CPUState *cpu)
>          qemu_mutex_unlock(&cpu->work_mutex);
>          wi->func(cpu, wi->data);
>          qemu_mutex_lock(&cpu->work_mutex);
> +        if (wi->wcpu != NULL) {
> +            atomic_dec(&wi->wcpu->pending_work_items);
> +            qemu_cond_broadcast(wi->wcpu->halt_cond);
> +        }
>          if (wi->free) {
>              g_free(wi);
>          } else {
> @@ -1406,7 +1445,8 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>      while (1) {
>          bool sleep = false;
>
> -        if (cpu_can_run(cpu) && !async_safe_work_pending()) {
> +        if (cpu_can_run(cpu) && !async_safe_work_pending()
> +                             && !async_waiting_for_work(cpu)) {
>              int r;
>
>              atomic_inc(&tcg_scheduled_cpus);
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 019f06d..7be82ed 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -259,6 +259,8 @@ struct qemu_work_item {
>      void *data;
>      int done;
>      bool free;
> +    /* CPU waiting for this work item to finish. If NULL, no CPU is waiting. */
> +    CPUState *wcpu;
>  };
>
>  /**
> @@ -303,6 +305,7 @@ struct qemu_work_item {
>   * @kvm_fd: vCPU file descriptor for KVM.
>   * @work_mutex: Lock to prevent multiple access to queued_work_*.
>   * @queued_work_first: First asynchronous work pending.
> + * @pending_work_items: Work items for which the CPU needs to wait completion.
>   *
>   * State of one CPU core or thread.
>   */
> @@ -337,6 +340,7 @@ struct CPUState {
>
>      QemuMutex work_mutex;
>      struct qemu_work_item *queued_work_first, *queued_work_last;
> +    int pending_work_items;
>
>      CPUAddressSpace *cpu_ases;
>      int num_ases;
> @@ -398,6 +402,9 @@ struct CPUState {
>       * by a stcond (see softmmu_template.h). */
>      bool excl_succeeded;
>
> +    /* True if some CPU requested a TLB flush for this CPU. */
> +    bool pending_tlb_flush;
> +
>      /* Note that this is accessed at the start of every TB via a negative
>         offset from AREG0.  Leave this field at the end so as to make the
>         (absolute value) offset as small as possible.  This reduces code
> @@ -680,6 +687,19 @@ void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
>  void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
>
>  /**
> + * async_wait_run_on_cpu:
> + * @cpu: The vCPU to run on.
> + * @wpu: The vCPU submitting the work.
> + * @func: The function to be executed.
> + * @data: Data to pass to the function.
> + *
> + * Schedules the function @func for execution on the vCPU @cpu asynchronously.
> + * The vCPU @wcpu will wait for @cpu to finish the job.
> + */
> +void async_wait_run_on_cpu(CPUState *cpu, CPUState *wcpu, run_on_cpu_func func,
> +                                                                   void *data);
> +
> +/**
>   * async_safe_run_on_cpu:
>   * @cpu: The vCPU to run on.
>   * @func: The function to be executed.
> @@ -699,6 +719,17 @@ void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
>  bool async_safe_work_pending(void);
>
>  /**
> + * async_waiting_for_work:
> + *
> + * Check whether there are work items for which @cpu is waiting completion.
> + * Returns: @true if work items are pending for completion, @false otherwise.
> + */
> +static inline bool async_waiting_for_work(CPUState *cpu)
> +{
> +    return atomic_mb_read(&cpu->pending_work_items) != 0;
> +}
> +
> +/**
>   * qemu_get_cpu:
>   * @index: The CPUState@cpu_index value of the CPU to obtain.
>   *


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()
  2016-06-08 13:54   ` Alex Bennée
@ 2016-06-08 14:10     ` alvise rigo
  2016-06-08 14:53       ` Sergey Fedorov
  0 siblings, 1 reply; 36+ messages in thread
From: alvise rigo @ 2016-06-08 14:10 UTC (permalink / raw)
  To: Alex Bennée
  Cc: MTTCG Devel, QEMU Developers, Jani Kokkonen, Claudio Fontana,
	VirtualOpenSystems Technical Team, KONRAD Frédéric,
	Paolo Bonzini, Richard Henderson, Sergey Fedorov, Emilio G. Cota,
	Peter Maydell

Using run_on_cpu() we might deadlock QEMU if other vCPUs are waiting
for the current vCPU. We need to exit from the vCPU loop in order to
avoid this.

Regards,
alvise

On Wed, Jun 8, 2016 at 3:54 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Introduce a new function that allows the calling VCPU to add a work item
>> to another VCPU (aka target VCPU). This new function differs from
>> async_run_on_cpu() since it makes the calling VCPU waiting for the target
>> VCPU to finish the work item. The mechanism makes use of the halt_cond
>> to wait and in case process pending work items.
>
> Isn't this exactly what would happen if you use run_on_cpu(). That will
> stall the current vCPU and busy wait until the work item is completed.
>
>>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  cpus.c            | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>  include/qom/cpu.h | 31 +++++++++++++++++++++++++++++++
>>  2 files changed, 73 insertions(+), 2 deletions(-)
>>
>> diff --git a/cpus.c b/cpus.c
>> index b9ec903..7bc96e2 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -89,7 +89,7 @@ static bool cpu_thread_is_idle(CPUState *cpu)
>>      if (cpu->stop || cpu->queued_work_first) {
>>          return false;
>>      }
>> -    if (cpu_is_stopped(cpu)) {
>> +    if (cpu_is_stopped(cpu) || async_waiting_for_work(cpu)) {
>>          return true;
>>      }
>>      if (!cpu->halted || cpu_has_work(cpu) ||
>> @@ -1012,6 +1012,7 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>>      wi->func = func;
>>      wi->data = data;
>>      wi->free = true;
>> +    wi->wcpu = NULL;
>>
>>      qemu_mutex_lock(&cpu->work_mutex);
>>      if (cpu->queued_work_first == NULL) {
>> @@ -1027,6 +1028,40 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>>      qemu_cpu_kick(cpu);
>>  }
>>
>> +void async_wait_run_on_cpu(CPUState *cpu, CPUState *wcpu, run_on_cpu_func func,
>> +                                                                    void *data)
>> +{
>> +    struct qemu_work_item *wwi;
>> +
>> +    assert(wcpu != cpu);
>> +
>> +    wwi = g_malloc0(sizeof(struct qemu_work_item));
>> +    wwi->func = func;
>> +    wwi->data = data;
>> +    wwi->free = true;
>> +    wwi->wcpu = wcpu;
>> +
>> +    /* Increase the number of pending work items */
>> +    atomic_inc(&wcpu->pending_work_items);
>> +
>> +    qemu_mutex_lock(&cpu->work_mutex);
>> +    /* Add the waiting work items at the beginning to free as soon as possible
>> +     * the waiting CPU. */
>> +    if (cpu->queued_work_first == NULL) {
>> +        cpu->queued_work_last = wwi;
>> +    } else {
>> +        wwi->next = cpu->queued_work_first;
>> +    }
>> +    cpu->queued_work_first = wwi;
>> +    wwi->done = false;
>> +    qemu_mutex_unlock(&cpu->work_mutex);
>> +
>> +    qemu_cpu_kick(cpu);
>> +
>> +    /* In order to wait, @wcpu has to exit the CPU loop */
>> +    cpu_exit(wcpu);
>> +}
>> +
>>  /*
>>   * Safe work interface
>>   *
>> @@ -1120,6 +1155,10 @@ static void flush_queued_work(CPUState *cpu)
>>          qemu_mutex_unlock(&cpu->work_mutex);
>>          wi->func(cpu, wi->data);
>>          qemu_mutex_lock(&cpu->work_mutex);
>> +        if (wi->wcpu != NULL) {
>> +            atomic_dec(&wi->wcpu->pending_work_items);
>> +            qemu_cond_broadcast(wi->wcpu->halt_cond);
>> +        }
>>          if (wi->free) {
>>              g_free(wi);
>>          } else {
>> @@ -1406,7 +1445,8 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>>      while (1) {
>>          bool sleep = false;
>>
>> -        if (cpu_can_run(cpu) && !async_safe_work_pending()) {
>> +        if (cpu_can_run(cpu) && !async_safe_work_pending()
>> +                             && !async_waiting_for_work(cpu)) {
>>              int r;
>>
>>              atomic_inc(&tcg_scheduled_cpus);
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 019f06d..7be82ed 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -259,6 +259,8 @@ struct qemu_work_item {
>>      void *data;
>>      int done;
>>      bool free;
>> +    /* CPU waiting for this work item to finish. If NULL, no CPU is waiting. */
>> +    CPUState *wcpu;
>>  };
>>
>>  /**
>> @@ -303,6 +305,7 @@ struct qemu_work_item {
>>   * @kvm_fd: vCPU file descriptor for KVM.
>>   * @work_mutex: Lock to prevent multiple access to queued_work_*.
>>   * @queued_work_first: First asynchronous work pending.
>> + * @pending_work_items: Work items for which the CPU needs to wait completion.
>>   *
>>   * State of one CPU core or thread.
>>   */
>> @@ -337,6 +340,7 @@ struct CPUState {
>>
>>      QemuMutex work_mutex;
>>      struct qemu_work_item *queued_work_first, *queued_work_last;
>> +    int pending_work_items;
>>
>>      CPUAddressSpace *cpu_ases;
>>      int num_ases;
>> @@ -398,6 +402,9 @@ struct CPUState {
>>       * by a stcond (see softmmu_template.h). */
>>      bool excl_succeeded;
>>
>> +    /* True if some CPU requested a TLB flush for this CPU. */
>> +    bool pending_tlb_flush;
>> +
>>      /* Note that this is accessed at the start of every TB via a negative
>>         offset from AREG0.  Leave this field at the end so as to make the
>>         (absolute value) offset as small as possible.  This reduces code
>> @@ -680,6 +687,19 @@ void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
>>  void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
>>
>>  /**
>> + * async_wait_run_on_cpu:
>> + * @cpu: The vCPU to run on.
>> + * @wpu: The vCPU submitting the work.
>> + * @func: The function to be executed.
>> + * @data: Data to pass to the function.
>> + *
>> + * Schedules the function @func for execution on the vCPU @cpu asynchronously.
>> + * The vCPU @wcpu will wait for @cpu to finish the job.
>> + */
>> +void async_wait_run_on_cpu(CPUState *cpu, CPUState *wcpu, run_on_cpu_func func,
>> +                                                                   void *data);
>> +
>> +/**
>>   * async_safe_run_on_cpu:
>>   * @cpu: The vCPU to run on.
>>   * @func: The function to be executed.
>> @@ -699,6 +719,17 @@ void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
>>  bool async_safe_work_pending(void);
>>
>>  /**
>> + * async_waiting_for_work:
>> + *
>> + * Check whether there are work items for which @cpu is waiting completion.
>> + * Returns: @true if work items are pending for completion, @false otherwise.
>> + */
>> +static inline bool async_waiting_for_work(CPUState *cpu)
>> +{
>> +    return atomic_mb_read(&cpu->pending_work_items) != 0;
>> +}
>> +
>> +/**
>>   * qemu_get_cpu:
>>   * @index: The CPUState@cpu_index value of the CPU to obtain.
>>   *
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()
  2016-06-08 14:10     ` alvise rigo
@ 2016-06-08 14:53       ` Sergey Fedorov
  2016-06-08 15:20         ` Alex Bennée
  0 siblings, 1 reply; 36+ messages in thread
From: Sergey Fedorov @ 2016-06-08 14:53 UTC (permalink / raw)
  To: alvise rigo, Alex Bennée
  Cc: MTTCG Devel, QEMU Developers, Jani Kokkonen, Claudio Fontana,
	VirtualOpenSystems Technical Team, KONRAD Frédéric,
	Paolo Bonzini, Richard Henderson, Emilio G. Cota, Peter Maydell

On 08/06/16 17:10, alvise rigo wrote:
> Using run_on_cpu() we might deadlock QEMU if other vCPUs are waiting
> for the current vCPU. We need to exit from the vCPU loop in order to
> avoid this.

I see, we could deadlock indeed. Alternatively, we may want fix
run_on_cpu() to avoid waiting for completion by itself when called from
CPU loop.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()
  2016-06-08 14:53       ` Sergey Fedorov
@ 2016-06-08 15:20         ` Alex Bennée
  2016-06-08 16:24           ` alvise rigo
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2016-06-08 15:20 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: alvise rigo, MTTCG Devel, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Emilio G. Cota, Peter Maydell


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 08/06/16 17:10, alvise rigo wrote:
>> Using run_on_cpu() we might deadlock QEMU if other vCPUs are waiting
>> for the current vCPU. We need to exit from the vCPU loop in order to
>> avoid this.
>
> I see, we could deadlock indeed. Alternatively, we may want fix
> run_on_cpu() to avoid waiting for completion by itself when called from
> CPU loop.

async_safe_run_on_cpu can't deadlock as all vCPUs are suspended (or
waiting) for the work to complete. The tasks are run in strict order so
if you queued async tasks for other vCPUs first you could ensure
everything is in the state you want it when you finally service the
calling vCPU.

>
> Kind regards,
> Sergey


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()
  2016-06-08 15:20         ` Alex Bennée
@ 2016-06-08 16:24           ` alvise rigo
  2016-06-13  9:26             ` Alex Bennée
  0 siblings, 1 reply; 36+ messages in thread
From: alvise rigo @ 2016-06-08 16:24 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Sergey Fedorov, MTTCG Devel, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Emilio G. Cota, Peter Maydell

I think that async_safe_run_on_cpu() does a different thing: it
queries a job to the target vCPU and wants all the other to "observe"
the submitted task. However, we will have the certainty that only the
target vCPU observed the task, the other might still be running in the
guest code.

alvise

On Wed, Jun 8, 2016 at 5:20 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 08/06/16 17:10, alvise rigo wrote:
>>> Using run_on_cpu() we might deadlock QEMU if other vCPUs are waiting
>>> for the current vCPU. We need to exit from the vCPU loop in order to
>>> avoid this.
>>
>> I see, we could deadlock indeed. Alternatively, we may want fix
>> run_on_cpu() to avoid waiting for completion by itself when called from
>> CPU loop.
>
> async_safe_run_on_cpu can't deadlock as all vCPUs are suspended (or
> waiting) for the work to complete. The tasks are run in strict order so
> if you queued async tasks for other vCPUs first you could ensure
> everything is in the state you want it when you finally service the
> calling vCPU.
>
>>
>> Kind regards,
>> Sergey
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns
  2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
                   ` (9 preceding siblings ...)
  2016-05-26 16:35 ` [Qemu-devel] [RFC 10/10] cpus: Do not sleep if some work item is pending Alvise Rigo
@ 2016-06-10 15:21 ` Alex Bennée
  2016-06-10 15:30   ` alvise rigo
  10 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2016-06-10 15:21 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, qemu-devel, jani.kokkonen, claudio.fontana, tech,
	fred.konrad, pbonzini, rth, serge.fdrv, cota, peter.maydell


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Hi,
>
> This series ports the latest iteration of the LL/SC work on top of the
> latest MTTCG reference branch posted recently by Alex.
>
> These patches apply on top of the following series:
>
> - [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86
>   https://github.com/stsquad/qemu/tree/mttcg/enable-mttcg-for-armv7-v1
> - [RFC v8 00/14] Slow-path for atomic instruction translation
>   https://git.virtualopensystems.com/dev/qemu-mt/tree/\
>   slowpath-for-atomic-v8-no-mttcg - only minor changes have been necessary
> - Few recent patches from Emilio regarding the spinlock implementation
>
> Overall, these patches allow the LL/SC infrastructure to work in multi-threaded
> mode (patches 01-02-04) and make TLB flushes to other VCPUs safe.
>
> Patch 03 introduces a new API to submit a work item to a VCPU and wait for its
> completion. This API is used to query TLB flushes that result from the
> emulation of some ARM instructions. Patches 07, 08 and 09 modify the current
> tlb_flush_* functions to use the new API.  Patch 10 fixes a rare hang that I
> was experiencing with this branch.
>
> The whole work can be fetched from the following repository:
> git@git.virtualopensystems.com:dev/qemu-mt.git
> at the branch "slowpath-for-atomic-v8-mttcg".

Hmm this branch has build failures for linux-user and other
architectures. Is this the latest one?

>
> Alvise Rigo (10):
>   exec: Introduce tcg_exclusive_{lock,unlock}()
>   softmmu_llsc_template.h: Move to multi-threading
>   cpus: Introduce async_wait_run_on_cpu()
>   cputlb: Introduce tlb_flush_other()
>   target-arm: End TB after ldrex instruction
>   cputlb: Add tlb_tables_flush_bitmap()
>   cputlb: Query tlb_flush_by_mmuidx
>   cputlb: Query tlb_flush_page_by_mmuidx
>   cputlb: Query tlb_flush_page_all
>   cpus: Do not sleep if some work item is pending
>
>  cpus.c                     |  48 ++++++++++-
>  cputlb.c                   | 202 ++++++++++++++++++++++++++++++++++-----------
>  exec.c                     |  18 ++++
>  include/exec/exec-all.h    |  13 +--
>  include/qom/cpu.h          |  36 ++++++++
>  softmmu_llsc_template.h    |  13 ++-
>  softmmu_template.h         |   6 ++
>  target-arm/helper.c        |  79 +++++++++---------
>  target-arm/op_helper.c     |   6 ++
>  target-arm/translate-a64.c |   2 +
>  target-arm/translate.c     |   2 +
>  11 files changed, 327 insertions(+), 98 deletions(-)


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-05-26 16:35 ` [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading Alvise Rigo
@ 2016-06-10 15:21   ` Sergey Fedorov
  2016-06-10 15:53     ` alvise rigo
  2016-06-10 16:15     ` Alex Bennée
  0 siblings, 2 replies; 36+ messages in thread
From: Sergey Fedorov @ 2016-06-10 15:21 UTC (permalink / raw)
  To: Alvise Rigo, mttcg, alex.bennee
  Cc: qemu-devel, jani.kokkonen, claudio.fontana, tech, fred.konrad,
	pbonzini, rth, cota, peter.maydell

On 26/05/16 19:35, Alvise Rigo wrote:
> Using tcg_exclusive_{lock,unlock}(), make the emulation of
> LoadLink/StoreConditional thread safe.
>
> During an LL access, this lock protects the load access itself, the
> update of the exclusive history and the update of the VCPU's protected
> range.  In a SC access, the lock protects the store access itself, the
> possible reset of other VCPUs' protected range and the reset of the
> exclusive context of calling VCPU.
>
> The lock is also taken when a normal store happens to access an
> exclusive page to reset other VCPUs' protected range in case of
> collision.

I think the key problem here is that the load in LL helper can race with
a concurrent regular fast-path store. It's probably easier to annotate
the source here:

     1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
     2                                  TCGMemOpIdx oi, uintptr_t retaddr)
     3  {
     4      WORD_TYPE ret;
     5      int index;
     6      CPUState *this_cpu = ENV_GET_CPU(env);
     7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
     8      hwaddr hw_addr;
     9      unsigned mmu_idx = get_mmuidx(oi);
      
    10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
      
    11      tcg_exclusive_lock();
      
    12      /* Use the proper load helper from cpu_ldst.h */
    13      ret = helper_ld(env, addr, oi, retaddr);
      
    14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
+ xlat)
    15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
    16      hw_addr = (env->iotlb[mmu_idx][index].addr &
TARGET_PAGE_MASK) + addr;
    17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
TLB_MMIO))) {
    18          /* If all the vCPUs have the EXCL bit set for this page
there is no need
    19           * to request any flush. */
    20          if (cpu_physical_memory_set_excl(hw_addr)) {
    21              CPUState *cpu;
      
    22              excl_history_put_addr(hw_addr);
    23              CPU_FOREACH(cpu) {
    24                  if (this_cpu != cpu) {
    25                      tlb_flush_other(this_cpu, cpu, 1);
    26                  }
    27              }
    28          }
    29          /* For this vCPU, just update the TLB entry, no need to
flush. */
    30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
    31      } else {
    32          /* Set a pending exclusive access in the MemoryRegion */
    33          MemoryRegion *mr = iotlb_to_region(this_cpu,
    34                                            
env->iotlb[mmu_idx][index].addr,
    35                                            
env->iotlb[mmu_idx][index].attrs);
    36          mr->pending_excl_access = true;
    37      }
      
    38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
      
    39      tcg_exclusive_unlock();
      
    40      /* From now on we are in LL/SC context */
    41      this_cpu->ll_sc_context = true;
      
    42      return ret;
    43  }


The exclusive lock at line 11 doesn't help if concurrent fast-patch
store at this address occurs after we finished load at line 13 but
before TLB is flushed as a result of line 25. If we reorder the load to
happen after the TLB flush request we still must be sure that the flush
is complete before we can do the load safely.

>
> Moreover, adapt target-arm to also cope with the new multi-threaded
> execution.
>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  softmmu_llsc_template.h | 11 +++++++++--
>  softmmu_template.h      |  6 ++++++
>  target-arm/op_helper.c  |  6 ++++++
>  3 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
> index 2c4a494..d3810c0 100644
> --- a/softmmu_llsc_template.h
> +++ b/softmmu_llsc_template.h
> @@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>      hwaddr hw_addr;
>      unsigned mmu_idx = get_mmuidx(oi);
>  
> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +
> +    tcg_exclusive_lock();
> +
>      /* Use the proper load helper from cpu_ldst.h */
>      ret = helper_ld(env, addr, oi, retaddr);
>  
> -    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> -
>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
> @@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>  
>      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>  
> +    tcg_exclusive_unlock();
> +
>      /* From now on we are in LL/SC context */
>      this_cpu->ll_sc_context = true;
>  
> @@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>           * access as one made by the store conditional wrapper. If the store
>           * conditional does not succeed, the value will be set to 0.*/
>          cpu->excl_succeeded = true;
> +
> +        tcg_exclusive_lock();
>          helper_st(env, addr, val, oi, retaddr);
>  
>          if (cpu->excl_succeeded) {
> @@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>  
>      /* Unset LL/SC context */
>      cc->cpu_reset_excl_context(cpu);
> +    tcg_exclusive_unlock();
>  
>      return ret;
>  }
> diff --git a/softmmu_template.h b/softmmu_template.h
> index 76fe37e..9363a7b 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -537,11 +537,16 @@ static inline void smmu_helper(do_excl_store)(CPUArchState *env,
>          }
>      }
>  
> +    /* Take the lock in case we are not coming from a SC */
> +    tcg_exclusive_lock();
> +
>      smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
>                                get_mmuidx(oi), index, retaddr);
>  
>      reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
>  
> +    tcg_exclusive_unlock();
> +
>      return;
>  }
>  
> @@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>      /* Handle an IO access or exclusive access.  */
>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>          if (tlb_addr & TLB_EXCL) {
> +
>              smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
>                                         retaddr);
>              return;
> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
> index e22afc5..19ea52d 100644
> --- a/target-arm/op_helper.c
> +++ b/target-arm/op_helper.c
> @@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
>      cs->exception_index = excp;
>      env->exception.syndrome = syndrome;
>      env->exception.target_el = target_el;
> +    tcg_exclusive_lock();
>      cc->cpu_reset_excl_context(cs);
> +    tcg_exclusive_unlock();
>      cpu_loop_exit(cs);
>  }
>  
> @@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
>      CPUState *cs = ENV_GET_CPU(env);
>      CPUClass *cc = CPU_GET_CLASS(cs);
>  
> +    tcg_exclusive_lock();
>      cc->cpu_reset_excl_context(cs);
> +    tcg_exclusive_unlock();
>  }
>  
>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
> @@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
>  
>      aarch64_save_sp(env, cur_el);
>  
> +    tcg_exclusive_lock();
>      cc->cpu_reset_excl_context(cs);
> +    tcg_exclusive_unlock();
>  
>      /* We must squash the PSTATE.SS bit to zero unless both of the
>       * following hold:

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns
  2016-06-10 15:21 ` [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alex Bennée
@ 2016-06-10 15:30   ` alvise rigo
  0 siblings, 0 replies; 36+ messages in thread
From: alvise rigo @ 2016-06-10 15:30 UTC (permalink / raw)
  To: Alex Bennée
  Cc: MTTCG Devel, QEMU Developers, Jani Kokkonen, Claudio Fontana,
	VirtualOpenSystems Technical Team, KONRAD Frédéric,
	Paolo Bonzini, Richard Henderson, Sergey Fedorov, Emilio G. Cota,
	Peter Maydell

I might have broken something while rebasing on top of
enable-mttcg-for-armv7-v1.
I will sort this problem out.

Thank you,
alvise

On Fri, Jun 10, 2016 at 5:21 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Hi,
>>
>> This series ports the latest iteration of the LL/SC work on top of the
>> latest MTTCG reference branch posted recently by Alex.
>>
>> These patches apply on top of the following series:
>>
>> - [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86
>>   https://github.com/stsquad/qemu/tree/mttcg/enable-mttcg-for-armv7-v1
>> - [RFC v8 00/14] Slow-path for atomic instruction translation
>>   https://git.virtualopensystems.com/dev/qemu-mt/tree/\
>>   slowpath-for-atomic-v8-no-mttcg - only minor changes have been necessary
>> - Few recent patches from Emilio regarding the spinlock implementation
>>
>> Overall, these patches allow the LL/SC infrastructure to work in multi-threaded
>> mode (patches 01-02-04) and make TLB flushes to other VCPUs safe.
>>
>> Patch 03 introduces a new API to submit a work item to a VCPU and wait for its
>> completion. This API is used to query TLB flushes that result from the
>> emulation of some ARM instructions. Patches 07, 08 and 09 modify the current
>> tlb_flush_* functions to use the new API.  Patch 10 fixes a rare hang that I
>> was experiencing with this branch.
>>
>> The whole work can be fetched from the following repository:
>> git@git.virtualopensystems.com:dev/qemu-mt.git
>> at the branch "slowpath-for-atomic-v8-mttcg".
>
> Hmm this branch has build failures for linux-user and other
> architectures. Is this the latest one?
>
>>
>> Alvise Rigo (10):
>>   exec: Introduce tcg_exclusive_{lock,unlock}()
>>   softmmu_llsc_template.h: Move to multi-threading
>>   cpus: Introduce async_wait_run_on_cpu()
>>   cputlb: Introduce tlb_flush_other()
>>   target-arm: End TB after ldrex instruction
>>   cputlb: Add tlb_tables_flush_bitmap()
>>   cputlb: Query tlb_flush_by_mmuidx
>>   cputlb: Query tlb_flush_page_by_mmuidx
>>   cputlb: Query tlb_flush_page_all
>>   cpus: Do not sleep if some work item is pending
>>
>>  cpus.c                     |  48 ++++++++++-
>>  cputlb.c                   | 202 ++++++++++++++++++++++++++++++++++-----------
>>  exec.c                     |  18 ++++
>>  include/exec/exec-all.h    |  13 +--
>>  include/qom/cpu.h          |  36 ++++++++
>>  softmmu_llsc_template.h    |  13 ++-
>>  softmmu_template.h         |   6 ++
>>  target-arm/helper.c        |  79 +++++++++---------
>>  target-arm/op_helper.c     |   6 ++
>>  target-arm/translate-a64.c |   2 +
>>  target-arm/translate.c     |   2 +
>>  11 files changed, 327 insertions(+), 98 deletions(-)
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-10 15:21   ` Sergey Fedorov
@ 2016-06-10 15:53     ` alvise rigo
  2016-06-10 16:00       ` Sergey Fedorov
  2016-06-14 12:00       ` Alex Bennée
  2016-06-10 16:15     ` Alex Bennée
  1 sibling, 2 replies; 36+ messages in thread
From: alvise rigo @ 2016-06-10 15:53 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: MTTCG Devel, Alex Bennée, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Emilio G. Cota, Peter Maydell

On Fri, Jun 10, 2016 at 5:21 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> On 26/05/16 19:35, Alvise Rigo wrote:
>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>> LoadLink/StoreConditional thread safe.
>>
>> During an LL access, this lock protects the load access itself, the
>> update of the exclusive history and the update of the VCPU's protected
>> range.  In a SC access, the lock protects the store access itself, the
>> possible reset of other VCPUs' protected range and the reset of the
>> exclusive context of calling VCPU.
>>
>> The lock is also taken when a normal store happens to access an
>> exclusive page to reset other VCPUs' protected range in case of
>> collision.
>
> I think the key problem here is that the load in LL helper can race with
> a concurrent regular fast-path store. It's probably easier to annotate
> the source here:
>
>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>      3  {
>      4      WORD_TYPE ret;
>      5      int index;
>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>      8      hwaddr hw_addr;
>      9      unsigned mmu_idx = get_mmuidx(oi);
>
>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>
>     11      tcg_exclusive_lock();
>
>     12      /* Use the proper load helper from cpu_ldst.h */
>     13      ret = helper_ld(env, addr, oi, retaddr);
>
>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
> + xlat)
>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
> TARGET_PAGE_MASK) + addr;
>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
> TLB_MMIO))) {
>     18          /* If all the vCPUs have the EXCL bit set for this page
> there is no need
>     19           * to request any flush. */
>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>     21              CPUState *cpu;
>
>     22              excl_history_put_addr(hw_addr);
>     23              CPU_FOREACH(cpu) {
>     24                  if (this_cpu != cpu) {
>     25                      tlb_flush_other(this_cpu, cpu, 1);
>     26                  }
>     27              }
>     28          }
>     29          /* For this vCPU, just update the TLB entry, no need to
> flush. */
>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>     31      } else {
>     32          /* Set a pending exclusive access in the MemoryRegion */
>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>     34
> env->iotlb[mmu_idx][index].addr,
>     35
> env->iotlb[mmu_idx][index].attrs);
>     36          mr->pending_excl_access = true;
>     37      }
>
>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>
>     39      tcg_exclusive_unlock();
>
>     40      /* From now on we are in LL/SC context */
>     41      this_cpu->ll_sc_context = true;
>
>     42      return ret;
>     43  }
>
>
> The exclusive lock at line 11 doesn't help if concurrent fast-patch
> store at this address occurs after we finished load at line 13 but
> before TLB is flushed as a result of line 25. If we reorder the load to
> happen after the TLB flush request we still must be sure that the flush
> is complete before we can do the load safely.

You are right, the risk actually exists. One solution to the problem
could be to ignore the data acquired by the load and redo the LL after
the flushes have been completed (basically the disas_ctx->pc points to
the LL instruction). This time the LL will happen without flush
requests and the access will be actually protected by the lock.

Regards,
alvise

>
>>
>> Moreover, adapt target-arm to also cope with the new multi-threaded
>> execution.
>>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  softmmu_llsc_template.h | 11 +++++++++--
>>  softmmu_template.h      |  6 ++++++
>>  target-arm/op_helper.c  |  6 ++++++
>>  3 files changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>> index 2c4a494..d3810c0 100644
>> --- a/softmmu_llsc_template.h
>> +++ b/softmmu_llsc_template.h
>> @@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>      hwaddr hw_addr;
>>      unsigned mmu_idx = get_mmuidx(oi);
>>
>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> +
>> +    tcg_exclusive_lock();
>> +
>>      /* Use the proper load helper from cpu_ldst.h */
>>      ret = helper_ld(env, addr, oi, retaddr);
>>
>> -    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> -
>>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>> @@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>
>>      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>
>> +    tcg_exclusive_unlock();
>> +
>>      /* From now on we are in LL/SC context */
>>      this_cpu->ll_sc_context = true;
>>
>> @@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>           * access as one made by the store conditional wrapper. If the store
>>           * conditional does not succeed, the value will be set to 0.*/
>>          cpu->excl_succeeded = true;
>> +
>> +        tcg_exclusive_lock();
>>          helper_st(env, addr, val, oi, retaddr);
>>
>>          if (cpu->excl_succeeded) {
>> @@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>
>>      /* Unset LL/SC context */
>>      cc->cpu_reset_excl_context(cpu);
>> +    tcg_exclusive_unlock();
>>
>>      return ret;
>>  }
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index 76fe37e..9363a7b 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -537,11 +537,16 @@ static inline void smmu_helper(do_excl_store)(CPUArchState *env,
>>          }
>>      }
>>
>> +    /* Take the lock in case we are not coming from a SC */
>> +    tcg_exclusive_lock();
>> +
>>      smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
>>                                get_mmuidx(oi), index, retaddr);
>>
>>      reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
>>
>> +    tcg_exclusive_unlock();
>> +
>>      return;
>>  }
>>
>> @@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>      /* Handle an IO access or exclusive access.  */
>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>>          if (tlb_addr & TLB_EXCL) {
>> +
>>              smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
>>                                         retaddr);
>>              return;
>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>> index e22afc5..19ea52d 100644
>> --- a/target-arm/op_helper.c
>> +++ b/target-arm/op_helper.c
>> @@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
>>      cs->exception_index = excp;
>>      env->exception.syndrome = syndrome;
>>      env->exception.target_el = target_el;
>> +    tcg_exclusive_lock();
>>      cc->cpu_reset_excl_context(cs);
>> +    tcg_exclusive_unlock();
>>      cpu_loop_exit(cs);
>>  }
>>
>> @@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
>>      CPUState *cs = ENV_GET_CPU(env);
>>      CPUClass *cc = CPU_GET_CLASS(cs);
>>
>> +    tcg_exclusive_lock();
>>      cc->cpu_reset_excl_context(cs);
>> +    tcg_exclusive_unlock();
>>  }
>>
>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>> @@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
>>
>>      aarch64_save_sp(env, cur_el);
>>
>> +    tcg_exclusive_lock();
>>      cc->cpu_reset_excl_context(cs);
>> +    tcg_exclusive_unlock();
>>
>>      /* We must squash the PSTATE.SS bit to zero unless both of the
>>       * following hold:
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-10 15:53     ` alvise rigo
@ 2016-06-10 16:00       ` Sergey Fedorov
  2016-06-10 16:04         ` alvise rigo
  2016-06-14 12:00       ` Alex Bennée
  1 sibling, 1 reply; 36+ messages in thread
From: Sergey Fedorov @ 2016-06-10 16:00 UTC (permalink / raw)
  To: alvise rigo
  Cc: MTTCG Devel, Alex Bennée, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Emilio G. Cota, Peter Maydell

On 10/06/16 18:53, alvise rigo wrote:
> On Fri, Jun 10, 2016 at 5:21 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>> On 26/05/16 19:35, Alvise Rigo wrote:
>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>> LoadLink/StoreConditional thread safe.
>>>
>>> During an LL access, this lock protects the load access itself, the
>>> update of the exclusive history and the update of the VCPU's protected
>>> range.  In a SC access, the lock protects the store access itself, the
>>> possible reset of other VCPUs' protected range and the reset of the
>>> exclusive context of calling VCPU.
>>>
>>> The lock is also taken when a normal store happens to access an
>>> exclusive page to reset other VCPUs' protected range in case of
>>> collision.
>> I think the key problem here is that the load in LL helper can race with
>> a concurrent regular fast-path store. It's probably easier to annotate
>> the source here:
>>
>>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>>      3  {
>>      4      WORD_TYPE ret;
>>      5      int index;
>>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>      8      hwaddr hw_addr;
>>      9      unsigned mmu_idx = get_mmuidx(oi);
>>
>>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>
>>     11      tcg_exclusive_lock();
>>
>>     12      /* Use the proper load helper from cpu_ldst.h */
>>     13      ret = helper_ld(env, addr, oi, retaddr);
>>
>>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>> + xlat)
>>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
>> TARGET_PAGE_MASK) + addr;
>>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>> TLB_MMIO))) {
>>     18          /* If all the vCPUs have the EXCL bit set for this page
>> there is no need
>>     19           * to request any flush. */
>>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>>     21              CPUState *cpu;
>>
>>     22              excl_history_put_addr(hw_addr);
>>     23              CPU_FOREACH(cpu) {
>>     24                  if (this_cpu != cpu) {
>>     25                      tlb_flush_other(this_cpu, cpu, 1);
>>     26                  }
>>     27              }
>>     28          }
>>     29          /* For this vCPU, just update the TLB entry, no need to
>> flush. */
>>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>     31      } else {
>>     32          /* Set a pending exclusive access in the MemoryRegion */
>>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>>     34
>> env->iotlb[mmu_idx][index].addr,
>>     35
>> env->iotlb[mmu_idx][index].attrs);
>>     36          mr->pending_excl_access = true;
>>     37      }
>>
>>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>
>>     39      tcg_exclusive_unlock();
>>
>>     40      /* From now on we are in LL/SC context */
>>     41      this_cpu->ll_sc_context = true;
>>
>>     42      return ret;
>>     43  }
>>
>>
>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>> store at this address occurs after we finished load at line 13 but
>> before TLB is flushed as a result of line 25. If we reorder the load to
>> happen after the TLB flush request we still must be sure that the flush
>> is complete before we can do the load safely.
> You are right, the risk actually exists. One solution to the problem
> could be to ignore the data acquired by the load and redo the LL after
> the flushes have been completed (basically the disas_ctx->pc points to
> the LL instruction). This time the LL will happen without flush
> requests and the access will be actually protected by the lock.

Yes, if some other CPU wouldn't evict an entry with the same address
from the exclusive history...

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-10 16:00       ` Sergey Fedorov
@ 2016-06-10 16:04         ` alvise rigo
  0 siblings, 0 replies; 36+ messages in thread
From: alvise rigo @ 2016-06-10 16:04 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: MTTCG Devel, Alex Bennée, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Emilio G. Cota, Peter Maydell

This would require to fill again the whole history which I find very
unlikely. In any case, this has to be documented.

Thank you,
alvise

On Fri, Jun 10, 2016 at 6:00 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> On 10/06/16 18:53, alvise rigo wrote:
>> On Fri, Jun 10, 2016 at 5:21 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>>> On 26/05/16 19:35, Alvise Rigo wrote:
>>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>>> LoadLink/StoreConditional thread safe.
>>>>
>>>> During an LL access, this lock protects the load access itself, the
>>>> update of the exclusive history and the update of the VCPU's protected
>>>> range.  In a SC access, the lock protects the store access itself, the
>>>> possible reset of other VCPUs' protected range and the reset of the
>>>> exclusive context of calling VCPU.
>>>>
>>>> The lock is also taken when a normal store happens to access an
>>>> exclusive page to reset other VCPUs' protected range in case of
>>>> collision.
>>> I think the key problem here is that the load in LL helper can race with
>>> a concurrent regular fast-path store. It's probably easier to annotate
>>> the source here:
>>>
>>>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>>>      3  {
>>>      4      WORD_TYPE ret;
>>>      5      int index;
>>>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>>>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>>      8      hwaddr hw_addr;
>>>      9      unsigned mmu_idx = get_mmuidx(oi);
>>>
>>>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>
>>>     11      tcg_exclusive_lock();
>>>
>>>     12      /* Use the proper load helper from cpu_ldst.h */
>>>     13      ret = helper_ld(env, addr, oi, retaddr);
>>>
>>>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>>> + xlat)
>>>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
>>> TARGET_PAGE_MASK) + addr;
>>>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>>> TLB_MMIO))) {
>>>     18          /* If all the vCPUs have the EXCL bit set for this page
>>> there is no need
>>>     19           * to request any flush. */
>>>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>>>     21              CPUState *cpu;
>>>
>>>     22              excl_history_put_addr(hw_addr);
>>>     23              CPU_FOREACH(cpu) {
>>>     24                  if (this_cpu != cpu) {
>>>     25                      tlb_flush_other(this_cpu, cpu, 1);
>>>     26                  }
>>>     27              }
>>>     28          }
>>>     29          /* For this vCPU, just update the TLB entry, no need to
>>> flush. */
>>>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>>     31      } else {
>>>     32          /* Set a pending exclusive access in the MemoryRegion */
>>>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>>>     34
>>> env->iotlb[mmu_idx][index].addr,
>>>     35
>>> env->iotlb[mmu_idx][index].attrs);
>>>     36          mr->pending_excl_access = true;
>>>     37      }
>>>
>>>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>
>>>     39      tcg_exclusive_unlock();
>>>
>>>     40      /* From now on we are in LL/SC context */
>>>     41      this_cpu->ll_sc_context = true;
>>>
>>>     42      return ret;
>>>     43  }
>>>
>>>
>>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>>> store at this address occurs after we finished load at line 13 but
>>> before TLB is flushed as a result of line 25. If we reorder the load to
>>> happen after the TLB flush request we still must be sure that the flush
>>> is complete before we can do the load safely.
>> You are right, the risk actually exists. One solution to the problem
>> could be to ignore the data acquired by the load and redo the LL after
>> the flushes have been completed (basically the disas_ctx->pc points to
>> the LL instruction). This time the LL will happen without flush
>> requests and the access will be actually protected by the lock.
>
> Yes, if some other CPU wouldn't evict an entry with the same address
> from the exclusive history...
>
> Kind regards,
> Sergey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-10 15:21   ` Sergey Fedorov
  2016-06-10 15:53     ` alvise rigo
@ 2016-06-10 16:15     ` Alex Bennée
  2016-06-11 19:53       ` Sergey Fedorov
  2016-06-14  8:37       ` Alex Bennée
  1 sibling, 2 replies; 36+ messages in thread
From: Alex Bennée @ 2016-06-10 16:15 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Alvise Rigo, mttcg, qemu-devel, jani.kokkonen, claudio.fontana,
	tech, fred.konrad, pbonzini, rth, cota, peter.maydell


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 26/05/16 19:35, Alvise Rigo wrote:
>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>> LoadLink/StoreConditional thread safe.
>>
>> During an LL access, this lock protects the load access itself, the
>> update of the exclusive history and the update of the VCPU's protected
>> range.  In a SC access, the lock protects the store access itself, the
>> possible reset of other VCPUs' protected range and the reset of the
>> exclusive context of calling VCPU.
>>
>> The lock is also taken when a normal store happens to access an
>> exclusive page to reset other VCPUs' protected range in case of
>> collision.
>
> I think the key problem here is that the load in LL helper can race with
> a concurrent regular fast-path store. It's probably easier to annotate
> the source here:
>
>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>      3  {
>      4      WORD_TYPE ret;
>      5      int index;
>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>      8      hwaddr hw_addr;
>      9      unsigned mmu_idx = get_mmuidx(oi);
>
>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>
>     11      tcg_exclusive_lock();
>
>     12      /* Use the proper load helper from cpu_ldst.h */
>     13      ret = helper_ld(env, addr, oi, retaddr);
>
>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
> + xlat)
>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
> TARGET_PAGE_MASK) + addr;
>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
> TLB_MMIO))) {
>     18          /* If all the vCPUs have the EXCL bit set for this page
> there is no need
>     19           * to request any flush. */
>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>     21              CPUState *cpu;
>
>     22              excl_history_put_addr(hw_addr);
>     23              CPU_FOREACH(cpu) {
>     24                  if (this_cpu != cpu) {
>     25                      tlb_flush_other(this_cpu, cpu, 1);
>     26                  }
>     27              }
>     28          }
>     29          /* For this vCPU, just update the TLB entry, no need to
> flush. */
>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>     31      } else {
>     32          /* Set a pending exclusive access in the MemoryRegion */
>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>     34
> env->iotlb[mmu_idx][index].addr,
>     35
> env->iotlb[mmu_idx][index].attrs);
>     36          mr->pending_excl_access = true;
>     37      }
>
>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>
>     39      tcg_exclusive_unlock();
>
>     40      /* From now on we are in LL/SC context */
>     41      this_cpu->ll_sc_context = true;
>
>     42      return ret;
>     43  }
>
>
> The exclusive lock at line 11 doesn't help if concurrent fast-patch
> store at this address occurs after we finished load at line 13 but
> before TLB is flushed as a result of line 25. If we reorder the load to
> happen after the TLB flush request we still must be sure that the flush
> is complete before we can do the load safely.

I think this can be fixed using async_safe_run_on_cpu and tweaking the
ldlink helper.

  * Change the helper_ldlink call
    - pass it offset-of(cpu->reg[n]) so it can store result of load
    - maybe pass it next-pc (unless there is some other way to know)

  vCPU runs until the ldlink instruction occurs and jumps to the helper

  * Once in the helper_ldlink
    - queue up an async helper function with info of offset
    - cpu_loop_exit_restore(with next PC)

  vCPU the issued the ldlink exits immediately, waits until all vCPUs are
  out of generated code.

  * Once in helper_ldlink async helper
    - Everything at this point is quiescent, no vCPU activity
    - Flush all TLBs/set flags
    - Do the load from memory, store directly into cpu->reg[n]

The key thing is once we are committed to load in the async helper
nothing else can get in the way. Any stores before we are in the helper
happen as normal, once we exit the async helper all potential
conflicting stores will slow path.

There is a little messing about in knowing the next PC which is simple
in the ARM case but gets a bit more complicated for architectures that
have deferred jump slots. I haven't looked into this nit yet.

>
>>
>> Moreover, adapt target-arm to also cope with the new multi-threaded
>> execution.
>>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  softmmu_llsc_template.h | 11 +++++++++--
>>  softmmu_template.h      |  6 ++++++
>>  target-arm/op_helper.c  |  6 ++++++
>>  3 files changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>> index 2c4a494..d3810c0 100644
>> --- a/softmmu_llsc_template.h
>> +++ b/softmmu_llsc_template.h
>> @@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>      hwaddr hw_addr;
>>      unsigned mmu_idx = get_mmuidx(oi);
>>
>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> +
>> +    tcg_exclusive_lock();
>> +
>>      /* Use the proper load helper from cpu_ldst.h */
>>      ret = helper_ld(env, addr, oi, retaddr);
>>
>> -    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> -
>>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>> @@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>
>>      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>
>> +    tcg_exclusive_unlock();
>> +
>>      /* From now on we are in LL/SC context */
>>      this_cpu->ll_sc_context = true;
>>
>> @@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>           * access as one made by the store conditional wrapper. If the store
>>           * conditional does not succeed, the value will be set to 0.*/
>>          cpu->excl_succeeded = true;
>> +
>> +        tcg_exclusive_lock();
>>          helper_st(env, addr, val, oi, retaddr);
>>
>>          if (cpu->excl_succeeded) {
>> @@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>
>>      /* Unset LL/SC context */
>>      cc->cpu_reset_excl_context(cpu);
>> +    tcg_exclusive_unlock();
>>
>>      return ret;
>>  }
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index 76fe37e..9363a7b 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -537,11 +537,16 @@ static inline void smmu_helper(do_excl_store)(CPUArchState *env,
>>          }
>>      }
>>
>> +    /* Take the lock in case we are not coming from a SC */
>> +    tcg_exclusive_lock();
>> +
>>      smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
>>                                get_mmuidx(oi), index, retaddr);
>>
>>      reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
>>
>> +    tcg_exclusive_unlock();
>> +
>>      return;
>>  }
>>
>> @@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>      /* Handle an IO access or exclusive access.  */
>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>>          if (tlb_addr & TLB_EXCL) {
>> +
>>              smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
>>                                         retaddr);
>>              return;
>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>> index e22afc5..19ea52d 100644
>> --- a/target-arm/op_helper.c
>> +++ b/target-arm/op_helper.c
>> @@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
>>      cs->exception_index = excp;
>>      env->exception.syndrome = syndrome;
>>      env->exception.target_el = target_el;
>> +    tcg_exclusive_lock();
>>      cc->cpu_reset_excl_context(cs);
>> +    tcg_exclusive_unlock();
>>      cpu_loop_exit(cs);
>>  }
>>
>> @@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
>>      CPUState *cs = ENV_GET_CPU(env);
>>      CPUClass *cc = CPU_GET_CLASS(cs);
>>
>> +    tcg_exclusive_lock();
>>      cc->cpu_reset_excl_context(cs);
>> +    tcg_exclusive_unlock();
>>  }
>>
>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>> @@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
>>
>>      aarch64_save_sp(env, cur_el);
>>
>> +    tcg_exclusive_lock();
>>      cc->cpu_reset_excl_context(cs);
>> +    tcg_exclusive_unlock();
>>
>>      /* We must squash the PSTATE.SS bit to zero unless both of the
>>       * following hold:


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-10 16:15     ` Alex Bennée
@ 2016-06-11 19:53       ` Sergey Fedorov
  2016-06-14  8:37       ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Sergey Fedorov @ 2016-06-11 19:53 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Alvise Rigo, mttcg, qemu-devel, jani.kokkonen, claudio.fontana,
	tech, fred.konrad, pbonzini, rth, cota, peter.maydell

On 10/06/16 19:15, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 26/05/16 19:35, Alvise Rigo wrote:
>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>> LoadLink/StoreConditional thread safe.
>>>
>>> During an LL access, this lock protects the load access itself, the
>>> update of the exclusive history and the update of the VCPU's protected
>>> range.  In a SC access, the lock protects the store access itself, the
>>> possible reset of other VCPUs' protected range and the reset of the
>>> exclusive context of calling VCPU.
>>>
>>> The lock is also taken when a normal store happens to access an
>>> exclusive page to reset other VCPUs' protected range in case of
>>> collision.
>> I think the key problem here is that the load in LL helper can race with
>> a concurrent regular fast-path store. 
(snip)
> I think this can be fixed using async_safe_run_on_cpu and tweaking the
> ldlink helper.
>
>   * Change the helper_ldlink call
>     - pass it offset-of(cpu->reg[n]) so it can store result of load
>     - maybe pass it next-pc (unless there is some other way to know)
>
>   vCPU runs until the ldlink instruction occurs and jumps to the helper
>
>   * Once in the helper_ldlink
>     - queue up an async helper function with info of offset
>     - cpu_loop_exit_restore(with next PC)
>
>   vCPU the issued the ldlink exits immediately, waits until all vCPUs are
>   out of generated code.
>
>   * Once in helper_ldlink async helper
>     - Everything at this point is quiescent, no vCPU activity
>     - Flush all TLBs/set flags
>     - Do the load from memory, store directly into cpu->reg[n]
>
> The key thing is once we are committed to load in the async helper
> nothing else can get in the way. Any stores before we are in the helper
> happen as normal, once we exit the async helper all potential
> conflicting stores will slow path.
>
> There is a little messing about in knowing the next PC which is simple
> in the ARM case but gets a bit more complicated for architectures that
> have deferred jump slots. I haven't looked into this nit yet.


Hmm, this looks pretty similar to what linux-user code actually does,
e.g. in do_strex(). Just restarting the LL instruction as Alvise
suggests may well be an easier approach (or may not).

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()
  2016-06-08 16:24           ` alvise rigo
@ 2016-06-13  9:26             ` Alex Bennée
  0 siblings, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2016-06-13  9:26 UTC (permalink / raw)
  To: alvise rigo
  Cc: Sergey Fedorov, MTTCG Devel, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Emilio G. Cota, Peter Maydell


alvise rigo <a.rigo@virtualopensystems.com> writes:

> I think that async_safe_run_on_cpu() does a different thing: it
> queries a job to the target vCPU and wants all the other to "observe"
> the submitted task. However, we will have the certainty that only the
> target vCPU observed the task, the other might still be running in the
> guest code.

For the code to have run every will have come out of the run loop and
synced up at that point. No safe work is run with guest code executing.

>
> alvise
>
> On Wed, Jun 8, 2016 at 5:20 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>>
>>> On 08/06/16 17:10, alvise rigo wrote:
>>>> Using run_on_cpu() we might deadlock QEMU if other vCPUs are waiting
>>>> for the current vCPU. We need to exit from the vCPU loop in order to
>>>> avoid this.
>>>
>>> I see, we could deadlock indeed. Alternatively, we may want fix
>>> run_on_cpu() to avoid waiting for completion by itself when called from
>>> CPU loop.
>>
>> async_safe_run_on_cpu can't deadlock as all vCPUs are suspended (or
>> waiting) for the work to complete. The tasks are run in strict order so
>> if you queued async tasks for other vCPUs first you could ensure
>> everything is in the state you want it when you finally service the
>> calling vCPU.
>>
>>>
>>> Kind regards,
>>> Sergey
>>
>>
>> --
>> Alex Bennée


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-10 16:15     ` Alex Bennée
  2016-06-11 19:53       ` Sergey Fedorov
@ 2016-06-14  8:37       ` Alex Bennée
  2016-06-14  9:31         ` Sergey Fedorov
  1 sibling, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2016-06-14  8:37 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Alvise Rigo, mttcg, qemu-devel, jani.kokkonen, claudio.fontana,
	tech, fred.konrad, pbonzini, rth, cota, peter.maydell


Alex Bennée <alex.bennee@linaro.org> writes:

> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 26/05/16 19:35, Alvise Rigo wrote:
>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>> LoadLink/StoreConditional thread safe.
>>>
>>> During an LL access, this lock protects the load access itself, the
>>> update of the exclusive history and the update of the VCPU's protected
>>> range.  In a SC access, the lock protects the store access itself, the
>>> possible reset of other VCPUs' protected range and the reset of the
>>> exclusive context of calling VCPU.
>>>
>>> The lock is also taken when a normal store happens to access an
>>> exclusive page to reset other VCPUs' protected range in case of
>>> collision.
>>
>> I think the key problem here is that the load in LL helper can race with
>> a concurrent regular fast-path store. It's probably easier to annotate
>> the source here:
>>
>>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>>      3  {
>>      4      WORD_TYPE ret;
>>      5      int index;
>>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>      8      hwaddr hw_addr;
>>      9      unsigned mmu_idx = get_mmuidx(oi);
>>
>>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>
>>     11      tcg_exclusive_lock();
>>
>>     12      /* Use the proper load helper from cpu_ldst.h */
>>     13      ret = helper_ld(env, addr, oi, retaddr);
>>
>>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>> + xlat)
>>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
>> TARGET_PAGE_MASK) + addr;
>>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>> TLB_MMIO))) {
>>     18          /* If all the vCPUs have the EXCL bit set for this page
>> there is no need
>>     19           * to request any flush. */
>>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>>     21              CPUState *cpu;
>>
>>     22              excl_history_put_addr(hw_addr);
>>     23              CPU_FOREACH(cpu) {
>>     24                  if (this_cpu != cpu) {
>>     25                      tlb_flush_other(this_cpu, cpu, 1);
>>     26                  }
>>     27              }
>>     28          }
>>     29          /* For this vCPU, just update the TLB entry, no need to
>> flush. */
>>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>     31      } else {
>>     32          /* Set a pending exclusive access in the MemoryRegion */
>>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>>     34
>> env->iotlb[mmu_idx][index].addr,
>>     35
>> env->iotlb[mmu_idx][index].attrs);
>>     36          mr->pending_excl_access = true;
>>     37      }
>>
>>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>
>>     39      tcg_exclusive_unlock();
>>
>>     40      /* From now on we are in LL/SC context */
>>     41      this_cpu->ll_sc_context = true;
>>
>>     42      return ret;
>>     43  }
>>
>>
>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>> store at this address occurs after we finished load at line 13 but
>> before TLB is flushed as a result of line 25. If we reorder the load to
>> happen after the TLB flush request we still must be sure that the flush
>> is complete before we can do the load safely.
>
> I think this can be fixed using async_safe_run_on_cpu and tweaking the
> ldlink helper.
>
>   * Change the helper_ldlink call
>     - pass it offset-of(cpu->reg[n]) so it can store result of load
>     - maybe pass it next-pc (unless there is some other way to know)
>
>   vCPU runs until the ldlink instruction occurs and jumps to the helper
>
>   * Once in the helper_ldlink
>     - queue up an async helper function with info of offset
>     - cpu_loop_exit_restore(with next PC)
>
>   vCPU the issued the ldlink exits immediately, waits until all vCPUs are
>   out of generated code.
>
>   * Once in helper_ldlink async helper
>     - Everything at this point is quiescent, no vCPU activity
>     - Flush all TLBs/set flags
>     - Do the load from memory, store directly into cpu->reg[n]
>
> The key thing is once we are committed to load in the async helper
> nothing else can get in the way. Any stores before we are in the helper
> happen as normal, once we exit the async helper all potential
> conflicting stores will slow path.
>
> There is a little messing about in knowing the next PC which is simple
> in the ARM case but gets a bit more complicated for architectures that
> have deferred jump slots. I haven't looked into this nit yet.

Thinking on it further the messing about with offset and PCs could be
solved just by having a simple flag:

 - First run, no flag set, queue safe work, restart loop at PC
 - Second run, flag set, do load, clear flag

As long as the flag is per-vCPU I think its safe from races.

>
>>
>>>
>>> Moreover, adapt target-arm to also cope with the new multi-threaded
>>> execution.
>>>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>> ---
>>>  softmmu_llsc_template.h | 11 +++++++++--
>>>  softmmu_template.h      |  6 ++++++
>>>  target-arm/op_helper.c  |  6 ++++++
>>>  3 files changed, 21 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>>> index 2c4a494..d3810c0 100644
>>> --- a/softmmu_llsc_template.h
>>> +++ b/softmmu_llsc_template.h
>>> @@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>      hwaddr hw_addr;
>>>      unsigned mmu_idx = get_mmuidx(oi);
>>>
>>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>> +
>>> +    tcg_exclusive_lock();
>>> +
>>>      /* Use the proper load helper from cpu_ldst.h */
>>>      ret = helper_ld(env, addr, oi, retaddr);
>>>
>>> -    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>> -
>>>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>>>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>>> @@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>
>>>      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>
>>> +    tcg_exclusive_unlock();
>>> +
>>>      /* From now on we are in LL/SC context */
>>>      this_cpu->ll_sc_context = true;
>>>
>>> @@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>           * access as one made by the store conditional wrapper. If the store
>>>           * conditional does not succeed, the value will be set to 0.*/
>>>          cpu->excl_succeeded = true;
>>> +
>>> +        tcg_exclusive_lock();
>>>          helper_st(env, addr, val, oi, retaddr);
>>>
>>>          if (cpu->excl_succeeded) {
>>> @@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>
>>>      /* Unset LL/SC context */
>>>      cc->cpu_reset_excl_context(cpu);
>>> +    tcg_exclusive_unlock();
>>>
>>>      return ret;
>>>  }
>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>> index 76fe37e..9363a7b 100644
>>> --- a/softmmu_template.h
>>> +++ b/softmmu_template.h
>>> @@ -537,11 +537,16 @@ static inline void smmu_helper(do_excl_store)(CPUArchState *env,
>>>          }
>>>      }
>>>
>>> +    /* Take the lock in case we are not coming from a SC */
>>> +    tcg_exclusive_lock();
>>> +
>>>      smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
>>>                                get_mmuidx(oi), index, retaddr);
>>>
>>>      reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
>>>
>>> +    tcg_exclusive_unlock();
>>> +
>>>      return;
>>>  }
>>>
>>> @@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>      /* Handle an IO access or exclusive access.  */
>>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>>>          if (tlb_addr & TLB_EXCL) {
>>> +
>>>              smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
>>>                                         retaddr);
>>>              return;
>>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>>> index e22afc5..19ea52d 100644
>>> --- a/target-arm/op_helper.c
>>> +++ b/target-arm/op_helper.c
>>> @@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
>>>      cs->exception_index = excp;
>>>      env->exception.syndrome = syndrome;
>>>      env->exception.target_el = target_el;
>>> +    tcg_exclusive_lock();
>>>      cc->cpu_reset_excl_context(cs);
>>> +    tcg_exclusive_unlock();
>>>      cpu_loop_exit(cs);
>>>  }
>>>
>>> @@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
>>>      CPUState *cs = ENV_GET_CPU(env);
>>>      CPUClass *cc = CPU_GET_CLASS(cs);
>>>
>>> +    tcg_exclusive_lock();
>>>      cc->cpu_reset_excl_context(cs);
>>> +    tcg_exclusive_unlock();
>>>  }
>>>
>>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>>> @@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
>>>
>>>      aarch64_save_sp(env, cur_el);
>>>
>>> +    tcg_exclusive_lock();
>>>      cc->cpu_reset_excl_context(cs);
>>> +    tcg_exclusive_unlock();
>>>
>>>      /* We must squash the PSTATE.SS bit to zero unless both of the
>>>       * following hold:


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-14  8:37       ` Alex Bennée
@ 2016-06-14  9:31         ` Sergey Fedorov
  0 siblings, 0 replies; 36+ messages in thread
From: Sergey Fedorov @ 2016-06-14  9:31 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Alvise Rigo, mttcg, qemu-devel, jani.kokkonen, claudio.fontana,
	tech, fred.konrad, pbonzini, rth, cota, peter.maydell

On 14/06/16 11:37, Alex Bennée wrote:
> Alex Bennée <alex.bennee@linaro.org> writes:
>
>> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>>
>>> On 26/05/16 19:35, Alvise Rigo wrote:
>>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>>> LoadLink/StoreConditional thread safe.
>>>>
>>>> During an LL access, this lock protects the load access itself, the
>>>> update of the exclusive history and the update of the VCPU's protected
>>>> range.  In a SC access, the lock protects the store access itself, the
>>>> possible reset of other VCPUs' protected range and the reset of the
>>>> exclusive context of calling VCPU.
>>>>
>>>> The lock is also taken when a normal store happens to access an
>>>> exclusive page to reset other VCPUs' protected range in case of
>>>> collision.
>>> I think the key problem here is that the load in LL helper can race with
>>> a concurrent regular fast-path store. It's probably easier to annotate
>>> the source here:
>>>
>>>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>>>      3  {
>>>      4      WORD_TYPE ret;
>>>      5      int index;
>>>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>>>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>>      8      hwaddr hw_addr;
>>>      9      unsigned mmu_idx = get_mmuidx(oi);
>>>
>>>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>
>>>     11      tcg_exclusive_lock();
>>>
>>>     12      /* Use the proper load helper from cpu_ldst.h */
>>>     13      ret = helper_ld(env, addr, oi, retaddr);
>>>
>>>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>>> + xlat)
>>>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
>>> TARGET_PAGE_MASK) + addr;
>>>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>>> TLB_MMIO))) {
>>>     18          /* If all the vCPUs have the EXCL bit set for this page
>>> there is no need
>>>     19           * to request any flush. */
>>>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>>>     21              CPUState *cpu;
>>>
>>>     22              excl_history_put_addr(hw_addr);
>>>     23              CPU_FOREACH(cpu) {
>>>     24                  if (this_cpu != cpu) {
>>>     25                      tlb_flush_other(this_cpu, cpu, 1);
>>>     26                  }
>>>     27              }
>>>     28          }
>>>     29          /* For this vCPU, just update the TLB entry, no need to
>>> flush. */
>>>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>>     31      } else {
>>>     32          /* Set a pending exclusive access in the MemoryRegion */
>>>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>>>     34
>>> env->iotlb[mmu_idx][index].addr,
>>>     35
>>> env->iotlb[mmu_idx][index].attrs);
>>>     36          mr->pending_excl_access = true;
>>>     37      }
>>>
>>>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>
>>>     39      tcg_exclusive_unlock();
>>>
>>>     40      /* From now on we are in LL/SC context */
>>>     41      this_cpu->ll_sc_context = true;
>>>
>>>     42      return ret;
>>>     43  }
>>>
>>>
>>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>>> store at this address occurs after we finished load at line 13 but
>>> before TLB is flushed as a result of line 25. If we reorder the load to
>>> happen after the TLB flush request we still must be sure that the flush
>>> is complete before we can do the load safely.
>> I think this can be fixed using async_safe_run_on_cpu and tweaking the
>> ldlink helper.
>>
>>   * Change the helper_ldlink call
>>     - pass it offset-of(cpu->reg[n]) so it can store result of load
>>     - maybe pass it next-pc (unless there is some other way to know)
>>
>>   vCPU runs until the ldlink instruction occurs and jumps to the helper
>>
>>   * Once in the helper_ldlink
>>     - queue up an async helper function with info of offset
>>     - cpu_loop_exit_restore(with next PC)
>>
>>   vCPU the issued the ldlink exits immediately, waits until all vCPUs are
>>   out of generated code.
>>
>>   * Once in helper_ldlink async helper
>>     - Everything at this point is quiescent, no vCPU activity
>>     - Flush all TLBs/set flags
>>     - Do the load from memory, store directly into cpu->reg[n]
>>
>> The key thing is once we are committed to load in the async helper
>> nothing else can get in the way. Any stores before we are in the helper
>> happen as normal, once we exit the async helper all potential
>> conflicting stores will slow path.
>>
>> There is a little messing about in knowing the next PC which is simple
>> in the ARM case but gets a bit more complicated for architectures that
>> have deferred jump slots. I haven't looked into this nit yet.
> Thinking on it further the messing about with offset and PCs could be
> solved just by having a simple flag:
>
>  - First run, no flag set, queue safe work, restart loop at PC
>  - Second run, flag set, do load, clear flag
>
> As long as the flag is per-vCPU I think its safe from races.

As soon as we flushed TLBs properly, we can set an exclusive flag for
the page. Next attempt to execute the LL we can do actual load and
proceed further. I think there's no need to introduce some special flag.
That is basically what Alvise suggested, see
http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417787.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-10 15:53     ` alvise rigo
  2016-06-10 16:00       ` Sergey Fedorov
@ 2016-06-14 12:00       ` Alex Bennée
  2016-06-14 12:58         ` alvise rigo
  1 sibling, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2016-06-14 12:00 UTC (permalink / raw)
  To: alvise rigo
  Cc: Sergey Fedorov, MTTCG Devel, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Emilio G. Cota, Peter Maydell


alvise rigo <a.rigo@virtualopensystems.com> writes:

> On Fri, Jun 10, 2016 at 5:21 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>> On 26/05/16 19:35, Alvise Rigo wrote:
>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>> LoadLink/StoreConditional thread safe.
>>>
>>> During an LL access, this lock protects the load access itself, the
>>> update of the exclusive history and the update of the VCPU's protected
>>> range.  In a SC access, the lock protects the store access itself, the
>>> possible reset of other VCPUs' protected range and the reset of the
>>> exclusive context of calling VCPU.
>>>
>>> The lock is also taken when a normal store happens to access an
>>> exclusive page to reset other VCPUs' protected range in case of
>>> collision.
>>
>> I think the key problem here is that the load in LL helper can race with
>> a concurrent regular fast-path store. It's probably easier to annotate
>> the source here:
>>
>>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>>      3  {
>>      4      WORD_TYPE ret;
>>      5      int index;
>>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>      8      hwaddr hw_addr;
>>      9      unsigned mmu_idx = get_mmuidx(oi);
>>
>>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>
>>     11      tcg_exclusive_lock();
>>
>>     12      /* Use the proper load helper from cpu_ldst.h */
>>     13      ret = helper_ld(env, addr, oi, retaddr);
>>
>>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>> + xlat)
>>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
>> TARGET_PAGE_MASK) + addr;
>>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>> TLB_MMIO))) {
>>     18          /* If all the vCPUs have the EXCL bit set for this page
>> there is no need
>>     19           * to request any flush. */
>>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>>     21              CPUState *cpu;
>>
>>     22              excl_history_put_addr(hw_addr);
>>     23              CPU_FOREACH(cpu) {
>>     24                  if (this_cpu != cpu) {
>>     25                      tlb_flush_other(this_cpu, cpu, 1);
>>     26                  }
>>     27              }
>>     28          }
>>     29          /* For this vCPU, just update the TLB entry, no need to
>> flush. */
>>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>     31      } else {
>>     32          /* Set a pending exclusive access in the MemoryRegion */
>>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>>     34
>> env->iotlb[mmu_idx][index].addr,
>>     35
>> env->iotlb[mmu_idx][index].attrs);
>>     36          mr->pending_excl_access = true;
>>     37      }
>>
>>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>
>>     39      tcg_exclusive_unlock();
>>
>>     40      /* From now on we are in LL/SC context */
>>     41      this_cpu->ll_sc_context = true;
>>
>>     42      return ret;
>>     43  }
>>
>>
>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>> store at this address occurs after we finished load at line 13 but
>> before TLB is flushed as a result of line 25. If we reorder the load to
>> happen after the TLB flush request we still must be sure that the flush
>> is complete before we can do the load safely.
>
> You are right, the risk actually exists. One solution to the problem
> could be to ignore the data acquired by the load and redo the LL after
> the flushes have been completed (basically the disas_ctx->pc points to
> the LL instruction). This time the LL will happen without flush
> requests and the access will be actually protected by the lock.

So is just punting the work of to an async safe task and restarting the
vCPU thread going to be enough?

>
> Regards,
> alvise
>
>>
>>>
>>> Moreover, adapt target-arm to also cope with the new multi-threaded
>>> execution.
>>>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>> ---
>>>  softmmu_llsc_template.h | 11 +++++++++--
>>>  softmmu_template.h      |  6 ++++++
>>>  target-arm/op_helper.c  |  6 ++++++
>>>  3 files changed, 21 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>>> index 2c4a494..d3810c0 100644
>>> --- a/softmmu_llsc_template.h
>>> +++ b/softmmu_llsc_template.h
>>> @@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>      hwaddr hw_addr;
>>>      unsigned mmu_idx = get_mmuidx(oi);
>>>
>>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>> +
>>> +    tcg_exclusive_lock();
>>> +
>>>      /* Use the proper load helper from cpu_ldst.h */
>>>      ret = helper_ld(env, addr, oi, retaddr);
>>>
>>> -    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>> -
>>>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>>>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>>> @@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>
>>>      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>
>>> +    tcg_exclusive_unlock();
>>> +
>>>      /* From now on we are in LL/SC context */
>>>      this_cpu->ll_sc_context = true;
>>>
>>> @@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>           * access as one made by the store conditional wrapper. If the store
>>>           * conditional does not succeed, the value will be set to 0.*/
>>>          cpu->excl_succeeded = true;
>>> +
>>> +        tcg_exclusive_lock();
>>>          helper_st(env, addr, val, oi, retaddr);
>>>
>>>          if (cpu->excl_succeeded) {
>>> @@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>
>>>      /* Unset LL/SC context */
>>>      cc->cpu_reset_excl_context(cpu);
>>> +    tcg_exclusive_unlock();
>>>
>>>      return ret;
>>>  }
>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>> index 76fe37e..9363a7b 100644
>>> --- a/softmmu_template.h
>>> +++ b/softmmu_template.h
>>> @@ -537,11 +537,16 @@ static inline void smmu_helper(do_excl_store)(CPUArchState *env,
>>>          }
>>>      }
>>>
>>> +    /* Take the lock in case we are not coming from a SC */
>>> +    tcg_exclusive_lock();
>>> +
>>>      smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
>>>                                get_mmuidx(oi), index, retaddr);
>>>
>>>      reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
>>>
>>> +    tcg_exclusive_unlock();
>>> +
>>>      return;
>>>  }
>>>
>>> @@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>      /* Handle an IO access or exclusive access.  */
>>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>>>          if (tlb_addr & TLB_EXCL) {
>>> +
>>>              smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
>>>                                         retaddr);
>>>              return;
>>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>>> index e22afc5..19ea52d 100644
>>> --- a/target-arm/op_helper.c
>>> +++ b/target-arm/op_helper.c
>>> @@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
>>>      cs->exception_index = excp;
>>>      env->exception.syndrome = syndrome;
>>>      env->exception.target_el = target_el;
>>> +    tcg_exclusive_lock();
>>>      cc->cpu_reset_excl_context(cs);
>>> +    tcg_exclusive_unlock();
>>>      cpu_loop_exit(cs);
>>>  }
>>>
>>> @@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
>>>      CPUState *cs = ENV_GET_CPU(env);
>>>      CPUClass *cc = CPU_GET_CLASS(cs);
>>>
>>> +    tcg_exclusive_lock();
>>>      cc->cpu_reset_excl_context(cs);
>>> +    tcg_exclusive_unlock();
>>>  }
>>>
>>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>>> @@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
>>>
>>>      aarch64_save_sp(env, cur_el);
>>>
>>> +    tcg_exclusive_lock();
>>>      cc->cpu_reset_excl_context(cs);
>>> +    tcg_exclusive_unlock();
>>>
>>>      /* We must squash the PSTATE.SS bit to zero unless both of the
>>>       * following hold:
>>


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-14 12:00       ` Alex Bennée
@ 2016-06-14 12:58         ` alvise rigo
  2016-06-14 13:14           ` Alex Bennée
  0 siblings, 1 reply; 36+ messages in thread
From: alvise rigo @ 2016-06-14 12:58 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Sergey Fedorov, MTTCG Devel, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Emilio G. Cota, Peter Maydell

1. LL(x)                   // x requires a flush
2. query flush to all the n VCPUs
3. exit from the CPU loop and wait until all the flushes are done
4. enter the loop to re-execute LL(x). This time no flush is required

Now, points 2. and 3. can be done either with n calls of
async_safe_run_on_cpu() or with n calls of async_wait_run_on_cpu(). In my
opinion the former is not really done for this use case since it would call
n^2 times cpu_exit() and it would not really ensure that the VCPU has
exited from the guest code to make an iteration of the CPU loop.

Regards,
alvise

On Tue, Jun 14, 2016 at 2:00 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> alvise rigo <a.rigo@virtualopensystems.com> writes:
>
>> On Fri, Jun 10, 2016 at 5:21 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>>> On 26/05/16 19:35, Alvise Rigo wrote:
>>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>>> LoadLink/StoreConditional thread safe.
>>>>
>>>> During an LL access, this lock protects the load access itself, the
>>>> update of the exclusive history and the update of the VCPU's protected
>>>> range.  In a SC access, the lock protects the store access itself, the
>>>> possible reset of other VCPUs' protected range and the reset of the
>>>> exclusive context of calling VCPU.
>>>>
>>>> The lock is also taken when a normal store happens to access an
>>>> exclusive page to reset other VCPUs' protected range in case of
>>>> collision.
>>>
>>> I think the key problem here is that the load in LL helper can race with
>>> a concurrent regular fast-path store. It's probably easier to annotate
>>> the source here:
>>>
>>>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>>>      3  {
>>>      4      WORD_TYPE ret;
>>>      5      int index;
>>>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>>>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>>      8      hwaddr hw_addr;
>>>      9      unsigned mmu_idx = get_mmuidx(oi);
>>>
>>>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>
>>>     11      tcg_exclusive_lock();
>>>
>>>     12      /* Use the proper load helper from cpu_ldst.h */
>>>     13      ret = helper_ld(env, addr, oi, retaddr);
>>>
>>>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>>> + xlat)
>>>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
>>> TARGET_PAGE_MASK) + addr;
>>>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>>> TLB_MMIO))) {
>>>     18          /* If all the vCPUs have the EXCL bit set for this page
>>> there is no need
>>>     19           * to request any flush. */
>>>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>>>     21              CPUState *cpu;
>>>
>>>     22              excl_history_put_addr(hw_addr);
>>>     23              CPU_FOREACH(cpu) {
>>>     24                  if (this_cpu != cpu) {
>>>     25                      tlb_flush_other(this_cpu, cpu, 1);
>>>     26                  }
>>>     27              }
>>>     28          }
>>>     29          /* For this vCPU, just update the TLB entry, no need to
>>> flush. */
>>>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>>     31      } else {
>>>     32          /* Set a pending exclusive access in the MemoryRegion */
>>>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>>>     34
>>> env->iotlb[mmu_idx][index].addr,
>>>     35
>>> env->iotlb[mmu_idx][index].attrs);
>>>     36          mr->pending_excl_access = true;
>>>     37      }
>>>
>>>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>
>>>     39      tcg_exclusive_unlock();
>>>
>>>     40      /* From now on we are in LL/SC context */
>>>     41      this_cpu->ll_sc_context = true;
>>>
>>>     42      return ret;
>>>     43  }
>>>
>>>
>>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>>> store at this address occurs after we finished load at line 13 but
>>> before TLB is flushed as a result of line 25. If we reorder the load to
>>> happen after the TLB flush request we still must be sure that the flush
>>> is complete before we can do the load safely.
>>
>> You are right, the risk actually exists. One solution to the problem
>> could be to ignore the data acquired by the load and redo the LL after
>> the flushes have been completed (basically the disas_ctx->pc points to
>> the LL instruction). This time the LL will happen without flush
>> requests and the access will be actually protected by the lock.
>
> So is just punting the work of to an async safe task and restarting the
> vCPU thread going to be enough?
>
>>
>> Regards,
>> alvise
>>
>>>
>>>>
>>>> Moreover, adapt target-arm to also cope with the new multi-threaded
>>>> execution.
>>>>
>>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>>> ---
>>>>  softmmu_llsc_template.h | 11 +++++++++--
>>>>  softmmu_template.h      |  6 ++++++
>>>>  target-arm/op_helper.c  |  6 ++++++
>>>>  3 files changed, 21 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>>>> index 2c4a494..d3810c0 100644
>>>> --- a/softmmu_llsc_template.h
>>>> +++ b/softmmu_llsc_template.h
>>>> @@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>>      hwaddr hw_addr;
>>>>      unsigned mmu_idx = get_mmuidx(oi);
>>>>
>>>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>> +
>>>> +    tcg_exclusive_lock();
>>>> +
>>>>      /* Use the proper load helper from cpu_ldst.h */
>>>>      ret = helper_ld(env, addr, oi, retaddr);
>>>>
>>>> -    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>> -
>>>>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>>>>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>>>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>>>> @@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>>
>>>>      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>>
>>>> +    tcg_exclusive_unlock();
>>>> +
>>>>      /* From now on we are in LL/SC context */
>>>>      this_cpu->ll_sc_context = true;
>>>>
>>>> @@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>>           * access as one made by the store conditional wrapper. If the store
>>>>           * conditional does not succeed, the value will be set to 0.*/
>>>>          cpu->excl_succeeded = true;
>>>> +
>>>> +        tcg_exclusive_lock();
>>>>          helper_st(env, addr, val, oi, retaddr);
>>>>
>>>>          if (cpu->excl_succeeded) {
>>>> @@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>>
>>>>      /* Unset LL/SC context */
>>>>      cc->cpu_reset_excl_context(cpu);
>>>> +    tcg_exclusive_unlock();
>>>>
>>>>      return ret;
>>>>  }
>>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>>> index 76fe37e..9363a7b 100644
>>>> --- a/softmmu_template.h
>>>> +++ b/softmmu_template.h
>>>> @@ -537,11 +537,16 @@ static inline void smmu_helper(do_excl_store)(CPUArchState *env,
>>>>          }
>>>>      }
>>>>
>>>> +    /* Take the lock in case we are not coming from a SC */
>>>> +    tcg_exclusive_lock();
>>>> +
>>>>      smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
>>>>                                get_mmuidx(oi), index, retaddr);
>>>>
>>>>      reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
>>>>
>>>> +    tcg_exclusive_unlock();
>>>> +
>>>>      return;
>>>>  }
>>>>
>>>> @@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>      /* Handle an IO access or exclusive access.  */
>>>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>>>>          if (tlb_addr & TLB_EXCL) {
>>>> +
>>>>              smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
>>>>                                         retaddr);
>>>>              return;
>>>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>>>> index e22afc5..19ea52d 100644
>>>> --- a/target-arm/op_helper.c
>>>> +++ b/target-arm/op_helper.c
>>>> @@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
>>>>      cs->exception_index = excp;
>>>>      env->exception.syndrome = syndrome;
>>>>      env->exception.target_el = target_el;
>>>> +    tcg_exclusive_lock();
>>>>      cc->cpu_reset_excl_context(cs);
>>>> +    tcg_exclusive_unlock();
>>>>      cpu_loop_exit(cs);
>>>>  }
>>>>
>>>> @@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
>>>>      CPUState *cs = ENV_GET_CPU(env);
>>>>      CPUClass *cc = CPU_GET_CLASS(cs);
>>>>
>>>> +    tcg_exclusive_lock();
>>>>      cc->cpu_reset_excl_context(cs);
>>>> +    tcg_exclusive_unlock();
>>>>  }
>>>>
>>>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>>>> @@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
>>>>
>>>>      aarch64_save_sp(env, cur_el);
>>>>
>>>> +    tcg_exclusive_lock();
>>>>      cc->cpu_reset_excl_context(cs);
>>>> +    tcg_exclusive_unlock();
>>>>
>>>>      /* We must squash the PSTATE.SS bit to zero unless both of the
>>>>       * following hold:
>>>
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
  2016-06-14 12:58         ` alvise rigo
@ 2016-06-14 13:14           ` Alex Bennée
  0 siblings, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2016-06-14 13:14 UTC (permalink / raw)
  To: alvise rigo
  Cc: Sergey Fedorov, MTTCG Devel, QEMU Developers, Jani Kokkonen,
	Claudio Fontana, VirtualOpenSystems Technical Team,
	KONRAD Frédéric, Paolo Bonzini, Richard Henderson,
	Emilio G. Cota, Peter Maydell


alvise rigo <a.rigo@virtualopensystems.com> writes:

> 1. LL(x)                   // x requires a flush
> 2. query flush to all the n VCPUs
> 3. exit from the CPU loop and wait until all the flushes are done
> 4. enter the loop to re-execute LL(x). This time no flush is required
>
> Now, points 2. and 3. can be done either with n calls of
> async_safe_run_on_cpu() or with n calls of async_wait_run_on_cpu(). In my
> opinion the former is not really done for this use case since it would call
> n^2 times cpu_exit() and it would not really ensure that the VCPU has
> exited from the guest code to make an iteration of the CPU loop.

async_safe_run_on_cpu maybe should be renamed async_safe_run_on_system()
as all vCPUs are held. You only need one helper to do all the flushing
and bit-setting.

>
> Regards,
> alvise
>
> On Tue, Jun 14, 2016 at 2:00 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> alvise rigo <a.rigo@virtualopensystems.com> writes:
>>
>>> On Fri, Jun 10, 2016 at 5:21 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>>>> On 26/05/16 19:35, Alvise Rigo wrote:
>>>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>>>> LoadLink/StoreConditional thread safe.
>>>>>
>>>>> During an LL access, this lock protects the load access itself, the
>>>>> update of the exclusive history and the update of the VCPU's protected
>>>>> range.  In a SC access, the lock protects the store access itself, the
>>>>> possible reset of other VCPUs' protected range and the reset of the
>>>>> exclusive context of calling VCPU.
>>>>>
>>>>> The lock is also taken when a normal store happens to access an
>>>>> exclusive page to reset other VCPUs' protected range in case of
>>>>> collision.
>>>>
>>>> I think the key problem here is that the load in LL helper can race with
>>>> a concurrent regular fast-path store. It's probably easier to annotate
>>>> the source here:
>>>>
>>>>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>>>>      3  {
>>>>      4      WORD_TYPE ret;
>>>>      5      int index;
>>>>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>>>>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>>>      8      hwaddr hw_addr;
>>>>      9      unsigned mmu_idx = get_mmuidx(oi);
>>>>
>>>>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>>
>>>>     11      tcg_exclusive_lock();
>>>>
>>>>     12      /* Use the proper load helper from cpu_ldst.h */
>>>>     13      ret = helper_ld(env, addr, oi, retaddr);
>>>>
>>>>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>>>> + xlat)
>>>>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>>>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
>>>> TARGET_PAGE_MASK) + addr;
>>>>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>>>> TLB_MMIO))) {
>>>>     18          /* If all the vCPUs have the EXCL bit set for this page
>>>> there is no need
>>>>     19           * to request any flush. */
>>>>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>>>>     21              CPUState *cpu;
>>>>
>>>>     22              excl_history_put_addr(hw_addr);
>>>>     23              CPU_FOREACH(cpu) {
>>>>     24                  if (this_cpu != cpu) {
>>>>     25                      tlb_flush_other(this_cpu, cpu, 1);
>>>>     26                  }
>>>>     27              }
>>>>     28          }
>>>>     29          /* For this vCPU, just update the TLB entry, no need to
>>>> flush. */
>>>>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>>>     31      } else {
>>>>     32          /* Set a pending exclusive access in the MemoryRegion */
>>>>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>>>>     34
>>>> env->iotlb[mmu_idx][index].addr,
>>>>     35
>>>> env->iotlb[mmu_idx][index].attrs);
>>>>     36          mr->pending_excl_access = true;
>>>>     37      }
>>>>
>>>>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>>
>>>>     39      tcg_exclusive_unlock();
>>>>
>>>>     40      /* From now on we are in LL/SC context */
>>>>     41      this_cpu->ll_sc_context = true;
>>>>
>>>>     42      return ret;
>>>>     43  }
>>>>
>>>>
>>>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>>>> store at this address occurs after we finished load at line 13 but
>>>> before TLB is flushed as a result of line 25. If we reorder the load to
>>>> happen after the TLB flush request we still must be sure that the flush
>>>> is complete before we can do the load safely.
>>>
>>> You are right, the risk actually exists. One solution to the problem
>>> could be to ignore the data acquired by the load and redo the LL after
>>> the flushes have been completed (basically the disas_ctx->pc points to
>>> the LL instruction). This time the LL will happen without flush
>>> requests and the access will be actually protected by the lock.
>>
>> So is just punting the work of to an async safe task and restarting the
>> vCPU thread going to be enough?
>>
>>>
>>> Regards,
>>> alvise
>>>
>>>>
>>>>>
>>>>> Moreover, adapt target-arm to also cope with the new multi-threaded
>>>>> execution.
>>>>>
>>>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>>>> ---
>>>>>  softmmu_llsc_template.h | 11 +++++++++--
>>>>>  softmmu_template.h      |  6 ++++++
>>>>>  target-arm/op_helper.c  |  6 ++++++
>>>>>  3 files changed, 21 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>>>>> index 2c4a494..d3810c0 100644
>>>>> --- a/softmmu_llsc_template.h
>>>>> +++ b/softmmu_llsc_template.h
>>>>> @@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>>>      hwaddr hw_addr;
>>>>>      unsigned mmu_idx = get_mmuidx(oi);
>>>>>
>>>>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>>> +
>>>>> +    tcg_exclusive_lock();
>>>>> +
>>>>>      /* Use the proper load helper from cpu_ldst.h */
>>>>>      ret = helper_ld(env, addr, oi, retaddr);
>>>>>
>>>>> -    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>>> -
>>>>>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>>>>>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>>>>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>>>>> @@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>>>
>>>>>      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>>>
>>>>> +    tcg_exclusive_unlock();
>>>>> +
>>>>>      /* From now on we are in LL/SC context */
>>>>>      this_cpu->ll_sc_context = true;
>>>>>
>>>>> @@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>>>           * access as one made by the store conditional wrapper. If the store
>>>>>           * conditional does not succeed, the value will be set to 0.*/
>>>>>          cpu->excl_succeeded = true;
>>>>> +
>>>>> +        tcg_exclusive_lock();
>>>>>          helper_st(env, addr, val, oi, retaddr);
>>>>>
>>>>>          if (cpu->excl_succeeded) {
>>>>> @@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>>>
>>>>>      /* Unset LL/SC context */
>>>>>      cc->cpu_reset_excl_context(cpu);
>>>>> +    tcg_exclusive_unlock();
>>>>>
>>>>>      return ret;
>>>>>  }
>>>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>>>> index 76fe37e..9363a7b 100644
>>>>> --- a/softmmu_template.h
>>>>> +++ b/softmmu_template.h
>>>>> @@ -537,11 +537,16 @@ static inline void smmu_helper(do_excl_store)(CPUArchState *env,
>>>>>          }
>>>>>      }
>>>>>
>>>>> +    /* Take the lock in case we are not coming from a SC */
>>>>> +    tcg_exclusive_lock();
>>>>> +
>>>>>      smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
>>>>>                                get_mmuidx(oi), index, retaddr);
>>>>>
>>>>>      reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
>>>>>
>>>>> +    tcg_exclusive_unlock();
>>>>> +
>>>>>      return;
>>>>>  }
>>>>>
>>>>> @@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>>      /* Handle an IO access or exclusive access.  */
>>>>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>>>>>          if (tlb_addr & TLB_EXCL) {
>>>>> +
>>>>>              smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
>>>>>                                         retaddr);
>>>>>              return;
>>>>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>>>>> index e22afc5..19ea52d 100644
>>>>> --- a/target-arm/op_helper.c
>>>>> +++ b/target-arm/op_helper.c
>>>>> @@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
>>>>>      cs->exception_index = excp;
>>>>>      env->exception.syndrome = syndrome;
>>>>>      env->exception.target_el = target_el;
>>>>> +    tcg_exclusive_lock();
>>>>>      cc->cpu_reset_excl_context(cs);
>>>>> +    tcg_exclusive_unlock();
>>>>>      cpu_loop_exit(cs);
>>>>>  }
>>>>>
>>>>> @@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
>>>>>      CPUState *cs = ENV_GET_CPU(env);
>>>>>      CPUClass *cc = CPU_GET_CLASS(cs);
>>>>>
>>>>> +    tcg_exclusive_lock();
>>>>>      cc->cpu_reset_excl_context(cs);
>>>>> +    tcg_exclusive_unlock();
>>>>>  }
>>>>>
>>>>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>>>>> @@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
>>>>>
>>>>>      aarch64_save_sp(env, cur_el);
>>>>>
>>>>> +    tcg_exclusive_lock();
>>>>>      cc->cpu_reset_excl_context(cs);
>>>>> +    tcg_exclusive_unlock();
>>>>>
>>>>>      /* We must squash the PSTATE.SS bit to zero unless both of the
>>>>>       * following hold:
>>>>
>>
>>
>> --
>> Alex Bennée


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-06-14 13:14 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 16:35 [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alvise Rigo
2016-05-26 16:35 ` [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}() Alvise Rigo
2016-05-31 15:03   ` Pranith Kumar
2016-06-02 16:21     ` alvise rigo
2016-06-08  9:21   ` Alex Bennée
2016-06-08 10:00     ` alvise rigo
2016-06-08 11:32       ` Peter Maydell
2016-06-08 13:52       ` Alex Bennée
2016-05-26 16:35 ` [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading Alvise Rigo
2016-06-10 15:21   ` Sergey Fedorov
2016-06-10 15:53     ` alvise rigo
2016-06-10 16:00       ` Sergey Fedorov
2016-06-10 16:04         ` alvise rigo
2016-06-14 12:00       ` Alex Bennée
2016-06-14 12:58         ` alvise rigo
2016-06-14 13:14           ` Alex Bennée
2016-06-10 16:15     ` Alex Bennée
2016-06-11 19:53       ` Sergey Fedorov
2016-06-14  8:37       ` Alex Bennée
2016-06-14  9:31         ` Sergey Fedorov
2016-05-26 16:35 ` [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu() Alvise Rigo
2016-06-08 13:54   ` Alex Bennée
2016-06-08 14:10     ` alvise rigo
2016-06-08 14:53       ` Sergey Fedorov
2016-06-08 15:20         ` Alex Bennée
2016-06-08 16:24           ` alvise rigo
2016-06-13  9:26             ` Alex Bennée
2016-05-26 16:35 ` [Qemu-devel] [RFC 04/10] cputlb: Introduce tlb_flush_other() Alvise Rigo
2016-05-26 16:35 ` [Qemu-devel] [RFC 05/10] target-arm: End TB after ldrex instruction Alvise Rigo
2016-05-26 16:35 ` [Qemu-devel] [RFC 06/10] cputlb: Add tlb_tables_flush_bitmap() Alvise Rigo
2016-05-26 16:35 ` [Qemu-devel] [RFC 07/10] cputlb: Query tlb_flush_by_mmuidx Alvise Rigo
2016-05-26 16:35 ` [Qemu-devel] [RFC 08/10] cputlb: Query tlb_flush_page_by_mmuidx Alvise Rigo
2016-05-26 16:35 ` [Qemu-devel] [RFC 09/10] cputlb: Query tlb_flush_page_all Alvise Rigo
2016-05-26 16:35 ` [Qemu-devel] [RFC 10/10] cpus: Do not sleep if some work item is pending Alvise Rigo
2016-06-10 15:21 ` [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns Alex Bennée
2016-06-10 15:30   ` alvise rigo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.