All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
@ 2015-08-10 15:26 fred.konrad
  2015-08-10 15:26 ` [Qemu-devel] [RFC PATCH V7 01/19] cpus: protect queued_work_* with work_mutex fred.konrad
                   ` (21 more replies)
  0 siblings, 22 replies; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:26 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This is the 7th round of the MTTCG patch series.


It can be cloned from:
git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.

This patch-set try to address the different issues in the global picture of
MTTCG, presented on the wiki.

== Needed patch for our work ==

Some preliminaries are needed for our work:
 * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
   the CPUState.
 * We need to run some work safely when all VCPUs are outside their execution
   loop. This is done with the async_run_safe_work_on_cpu function introduced
   in this series.
 * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
   atomic instruction.

== Code generation and cache ==

As Qemu stands, there is no protection at all against two threads attempting to
generate code at the same time or modifying a TranslationBlock.
The "protect TBContext with tb_lock" patch address the issue of code generation
and makes all the tb_* function thread safe (except tb_flush).
This raised the question of one or multiple caches. We choosed to use one
unified cache because it's easier as a first step and since the structure of
QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
don't see the benefit of having two pools of tbs.

== Dirty tracking ==

Protecting the IOs:
To allows all VCPUs threads to run at the same time we need to drop the
global_mutex as soon as possible. The io access need to take the mutex. This is
likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
will be upstreamed.

Invalidation of TranslationBlocks:
We can have all VCPUs running during an invalidation. Each VCPU is able to clean
it's jump cache itself as it is in CPUState so that can be handled by a simple
call to async_run_on_cpu. However tb_invalidate also writes to the
TranslationBlock which is shared as we have only one pool.
Hence this part of invalidate requires all VCPUs to exit before it can be done.
Hence the async_run_safe_work_on_cpu is introduced to handle this case.

== Atomic instruction ==

For now only ARM on x64 is supported by using an cmpxchg instruction.
Specifically the limitation of this approach is that it is harder to support
64bit ARM on a host architecture that is multi-core, but only supports 32 bit
cmpxchg (we believe this could be the case for some PPC cores).  For now this
case is not correctly handled. The existing atomic patch will attempt to execute
the 64 bit cmpxchg functionality in a non thread safe fashion. Our intention is
to provide a new multi-thread ARM atomic patch for 64bit ARM on effective 32bit
hosts.
This atomic instruction part has been tested with Alexander's atomic stress repo
available here:
https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html

The execution is a little slower than upstream probably because of the different
VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spin_lock
reduce considerably the difference.

== Testing ==

A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux guest show
a good performance progression: it takes basically 18s upstream to complete vs
10s with MTTCG.

Testing image is available here:
https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3

Then simply:
./configure --target-list=arm-softmmu
make -j8
./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
-initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
--append "console=ttyAMA0"

login: root

The dhrystone command is the last one in the history.
"dhrystone 10000000 & dhrystone 10000000"

The atomic spinlock benchmark from Alexander shows that atomic basically work.
Just follow the instruction here:
https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html

== Known issues ==

* GDB stub:
  GDB stub is not tested right now it will probably requires some changes to
  work.

* deadlock on exit:
  When exiting QEMU Ctrl-C some VCPU's thread are not able to exit and continue
  execution.
  http://git.greensocs.com/fkonrad/mttcg/issues/1

* memory_region_rom_device_set_romd from pflash01 just crashes the TCG code.
  Strangely this happen only with "-smp 4" and 2 in the DTB.
  http://git.greensocs.com/fkonrad/mttcg/issues/2

Changes V6 -> V7:
  * global_lock:
     * Don't protect softmmu read/write helper as it's now done in
       adress_space_rw.
  * tcg_exec_flag:
     * Make the flag atomically test and set through an API.
  * introduce async_safe_work:
     * move qemu_cpu_kick_thread to avoid prototype declaration.
     * use the work_mutex.
  * async_work:
     * protect it with a mutex (work_mutex) against concurent access.
  * tb_lock:
     * protect tcg_malloc_internal as well.
  * signal the VCPU even of current_cpu is NULL.
  * added PSCI patch.
  * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899).

Changes V5 -> V6:
  * Introduce async_safe_work to do the tb_flush and some part of tb_invalidate.
  * Introduce QemuSpin from Guillaume which allow a faster atomic instruction
    (6s to pass Alexander's atomic test instead of 30s before).
  * Don't take tb_lock before tb_find_fast.
  * Handle tb_flush with async_safe_work.
  * Handle tb_invalidate with async_work and async_safe_work.
  * Drop the tlb_flush_request mechanism and use async_work as well.
  * Fix the wrong lenght in atomic patch.
  * Fix the wrong return address for exception in atomic patch.

Alex Bennée (1):
  target-arm/psci.c: wake up sleeping CPUs (MTTCG)

Guillaume Delbergue (1):
  add support for spin lock on POSIX systems exclusively

KONRAD Frederic (17):
  cpus: protect queued_work_* with work_mutex.
  cpus: add tcg_exec_flag.
  cpus: introduce async_run_safe_work_on_cpu.
  replace spinlock by QemuMutex.
  remove unused spinlock.
  protect TBContext with tb_lock.
  tcg: remove tcg_halt_cond global variable.
  Drop global lock during TCG code execution
  cpu: remove exit_request global.
  tcg: switch on multithread.
  Use atomic cmpxchg to atomically check the exclusive value in a STREX
  add a callback when tb_invalidate is called.
  cpu: introduce tlb_flush*_all.
  arm: use tlb_flush*_all
  translate-all: introduces tb_flush_safe.
  translate-all: (wip) use tb_flush_safe when we can't alloc more tb.
  mttcg: signal the associated cpu anyway.

 cpu-exec.c                  |  98 +++++++++------
 cpus.c                      | 295 +++++++++++++++++++++++++-------------------
 cputlb.c                    |  81 ++++++++++++
 include/exec/exec-all.h     |   8 +-
 include/exec/spinlock.h     |  49 --------
 include/qemu/thread-posix.h |   4 +
 include/qemu/thread-win32.h |   4 +
 include/qemu/thread.h       |   7 ++
 include/qom/cpu.h           |  57 +++++++++
 linux-user/main.c           |   6 +-
 qom/cpu.c                   |  20 +++
 target-arm/cpu.c            |  21 ++++
 target-arm/cpu.h            |   6 +
 target-arm/helper.c         |  58 +++------
 target-arm/helper.h         |   4 +
 target-arm/op_helper.c      | 128 ++++++++++++++++++-
 target-arm/psci.c           |   2 +
 target-arm/translate.c      | 101 +++------------
 target-i386/mem_helper.c    |  16 ++-
 target-i386/misc_helper.c   |  27 +++-
 tcg/i386/tcg-target.c       |   8 ++
 tcg/tcg.h                   |  14 ++-
 translate-all.c             | 217 +++++++++++++++++++++++++++-----
 util/qemu-thread-posix.c    |  45 +++++++
 util/qemu-thread-win32.c    |  30 +++++
 vl.c                        |   6 +
 26 files changed, 934 insertions(+), 378 deletions(-)
 delete mode 100644 include/exec/spinlock.h

-- 
1.9.0

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 01/19] cpus: protect queued_work_* with work_mutex.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
@ 2015-08-10 15:26 ` fred.konrad
  2015-08-10 15:59   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 02/19] cpus: add tcg_exec_flag fred.konrad
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:26 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This protects queued_work_* used by async_run_on_cpu, run_on_cpu and
flush_queued_work with a new lock (work_mutex) to prevent multiple (concurrent)
access.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

Changes V1 -> V2:
  * Unlock the mutex while running the callback.
---
 cpus.c            | 11 +++++++++++
 include/qom/cpu.h |  3 +++
 qom/cpu.c         |  1 +
 3 files changed, 15 insertions(+)

diff --git a/cpus.c b/cpus.c
index b00a423..eabd4b1 100644
--- a/cpus.c
+++ b/cpus.c
@@ -845,6 +845,8 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
     wi.func = func;
     wi.data = data;
     wi.free = false;
+
+    qemu_mutex_lock(&cpu->work_mutex);
     if (cpu->queued_work_first == NULL) {
         cpu->queued_work_first = &wi;
     } else {
@@ -853,6 +855,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
     cpu->queued_work_last = &wi;
     wi.next = NULL;
     wi.done = false;
+    qemu_mutex_unlock(&cpu->work_mutex);
 
     qemu_cpu_kick(cpu);
     while (!wi.done) {
@@ -876,6 +879,8 @@ void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
     wi->func = func;
     wi->data = data;
     wi->free = true;
+
+    qemu_mutex_lock(&cpu->work_mutex);
     if (cpu->queued_work_first == NULL) {
         cpu->queued_work_first = wi;
     } else {
@@ -884,6 +889,7 @@ void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
     cpu->queued_work_last = wi;
     wi->next = NULL;
     wi->done = false;
+    qemu_mutex_unlock(&cpu->work_mutex);
 
     qemu_cpu_kick(cpu);
 }
@@ -896,15 +902,20 @@ static void flush_queued_work(CPUState *cpu)
         return;
     }
 
+    qemu_mutex_lock(&cpu->work_mutex);
     while ((wi = cpu->queued_work_first)) {
         cpu->queued_work_first = wi->next;
+        qemu_mutex_unlock(&cpu->work_mutex);
         wi->func(wi->data);
+        qemu_mutex_lock(&cpu->work_mutex);
         wi->done = true;
         if (wi->free) {
             g_free(wi);
         }
     }
     cpu->queued_work_last = NULL;
+    qemu_mutex_unlock(&cpu->work_mutex);
+
     qemu_cond_broadcast(&qemu_work_cond);
 }
 
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 20aabc9..efa9624 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -242,6 +242,8 @@ struct kvm_run;
  * @mem_io_pc: Host Program Counter at which the memory was accessed.
  * @mem_io_vaddr: Target virtual address at which the memory was accessed.
  * @kvm_fd: vCPU file descriptor for KVM.
+ * @work_mutex: Lock to prevent multiple access to queued_work_*.
+ * @queued_work_first: First asynchronous work pending.
  *
  * State of one CPU core or thread.
  */
@@ -262,6 +264,7 @@ struct CPUState {
     uint32_t host_tid;
     bool running;
     struct QemuCond *halt_cond;
+    QemuMutex work_mutex;
     struct qemu_work_item *queued_work_first, *queued_work_last;
     bool thread_kicked;
     bool created;
diff --git a/qom/cpu.c b/qom/cpu.c
index eb9cfec..4e12598 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -316,6 +316,7 @@ static void cpu_common_initfn(Object *obj)
     cpu->gdb_num_regs = cpu->gdb_num_g_regs = cc->gdb_num_core_regs;
     QTAILQ_INIT(&cpu->breakpoints);
     QTAILQ_INIT(&cpu->watchpoints);
+    qemu_mutex_init(&cpu->work_mutex);
 }
 
 static void cpu_common_finalize(Object *obj)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 02/19] cpus: add tcg_exec_flag.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
  2015-08-10 15:26 ` [Qemu-devel] [RFC PATCH V7 01/19] cpus: protect queued_work_* with work_mutex fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-11 10:53   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 03/19] cpus: introduce async_run_safe_work_on_cpu fred.konrad
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This flag indicates the state of the VCPU thread:
  * 0 if the VCPU is allowed to execute code.
  * 1 if the VCPU is currently executing code.
  * -1 if the VCPU is not allowed to execute code.

This allows to atomically check and run safe work or check and continue the TCG
execution.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
Changes V2 -> V3:
  * introduce a third state which allow or not the execution.
  * atomically check and set the flag when starting or blocking the code execution.
Changes V1 -> V2:
  * do both tcg_executing = 0 or 1 in cpu_exec().
---
 cpu-exec.c        |  5 +++++
 include/qom/cpu.h | 32 ++++++++++++++++++++++++++++++++
 qom/cpu.c         | 19 +++++++++++++++++++
 3 files changed, 56 insertions(+)

diff --git a/cpu-exec.c b/cpu-exec.c
index 75694f3..e16666a 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -371,6 +371,10 @@ int cpu_exec(CPUState *cpu)
         cpu->halted = 0;
     }
 
+    if (!tcg_cpu_try_start_execution(cpu)) {
+        cpu->exit_request = 1;
+        return 0;
+    }
     current_cpu = cpu;
 
     /* As long as current_cpu is null, up to the assignment just above,
@@ -583,5 +587,6 @@ int cpu_exec(CPUState *cpu)
 
     /* fail safe : never use current_cpu outside cpu_exec() */
     current_cpu = NULL;
+    tcg_cpu_allow_execution(cpu);
     return ret;
 }
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index efa9624..de7487e 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -226,6 +226,7 @@ struct kvm_run;
  * @stopped: Indicates the CPU has been artificially stopped.
  * @tcg_exit_req: Set to force TCG to stop executing linked TBs for this
  *           CPU and return to its top level loop.
+ * @tcg_exec_flag: See tcg_cpu_flag_* function.
  * @singlestep_enabled: Flags for single-stepping.
  * @icount_extra: Instructions until next timer event.
  * @icount_decr: Number of cycles left, with interrupt flag in high bit.
@@ -322,6 +323,8 @@ struct CPUState {
        (absolute value) offset as small as possible.  This reduces code
        size, especially for hosts without large memory offsets.  */
     volatile sig_atomic_t tcg_exit_req;
+
+    int tcg_exec_flag;
 };
 
 QTAILQ_HEAD(CPUTailQ, CPUState);
@@ -337,6 +340,35 @@ extern struct CPUTailQ cpus;
 DECLARE_TLS(CPUState *, current_cpu);
 #define current_cpu tls_var(current_cpu)
 
+
+/**
+ * tcg_cpu_try_block_execution
+ * @cpu: The CPU to block the execution
+ *
+ * Try to set the tcg_exec_flag to -1 saying the CPU can't execute code if the
+ * CPU is not executing code.
+ * Returns true if the cpu execution is blocked, false otherwise.
+ */
+bool tcg_cpu_try_block_execution(CPUState *cpu);
+
+/**
+ * tcg_cpu_allow_execution
+ * @cpu: The CPU to allow the execution.
+ *
+ * Just reset the state of tcg_exec_flag, and allow the execution of some code.
+ */
+void tcg_cpu_allow_execution(CPUState *cpu);
+
+/**
+ * tcg_cpu_try_start_execution
+ * @cpu: The CPU to start the execution.
+ *
+ * Just set the tcg_exec_flag to 1 saying the CPU is executing code if the CPU
+ * is allowed to run some code.
+ * Returns true if the cpu can execute, false otherwise.
+ */
+bool tcg_cpu_try_start_execution(CPUState *cpu);
+
 /**
  * cpu_paging_enabled:
  * @cpu: The CPU whose state is to be inspected.
diff --git a/qom/cpu.c b/qom/cpu.c
index 4e12598..e32f90c 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -26,6 +26,23 @@
 #include "qemu/error-report.h"
 #include "sysemu/sysemu.h"
 
+bool tcg_cpu_try_block_execution(CPUState *cpu)
+{
+    return (atomic_cmpxchg(&cpu->tcg_exec_flag, 0, -1)
+           || (cpu->tcg_exec_flag == -1));
+}
+
+void tcg_cpu_allow_execution(CPUState *cpu)
+{
+    cpu->tcg_exec_flag = 0;
+}
+
+bool tcg_cpu_try_start_execution(CPUState *cpu)
+{
+    return (atomic_cmpxchg(&cpu->tcg_exec_flag, 0, 1)
+           || (cpu->tcg_exec_flag == 1));
+}
+
 bool cpu_exists(int64_t id)
 {
     CPUState *cpu;
@@ -249,6 +266,8 @@ static void cpu_common_reset(CPUState *cpu)
     cpu->icount_decr.u32 = 0;
     cpu->can_do_io = 0;
     cpu->exception_index = -1;
+
+    tcg_cpu_allow_execution(cpu);
     memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
 }
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 03/19] cpus: introduce async_run_safe_work_on_cpu.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
  2015-08-10 15:26 ` [Qemu-devel] [RFC PATCH V7 01/19] cpus: protect queued_work_* with work_mutex fred.konrad
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 02/19] cpus: add tcg_exec_flag fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 04/19] replace spinlock by QemuMutex fred.konrad
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

We already had async_run_on_cpu but we need all VCPUs outside their execution
loop to execute some tb_flush/invalidate task:

async_run_on_cpu_safe schedule a work on a VCPU but the work start when no more
VCPUs are executing code.
When a safe work is pending cpu_has_work returns true, so cpu_exec returns and
the VCPUs can't enters execution loop. cpu_thread_is_idle returns false so at
the moment where all VCPUs are stop || stopped the safe work queue can be
flushed.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>

Changes V3 -> V4:
  * Use tcg_cpu_try_block_execution.
  * Use a counter to know how many safe work are pending.
Changes V2 -> V3:
  * Unlock the mutex while executing the callback.
Changes V1 -> V2:
  * Move qemu_cpu_kick_thread to avoid prototype declaration.
  * Use the work_mutex lock to protect the queued_safe_work_* structures.
---
 cpu-exec.c        |   5 ++
 cpus.c            | 149 +++++++++++++++++++++++++++++++++++++++---------------
 include/qom/cpu.h |  24 ++++++++-
 3 files changed, 137 insertions(+), 41 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index e16666a..97805cc 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -363,6 +363,11 @@ int cpu_exec(CPUState *cpu)
     /* This must be volatile so it is not trashed by longjmp() */
     volatile bool have_tb_lock = false;
 
+    if (async_safe_work_pending()) {
+        cpu->exit_request = 1;
+        return 0;
+    }
+
     if (cpu->halted) {
         if (!cpu_has_work(cpu)) {
             return EXCP_HALTED;
diff --git a/cpus.c b/cpus.c
index eabd4b1..2250296 100644
--- a/cpus.c
+++ b/cpus.c
@@ -69,6 +69,8 @@ static CPUState *next_cpu;
 int64_t max_delay;
 int64_t max_advance;
 
+int safe_work_pending; /* Number of safe work pending for all VCPUs. */
+
 bool cpu_is_stopped(CPUState *cpu)
 {
     return cpu->stopped || !runstate_is_running();
@@ -76,7 +78,7 @@ bool cpu_is_stopped(CPUState *cpu)
 
 static bool cpu_thread_is_idle(CPUState *cpu)
 {
-    if (cpu->stop || cpu->queued_work_first) {
+    if (cpu->stop || cpu->queued_work_first || cpu->queued_safe_work_first) {
         return false;
     }
     if (cpu_is_stopped(cpu)) {
@@ -833,6 +835,45 @@ void qemu_init_cpu_loop(void)
     qemu_thread_get_self(&io_thread);
 }
 
+static void qemu_cpu_kick_thread(CPUState *cpu)
+{
+#ifndef _WIN32
+    int err;
+
+    err = pthread_kill(cpu->thread->thread, SIG_IPI);
+    if (err) {
+        fprintf(stderr, "qemu:%s: %s", __func__, strerror(err));
+        exit(1);
+    }
+#else /* _WIN32 */
+    if (!qemu_cpu_is_self(cpu)) {
+        CONTEXT tcgContext;
+
+        if (SuspendThread(cpu->hThread) == (DWORD)-1) {
+            fprintf(stderr, "qemu:%s: GetLastError:%lu\n", __func__,
+                    GetLastError());
+            exit(1);
+        }
+
+        /* On multi-core systems, we are not sure that the thread is actually
+         * suspended until we can get the context.
+         */
+        tcgContext.ContextFlags = CONTEXT_CONTROL;
+        while (GetThreadContext(cpu->hThread, &tcgContext) != 0) {
+            continue;
+        }
+
+        cpu_signal(0);
+
+        if (ResumeThread(cpu->hThread) == (DWORD)-1) {
+            fprintf(stderr, "qemu:%s: GetLastError:%lu\n", __func__,
+                    GetLastError());
+            exit(1);
+        }
+    }
+#endif
+}
+
 void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
 {
     struct qemu_work_item wi;
@@ -894,6 +935,70 @@ void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
     qemu_cpu_kick(cpu);
 }
 
+void async_run_safe_work_on_cpu(CPUState *cpu, void (*func)(void *data),
+                                void *data)
+{
+    struct qemu_work_item *wi;
+
+    wi = g_malloc0(sizeof(struct qemu_work_item));
+    wi->func = func;
+    wi->data = data;
+    wi->free = true;
+
+    atomic_inc(&safe_work_pending);
+    qemu_mutex_lock(&cpu->work_mutex);
+    if (cpu->queued_safe_work_first == NULL) {
+        cpu->queued_safe_work_first = wi;
+    } else {
+        cpu->queued_safe_work_last->next = wi;
+    }
+    cpu->queued_safe_work_last = wi;
+    wi->next = NULL;
+    wi->done = false;
+    qemu_mutex_unlock(&cpu->work_mutex);
+
+    CPU_FOREACH(cpu) {
+        qemu_cpu_kick_thread(cpu);
+    }
+}
+
+static void flush_queued_safe_work(CPUState *cpu)
+{
+    struct qemu_work_item *wi;
+    CPUState *other_cpu;
+
+    if (cpu->queued_safe_work_first == NULL) {
+        return;
+    }
+
+    CPU_FOREACH(other_cpu) {
+        if (!tcg_cpu_try_block_execution(other_cpu)) {
+            return;
+        }
+    }
+
+    qemu_mutex_lock(&cpu->work_mutex);
+    while ((wi = cpu->queued_safe_work_first)) {
+        cpu->queued_safe_work_first = wi->next;
+        qemu_mutex_unlock(&cpu->work_mutex);
+        wi->func(wi->data);
+        qemu_mutex_lock(&cpu->work_mutex);
+        wi->done = true;
+        if (wi->free) {
+            g_free(wi);
+        }
+        atomic_dec(&safe_work_pending);
+    }
+    cpu->queued_safe_work_last = NULL;
+    qemu_mutex_unlock(&cpu->work_mutex);
+    qemu_cond_broadcast(&qemu_work_cond);
+}
+
+bool async_safe_work_pending(void)
+{
+    return safe_work_pending != 0;
+}
+
 static void flush_queued_work(CPUState *cpu)
 {
     struct qemu_work_item *wi;
@@ -926,6 +1031,9 @@ static void qemu_wait_io_event_common(CPUState *cpu)
         cpu->stopped = true;
         qemu_cond_signal(&qemu_pause_cond);
     }
+    qemu_mutex_unlock_iothread();
+    flush_queued_safe_work(cpu);
+    qemu_mutex_lock_iothread();
     flush_queued_work(cpu);
     cpu->thread_kicked = false;
 }
@@ -1085,45 +1193,6 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
     return NULL;
 }
 
-static void qemu_cpu_kick_thread(CPUState *cpu)
-{
-#ifndef _WIN32
-    int err;
-
-    err = pthread_kill(cpu->thread->thread, SIG_IPI);
-    if (err) {
-        fprintf(stderr, "qemu:%s: %s", __func__, strerror(err));
-        exit(1);
-    }
-#else /* _WIN32 */
-    if (!qemu_cpu_is_self(cpu)) {
-        CONTEXT tcgContext;
-
-        if (SuspendThread(cpu->hThread) == (DWORD)-1) {
-            fprintf(stderr, "qemu:%s: GetLastError:%lu\n", __func__,
-                    GetLastError());
-            exit(1);
-        }
-
-        /* On multi-core systems, we are not sure that the thread is actually
-         * suspended until we can get the context.
-         */
-        tcgContext.ContextFlags = CONTEXT_CONTROL;
-        while (GetThreadContext(cpu->hThread, &tcgContext) != 0) {
-            continue;
-        }
-
-        cpu_signal(0);
-
-        if (ResumeThread(cpu->hThread) == (DWORD)-1) {
-            fprintf(stderr, "qemu:%s: GetLastError:%lu\n", __func__,
-                    GetLastError());
-            exit(1);
-        }
-    }
-#endif
-}
-
 void qemu_cpu_kick(CPUState *cpu)
 {
     qemu_cond_broadcast(cpu->halt_cond);
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index de7487e..23418c0 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -243,8 +243,9 @@ struct kvm_run;
  * @mem_io_pc: Host Program Counter at which the memory was accessed.
  * @mem_io_vaddr: Target virtual address at which the memory was accessed.
  * @kvm_fd: vCPU file descriptor for KVM.
- * @work_mutex: Lock to prevent multiple access to queued_work_*.
+ * @work_mutex: Lock to prevent multiple access to queued_* qemu_work_item.
  * @queued_work_first: First asynchronous work pending.
+ * @queued_safe_work_first: First item of safe work pending.
  *
  * State of one CPU core or thread.
  */
@@ -267,6 +268,7 @@ struct CPUState {
     struct QemuCond *halt_cond;
     QemuMutex work_mutex;
     struct qemu_work_item *queued_work_first, *queued_work_last;
+    struct qemu_work_item *queued_safe_work_first, *queued_safe_work_last;
     bool thread_kicked;
     bool created;
     bool stop;
@@ -575,6 +577,26 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 
 /**
+ * async_run_safe_work_on_cpu:
+ * @cpu: The vCPU to run on.
+ * @func: The function to be executed.
+ * @data: Data to pass to the function.
+ *
+ * Schedules the function @func for execution on the vCPU @cpu asynchronously
+ * when all the VCPUs are outside their loop.
+ */
+void async_run_safe_work_on_cpu(CPUState *cpu, void (*func)(void *data),
+                                void *data);
+
+/**
+ * async_safe_work_pending:
+ *
+ * Check whether any safe work is pending on any VCPUs.
+ * Returns: @true if a safe work is pending, @false otherwise.
+ */
+bool async_safe_work_pending(void);
+
+/**
  * qemu_get_cpu:
  * @index: The CPUState@cpu_index value of the CPU to obtain.
  *
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 04/19] replace spinlock by QemuMutex.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (2 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 03/19] cpus: introduce async_run_safe_work_on_cpu fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 16:09   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 05/19] remove unused spinlock fred.konrad
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

spinlock is only used in two cases:
  * cpu-exec.c: to protect TranslationBlock
  * mem_helper.c: for lock helper in target-i386 (which seems broken).

It's a pthread_mutex_t in user-mode so better using QemuMutex directly in this
case.
It allows as well to reuse tb_lock mutex of TBContext in case of multithread
TCG.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
---
 cpu-exec.c               | 15 +++++++++++----
 include/exec/exec-all.h  |  4 ++--
 linux-user/main.c        |  6 +++---
 target-i386/mem_helper.c | 16 +++++++++++++---
 tcg/i386/tcg-target.c    |  8 ++++++++
 5 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 97805cc..f3358a9 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -361,7 +361,9 @@ int cpu_exec(CPUState *cpu)
     SyncClocks sc;
 
     /* This must be volatile so it is not trashed by longjmp() */
+#if defined(CONFIG_USER_ONLY)
     volatile bool have_tb_lock = false;
+#endif
 
     if (async_safe_work_pending()) {
         cpu->exit_request = 1;
@@ -488,8 +490,10 @@ int cpu_exec(CPUState *cpu)
                     cpu->exception_index = EXCP_INTERRUPT;
                     cpu_loop_exit(cpu);
                 }
-                spin_lock(&tcg_ctx.tb_ctx.tb_lock);
+#if defined(CONFIG_USER_ONLY)
+                qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
                 have_tb_lock = true;
+#endif
                 tb = tb_find_fast(cpu);
                 /* Note: we do it here to avoid a gcc bug on Mac OS X when
                    doing it in tb_find_slow */
@@ -511,9 +515,10 @@ int cpu_exec(CPUState *cpu)
                     tb_add_jump((TranslationBlock *)(next_tb & ~TB_EXIT_MASK),
                                 next_tb & TB_EXIT_MASK, tb);
                 }
+#if defined(CONFIG_USER_ONLY)
                 have_tb_lock = false;
-                spin_unlock(&tcg_ctx.tb_ctx.tb_lock);
-
+                qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+#endif
                 /* cpu_interrupt might be called while translating the
                    TB, but before it is linked into a potentially
                    infinite loop and becomes env->current_tb. Avoid
@@ -580,10 +585,12 @@ int cpu_exec(CPUState *cpu)
             x86_cpu = X86_CPU(cpu);
             env = &x86_cpu->env;
 #endif
+#if defined(CONFIG_USER_ONLY)
             if (have_tb_lock) {
-                spin_unlock(&tcg_ctx.tb_ctx.tb_lock);
+                qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
                 have_tb_lock = false;
             }
+#endif
         }
     } /* for(;;) */
 
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index a6fce04..55a6ff2 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -176,7 +176,7 @@ struct TranslationBlock {
     struct TranslationBlock *jmp_first;
 };
 
-#include "exec/spinlock.h"
+#include "qemu/thread.h"
 
 typedef struct TBContext TBContext;
 
@@ -186,7 +186,7 @@ struct TBContext {
     TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
-    spinlock_t tb_lock;
+    QemuMutex tb_lock;
 
     /* statistics */
     int tb_flush_count;
diff --git a/linux-user/main.c b/linux-user/main.c
index 05914b1..20e7199 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -107,7 +107,7 @@ static int pending_cpus;
 /* Make sure everything is in a consistent state for calling fork().  */
 void fork_start(void)
 {
-    pthread_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
     pthread_mutex_lock(&exclusive_lock);
     mmap_fork_start();
 }
@@ -129,11 +129,11 @@ void fork_end(int child)
         pthread_mutex_init(&cpu_list_mutex, NULL);
         pthread_cond_init(&exclusive_cond, NULL);
         pthread_cond_init(&exclusive_resume, NULL);
-        pthread_mutex_init(&tcg_ctx.tb_ctx.tb_lock, NULL);
+        qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
         gdbserver_fork(thread_cpu);
     } else {
         pthread_mutex_unlock(&exclusive_lock);
-        pthread_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
     }
 }
 
diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
index 1aec8a5..7106cc3 100644
--- a/target-i386/mem_helper.c
+++ b/target-i386/mem_helper.c
@@ -23,17 +23,27 @@
 
 /* broken thread support */
 
-static spinlock_t global_cpu_lock = SPIN_LOCK_UNLOCKED;
+#if defined(CONFIG_USER_ONLY)
+QemuMutex global_cpu_lock;
 
 void helper_lock(void)
 {
-    spin_lock(&global_cpu_lock);
+    qemu_mutex_lock(&global_cpu_lock);
 }
 
 void helper_unlock(void)
 {
-    spin_unlock(&global_cpu_lock);
+    qemu_mutex_unlock(&global_cpu_lock);
 }
+#else
+void helper_lock(void)
+{
+}
+
+void helper_unlock(void)
+{
+}
+#endif
 
 void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
 {
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index ff4d9cf..0d7c99c 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -24,6 +24,10 @@
 
 #include "tcg-be-ldst.h"
 
+#if defined(CONFIG_USER_ONLY)
+extern QemuMutex global_cpu_lock;
+#endif
+
 #ifndef NDEBUG
 static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
 #if TCG_TARGET_REG_BITS == 64
@@ -2342,6 +2346,10 @@ static void tcg_target_init(TCGContext *s)
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
 
     tcg_add_target_add_op_defs(x86_op_defs);
+
+#if defined(CONFIG_USER_ONLY)
+    qemu_mutex_init(global_cpu_lock);
+#endif
 }
 
 typedef struct {
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 05/19] remove unused spinlock.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (3 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 04/19] replace spinlock by QemuMutex fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 06/19] add support for spin lock on POSIX systems exclusively fred.konrad
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This just removes spinlock as it is not used anymore.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>

Changes V6 -> V7:
  * Drop the checkpatch part.
---
 include/exec/spinlock.h | 49 -------------------------------------------------
 1 file changed, 49 deletions(-)
 delete mode 100644 include/exec/spinlock.h

diff --git a/include/exec/spinlock.h b/include/exec/spinlock.h
deleted file mode 100644
index a72edda..0000000
--- a/include/exec/spinlock.h
+++ /dev/null
@@ -1,49 +0,0 @@
-/*
- *  Copyright (c) 2003 Fabrice Bellard
- *
- * This library is free software; you can redistribute it and/or
- * modify it under the terms of the GNU Lesser General Public
- * License as published by the Free Software Foundation; either
- * version 2 of the License, or (at your option) any later version.
- *
- * This library is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * Lesser General Public License for more details.
- *
- * You should have received a copy of the GNU Lesser General Public
- * License along with this library; if not, see <http://www.gnu.org/licenses/>
- */
-
-/* configure guarantees us that we have pthreads on any host except
- * mingw32, which doesn't support any of the user-only targets.
- * So we can simply assume we have pthread mutexes here.
- */
-#if defined(CONFIG_USER_ONLY)
-
-#include <pthread.h>
-#define spin_lock pthread_mutex_lock
-#define spin_unlock pthread_mutex_unlock
-#define spinlock_t pthread_mutex_t
-#define SPIN_LOCK_UNLOCKED PTHREAD_MUTEX_INITIALIZER
-
-#else
-
-/* Empty implementations, on the theory that system mode emulation
- * is single-threaded. This means that these functions should only
- * be used from code run in the TCG cpu thread, and cannot protect
- * data structures which might also be accessed from the IO thread
- * or from signal handlers.
- */
-typedef int spinlock_t;
-#define SPIN_LOCK_UNLOCKED 0
-
-static inline void spin_lock(spinlock_t *lock)
-{
-}
-
-static inline void spin_unlock(spinlock_t *lock)
-{
-}
-
-#endif
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 06/19] add support for spin lock on POSIX systems exclusively
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (4 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 05/19] remove unused spinlock fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 16:10   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock fred.konrad
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>

WARNING: spin lock is currently not implemented on WIN32

Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
---
 include/qemu/thread-posix.h |  4 ++++
 include/qemu/thread-win32.h |  4 ++++
 include/qemu/thread.h       |  7 +++++++
 util/qemu-thread-posix.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
 util/qemu-thread-win32.c    | 30 ++++++++++++++++++++++++++++++
 5 files changed, 90 insertions(+)

diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index eb5c7a1..8ce8f01 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -7,6 +7,10 @@ struct QemuMutex {
     pthread_mutex_t lock;
 };
 
+struct QemuSpin {
+    pthread_spinlock_t lock;
+};
+
 struct QemuCond {
     pthread_cond_t cond;
 };
diff --git a/include/qemu/thread-win32.h b/include/qemu/thread-win32.h
index 3d58081..310c8bd 100644
--- a/include/qemu/thread-win32.h
+++ b/include/qemu/thread-win32.h
@@ -7,6 +7,10 @@ struct QemuMutex {
     LONG owner;
 };
 
+struct QemuSpin {
+    PKSPIN_LOCK lock;
+};
+
 struct QemuCond {
     LONG waiters, target;
     HANDLE sema;
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index 5114ec8..f5d1259 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -5,6 +5,7 @@
 #include <stdbool.h>
 
 typedef struct QemuMutex QemuMutex;
+typedef struct QemuSpin QemuSpin;
 typedef struct QemuCond QemuCond;
 typedef struct QemuSemaphore QemuSemaphore;
 typedef struct QemuEvent QemuEvent;
@@ -25,6 +26,12 @@ void qemu_mutex_lock(QemuMutex *mutex);
 int qemu_mutex_trylock(QemuMutex *mutex);
 void qemu_mutex_unlock(QemuMutex *mutex);
 
+void qemu_spin_init(QemuSpin *spin);
+void qemu_spin_destroy(QemuSpin *spin);
+void qemu_spin_lock(QemuSpin *spin);
+int qemu_spin_trylock(QemuSpin *spin);
+void qemu_spin_unlock(QemuSpin *spin);
+
 void qemu_cond_init(QemuCond *cond);
 void qemu_cond_destroy(QemuCond *cond);
 
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index ba67cec..224bacc 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -89,6 +89,51 @@ void qemu_mutex_unlock(QemuMutex *mutex)
         error_exit(err, __func__);
 }
 
+void qemu_spin_init(QemuSpin *spin)
+{
+    int err;
+
+    err = pthread_spin_init(&spin->lock, 0);
+    if (err) {
+        error_exit(err, __func__);
+    }
+}
+
+void qemu_spin_destroy(QemuSpin *spin)
+{
+    int err;
+
+    err = pthread_spin_destroy(&spin->lock);
+    if (err) {
+        error_exit(err, __func__);
+    }
+}
+
+void qemu_spin_lock(QemuSpin *spin)
+{
+    int err;
+
+    err = pthread_spin_lock(&spin->lock);
+    if (err) {
+        error_exit(err, __func__);
+    }
+}
+
+int qemu_spin_trylock(QemuSpin *spin)
+{
+    return pthread_spin_trylock(&spin->lock);
+}
+
+void qemu_spin_unlock(QemuSpin *spin)
+{
+    int err;
+
+    err = pthread_spin_unlock(&spin->lock);
+    if (err) {
+        error_exit(err, __func__);
+    }
+}
+
 void qemu_cond_init(QemuCond *cond)
 {
     int err;
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index 406b52f..6fbe6a8 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -80,6 +80,36 @@ void qemu_mutex_unlock(QemuMutex *mutex)
     LeaveCriticalSection(&mutex->lock);
 }
 
+void qemu_spin_init(QemuSpin *spin)
+{
+    printf("spinlock not implemented");
+    abort();
+}
+
+void qemu_spin_destroy(QemuSpin *spin)
+{
+    printf("spinlock not implemented");
+    abort();
+}
+
+void qemu_spin_lock(QemuSpin *spin)
+{
+    printf("spinlock not implemented");
+    abort();
+}
+
+int qemu_spin_trylock(QemuSpin *spin)
+{
+    printf("spinlock not implemented");
+    abort();
+}
+
+void qemu_spin_unlock(QemuSpin *spin)
+{
+    printf("spinlock not implemented");
+    abort();
+}
+
 void qemu_cond_init(QemuCond *cond)
 {
     memset(cond, 0, sizeof(*cond));
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (5 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 06/19] add support for spin lock on POSIX systems exclusively fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 16:36   ` Paolo Bonzini
  2015-08-12 17:45   ` Frederic Konrad
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 08/19] tcg: remove tcg_halt_cond global variable fred.konrad
                   ` (14 subsequent siblings)
  21 siblings, 2 replies; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This protects TBContext with tb_lock to make tb_* thread safe.

We can still have issue with tb_flush in case of multithread TCG:
  An other CPU can be executing code during a flush.

This can be fixed later by making all other TCG thread exiting before calling
tb_flush().

tb_find_slow is separated into tb_find_slow and tb_find_physical as the whole
tb_find_slow doesn't require to lock the tb.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>

Changes:
V6 -> V7:
  * Drop a tb_lock in already locked restore_state_to_opc.
V5 -> V6:
  * Drop a tb_lock arround tb_find_fast in cpu-exec.c.
---
 cpu-exec.c              |  58 +++++++++++++-------
 include/exec/exec-all.h |   1 +
 target-arm/translate.c  |   3 ++
 tcg/tcg.h               |  14 ++++-
 translate-all.c         | 137 +++++++++++++++++++++++++++++++++++++-----------
 5 files changed, 162 insertions(+), 51 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index f3358a9..a012e9d 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -131,6 +131,8 @@ static void init_delay_params(SyncClocks *sc, const CPUState *cpu)
 void cpu_loop_exit(CPUState *cpu)
 {
     cpu->current_tb = NULL;
+    /* Release those mutex before long jump so other thread can work. */
+    tb_lock_reset();
     siglongjmp(cpu->jmp_env, 1);
 }
 
@@ -143,6 +145,8 @@ void cpu_resume_from_signal(CPUState *cpu, void *puc)
     /* XXX: restore cpu registers saved in host registers */
 
     cpu->exception_index = -1;
+    /* Release those mutex before long jump so other thread can work. */
+    tb_lock_reset();
     siglongjmp(cpu->jmp_env, 1);
 }
 
@@ -253,10 +257,8 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
     tb_free(tb);
 }
 
-static TranslationBlock *tb_find_slow(CPUState *cpu,
-                                      target_ulong pc,
-                                      target_ulong cs_base,
-                                      uint64_t flags)
+static TranslationBlock *tb_find_physical(CPUState *cpu, target_ulong pc,
+                                          target_ulong cs_base, uint64_t flags)
 {
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb, **ptb1;
@@ -273,8 +275,9 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
     ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     for(;;) {
         tb = *ptb1;
-        if (!tb)
-            goto not_found;
+        if (!tb) {
+            return tb;
+        }
         if (tb->pc == pc &&
             tb->page_addr[0] == phys_page1 &&
             tb->cs_base == cs_base &&
@@ -282,28 +285,42 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
             /* check next page if needed */
             if (tb->page_addr[1] != -1) {
                 tb_page_addr_t phys_page2;
-
                 virt_page2 = (pc & TARGET_PAGE_MASK) +
                     TARGET_PAGE_SIZE;
                 phys_page2 = get_page_addr_code(env, virt_page2);
-                if (tb->page_addr[1] == phys_page2)
-                    goto found;
+                if (tb->page_addr[1] == phys_page2) {
+                    return tb;
+                }
             } else {
-                goto found;
+                return tb;
             }
         }
         ptb1 = &tb->phys_hash_next;
     }
- not_found:
-   /* if no translated code available, then translate it now */
-    tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
-
- found:
-    /* Move the last found TB to the head of the list */
-    if (likely(*ptb1)) {
-        *ptb1 = tb->phys_hash_next;
-        tb->phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h];
-        tcg_ctx.tb_ctx.tb_phys_hash[h] = tb;
+    return tb;
+}
+
+static TranslationBlock *tb_find_slow(CPUState *cpu, target_ulong pc,
+                                      target_ulong cs_base, uint64_t flags)
+{
+    /*
+     * First try to get the tb if we don't find it we need to lock and compile
+     * it.
+     */
+    TranslationBlock *tb;
+
+    tb = tb_find_physical(cpu, pc, cs_base, flags);
+    if (!tb) {
+        tb_lock();
+        /*
+         * Retry to get the TB in case a CPU just translate it to avoid having
+         * duplicated TB in the pool.
+         */
+        tb = tb_find_physical(cpu, pc, cs_base, flags);
+        if (!tb) {
+            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+        }
+        tb_unlock();
     }
     /* we add the TB in the virtual pc hash table */
     cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
@@ -326,6 +343,7 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu)
                  tb->flags != flags)) {
         tb = tb_find_slow(cpu, pc, cs_base, flags);
     }
+
     return tb;
 }
 
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 55a6ff2..9f1c1cb 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -74,6 +74,7 @@ typedef struct TranslationBlock TranslationBlock;
 
 void gen_intermediate_code(CPUArchState *env, struct TranslationBlock *tb);
 void gen_intermediate_code_pc(CPUArchState *env, struct TranslationBlock *tb);
+/* Called with tb_lock held.  */
 void restore_state_to_opc(CPUArchState *env, struct TranslationBlock *tb,
                           int pc_pos);
 
diff --git a/target-arm/translate.c b/target-arm/translate.c
index 69ac18c..960c75e 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -11166,6 +11166,8 @@ static inline void gen_intermediate_code_internal(ARMCPU *cpu,
 
     dc->tb = tb;
 
+    tb_lock();
+
     dc->is_jmp = DISAS_NEXT;
     dc->pc = pc_start;
     dc->singlestep_enabled = cs->singlestep_enabled;
@@ -11506,6 +11508,7 @@ done_generating:
         tb->size = dc->pc - pc_start;
         tb->icount = num_insns;
     }
+    tb_unlock();
 }
 
 void gen_intermediate_code(CPUARMState *env, TranslationBlock *tb)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 231a781..1932323 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -590,21 +590,33 @@ static inline bool tcg_op_buf_full(void)
 
 /* pool based memory allocation */
 
+/* tb_lock must be help for tcg_malloc_internal. */
 void *tcg_malloc_internal(TCGContext *s, int size);
+
 void tcg_pool_reset(TCGContext *s);
 void tcg_pool_delete(TCGContext *s);
 
+void tb_lock(void);
+void tb_unlock(void);
+void tb_lock_reset(void);
+
 static inline void *tcg_malloc(int size)
 {
     TCGContext *s = &tcg_ctx;
     uint8_t *ptr, *ptr_end;
+    void *ret;
+
+    tb_lock();
     size = (size + sizeof(long) - 1) & ~(sizeof(long) - 1);
     ptr = s->pool_cur;
     ptr_end = ptr + size;
     if (unlikely(ptr_end > s->pool_end)) {
-        return tcg_malloc_internal(&tcg_ctx, size);
+        ret = tcg_malloc_internal(&tcg_ctx, size);
+        tb_unlock();
+        return ret;
     } else {
         s->pool_cur = ptr_end;
+        tb_unlock();
         return ptr;
     }
 }
diff --git a/translate-all.c b/translate-all.c
index 60a3d8b..046565c 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -129,6 +129,34 @@ static void *l1_map[V_L1_SIZE];
 /* code generation context */
 TCGContext tcg_ctx;
 
+/* translation block context */
+__thread volatile int have_tb_lock;
+
+void tb_lock(void)
+{
+    if (!have_tb_lock) {
+        qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
+    }
+    have_tb_lock++;
+}
+
+void tb_unlock(void)
+{
+    assert(have_tb_lock > 0);
+    have_tb_lock--;
+    if (!have_tb_lock) {
+        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+    }
+}
+
+void tb_lock_reset(void)
+{
+    if (have_tb_lock) {
+        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+    }
+    have_tb_lock = 0;
+}
+
 static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2);
 static TranslationBlock *tb_find_pc(uintptr_t tc_ptr);
@@ -217,6 +245,7 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
 #ifdef CONFIG_PROFILER
     ti = profile_getclock();
 #endif
+    tb_lock();
     tcg_func_start(s);
 
     gen_intermediate_code_pc(env, tb);
@@ -230,8 +259,10 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
 
     /* find opc index corresponding to search_pc */
     tc_ptr = (uintptr_t)tb->tc_ptr;
-    if (searched_pc < tc_ptr)
+    if (searched_pc < tc_ptr) {
+        tb_unlock();
         return -1;
+    }
 
     s->tb_next_offset = tb->tb_next_offset;
 #ifdef USE_DIRECT_JUMP
@@ -243,8 +274,10 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
 #endif
     j = tcg_gen_code_search_pc(s, (tcg_insn_unit *)tc_ptr,
                                searched_pc - tc_ptr);
-    if (j < 0)
+    if (j < 0) {
+        tb_unlock();
         return -1;
+    }
     /* now find start of instruction before */
     while (s->gen_opc_instr_start[j] == 0) {
         j--;
@@ -257,6 +290,8 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
     s->restore_time += profile_getclock() - ti;
     s->restore_count++;
 #endif
+
+    tb_unlock();
     return 0;
 }
 
@@ -675,6 +710,7 @@ static inline void code_gen_alloc(size_t tb_size)
             CODE_GEN_AVG_BLOCK_SIZE;
     tcg_ctx.tb_ctx.tbs =
             g_malloc(tcg_ctx.code_gen_max_blocks * sizeof(TranslationBlock));
+    qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
@@ -699,16 +735,22 @@ bool tcg_enabled(void)
     return tcg_ctx.code_gen_buffer != NULL;
 }
 
-/* Allocate a new translation block. Flush the translation buffer if
-   too many translation blocks or too much generated code. */
+/*
+ * Allocate a new translation block. Flush the translation buffer if
+ * too many translation blocks or too much generated code.
+ * tb_alloc is not thread safe but tb_gen_code is protected by a mutex so this
+ * function is called only by one thread.
+ */
 static TranslationBlock *tb_alloc(target_ulong pc)
 {
-    TranslationBlock *tb;
+    TranslationBlock *tb = NULL;
 
     if (tcg_ctx.tb_ctx.nb_tbs >= tcg_ctx.code_gen_max_blocks ||
         (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) >=
          tcg_ctx.code_gen_buffer_max_size) {
-        return NULL;
+        tb = &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs++];
+        tb->pc = pc;
+        tb->cflags = 0;
     }
     tb = &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs++];
     tb->pc = pc;
@@ -721,11 +763,16 @@ void tb_free(TranslationBlock *tb)
     /* In practice this is mostly used for single use temporary TB
        Ignore the hard cases and just back up if this TB happens to
        be the last one generated.  */
+
+    tb_lock();
+
     if (tcg_ctx.tb_ctx.nb_tbs > 0 &&
             tb == &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) {
         tcg_ctx.code_gen_ptr = tb->tc_ptr;
         tcg_ctx.tb_ctx.nb_tbs--;
     }
+
+    tb_unlock();
 }
 
 static inline void invalidate_page_bitmap(PageDesc *p)
@@ -774,6 +821,8 @@ static void page_flush_tb(void)
 /* XXX: tb_flush is currently not thread safe */
 void tb_flush(CPUState *cpu)
 {
+    tb_lock();
+
 #if defined(DEBUG_FLUSH)
     printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
            (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
@@ -798,6 +847,8 @@ void tb_flush(CPUState *cpu)
     /* XXX: flush processor icache at this point if cache flush is
        expensive */
     tcg_ctx.tb_ctx.tb_flush_count++;
+
+    tb_unlock();
 }
 
 #ifdef DEBUG_TB_CHECK
@@ -807,6 +858,8 @@ static void tb_invalidate_check(target_ulong address)
     TranslationBlock *tb;
     int i;
 
+    tb_lock();
+
     address &= TARGET_PAGE_MASK;
     for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
         for (tb = tb_ctx.tb_phys_hash[i]; tb != NULL; tb = tb->phys_hash_next) {
@@ -818,6 +871,8 @@ static void tb_invalidate_check(target_ulong address)
             }
         }
     }
+
+    tb_unlock();
 }
 
 /* verify that all the pages have correct rights for code */
@@ -826,6 +881,8 @@ static void tb_page_check(void)
     TranslationBlock *tb;
     int i, flags1, flags2;
 
+    tb_lock();
+
     for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
         for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
                 tb = tb->phys_hash_next) {
@@ -837,6 +894,8 @@ static void tb_page_check(void)
             }
         }
     }
+
+    tb_unlock();
 }
 
 #endif
@@ -917,6 +976,8 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     tb_page_addr_t phys_pc;
     TranslationBlock *tb1, *tb2;
 
+    tb_lock();
+
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_phys_hash_func(phys_pc);
@@ -964,6 +1025,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     tb->jmp_first = (TranslationBlock *)((uintptr_t)tb | 2); /* fail safe */
 
     tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
+    tb_unlock();
 }
 
 static void build_page_bitmap(PageDesc *p)
@@ -1005,6 +1067,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     target_ulong virt_page2;
     int code_gen_size;
 
+    tb_lock();
+
     phys_pc = get_page_addr_code(env, pc);
     if (use_icount) {
         cflags |= CF_USE_ICOUNT;
@@ -1033,6 +1097,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
         phys_page2 = get_page_addr_code(env, virt_page2);
     }
     tb_link_page(tb, phys_pc, phys_page2);
+
+    tb_unlock();
     return tb;
 }
 
@@ -1331,13 +1397,15 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 }
 
 /* add a new TB and link it to the physical page tables. phys_page2 is
-   (-1) to indicate that only one page contains the TB. */
+ * (-1) to indicate that only one page contains the TB. */
 static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
     unsigned int h;
     TranslationBlock **ptb;
 
+    tb_lock();
+
     /* Grab the mmap lock to stop another thread invalidating this TB
        before we are done.  */
     mmap_lock();
@@ -1371,6 +1439,8 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     tb_page_check();
 #endif
     mmap_unlock();
+
+    tb_unlock();
 }
 
 /* find the TB 'tb' such that tb[0].tc_ptr <= tc_ptr <
@@ -1379,31 +1449,34 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
 {
     int m_min, m_max, m;
     uintptr_t v;
-    TranslationBlock *tb;
-
-    if (tcg_ctx.tb_ctx.nb_tbs <= 0) {
-        return NULL;
-    }
-    if (tc_ptr < (uintptr_t)tcg_ctx.code_gen_buffer ||
-        tc_ptr >= (uintptr_t)tcg_ctx.code_gen_ptr) {
-        return NULL;
-    }
-    /* binary search (cf Knuth) */
-    m_min = 0;
-    m_max = tcg_ctx.tb_ctx.nb_tbs - 1;
-    while (m_min <= m_max) {
-        m = (m_min + m_max) >> 1;
-        tb = &tcg_ctx.tb_ctx.tbs[m];
-        v = (uintptr_t)tb->tc_ptr;
-        if (v == tc_ptr) {
-            return tb;
-        } else if (tc_ptr < v) {
-            m_max = m - 1;
-        } else {
-            m_min = m + 1;
+    TranslationBlock *tb = NULL;
+
+    tb_lock();
+
+    if ((tcg_ctx.tb_ctx.nb_tbs > 0)
+    && (tc_ptr >= (uintptr_t)tcg_ctx.code_gen_buffer &&
+        tc_ptr < (uintptr_t)tcg_ctx.code_gen_ptr)) {
+        /* binary search (cf Knuth) */
+        m_min = 0;
+        m_max = tcg_ctx.tb_ctx.nb_tbs - 1;
+        while (m_min <= m_max) {
+            m = (m_min + m_max) >> 1;
+            tb = &tcg_ctx.tb_ctx.tbs[m];
+            v = (uintptr_t)tb->tc_ptr;
+            if (v == tc_ptr) {
+                tb_unlock();
+                return tb;
+            } else if (tc_ptr < v) {
+                m_max = m - 1;
+            } else {
+                m_min = m + 1;
+            }
         }
+        tb = &tcg_ctx.tb_ctx.tbs[m_max];
     }
-    return &tcg_ctx.tb_ctx.tbs[m_max];
+
+    tb_unlock();
+    return tb;
 }
 
 #if !defined(CONFIG_USER_ONLY)
@@ -1565,6 +1638,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     int direct_jmp_count, direct_jmp2_count, cross_page;
     TranslationBlock *tb;
 
+    tb_lock();
+
     target_code_size = 0;
     max_target_code_size = 0;
     cross_page = 0;
@@ -1620,6 +1695,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
             tcg_ctx.tb_ctx.tb_phys_invalidate_count);
     cpu_fprintf(f, "TLB flush count     %d\n", tlb_flush_count);
     tcg_dump_info(f, cpu_fprintf);
+
+    tb_unlock();
 }
 
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 08/19] tcg: remove tcg_halt_cond global variable.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (6 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 16:12   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution fred.konrad
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This removes tcg_halt_cond global variable.
We need one QemuCond per virtual cpu for multithread TCG.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
---
 cpus.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/cpus.c b/cpus.c
index 2250296..2550be2 100644
--- a/cpus.c
+++ b/cpus.c
@@ -815,7 +815,6 @@ static unsigned iothread_requesting_mutex;
 static QemuThread io_thread;
 
 static QemuThread *tcg_cpu_thread;
-static QemuCond *tcg_halt_cond;
 
 /* cpu creation */
 static QemuCond qemu_cpu_cond;
@@ -1038,15 +1037,13 @@ static void qemu_wait_io_event_common(CPUState *cpu)
     cpu->thread_kicked = false;
 }
 
-static void qemu_tcg_wait_io_event(void)
+static void qemu_tcg_wait_io_event(CPUState *cpu)
 {
-    CPUState *cpu;
-
     while (all_cpu_threads_idle()) {
        /* Start accounting real time to the virtual clock if the CPUs
           are idle.  */
         qemu_clock_warp(QEMU_CLOCK_VIRTUAL);
-        qemu_cond_wait(tcg_halt_cond, &qemu_global_mutex);
+        qemu_cond_wait(cpu->halt_cond, &qemu_global_mutex);
     }
 
     while (iothread_requesting_mutex) {
@@ -1166,7 +1163,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 
     /* wait for initial kick-off after machine start */
     while (first_cpu->stopped) {
-        qemu_cond_wait(tcg_halt_cond, &qemu_global_mutex);
+        qemu_cond_wait(first_cpu->halt_cond, &qemu_global_mutex);
 
         /* process any pending work */
         CPU_FOREACH(cpu) {
@@ -1187,7 +1184,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
                 qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
             }
         }
-        qemu_tcg_wait_io_event();
+        qemu_tcg_wait_io_event(QTAILQ_FIRST(&cpus));
     }
 
     return NULL;
@@ -1328,12 +1325,12 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
 
     tcg_cpu_address_space_init(cpu, cpu->as);
 
+    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
+    qemu_cond_init(cpu->halt_cond);
+
     /* share a single thread for all cpus with TCG */
     if (!tcg_cpu_thread) {
         cpu->thread = g_malloc0(sizeof(QemuThread));
-        cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-        qemu_cond_init(cpu->halt_cond);
-        tcg_halt_cond = cpu->halt_cond;
         snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/TCG",
                  cpu->cpu_index);
         qemu_thread_create(cpu->thread, thread_name, qemu_tcg_cpu_thread_fn,
@@ -1347,7 +1344,6 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
         tcg_cpu_thread = cpu->thread;
     } else {
         cpu->thread = tcg_cpu_thread;
-        cpu->halt_cond = tcg_halt_cond;
     }
 }
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (7 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 08/19] tcg: remove tcg_halt_cond global variable fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 16:15   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 10/19] cpu: remove exit_request global fred.konrad
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This finally allows TCG to benefit from the iothread introduction: Drop
the global mutex while running pure TCG CPU code. Reacquire the lock
when entering MMIO or PIO emulation, or when leaving the TCG loop.

We have to revert a few optimization for the current TCG threading
model, namely kicking the TCG thread in qemu_mutex_lock_iothread and not
kicking it in qemu_cpu_kick. We also need to disable RAM block
reordering until we have a more efficient locking mechanism at hand.

I'm pretty sure some cases are still broken, definitely SMP (we no
longer perform round-robin scheduling "by chance"). Still, a Linux x86
UP guest and my Musicpal ARM model boot fine here. These numbers
demonstrate where we gain something:

20338 jan       20   0  331m  75m 6904 R   99  0.9   0:50.95 qemu-system-arm
20337 jan       20   0  331m  75m 6904 S   20  0.9   0:26.50 qemu-system-arm

The guest CPU was fully loaded, but the iothread could still run mostly
independent on a second core. Without the patch we don't get beyond

32206 jan       20   0  330m  73m 7036 R   82  0.9   1:06.00 qemu-system-arm
32204 jan       20   0  330m  73m 7036 S   21  0.9   0:17.03 qemu-system-arm

We don't benefit significantly, though, when the guest is not fully
loading a host CPU.

Note that this patch depends on
http://thread.gmane.org/gmane.comp.emulators.qemu/118657

Changes from Fred Konrad:
  * Rebase on the current HEAD.
  * Fixes a deadlock in qemu_devices_reset().
  * Remove the mutex in address_space_*
---
 cpus.c                    | 20 +++-----------------
 cputlb.c                  |  5 +++++
 target-i386/misc_helper.c | 27 ++++++++++++++++++++++++---
 translate-all.c           |  2 ++
 vl.c                      |  6 ++++++
 5 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/cpus.c b/cpus.c
index 2550be2..154a081 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1232,23 +1232,7 @@ bool qemu_mutex_iothread_locked(void)
 
 void qemu_mutex_lock_iothread(void)
 {
-    atomic_inc(&iothread_requesting_mutex);
-    /* In the simple case there is no need to bump the VCPU thread out of
-     * TCG code execution.
-     */
-    if (!tcg_enabled() || qemu_in_vcpu_thread() ||
-        !first_cpu || !first_cpu->thread) {
-        qemu_mutex_lock(&qemu_global_mutex);
-        atomic_dec(&iothread_requesting_mutex);
-    } else {
-        if (qemu_mutex_trylock(&qemu_global_mutex)) {
-            qemu_cpu_kick_thread(first_cpu);
-            qemu_mutex_lock(&qemu_global_mutex);
-        }
-        atomic_dec(&iothread_requesting_mutex);
-        qemu_cond_broadcast(&qemu_io_proceeded_cond);
-    }
-    iothread_locked = true;
+    qemu_mutex_lock(&qemu_global_mutex);
 }
 
 void qemu_mutex_unlock_iothread(void)
@@ -1469,7 +1453,9 @@ static int tcg_cpu_exec(CPUState *cpu)
         cpu->icount_decr.u16.low = decr;
         cpu->icount_extra = count;
     }
+    qemu_mutex_unlock_iothread();
     ret = cpu_exec(cpu);
+    qemu_mutex_lock_iothread();
 #ifdef CONFIG_PROFILER
     tcg_time += profile_getclock() - ti;
 #endif
diff --git a/cputlb.c b/cputlb.c
index a506086..79fff1c 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -30,6 +30,9 @@
 #include "exec/ram_addr.h"
 #include "tcg/tcg.h"
 
+void qemu_mutex_lock_iothread(void);
+void qemu_mutex_unlock_iothread(void);
+
 //#define DEBUG_TLB
 //#define DEBUG_TLB_CHECK
 
@@ -125,8 +128,10 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
    can be detected */
 void tlb_protect_code(ram_addr_t ram_addr)
 {
+    qemu_mutex_lock_iothread();
     cpu_physical_memory_test_and_clear_dirty(ram_addr, TARGET_PAGE_SIZE,
                                              DIRTY_MEMORY_CODE);
+    qemu_mutex_unlock_iothread();
 }
 
 /* update the TLB so that writes in physical page 'phys_addr' are no longer
diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c
index 52c5d65..55f63bf 100644
--- a/target-i386/misc_helper.c
+++ b/target-i386/misc_helper.c
@@ -27,8 +27,10 @@ void helper_outb(CPUX86State *env, uint32_t port, uint32_t data)
 #ifdef CONFIG_USER_ONLY
     fprintf(stderr, "outb: port=0x%04x, data=%02x\n", port, data);
 #else
+    qemu_mutex_lock_iothread();
     address_space_stb(&address_space_io, port, data,
                       cpu_get_mem_attrs(env), NULL);
+    qemu_mutex_unlock_iothread();
 #endif
 }
 
@@ -38,8 +40,13 @@ target_ulong helper_inb(CPUX86State *env, uint32_t port)
     fprintf(stderr, "inb: port=0x%04x\n", port);
     return 0;
 #else
-    return address_space_ldub(&address_space_io, port,
+    target_ulong ret;
+
+    qemu_mutex_lock_iothread();
+    ret = address_space_ldub(&address_space_io, port,
                               cpu_get_mem_attrs(env), NULL);
+    qemu_mutex_unlock_iothread();
+    return ret;
 #endif
 }
 
@@ -48,8 +55,10 @@ void helper_outw(CPUX86State *env, uint32_t port, uint32_t data)
 #ifdef CONFIG_USER_ONLY
     fprintf(stderr, "outw: port=0x%04x, data=%04x\n", port, data);
 #else
+    qemu_mutex_lock_iothread();
     address_space_stw(&address_space_io, port, data,
                       cpu_get_mem_attrs(env), NULL);
+    qemu_mutex_unlock_iothread();
 #endif
 }
 
@@ -59,8 +68,13 @@ target_ulong helper_inw(CPUX86State *env, uint32_t port)
     fprintf(stderr, "inw: port=0x%04x\n", port);
     return 0;
 #else
-    return address_space_lduw(&address_space_io, port,
+    target_ulong ret;
+
+    qemu_mutex_lock_iothread();
+    ret = address_space_lduw(&address_space_io, port,
                               cpu_get_mem_attrs(env), NULL);
+    qemu_mutex_unlock_iothread();
+    return ret;
 #endif
 }
 
@@ -69,8 +83,10 @@ void helper_outl(CPUX86State *env, uint32_t port, uint32_t data)
 #ifdef CONFIG_USER_ONLY
     fprintf(stderr, "outw: port=0x%04x, data=%08x\n", port, data);
 #else
+    qemu_mutex_lock_iothread();
     address_space_stl(&address_space_io, port, data,
                       cpu_get_mem_attrs(env), NULL);
+    qemu_mutex_unlock_iothread();
 #endif
 }
 
@@ -80,8 +96,13 @@ target_ulong helper_inl(CPUX86State *env, uint32_t port)
     fprintf(stderr, "inl: port=0x%04x\n", port);
     return 0;
 #else
-    return address_space_ldl(&address_space_io, port,
+    target_ulong ret;
+
+    qemu_mutex_lock_iothread();
+    ret = address_space_ldl(&address_space_io, port,
                              cpu_get_mem_attrs(env), NULL);
+    qemu_mutex_unlock_iothread();
+    return ret;
 #endif
 }
 
diff --git a/translate-all.c b/translate-all.c
index 046565c..954c67a 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1223,6 +1223,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
 #endif
 #ifdef TARGET_HAS_PRECISE_SMC
     if (current_tb_modified) {
+        qemu_mutex_unlock_iothread();
         /* we generate a block containing just the instruction
            modifying the memory. It will ensure that it cannot modify
            itself */
@@ -1327,6 +1328,7 @@ static void tb_invalidate_phys_page(tb_page_addr_t addr,
     p->first_tb = NULL;
 #ifdef TARGET_HAS_PRECISE_SMC
     if (current_tb_modified) {
+        qemu_mutex_unlock_iothread();
         /* we generate a block containing just the instruction
            modifying the memory. It will ensure that it cannot modify
            itself */
diff --git a/vl.c b/vl.c
index 3f269dc..922e969 100644
--- a/vl.c
+++ b/vl.c
@@ -1717,10 +1717,16 @@ void qemu_devices_reset(void)
 {
     QEMUResetEntry *re, *nre;
 
+    /*
+     * Some device's reset needs to grab the global_mutex. So just release it
+     * here.
+     */
+    qemu_mutex_unlock_iothread();
     /* reset all devices */
     QTAILQ_FOREACH_SAFE(re, &reset_handlers, entry, nre) {
         re->func(re->opaque);
     }
+    qemu_mutex_lock_iothread();
 }
 
 void qemu_system_reset(bool report)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 10/19] cpu: remove exit_request global.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (8 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 15:51   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 11/19] tcg: switch on multithread fred.konrad
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This removes exit_request global and adds a variable in CPUState for this.
Only the flag for the first cpu is used for the moment as we are still with one
TCG thread.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
---
 cpu-exec.c | 15 ---------------
 cpus.c     | 17 ++++++++++++++---
 2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index a012e9d..21a7b96 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -363,8 +363,6 @@ static void cpu_handle_debug_exception(CPUState *cpu)
 
 /* main execution loop */
 
-volatile sig_atomic_t exit_request;
-
 int cpu_exec(CPUState *cpu)
 {
     CPUClass *cc = CPU_GET_CLASS(cpu);
@@ -402,20 +400,8 @@ int cpu_exec(CPUState *cpu)
     }
     current_cpu = cpu;
 
-    /* As long as current_cpu is null, up to the assignment just above,
-     * requests by other threads to exit the execution loop are expected to
-     * be issued using the exit_request global. We must make sure that our
-     * evaluation of the global value is performed past the current_cpu
-     * value transition point, which requires a memory barrier as well as
-     * an instruction scheduling constraint on modern architectures.  */
-    smp_mb();
-
     rcu_read_lock();
 
-    if (unlikely(exit_request)) {
-        cpu->exit_request = 1;
-    }
-
     cc->cpu_exec_enter(cpu);
 
     /* Calculate difference between guest clock and host clock.
@@ -504,7 +490,6 @@ int cpu_exec(CPUState *cpu)
                     }
                 }
                 if (unlikely(cpu->exit_request)) {
-                    cpu->exit_request = 0;
                     cpu->exception_index = EXCP_INTERRUPT;
                     cpu_loop_exit(cpu);
                 }
diff --git a/cpus.c b/cpus.c
index 154a081..2de9eae 100644
--- a/cpus.c
+++ b/cpus.c
@@ -139,6 +139,8 @@ typedef struct TimersState {
 } TimersState;
 
 static TimersState timers_state;
+/* CPU associated to this thread. */
+static __thread CPUState *tcg_thread_cpu;
 
 int64_t cpu_get_icount_raw(void)
 {
@@ -663,12 +665,18 @@ static void cpu_handle_guest_debug(CPUState *cpu)
     cpu->stopped = true;
 }
 
+/**
+ * cpu_signal
+ * Signal handler when using TCG.
+ */
 static void cpu_signal(int sig)
 {
     if (current_cpu) {
         cpu_exit(current_cpu);
     }
-    exit_request = 1;
+
+    /* FIXME: We might want to check if the cpu is running? */
+    tcg_thread_cpu->exit_request = true;
 }
 
 #ifdef CONFIG_LINUX
@@ -1151,6 +1159,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
     CPUState *cpu = arg;
 
     qemu_mutex_lock_iothread();
+    tcg_thread_cpu = cpu;
     qemu_tcg_init_cpu_signals();
     qemu_thread_get_self(cpu->thread);
 
@@ -1480,7 +1489,8 @@ static void tcg_exec_all(void)
     if (next_cpu == NULL) {
         next_cpu = first_cpu;
     }
-    for (; next_cpu != NULL && !exit_request; next_cpu = CPU_NEXT(next_cpu)) {
+    for (; next_cpu != NULL && !first_cpu->exit_request;
+           next_cpu = CPU_NEXT(next_cpu)) {
         CPUState *cpu = next_cpu;
 
         qemu_clock_enable(QEMU_CLOCK_VIRTUAL,
@@ -1496,7 +1506,8 @@ static void tcg_exec_all(void)
             break;
         }
     }
-    exit_request = 0;
+
+    first_cpu->exit_request = 0;
 }
 
 void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 11/19] tcg: switch on multithread.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (9 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 10/19] cpu: remove exit_request global fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-13 11:17   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 12/19] Use atomic cmpxchg to atomically check the exclusive value in a STREX fred.konrad
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This switches on multithread.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>

Changes V5 -> V6:
  * make qemu_cpu_kick calling qemu_cpu_kick_thread in case of TCG.
---
 cpus.c | 93 ++++++++++++++++++++++++------------------------------------------
 1 file changed, 33 insertions(+), 60 deletions(-)

diff --git a/cpus.c b/cpus.c
index 2de9eae..2c5ca72 100644
--- a/cpus.c
+++ b/cpus.c
@@ -65,7 +65,6 @@
 
 #endif /* CONFIG_LINUX */
 
-static CPUState *next_cpu;
 int64_t max_delay;
 int64_t max_advance;
 
@@ -822,8 +821,6 @@ static unsigned iothread_requesting_mutex;
 
 static QemuThread io_thread;
 
-static QemuThread *tcg_cpu_thread;
-
 /* cpu creation */
 static QemuCond qemu_cpu_cond;
 /* system init */
@@ -1047,10 +1044,13 @@ static void qemu_wait_io_event_common(CPUState *cpu)
 
 static void qemu_tcg_wait_io_event(CPUState *cpu)
 {
-    while (all_cpu_threads_idle()) {
-       /* Start accounting real time to the virtual clock if the CPUs
-          are idle.  */
-        qemu_clock_warp(QEMU_CLOCK_VIRTUAL);
+    while (cpu_thread_is_idle(cpu)) {
+        /* Start accounting real time to the virtual clock if the CPUs
+         * are idle.
+         */
+        if ((all_cpu_threads_idle()) && (cpu->cpu_index == 0)) {
+            qemu_clock_warp(QEMU_CLOCK_VIRTUAL);
+        }
         qemu_cond_wait(cpu->halt_cond, &qemu_global_mutex);
     }
 
@@ -1058,9 +1058,7 @@ static void qemu_tcg_wait_io_event(CPUState *cpu)
         qemu_cond_wait(&qemu_io_proceeded_cond, &qemu_global_mutex);
     }
 
-    CPU_FOREACH(cpu) {
-        qemu_wait_io_event_common(cpu);
-    }
+    qemu_wait_io_event_common(cpu);
 }
 
 static void qemu_kvm_wait_io_event(CPUState *cpu)
@@ -1152,7 +1150,7 @@ static void *qemu_dummy_cpu_thread_fn(void *arg)
 #endif
 }
 
-static void tcg_exec_all(void);
+static void tcg_exec_all(CPUState *cpu);
 
 static void *qemu_tcg_cpu_thread_fn(void *arg)
 {
@@ -1163,37 +1161,26 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
     qemu_tcg_init_cpu_signals();
     qemu_thread_get_self(cpu->thread);
 
-    CPU_FOREACH(cpu) {
-        cpu->thread_id = qemu_get_thread_id();
-        cpu->created = true;
-        cpu->can_do_io = 1;
-    }
-    qemu_cond_signal(&qemu_cpu_cond);
-
-    /* wait for initial kick-off after machine start */
-    while (first_cpu->stopped) {
-        qemu_cond_wait(first_cpu->halt_cond, &qemu_global_mutex);
-
-        /* process any pending work */
-        CPU_FOREACH(cpu) {
-            qemu_wait_io_event_common(cpu);
-        }
-    }
+    cpu->thread_id = qemu_get_thread_id();
+    cpu->created = true;
+    cpu->can_do_io = 1;
 
-    /* process any pending work */
-    exit_request = 1;
+    qemu_cond_signal(&qemu_cpu_cond);
 
     while (1) {
-        tcg_exec_all();
+        if (!cpu->stopped) {
+            tcg_exec_all(cpu);
 
-        if (use_icount) {
-            int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
+            if (use_icount) {
+                int64_t deadline =
+                    qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
 
-            if (deadline == 0) {
-                qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
+                if (deadline == 0) {
+                    qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
+                }
             }
         }
-        qemu_tcg_wait_io_event(QTAILQ_FIRST(&cpus));
+        qemu_tcg_wait_io_event(cpu);
     }
 
     return NULL;
@@ -1202,7 +1189,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 void qemu_cpu_kick(CPUState *cpu)
 {
     qemu_cond_broadcast(cpu->halt_cond);
-    if (!tcg_enabled() && !cpu->thread_kicked) {
+    if (!cpu->thread_kicked) {
         qemu_cpu_kick_thread(cpu);
         cpu->thread_kicked = true;
     }
@@ -1320,23 +1307,15 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
 
     cpu->halt_cond = g_malloc0(sizeof(QemuCond));
     qemu_cond_init(cpu->halt_cond);
-
-    /* share a single thread for all cpus with TCG */
-    if (!tcg_cpu_thread) {
-        cpu->thread = g_malloc0(sizeof(QemuThread));
-        snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/TCG",
-                 cpu->cpu_index);
-        qemu_thread_create(cpu->thread, thread_name, qemu_tcg_cpu_thread_fn,
-                           cpu, QEMU_THREAD_JOINABLE);
+    cpu->thread = g_malloc0(sizeof(QemuThread));
+    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/TCG", cpu->cpu_index);
+    qemu_thread_create(cpu->thread, thread_name, qemu_tcg_cpu_thread_fn, cpu,
+                       QEMU_THREAD_JOINABLE);
 #ifdef _WIN32
-        cpu->hThread = qemu_thread_get_handle(cpu->thread);
+    cpu->hThread = qemu_thread_get_handle(cpu->thread);
 #endif
-        while (!cpu->created) {
-            qemu_cond_wait(&qemu_cpu_cond, &qemu_global_mutex);
-        }
-        tcg_cpu_thread = cpu->thread;
-    } else {
-        cpu->thread = tcg_cpu_thread;
+    while (!cpu->created) {
+        qemu_cond_wait(&qemu_cpu_cond, &qemu_global_mutex);
     }
 }
 
@@ -1479,20 +1458,14 @@ static int tcg_cpu_exec(CPUState *cpu)
     return ret;
 }
 
-static void tcg_exec_all(void)
+static void tcg_exec_all(CPUState *cpu)
 {
     int r;
 
     /* Account partial waits to QEMU_CLOCK_VIRTUAL.  */
     qemu_clock_warp(QEMU_CLOCK_VIRTUAL);
 
-    if (next_cpu == NULL) {
-        next_cpu = first_cpu;
-    }
-    for (; next_cpu != NULL && !first_cpu->exit_request;
-           next_cpu = CPU_NEXT(next_cpu)) {
-        CPUState *cpu = next_cpu;
-
+    while (!cpu->exit_request) {
         qemu_clock_enable(QEMU_CLOCK_VIRTUAL,
                           (cpu->singlestep_enabled & SSTEP_NOTIMER) == 0);
 
@@ -1507,7 +1480,7 @@ static void tcg_exec_all(void)
         }
     }
 
-    first_cpu->exit_request = 0;
+    cpu->exit_request = 0;
 }
 
 void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 12/19] Use atomic cmpxchg to atomically check the exclusive value in a STREX
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (10 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 11/19] tcg: switch on multithread fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 13/19] add a callback when tb_invalidate is called fred.konrad
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This mechanism replaces the existing load/store exclusive mechanism which seems
to be broken for multithread.
It follows the intention of the existing mechanism and stores the target address
and data values during a load operation and checks that they remain unchanged
before a store.

In common with the older approach, this provides weaker semantics than required
in that it could be that a different processor writes the same value as a
non-exclusive write, however in practise this seems to be irrelevant.

The old implementation didn’t correctly store it’s values as globals, but rather
kept a local copy per CPU.

This new mechanism stores the values globally and also uses the atomic cmpxchg
macros to ensure atomicity - it is therefore very efficient and threadsafe.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>

Changes:
  V5 -> V6:
    * Use spinlock instead of mutex.
    * Fix the length for address map.
    * Fix the return address for tlb_fill.
  V4 -> V5:
    * Remove atomic_check and atomic_release which were unused.
---
 target-arm/cpu.c       |  21 ++++++++
 target-arm/cpu.h       |   6 +++
 target-arm/helper.c    |  13 +++++
 target-arm/helper.h    |   4 ++
 target-arm/op_helper.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++-
 target-arm/translate.c |  98 +++++++------------------------------
 6 files changed, 188 insertions(+), 82 deletions(-)

diff --git a/target-arm/cpu.c b/target-arm/cpu.c
index 8b4323d..ba0d2a7 100644
--- a/target-arm/cpu.c
+++ b/target-arm/cpu.c
@@ -30,6 +30,26 @@
 #include "sysemu/kvm.h"
 #include "kvm_arm.h"
 
+/* Protect cpu_exclusive_* variable .*/
+__thread bool cpu_have_exclusive_lock;
+QemuSpin cpu_exclusive_lock;
+
+inline void arm_exclusive_lock(void)
+{
+    if (!cpu_have_exclusive_lock) {
+        qemu_spin_lock(&cpu_exclusive_lock);
+        cpu_have_exclusive_lock = true;
+    }
+}
+
+inline void arm_exclusive_unlock(void)
+{
+    if (cpu_have_exclusive_lock) {
+        cpu_have_exclusive_lock = false;
+        qemu_spin_unlock(&cpu_exclusive_lock);
+    }
+}
+
 static void arm_cpu_set_pc(CPUState *cs, vaddr value)
 {
     ARMCPU *cpu = ARM_CPU(cs);
@@ -469,6 +489,7 @@ static void arm_cpu_initfn(Object *obj)
         cpu->psci_version = 2; /* TCG implements PSCI 0.2 */
         if (!inited) {
             inited = true;
+            qemu_spin_init(&cpu_exclusive_lock);
             arm_translate_init();
         }
     }
diff --git a/target-arm/cpu.h b/target-arm/cpu.h
index 7e89152..f8d04fa 100644
--- a/target-arm/cpu.h
+++ b/target-arm/cpu.h
@@ -515,6 +515,9 @@ static inline bool is_a64(CPUARMState *env)
 int cpu_arm_signal_handler(int host_signum, void *pinfo,
                            void *puc);
 
+bool arm_get_phys_addr(CPUARMState *env, target_ulong address, int access_type,
+                       hwaddr *phys_ptr, int *prot, target_ulong *page_size);
+
 /**
  * pmccntr_sync
  * @env: CPUARMState
@@ -1933,4 +1936,7 @@ enum {
     QEMU_PSCI_CONDUIT_HVC = 2,
 };
 
+void arm_exclusive_lock(void);
+void arm_exclusive_unlock(void);
+
 #endif
diff --git a/target-arm/helper.c b/target-arm/helper.c
index b87afe7..34b465c 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -24,6 +24,15 @@ static inline bool get_phys_addr(CPUARMState *env, target_ulong address,
 #define PMCRE   0x1
 #endif
 
+bool arm_get_phys_addr(CPUARMState *env, target_ulong address, int access_type,
+                       hwaddr *phys_ptr, int *prot, target_ulong *page_size)
+{
+    MemTxAttrs attrs = {};
+    uint32_t fsr;
+    return get_phys_addr(env, address, access_type, cpu_mmu_index(env),
+                         phys_ptr, &attrs, prot, page_size, &fsr);
+}
+
 static int vfp_gdb_get_reg(CPUARMState *env, uint8_t *buf, int reg)
 {
     int nregs;
@@ -4824,6 +4833,10 @@ void arm_cpu_do_interrupt(CPUState *cs)
 
     arm_log_exception(cs->exception_index);
 
+    arm_exclusive_lock();
+    env->exclusive_addr = -1;
+    arm_exclusive_unlock();
+
     if (arm_is_psci_call(cpu, cs->exception_index)) {
         arm_handle_psci_call(cpu);
         qemu_log_mask(CPU_LOG_INT, "...handled as PSCI call\n");
diff --git a/target-arm/helper.h b/target-arm/helper.h
index 827b33d..c77bf04 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -530,6 +530,10 @@ DEF_HELPER_2(dc_zva, void, env, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 
+DEF_HELPER_4(atomic_cmpxchg64, i32, env, i32, i64, i32)
+DEF_HELPER_1(atomic_clear, void, env)
+DEF_HELPER_3(atomic_claim, void, env, i32, i64)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #endif
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index 663c05d..ba8c5f5 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -30,12 +30,139 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
     CPUState *cs = CPU(arm_env_get_cpu(env));
 
     assert(!excp_is_internal(excp));
+    arm_exclusive_lock();
     cs->exception_index = excp;
     env->exception.syndrome = syndrome;
     env->exception.target_el = target_el;
+    /*
+     * We MAY already have the lock - in which case we are exiting the
+     * instruction due to an exception. Otherwise we better make sure we are not
+     * about to enter a STREX anyway.
+     */
+    env->exclusive_addr = -1;
+    arm_exclusive_unlock();
     cpu_loop_exit(cs);
 }
 
+/* NB return 1 for fail, 0 for pass */
+uint32_t HELPER(atomic_cmpxchg64)(CPUARMState *env, uint32_t addr,
+                                  uint64_t newval, uint32_t size)
+{
+    ARMCPU *cpu = arm_env_get_cpu(env);
+    CPUState *cs = CPU(cpu);
+
+    uintptr_t retaddr = GETRA();
+    bool result = false;
+    hwaddr len = 1 << size;
+
+    hwaddr paddr;
+    target_ulong page_size;
+    int prot;
+
+    arm_exclusive_lock();
+
+    if (env->exclusive_addr != addr) {
+        arm_exclusive_unlock();
+        return 1;
+    }
+
+    if (arm_get_phys_addr(env, addr, 1, &paddr, &prot, &page_size)) {
+        tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, cpu_mmu_index(env),
+                 retaddr);
+        if (arm_get_phys_addr(env, addr, 1, &paddr, &prot, &page_size)) {
+            arm_exclusive_unlock();
+            return 1;
+        }
+    }
+
+    switch (size) {
+    case 0:
+    {
+        uint8_t oldval, *p;
+        p = address_space_map(cs->as, paddr, &len, true);
+        if (len == 1 << size) {
+            oldval = (uint8_t)env->exclusive_val;
+            result = (atomic_cmpxchg(p, oldval, (uint8_t)newval) == oldval);
+        }
+        address_space_unmap(cs->as, p, len, true, result ? len : 0);
+    }
+    break;
+    case 1:
+    {
+        uint16_t oldval, *p;
+        p = address_space_map(cs->as, paddr, &len, true);
+        if (len == 1 << size) {
+            oldval = (uint16_t)env->exclusive_val;
+            result = (atomic_cmpxchg(p, oldval, (uint16_t)newval) == oldval);
+        }
+        address_space_unmap(cs->as, p, len, true, result ? len : 0);
+    }
+    break;
+    case 2:
+    {
+        uint32_t oldval, *p;
+        p = address_space_map(cs->as, paddr, &len, true);
+        if (len == 1 << size) {
+            oldval = (uint32_t)env->exclusive_val;
+            result = (atomic_cmpxchg(p, oldval, (uint32_t)newval) == oldval);
+        }
+        address_space_unmap(cs->as, p, len, true, result ? len : 0);
+    }
+    break;
+    case 3:
+    {
+        uint64_t oldval, *p;
+        p = address_space_map(cs->as, paddr, &len, true);
+        if (len == 1 << size) {
+            oldval = (uint64_t)env->exclusive_val;
+            result = (atomic_cmpxchg(p, oldval, (uint64_t)newval) == oldval);
+        }
+        address_space_unmap(cs->as, p, len, true, result ? len : 0);
+    }
+    break;
+    default:
+        abort();
+    break;
+    }
+
+    env->exclusive_addr = -1;
+    arm_exclusive_unlock();
+    if (result) {
+        return 0;
+    } else {
+        return 1;
+    }
+}
+
+void HELPER(atomic_clear)(CPUARMState *env)
+{
+    /* make sure no STREX is about to start */
+    arm_exclusive_lock();
+    env->exclusive_addr = -1;
+    arm_exclusive_unlock();
+}
+
+void HELPER(atomic_claim)(CPUARMState *env, uint32_t addr, uint64_t val)
+{
+    CPUState *cpu;
+    CPUARMState *current_cpu;
+
+    /* ensure that there are no STREX's executing */
+    arm_exclusive_lock();
+
+    CPU_FOREACH(cpu) {
+        current_cpu = &ARM_CPU(cpu)->env;
+        if (current_cpu->exclusive_addr  == addr) {
+            /* We steal the atomic of this CPU. */
+            current_cpu->exclusive_addr = -1;
+        }
+    }
+
+    env->exclusive_val = val;
+    env->exclusive_addr = addr;
+    arm_exclusive_unlock();
+}
+
 static int exception_target_el(CPUARMState *env)
 {
     int target_el = MAX(1, arm_current_el(env));
@@ -595,7 +722,6 @@ void HELPER(exception_return)(CPUARMState *env)
 
     aarch64_save_sp(env, cur_el);
 
-    env->exclusive_addr = -1;
 
     /* We must squash the PSTATE.SS bit to zero unless both of the
      * following hold:
diff --git a/target-arm/translate.c b/target-arm/translate.c
index 960c75e..325ee4a 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -65,8 +65,8 @@ TCGv_ptr cpu_env;
 static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
 static TCGv_i32 cpu_R[16];
 static TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
-static TCGv_i64 cpu_exclusive_addr;
 static TCGv_i64 cpu_exclusive_val;
+static TCGv_i64 cpu_exclusive_addr;
 #ifdef CONFIG_USER_ONLY
 static TCGv_i64 cpu_exclusive_test;
 static TCGv_i32 cpu_exclusive_info;
@@ -7395,6 +7395,7 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
                                TCGv_i32 addr, int size)
 {
     TCGv_i32 tmp = tcg_temp_new_i32();
+    TCGv_i64 val = tcg_temp_new_i64();
 
     s->is_ldex = true;
 
@@ -7419,20 +7420,20 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
 
         tcg_gen_addi_i32(tmp2, addr, 4);
         gen_aa32_ld32u(tmp3, tmp2, get_mem_index(s));
+        tcg_gen_concat_i32_i64(val, tmp, tmp3);
         tcg_temp_free_i32(tmp2);
-        tcg_gen_concat_i32_i64(cpu_exclusive_val, tmp, tmp3);
         store_reg(s, rt2, tmp3);
     } else {
-        tcg_gen_extu_i32_i64(cpu_exclusive_val, tmp);
+        tcg_gen_extu_i32_i64(val, tmp);
     }
-
+    gen_helper_atomic_claim(cpu_env, addr, val);
+    tcg_temp_free_i64(val);
     store_reg(s, rt, tmp);
-    tcg_gen_extu_i32_i64(cpu_exclusive_addr, addr);
 }
 
 static void gen_clrex(DisasContext *s)
 {
-    tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+    gen_helper_atomic_clear(cpu_env);
 }
 
 #ifdef CONFIG_USER_ONLY
@@ -7449,84 +7450,19 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                                 TCGv_i32 addr, int size)
 {
     TCGv_i32 tmp;
-    TCGv_i64 val64, extaddr;
-    TCGLabel *done_label;
-    TCGLabel *fail_label;
-
-    /* if (env->exclusive_addr == addr && env->exclusive_val == [addr]) {
-         [addr] = {Rt};
-         {Rd} = 0;
-       } else {
-         {Rd} = 1;
-       } */
-    fail_label = gen_new_label();
-    done_label = gen_new_label();
-    extaddr = tcg_temp_new_i64();
-    tcg_gen_extu_i32_i64(extaddr, addr);
-    tcg_gen_brcond_i64(TCG_COND_NE, extaddr, cpu_exclusive_addr, fail_label);
-    tcg_temp_free_i64(extaddr);
-
-    tmp = tcg_temp_new_i32();
-    switch (size) {
-    case 0:
-        gen_aa32_ld8u(tmp, addr, get_mem_index(s));
-        break;
-    case 1:
-        gen_aa32_ld16u(tmp, addr, get_mem_index(s));
-        break;
-    case 2:
-    case 3:
-        gen_aa32_ld32u(tmp, addr, get_mem_index(s));
-        break;
-    default:
-        abort();
-    }
-
-    val64 = tcg_temp_new_i64();
-    if (size == 3) {
-        TCGv_i32 tmp2 = tcg_temp_new_i32();
-        TCGv_i32 tmp3 = tcg_temp_new_i32();
-        tcg_gen_addi_i32(tmp2, addr, 4);
-        gen_aa32_ld32u(tmp3, tmp2, get_mem_index(s));
-        tcg_temp_free_i32(tmp2);
-        tcg_gen_concat_i32_i64(val64, tmp, tmp3);
-        tcg_temp_free_i32(tmp3);
-    } else {
-        tcg_gen_extu_i32_i64(val64, tmp);
-    }
-    tcg_temp_free_i32(tmp);
-
-    tcg_gen_brcond_i64(TCG_COND_NE, val64, cpu_exclusive_val, fail_label);
-    tcg_temp_free_i64(val64);
+    TCGv_i32 tmp2;
+    TCGv_i64 val = tcg_temp_new_i64();
+    TCGv_i32 tmp_size = tcg_const_i32(size);
 
     tmp = load_reg(s, rt);
-    switch (size) {
-    case 0:
-        gen_aa32_st8(tmp, addr, get_mem_index(s));
-        break;
-    case 1:
-        gen_aa32_st16(tmp, addr, get_mem_index(s));
-        break;
-    case 2:
-    case 3:
-        gen_aa32_st32(tmp, addr, get_mem_index(s));
-        break;
-    default:
-        abort();
-    }
+    tmp2 = load_reg(s, rt2);
+    tcg_gen_concat_i32_i64(val, tmp, tmp2);
     tcg_temp_free_i32(tmp);
-    if (size == 3) {
-        tcg_gen_addi_i32(addr, addr, 4);
-        tmp = load_reg(s, rt2);
-        gen_aa32_st32(tmp, addr, get_mem_index(s));
-        tcg_temp_free_i32(tmp);
-    }
-    tcg_gen_movi_i32(cpu_R[rd], 0);
-    tcg_gen_br(done_label);
-    gen_set_label(fail_label);
-    tcg_gen_movi_i32(cpu_R[rd], 1);
-    gen_set_label(done_label);
-    tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+    tcg_temp_free_i32(tmp2);
+
+    gen_helper_atomic_cmpxchg64(cpu_R[rd], cpu_env, addr, val, tmp_size);
+    tcg_temp_free_i64(val);
+    tcg_temp_free_i32(tmp_size);
 }
 #endif
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 13/19] add a callback when tb_invalidate is called.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (11 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 12/19] Use atomic cmpxchg to atomically check the exclusive value in a STREX fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 16:52   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 14/19] cpu: introduce tlb_flush*_all fred.konrad
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

Instead of doing the jump cache invalidation directly in tb_invalidate delay it
after the exit so we don't have an other CPU trying to execute the code being
invalidated.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
---
 translate-all.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 59 insertions(+), 2 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 954c67a..fc5162a 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -62,6 +62,7 @@
 #include "translate-all.h"
 #include "qemu/bitmap.h"
 #include "qemu/timer.h"
+#include "sysemu/cpus.h"
 
 //#define DEBUG_TB_INVALIDATE
 //#define DEBUG_FLUSH
@@ -967,14 +968,58 @@ static inline void tb_reset_jump(TranslationBlock *tb, int n)
     tb_set_jmp_target(tb, n, (uintptr_t)(tb->tc_ptr + tb->tb_next_offset[n]));
 }
 
+struct CPUDiscardTBParams {
+    CPUState *cpu;
+    TranslationBlock *tb;
+};
+
+static void cpu_discard_tb_from_jmp_cache(void *opaque)
+{
+    unsigned int h;
+    struct CPUDiscardTBParams *params = opaque;
+
+    h = tb_jmp_cache_hash_func(params->tb->pc);
+    if (params->cpu->tb_jmp_cache[h] == params->tb) {
+        params->cpu->tb_jmp_cache[h] = NULL;
+    }
+
+    g_free(opaque);
+}
+
+static void tb_invalidate_jmp_remove(void *opaque)
+{
+    TranslationBlock *tb = opaque;
+    TranslationBlock *tb1, *tb2;
+    unsigned int n1;
+
+    /* suppress this TB from the two jump lists */
+    tb_jmp_remove(tb, 0);
+    tb_jmp_remove(tb, 1);
+
+    /* suppress any remaining jumps to this TB */
+    tb1 = tb->jmp_first;
+    for (;;) {
+        n1 = (uintptr_t)tb1 & 3;
+        if (n1 == 2) {
+            break;
+        }
+        tb1 = (TranslationBlock *)((uintptr_t)tb1 & ~3);
+        tb2 = tb1->jmp_next[n1];
+        tb_reset_jump(tb1, n1);
+        tb1->jmp_next[n1] = NULL;
+        tb1 = tb2;
+    }
+    tb->jmp_first = (TranslationBlock *)((uintptr_t)tb | 2); /* fail safe */
+}
+
 /* invalidate one TB */
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 {
     CPUState *cpu;
     PageDesc *p;
-    unsigned int h, n1;
+    unsigned int h;
     tb_page_addr_t phys_pc;
-    TranslationBlock *tb1, *tb2;
+    struct CPUDiscardTBParams *params;
 
     tb_lock();
 
@@ -997,6 +1042,9 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 
     tcg_ctx.tb_ctx.tb_invalidated_flag = 1;
 
+#if 0 /*MTTCG*/
+    TranslationBlock *tb1, *tb2;
+    unsigned int n1;
     /* remove the TB from the hash list */
     h = tb_jmp_cache_hash_func(tb->pc);
     CPU_FOREACH(cpu) {
@@ -1023,6 +1071,15 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
         tb1 = tb2;
     }
     tb->jmp_first = (TranslationBlock *)((uintptr_t)tb | 2); /* fail safe */
+#else
+    CPU_FOREACH(cpu) {
+        params = g_malloc(sizeof(struct CPUDiscardTBParams));
+        params->cpu = cpu;
+        params->tb = tb;
+        async_run_on_cpu(cpu, cpu_discard_tb_from_jmp_cache, params);
+    }
+    async_run_safe_work_on_cpu(first_cpu, tb_invalidate_jmp_remove, tb);
+#endif /* MTTCG */
 
     tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
     tb_unlock();
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 14/19] cpu: introduce tlb_flush*_all.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (12 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 13/19] add a callback when tb_invalidate is called fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 15:54   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 15/19] arm: use tlb_flush*_all fred.konrad
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

Some architectures allow to flush the tlb of other VCPUs. This is not a problem
when we have only one thread for all VCPUs but it definitely needs to be an
asynchronous work when we are in true multithreaded work.

TODO: Some test case, I fear some bad results in case a VCPUs execute a barrier
      or something like that.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
---
 cputlb.c                | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/exec/exec-all.h |  2 ++
 2 files changed, 78 insertions(+)

diff --git a/cputlb.c b/cputlb.c
index 79fff1c..e5853fd 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -72,6 +72,45 @@ void tlb_flush(CPUState *cpu, int flush_global)
     tlb_flush_count++;
 }
 
+struct TLBFlushParams {
+    CPUState *cpu;
+    int flush_global;
+};
+
+static void tlb_flush_async_work(void *opaque)
+{
+    struct TLBFlushParams *params = opaque;
+
+    tlb_flush(params->cpu, params->flush_global);
+    g_free(params);
+}
+
+void tlb_flush_all(int flush_global)
+{
+    CPUState *cpu;
+    struct TLBFlushParams *params;
+
+#if 0 /* MTTCG */
+    CPU_FOREACH(cpu) {
+        tlb_flush(cpu, flush_global);
+    }
+#else
+    CPU_FOREACH(cpu) {
+        if (qemu_cpu_is_self(cpu)) {
+            /* async_run_on_cpu handle this case but this just avoid a malloc
+             * here.
+             */
+            tlb_flush(cpu, flush_global);
+        } else {
+            params = g_malloc(sizeof(struct TLBFlushParams));
+            params->cpu = cpu;
+            params->flush_global = flush_global;
+            async_run_on_cpu(cpu, tlb_flush_async_work, params);
+        }
+    }
+#endif /* MTTCG */
+}
+
 static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong addr)
 {
     if (addr == (tlb_entry->addr_read &
@@ -124,6 +163,43 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
     tb_flush_jmp_cache(cpu, addr);
 }
 
+struct TLBFlushPageParams {
+    CPUState *cpu;
+    target_ulong addr;
+};
+
+static void tlb_flush_page_async_work(void *opaque)
+{
+    struct TLBFlushPageParams *params = opaque;
+
+    tlb_flush_page(params->cpu, params->addr);
+    g_free(params);
+}
+
+void tlb_flush_page_all(target_ulong addr)
+{
+    CPUState *cpu;
+    struct TLBFlushPageParams *params;
+
+    CPU_FOREACH(cpu) {
+#if 0 /* !MTTCG */
+        tlb_flush_page(cpu, addr);
+#else
+        if (qemu_cpu_is_self(cpu)) {
+            /* async_run_on_cpu handle this case but this just avoid a malloc
+             * here.
+             */
+            tlb_flush_page(cpu, addr);
+        } else {
+            params = g_malloc(sizeof(struct TLBFlushPageParams));
+            params->cpu = cpu;
+            params->addr = addr;
+            async_run_on_cpu(cpu, tlb_flush_page_async_work, params);
+        }
+#endif /* MTTCG */
+    }
+}
+
 /* update the TLBs so that writes to code in the virtual page 'addr'
    can be detected */
 void tlb_protect_code(ram_addr_t ram_addr)
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 9f1c1cb..e9512df 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -97,7 +97,9 @@ bool qemu_in_vcpu_thread(void);
 void cpu_reload_memory_map(CPUState *cpu);
 void tcg_cpu_address_space_init(CPUState *cpu, AddressSpace *as);
 /* cputlb.c */
+void tlb_flush_page_all(target_ulong addr);
 void tlb_flush_page(CPUState *cpu, target_ulong addr);
+void tlb_flush_all(int flush_global);
 void tlb_flush(CPUState *cpu, int flush_global);
 void tlb_set_page(CPUState *cpu, target_ulong vaddr,
                   hwaddr paddr, int prot,
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 15/19] arm: use tlb_flush*_all
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (13 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 14/19] cpu: introduce tlb_flush*_all fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe fred.konrad
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This just use the new mechanism to ensure that each VCPU thread flush its own
VCPU.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
---
 target-arm/helper.c | 45 +++++++--------------------------------------
 1 file changed, 7 insertions(+), 38 deletions(-)

diff --git a/target-arm/helper.c b/target-arm/helper.c
index 34b465c..9acd7e5 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -411,41 +411,25 @@ static void tlbimvaa_write(CPUARMState *env, const ARMCPRegInfo *ri,
 static void tlbiall_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                              uint64_t value)
 {
-    CPUState *other_cs;
-
-    CPU_FOREACH(other_cs) {
-        tlb_flush(other_cs, 1);
-    }
+    tlb_flush_all(1);
 }
 
 static void tlbiasid_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                              uint64_t value)
 {
-    CPUState *other_cs;
-
-    CPU_FOREACH(other_cs) {
-        tlb_flush(other_cs, value == 0);
-    }
+    tlb_flush_all(value == 0);
 }
 
 static void tlbimva_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                              uint64_t value)
 {
-    CPUState *other_cs;
-
-    CPU_FOREACH(other_cs) {
-        tlb_flush_page(other_cs, value & TARGET_PAGE_MASK);
-    }
+    tlb_flush_page_all(value & TARGET_PAGE_MASK);
 }
 
 static void tlbimvaa_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                              uint64_t value)
 {
-    CPUState *other_cs;
-
-    CPU_FOREACH(other_cs) {
-        tlb_flush_page(other_cs, value & TARGET_PAGE_MASK);
-    }
+    tlb_flush_page_all(value & TARGET_PAGE_MASK);
 }
 
 static const ARMCPRegInfo cp_reginfo[] = {
@@ -2281,34 +2265,19 @@ static void tlbi_aa64_asid_write(CPUARMState *env, const ARMCPRegInfo *ri,
 static void tlbi_aa64_va_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                   uint64_t value)
 {
-    CPUState *other_cs;
-    uint64_t pageaddr = sextract64(value << 12, 0, 56);
-
-    CPU_FOREACH(other_cs) {
-        tlb_flush_page(other_cs, pageaddr);
-    }
+    tlb_flush_page_all(sextract64(value << 12, 0, 56));
 }
 
 static void tlbi_aa64_vaa_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                   uint64_t value)
 {
-    CPUState *other_cs;
-    uint64_t pageaddr = sextract64(value << 12, 0, 56);
-
-    CPU_FOREACH(other_cs) {
-        tlb_flush_page(other_cs, pageaddr);
-    }
+    tlb_flush_page_all(sextract64(value << 12, 0, 56));
 }
 
 static void tlbi_aa64_asid_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                                   uint64_t value)
 {
-    CPUState *other_cs;
-    int asid = extract64(value, 48, 16);
-
-    CPU_FOREACH(other_cs) {
-        tlb_flush(other_cs, asid == 0);
-    }
+    tlb_flush_all(extract64(value, 48, 16) == 0);
 }
 
 static CPAccessResult aa64_zva_access(CPUARMState *env, const ARMCPRegInfo *ri)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (14 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 15/19] arm: use tlb_flush*_all fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 16:26   ` Paolo Bonzini
  2015-08-12 14:09   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 17/19] translate-all: (wip) use tb_flush_safe when we can't alloc more tb fred.konrad
                   ` (5 subsequent siblings)
  21 siblings, 2 replies; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

tb_flush is not thread safe we definitely need to exit VCPUs to do that.
This introduces tb_flush_safe which just creates an async safe work which will
do a tb_flush later.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
---
 include/exec/exec-all.h |  1 +
 translate-all.c         | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index e9512df..246df68 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -200,6 +200,7 @@ struct TBContext {
 
 void tb_free(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
+void tb_flush_safe(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
 
 #if defined(USE_DIRECT_JUMP)
diff --git a/translate-all.c b/translate-all.c
index fc5162a..7094bf0 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -818,6 +818,21 @@ static void page_flush_tb(void)
     }
 }
 
+static void tb_flush_work(void *opaque)
+{
+    CPUState *cpu = opaque;
+    tb_flush(cpu);
+}
+
+void tb_flush_safe(CPUState *cpu)
+{
+#if 0 /* !MTTCG */
+    tb_flush(cpu);
+#else
+    async_run_safe_work_on_cpu(cpu, tb_flush_work, cpu);
+#endif /* MTTCG */
+}
+
 /* flush all the translation blocks */
 /* XXX: tb_flush is currently not thread safe */
 void tb_flush(CPUState *cpu)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 17/19] translate-all: (wip) use tb_flush_safe when we can't alloc more tb.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (15 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 18/19] mttcg: signal the associated cpu anyway fred.konrad
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This changes just the tb_flush called from tb_alloc.

TODO:
 * changes the other tb_flush.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
---
 translate-all.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/translate-all.c b/translate-all.c
index 7094bf0..cabce75 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1148,7 +1148,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb = tb_alloc(pc);
     if (!tb) {
         /* flush must be done */
-        tb_flush(cpu);
+        tb_flush_safe(cpu);
         /* cannot fail at this point */
         tb = tb_alloc(pc);
         /* Don't forget to invalidate previous TB info.  */
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 18/19] mttcg: signal the associated cpu anyway.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (16 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 17/19] translate-all: (wip) use tb_flush_safe when we can't alloc more tb fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 15:51   ` Paolo Bonzini
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 19/19] target-arm/psci.c: wake up sleeping CPUs (MTTCG) fred.konrad
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

We might have a race here. If current_cpu is about to be set then cpu_exit won't
be called and we don't exit TCG. This was probably an issue with old
implementation as well.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
---
 cpus.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/cpus.c b/cpus.c
index 2c5ca72..f61530c 100644
--- a/cpus.c
+++ b/cpus.c
@@ -674,8 +674,7 @@ static void cpu_signal(int sig)
         cpu_exit(current_cpu);
     }
 
-    /* FIXME: We might want to check if the cpu is running? */
-    tcg_thread_cpu->exit_request = true;
+    cpu_exit(tcg_thread_cpu);
 }
 
 #ifdef CONFIG_LINUX
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH V7 19/19] target-arm/psci.c: wake up sleeping CPUs (MTTCG)
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (17 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 18/19] mttcg: signal the associated cpu anyway fred.konrad
@ 2015-08-10 15:27 ` fred.konrad
  2015-08-10 16:41   ` Paolo Bonzini
  2015-08-10 18:34 ` [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG Alex Bennée
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 81+ messages in thread
From: fred.konrad @ 2015-08-10 15:27 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: Alex Bennée <alex.bennee@linaro.org>

Testing with Alexander's bare metal syncronisation tests fails in MTTCG
leaving one CPU spinning forever waiting for the second CPU to wake up.
We simply need to poke the halt_cond once we have processed the PSCI
power on call.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
CC: Alexander Spyridakis <a.spyridakis@virtualopensystems.com>
---
 target-arm/psci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/target-arm/psci.c b/target-arm/psci.c
index 20e4cb6..83e309c 100644
--- a/target-arm/psci.c
+++ b/target-arm/psci.c
@@ -211,6 +211,8 @@ void arm_handle_psci_call(ARMCPU *cpu)
         }
         target_cpu_class->set_pc(target_cpu_state, entry);
 
+        qemu_cond_signal(target_cpu_state->halt_cond);
+
         ret = 0;
         break;
     case QEMU_PSCI_0_1_FN_CPU_OFF:
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 10/19] cpu: remove exit_request global.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 10/19] cpu: remove exit_request global fred.konrad
@ 2015-08-10 15:51   ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 15:51 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>  {
>      if (current_cpu) {
>          cpu_exit(current_cpu);
>      }
> -    exit_request = 1;
> +
> +    /* FIXME: We might want to check if the cpu is running? */
> +    tcg_thread_cpu->exit_request = true;
>  }
>  
>  #ifdef CONFIG_LINUX
> @@ -1151,6 +1159,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>      CPUState *cpu = arg;
>  
>      qemu_mutex_lock_iothread();
> +    tcg_thread_cpu = cpu;
>      qemu_tcg_init_cpu_signals();
>      qemu_thread_get_self(cpu->thread);

This only makes sense for MTTCG, so it should be squashed in patch 11, I
think.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 18/19] mttcg: signal the associated cpu anyway.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 18/19] mttcg: signal the associated cpu anyway fred.konrad
@ 2015-08-10 15:51   ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 15:51 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> diff --git a/cpus.c b/cpus.c
> index 2c5ca72..f61530c 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -674,8 +674,7 @@ static void cpu_signal(int sig)
>          cpu_exit(current_cpu);
>      }
>  
> -    /* FIXME: We might want to check if the cpu is running? */
> -    tcg_thread_cpu->exit_request = true;
> +    cpu_exit(tcg_thread_cpu);

If you do this, you can remove the first "if" too, because current_cpu
is always either tcg_thread_cpu or NULL.

I think it's okay to do that and squash this patch into patch 11 as well.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 14/19] cpu: introduce tlb_flush*_all.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 14/19] cpu: introduce tlb_flush*_all fred.konrad
@ 2015-08-10 15:54   ` Paolo Bonzini
  2015-08-10 16:00     ` Peter Maydell
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 15:54 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
> 
> Some architectures allow to flush the tlb of other VCPUs. This is not a problem
> when we have only one thread for all VCPUs but it definitely needs to be an
> asynchronous work when we are in true multithreaded work.
> 
> TODO: Some test case, I fear some bad results in case a VCPUs execute a barrier
>       or something like that.
> 
> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
> ---
>  cputlb.c                | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
>  include/exec/exec-all.h |  2 ++
>  2 files changed, 78 insertions(+)

I still believe this should be a target-specific change.  This would
also make it easier to do the remote TLB flush synchronously, as is the
case on ARM (if I understand correctly).

Paolo

> diff --git a/cputlb.c b/cputlb.c
> index 79fff1c..e5853fd 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -72,6 +72,45 @@ void tlb_flush(CPUState *cpu, int flush_global)
>      tlb_flush_count++;
>  }
>  
> +struct TLBFlushParams {
> +    CPUState *cpu;
> +    int flush_global;
> +};
> +
> +static void tlb_flush_async_work(void *opaque)
> +{
> +    struct TLBFlushParams *params = opaque;
> +
> +    tlb_flush(params->cpu, params->flush_global);
> +    g_free(params);
> +}
> +
> +void tlb_flush_all(int flush_global)
> +{
> +    CPUState *cpu;
> +    struct TLBFlushParams *params;
> +
> +#if 0 /* MTTCG */
> +    CPU_FOREACH(cpu) {
> +        tlb_flush(cpu, flush_global);
> +    }
> +#else
> +    CPU_FOREACH(cpu) {
> +        if (qemu_cpu_is_self(cpu)) {
> +            /* async_run_on_cpu handle this case but this just avoid a malloc
> +             * here.
> +             */
> +            tlb_flush(cpu, flush_global);
> +        } else {
> +            params = g_malloc(sizeof(struct TLBFlushParams));
> +            params->cpu = cpu;
> +            params->flush_global = flush_global;
> +            async_run_on_cpu(cpu, tlb_flush_async_work, params);
> +        }
> +    }
> +#endif /* MTTCG */
> +}
> +
>  static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong addr)
>  {
>      if (addr == (tlb_entry->addr_read &
> @@ -124,6 +163,43 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
>      tb_flush_jmp_cache(cpu, addr);
>  }
>  
> +struct TLBFlushPageParams {
> +    CPUState *cpu;
> +    target_ulong addr;
> +};
> +
> +static void tlb_flush_page_async_work(void *opaque)
> +{
> +    struct TLBFlushPageParams *params = opaque;
> +
> +    tlb_flush_page(params->cpu, params->addr);
> +    g_free(params);
> +}
> +
> +void tlb_flush_page_all(target_ulong addr)
> +{
> +    CPUState *cpu;
> +    struct TLBFlushPageParams *params;
> +
> +    CPU_FOREACH(cpu) {
> +#if 0 /* !MTTCG */
> +        tlb_flush_page(cpu, addr);
> +#else
> +        if (qemu_cpu_is_self(cpu)) {
> +            /* async_run_on_cpu handle this case but this just avoid a malloc
> +             * here.
> +             */
> +            tlb_flush_page(cpu, addr);
> +        } else {
> +            params = g_malloc(sizeof(struct TLBFlushPageParams));
> +            params->cpu = cpu;
> +            params->addr = addr;
> +            async_run_on_cpu(cpu, tlb_flush_page_async_work, params);
> +        }
> +#endif /* MTTCG */
> +    }
> +}
> +
>  /* update the TLBs so that writes to code in the virtual page 'addr'
>     can be detected */
>  void tlb_protect_code(ram_addr_t ram_addr)
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 9f1c1cb..e9512df 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -97,7 +97,9 @@ bool qemu_in_vcpu_thread(void);
>  void cpu_reload_memory_map(CPUState *cpu);
>  void tcg_cpu_address_space_init(CPUState *cpu, AddressSpace *as);
>  /* cputlb.c */
> +void tlb_flush_page_all(target_ulong addr);
>  void tlb_flush_page(CPUState *cpu, target_ulong addr);
> +void tlb_flush_all(int flush_global);
>  void tlb_flush(CPUState *cpu, int flush_global);
>  void tlb_set_page(CPUState *cpu, target_ulong vaddr,
>                    hwaddr paddr, int prot,
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 01/19] cpus: protect queued_work_* with work_mutex.
  2015-08-10 15:26 ` [Qemu-devel] [RFC PATCH V7 01/19] cpus: protect queued_work_* with work_mutex fred.konrad
@ 2015-08-10 15:59   ` Paolo Bonzini
  2015-08-10 16:04     ` Frederic Konrad
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 15:59 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 17:26, fred.konrad@greensocs.com wrote:
>  
> +    qemu_mutex_lock(&cpu->work_mutex);
>      while ((wi = cpu->queued_work_first)) {
>          cpu->queued_work_first = wi->next;
> +        qemu_mutex_unlock(&cpu->work_mutex);
>          wi->func(wi->data);
> +        qemu_mutex_lock(&cpu->work_mutex);
>          wi->done = true;

This should be atomic_mb_set

>          if (wi->free) {
>              g_free(wi);
>          }
>      }
>      cpu->queued_work_last = NULL;

... and I'm a bit afraid of leaving the state of the list inconsistent,
so I'd move this after the cpu->queued_work_first assignment.  Otherwise
the patch looks good, I'm queuing it for 2.5.

Paolo

> +    qemu_mutex_unlock(&cpu->work_mutex);
> +

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 14/19] cpu: introduce tlb_flush*_all.
  2015-08-10 15:54   ` Paolo Bonzini
@ 2015-08-10 16:00     ` Peter Maydell
  0 siblings, 0 replies; 81+ messages in thread
From: Peter Maydell @ 2015-08-10 16:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Mark Burton, Alvise Rigo, QEMU Developers,
	Guillaume Delbergue, Alex Bennée, KONRAD Frédéric

On 10 August 2015 at 16:54, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>
>> Some architectures allow to flush the tlb of other VCPUs. This is not a problem
>> when we have only one thread for all VCPUs but it definitely needs to be an
>> asynchronous work when we are in true multithreaded work.
>>
>> TODO: Some test case, I fear some bad results in case a VCPUs execute a barrier
>>       or something like that.
>>
>> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
>> ---
>>  cputlb.c                | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/exec/exec-all.h |  2 ++
>>  2 files changed, 78 insertions(+)
>
> I still believe this should be a target-specific change.  This would
> also make it easier to do the remote TLB flush synchronously, as is the
> case on ARM (if I understand correctly).

ARM TLB flushes have to complete by the next barrier instruction
(or equivalent thing); so they're asynchronous but with a guest-controlled
synchronization point.

Also, compare the series I posted recently for adding missing
TLB operations:
https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg00945.html
which adds support for flush-specific-mmuidx operations, which would
increase the number of primitives you're trying to support here.
That might argue for making this target-specific.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 01/19] cpus: protect queued_work_* with work_mutex.
  2015-08-10 15:59   ` Paolo Bonzini
@ 2015-08-10 16:04     ` Frederic Konrad
  2015-08-10 16:06       ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-10 16:04 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue

On 10/08/2015 17:59, Paolo Bonzini wrote:
>
> On 10/08/2015 17:26, fred.konrad@greensocs.com wrote:
>>   
>> +    qemu_mutex_lock(&cpu->work_mutex);
>>       while ((wi = cpu->queued_work_first)) {
>>           cpu->queued_work_first = wi->next;
>> +        qemu_mutex_unlock(&cpu->work_mutex);
>>           wi->func(wi->data);
>> +        qemu_mutex_lock(&cpu->work_mutex);
>>           wi->done = true;
> This should be atomic_mb_set

Isn't that protected by the mutex? Or maybe it's used somewhere else?
>
>>           if (wi->free) {
>>               g_free(wi);
>>           }
>>       }
>>       cpu->queued_work_last = NULL;
> ... and I'm a bit afraid of leaving the state of the list inconsistent,
> so I'd move this after the cpu->queued_work_first assignment.  Otherwise
> the patch looks good, I'm queuing it for 2.5.
>
> Paolo
>
>> +    qemu_mutex_unlock(&cpu->work_mutex);
>> +

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 01/19] cpus: protect queued_work_* with work_mutex.
  2015-08-10 16:04     ` Frederic Konrad
@ 2015-08-10 16:06       ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:06 UTC (permalink / raw)
  To: Frederic Konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 18:04, Frederic Konrad wrote:
> On 10/08/2015 17:59, Paolo Bonzini wrote:
>>
>> On 10/08/2015 17:26, fred.konrad@greensocs.com wrote:
>>>   +    qemu_mutex_lock(&cpu->work_mutex);
>>>       while ((wi = cpu->queued_work_first)) {
>>>           cpu->queued_work_first = wi->next;
>>> +        qemu_mutex_unlock(&cpu->work_mutex);
>>>           wi->func(wi->data);
>>> +        qemu_mutex_lock(&cpu->work_mutex);
>>>           wi->done = true;
>> This should be atomic_mb_set
> 
> Isn't that protected by the mutex?

This use is not protected by the mutex:

@@ -853,6 +855,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
     cpu->queued_work_last = &wi;
     wi.next = NULL;
     wi.done = false;
+    qemu_mutex_unlock(&cpu->work_mutex);
 
     qemu_cpu_kick(cpu);
     while (!wi.done) {

Paolo

 Or maybe it's used somewhere else?
>>
>>>           if (wi->free) {
>>>               g_free(wi);
>>>           }
>>>       }
>>>       cpu->queued_work_last = NULL;
>> ... and I'm a bit afraid of leaving the state of the list inconsistent,
>> so I'd move this after the cpu->queued_work_first assignment.  Otherwise
>> the patch looks good, I'm queuing it for 2.5.
>>
>> Paolo
>>
>>> +    qemu_mutex_unlock(&cpu->work_mutex);
>>> +
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 04/19] replace spinlock by QemuMutex.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 04/19] replace spinlock by QemuMutex fred.konrad
@ 2015-08-10 16:09   ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:09 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
> 
> spinlock is only used in two cases:
>   * cpu-exec.c: to protect TranslationBlock
>   * mem_helper.c: for lock helper in target-i386 (which seems broken).
> 
> It's a pthread_mutex_t in user-mode so better using QemuMutex directly in this
> case.
> It allows as well to reuse tb_lock mutex of TBContext in case of multithread
> TCG.
> 
> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
> ---
>  cpu-exec.c               | 15 +++++++++++----
>  include/exec/exec-all.h  |  4 ++--
>  linux-user/main.c        |  6 +++---
>  target-i386/mem_helper.c | 16 +++++++++++++---
>  tcg/i386/tcg-target.c    |  8 ++++++++
>  5 files changed, 37 insertions(+), 12 deletions(-)
> 
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 97805cc..f3358a9 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -361,7 +361,9 @@ int cpu_exec(CPUState *cpu)
>      SyncClocks sc;
>  
>      /* This must be volatile so it is not trashed by longjmp() */
> +#if defined(CONFIG_USER_ONLY)
>      volatile bool have_tb_lock = false;
> +#endif
>  
>      if (async_safe_work_pending()) {
>          cpu->exit_request = 1;
> @@ -488,8 +490,10 @@ int cpu_exec(CPUState *cpu)
>                      cpu->exception_index = EXCP_INTERRUPT;
>                      cpu_loop_exit(cpu);
>                  }
> -                spin_lock(&tcg_ctx.tb_ctx.tb_lock);
> +#if defined(CONFIG_USER_ONLY)
> +                qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
>                  have_tb_lock = true;
> +#endif
>                  tb = tb_find_fast(cpu);
>                  /* Note: we do it here to avoid a gcc bug on Mac OS X when
>                     doing it in tb_find_slow */
> @@ -511,9 +515,10 @@ int cpu_exec(CPUState *cpu)
>                      tb_add_jump((TranslationBlock *)(next_tb & ~TB_EXIT_MASK),
>                                  next_tb & TB_EXIT_MASK, tb);
>                  }
> +#if defined(CONFIG_USER_ONLY)
>                  have_tb_lock = false;
> -                spin_unlock(&tcg_ctx.tb_ctx.tb_lock);
> -
> +                qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
> +#endif
>                  /* cpu_interrupt might be called while translating the
>                     TB, but before it is linked into a potentially
>                     infinite loop and becomes env->current_tb. Avoid
> @@ -580,10 +585,12 @@ int cpu_exec(CPUState *cpu)
>              x86_cpu = X86_CPU(cpu);
>              env = &x86_cpu->env;
>  #endif
> +#if defined(CONFIG_USER_ONLY)
>              if (have_tb_lock) {
> -                spin_unlock(&tcg_ctx.tb_ctx.tb_lock);
> +                qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
>                  have_tb_lock = false;
>              }
> +#endif
>          }
>      } /* for(;;) */
>  
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index a6fce04..55a6ff2 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -176,7 +176,7 @@ struct TranslationBlock {
>      struct TranslationBlock *jmp_first;
>  };
>  
> -#include "exec/spinlock.h"
> +#include "qemu/thread.h"
>  
>  typedef struct TBContext TBContext;
>  
> @@ -186,7 +186,7 @@ struct TBContext {
>      TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
>      int nb_tbs;
>      /* any access to the tbs or the page table must use this lock */
> -    spinlock_t tb_lock;
> +    QemuMutex tb_lock;
>  
>      /* statistics */
>      int tb_flush_count;
> diff --git a/linux-user/main.c b/linux-user/main.c
> index 05914b1..20e7199 100644
> --- a/linux-user/main.c
> +++ b/linux-user/main.c
> @@ -107,7 +107,7 @@ static int pending_cpus;
>  /* Make sure everything is in a consistent state for calling fork().  */
>  void fork_start(void)
>  {
> -    pthread_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
> +    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
>      pthread_mutex_lock(&exclusive_lock);
>      mmap_fork_start();
>  }
> @@ -129,11 +129,11 @@ void fork_end(int child)
>          pthread_mutex_init(&cpu_list_mutex, NULL);
>          pthread_cond_init(&exclusive_cond, NULL);
>          pthread_cond_init(&exclusive_resume, NULL);
> -        pthread_mutex_init(&tcg_ctx.tb_ctx.tb_lock, NULL);
> +        qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>          gdbserver_fork(thread_cpu);
>      } else {
>          pthread_mutex_unlock(&exclusive_lock);
> -        pthread_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
> +        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
>      }
>  }
>  
> diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
> index 1aec8a5..7106cc3 100644
> --- a/target-i386/mem_helper.c
> +++ b/target-i386/mem_helper.c
> @@ -23,17 +23,27 @@
>  
>  /* broken thread support */
>  
> -static spinlock_t global_cpu_lock = SPIN_LOCK_UNLOCKED;
> +#if defined(CONFIG_USER_ONLY)
> +QemuMutex global_cpu_lock;
>  
>  void helper_lock(void)
>  {
> -    spin_lock(&global_cpu_lock);
> +    qemu_mutex_lock(&global_cpu_lock);
>  }
>  
>  void helper_unlock(void)
>  {
> -    spin_unlock(&global_cpu_lock);
> +    qemu_mutex_unlock(&global_cpu_lock);
>  }
> +#else
> +void helper_lock(void)
> +{
> +}
> +
> +void helper_unlock(void)
> +{
> +}
> +#endif
>  
>  void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
>  {
> diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
> index ff4d9cf..0d7c99c 100644
> --- a/tcg/i386/tcg-target.c
> +++ b/tcg/i386/tcg-target.c
> @@ -24,6 +24,10 @@
>  
>  #include "tcg-be-ldst.h"
>  
> +#if defined(CONFIG_USER_ONLY)
> +extern QemuMutex global_cpu_lock;

This should be in target-i386/, not tcg/i386.

With this change, I think it's okay to put this patch and patch 5 in,
separately from the rest of the MTTCG work.

Paolo

> +#endif
> +
>  #ifndef NDEBUG
>  static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>  #if TCG_TARGET_REG_BITS == 64
> @@ -2342,6 +2346,10 @@ static void tcg_target_init(TCGContext *s)
>      tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
>  
>      tcg_add_target_add_op_defs(x86_op_defs);
> +
> +#if defined(CONFIG_USER_ONLY)
> +    qemu_mutex_init(global_cpu_lock);
> +#endif
>  }
>  
>  typedef struct {
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 06/19] add support for spin lock on POSIX systems exclusively
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 06/19] add support for spin lock on POSIX systems exclusively fred.konrad
@ 2015-08-10 16:10   ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:10 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
> 
> WARNING: spin lock is currently not implemented on WIN32
> 
> Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>

Should go before patch 12 (and IIUC Alvise's work should go in first
anyway).

Paolo

> ---
>  include/qemu/thread-posix.h |  4 ++++
>  include/qemu/thread-win32.h |  4 ++++
>  include/qemu/thread.h       |  7 +++++++
>  util/qemu-thread-posix.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  util/qemu-thread-win32.c    | 30 ++++++++++++++++++++++++++++++
>  5 files changed, 90 insertions(+)
> 
> diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
> index eb5c7a1..8ce8f01 100644
> --- a/include/qemu/thread-posix.h
> +++ b/include/qemu/thread-posix.h
> @@ -7,6 +7,10 @@ struct QemuMutex {
>      pthread_mutex_t lock;
>  };
>  
> +struct QemuSpin {
> +    pthread_spinlock_t lock;
> +};
> +
>  struct QemuCond {
>      pthread_cond_t cond;
>  };
> diff --git a/include/qemu/thread-win32.h b/include/qemu/thread-win32.h
> index 3d58081..310c8bd 100644
> --- a/include/qemu/thread-win32.h
> +++ b/include/qemu/thread-win32.h
> @@ -7,6 +7,10 @@ struct QemuMutex {
>      LONG owner;
>  };
>  
> +struct QemuSpin {
> +    PKSPIN_LOCK lock;
> +};
> +
>  struct QemuCond {
>      LONG waiters, target;
>      HANDLE sema;
> diff --git a/include/qemu/thread.h b/include/qemu/thread.h
> index 5114ec8..f5d1259 100644
> --- a/include/qemu/thread.h
> +++ b/include/qemu/thread.h
> @@ -5,6 +5,7 @@
>  #include <stdbool.h>
>  
>  typedef struct QemuMutex QemuMutex;
> +typedef struct QemuSpin QemuSpin;
>  typedef struct QemuCond QemuCond;
>  typedef struct QemuSemaphore QemuSemaphore;
>  typedef struct QemuEvent QemuEvent;
> @@ -25,6 +26,12 @@ void qemu_mutex_lock(QemuMutex *mutex);
>  int qemu_mutex_trylock(QemuMutex *mutex);
>  void qemu_mutex_unlock(QemuMutex *mutex);
>  
> +void qemu_spin_init(QemuSpin *spin);
> +void qemu_spin_destroy(QemuSpin *spin);
> +void qemu_spin_lock(QemuSpin *spin);
> +int qemu_spin_trylock(QemuSpin *spin);
> +void qemu_spin_unlock(QemuSpin *spin);
> +
>  void qemu_cond_init(QemuCond *cond);
>  void qemu_cond_destroy(QemuCond *cond);
>  
> diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> index ba67cec..224bacc 100644
> --- a/util/qemu-thread-posix.c
> +++ b/util/qemu-thread-posix.c
> @@ -89,6 +89,51 @@ void qemu_mutex_unlock(QemuMutex *mutex)
>          error_exit(err, __func__);
>  }
>  
> +void qemu_spin_init(QemuSpin *spin)
> +{
> +    int err;
> +
> +    err = pthread_spin_init(&spin->lock, 0);
> +    if (err) {
> +        error_exit(err, __func__);
> +    }
> +}
> +
> +void qemu_spin_destroy(QemuSpin *spin)
> +{
> +    int err;
> +
> +    err = pthread_spin_destroy(&spin->lock);
> +    if (err) {
> +        error_exit(err, __func__);
> +    }
> +}
> +
> +void qemu_spin_lock(QemuSpin *spin)
> +{
> +    int err;
> +
> +    err = pthread_spin_lock(&spin->lock);
> +    if (err) {
> +        error_exit(err, __func__);
> +    }
> +}
> +
> +int qemu_spin_trylock(QemuSpin *spin)
> +{
> +    return pthread_spin_trylock(&spin->lock);
> +}
> +
> +void qemu_spin_unlock(QemuSpin *spin)
> +{
> +    int err;
> +
> +    err = pthread_spin_unlock(&spin->lock);
> +    if (err) {
> +        error_exit(err, __func__);
> +    }
> +}
> +
>  void qemu_cond_init(QemuCond *cond)
>  {
>      int err;
> diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
> index 406b52f..6fbe6a8 100644
> --- a/util/qemu-thread-win32.c
> +++ b/util/qemu-thread-win32.c
> @@ -80,6 +80,36 @@ void qemu_mutex_unlock(QemuMutex *mutex)
>      LeaveCriticalSection(&mutex->lock);
>  }
>  
> +void qemu_spin_init(QemuSpin *spin)
> +{
> +    printf("spinlock not implemented");
> +    abort();
> +}
> +
> +void qemu_spin_destroy(QemuSpin *spin)
> +{
> +    printf("spinlock not implemented");
> +    abort();
> +}
> +
> +void qemu_spin_lock(QemuSpin *spin)
> +{
> +    printf("spinlock not implemented");
> +    abort();
> +}
> +
> +int qemu_spin_trylock(QemuSpin *spin)
> +{
> +    printf("spinlock not implemented");
> +    abort();
> +}
> +
> +void qemu_spin_unlock(QemuSpin *spin)
> +{
> +    printf("spinlock not implemented");
> +    abort();
> +}
> +
>  void qemu_cond_init(QemuCond *cond)
>  {
>      memset(cond, 0, sizeof(*cond));
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 08/19] tcg: remove tcg_halt_cond global variable.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 08/19] tcg: remove tcg_halt_cond global variable fred.konrad
@ 2015-08-10 16:12   ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:12 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
> 
> This removes tcg_halt_cond global variable.
> We need one QemuCond per virtual cpu for multithread TCG.
> 
> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
> ---
>  cpus.c | 18 +++++++-----------
>  1 file changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/cpus.c b/cpus.c
> index 2250296..2550be2 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -815,7 +815,6 @@ static unsigned iothread_requesting_mutex;
>  static QemuThread io_thread;
>  
>  static QemuThread *tcg_cpu_thread;
> -static QemuCond *tcg_halt_cond;
>  
>  /* cpu creation */
>  static QemuCond qemu_cpu_cond;
> @@ -1038,15 +1037,13 @@ static void qemu_wait_io_event_common(CPUState *cpu)
>      cpu->thread_kicked = false;
>  }
>  
> -static void qemu_tcg_wait_io_event(void)
> +static void qemu_tcg_wait_io_event(CPUState *cpu)
>  {
> -    CPUState *cpu;
> -
>      while (all_cpu_threads_idle()) {
>         /* Start accounting real time to the virtual clock if the CPUs
>            are idle.  */
>          qemu_clock_warp(QEMU_CLOCK_VIRTUAL);
> -        qemu_cond_wait(tcg_halt_cond, &qemu_global_mutex);
> +        qemu_cond_wait(cpu->halt_cond, &qemu_global_mutex);
>      }
>  
>      while (iothread_requesting_mutex) {
> @@ -1166,7 +1163,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>  
>      /* wait for initial kick-off after machine start */
>      while (first_cpu->stopped) {
> -        qemu_cond_wait(tcg_halt_cond, &qemu_global_mutex);
> +        qemu_cond_wait(first_cpu->halt_cond, &qemu_global_mutex);
>  
>          /* process any pending work */
>          CPU_FOREACH(cpu) {
> @@ -1187,7 +1184,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>                  qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
>              }
>          }
> -        qemu_tcg_wait_io_event();
> +        qemu_tcg_wait_io_event(QTAILQ_FIRST(&cpus));
>      }
>  
>      return NULL;
> @@ -1328,12 +1325,12 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
>  
>      tcg_cpu_address_space_init(cpu, cpu->as);
>  
> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> +    qemu_cond_init(cpu->halt_cond);
> +
>      /* share a single thread for all cpus with TCG */
>      if (!tcg_cpu_thread) {
>          cpu->thread = g_malloc0(sizeof(QemuThread));
> -        cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> -        qemu_cond_init(cpu->halt_cond);
> -        tcg_halt_cond = cpu->halt_cond;
>          snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/TCG",
>                   cpu->cpu_index);
>          qemu_thread_create(cpu->thread, thread_name, qemu_tcg_cpu_thread_fn,
> @@ -1347,7 +1344,6 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
>          tcg_cpu_thread = cpu->thread;
>      } else {
>          cpu->thread = tcg_cpu_thread;
> -        cpu->halt_cond = tcg_halt_cond;
>      }
>  }
>  
> 

This should be squashed in "tcg: switch on multithread".

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution fred.konrad
@ 2015-08-10 16:15   ` Paolo Bonzini
  2015-08-11  6:55     ` Frederic Konrad
  2015-08-11 20:12     ` Alex Bennée
  0 siblings, 2 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:15 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>  void qemu_mutex_lock_iothread(void)
>  {
> -    atomic_inc(&iothread_requesting_mutex);
> -    /* In the simple case there is no need to bump the VCPU thread out of
> -     * TCG code execution.
> -     */
> -    if (!tcg_enabled() || qemu_in_vcpu_thread() ||
> -        !first_cpu || !first_cpu->thread) {
> -        qemu_mutex_lock(&qemu_global_mutex);
> -        atomic_dec(&iothread_requesting_mutex);
> -    } else {
> -        if (qemu_mutex_trylock(&qemu_global_mutex)) {
> -            qemu_cpu_kick_thread(first_cpu);
> -            qemu_mutex_lock(&qemu_global_mutex);
> -        }
> -        atomic_dec(&iothread_requesting_mutex);
> -        qemu_cond_broadcast(&qemu_io_proceeded_cond);
> -    }
> -    iothread_locked = true;

"iothread_locked = true" must be kept.  Otherwise... yay! :)

> @@ -125,8 +128,10 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
>     can be detected */
>  void tlb_protect_code(ram_addr_t ram_addr)
>  {
> +    qemu_mutex_lock_iothread();
>      cpu_physical_memory_test_and_clear_dirty(ram_addr, TARGET_PAGE_SIZE,
>                                               DIRTY_MEMORY_CODE);
> +    qemu_mutex_unlock_iothread();
>  }
>  

Not needed anymore.

> diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c
> index 52c5d65..55f63bf 100644
> --- a/target-i386/misc_helper.c
> +++ b/target-i386/misc_helper.c

None of this is needed anymore either! :)

> +    /*
> +     * Some device's reset needs to grab the global_mutex. So just release it
> +     * here.
> +     */
> +    qemu_mutex_unlock_iothread();
>      /* reset all devices */
>      QTAILQ_FOREACH_SAFE(re, &reset_handlers, entry, nre) {
>          re->func(re->opaque);
>      }
> +    qemu_mutex_lock_iothread();

Should never have been true?  (And, I think, it was pointed out in a
previous version too).

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe fred.konrad
@ 2015-08-10 16:26   ` Paolo Bonzini
  2015-08-12 14:09   ` Paolo Bonzini
  1 sibling, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:26 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> +
> +void tb_flush_safe(CPUState *cpu)
> +{
> +#if 0 /* !MTTCG */
> +    tb_flush(cpu);
> +#else
> +    async_run_safe_work_on_cpu(cpu, tb_flush_work, cpu);
> +#endif /* MTTCG */
> +}
> +

I think this can use first_cpu unconditionally; tb_flush only uses its
argument for an error message.

In fact, I think that by definition async_run_safe_work_on_cpu can use
any CPU for its work; it locks out everyone else, so it does not matter
which thread you're on.  So async_run_safe_work_on_cpu could drop the
@cpu argument (becoming async_run_safe_cpu_work) and use first_cpu
unconditionally.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock fred.konrad
@ 2015-08-10 16:36   ` Paolo Bonzini
  2015-08-10 16:50     ` Paolo Bonzini
  2015-08-11  6:46     ` Frederic Konrad
  2015-08-12 17:45   ` Frederic Konrad
  1 sibling, 2 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:36 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> diff --git a/cpu-exec.c b/cpu-exec.c
> index f3358a9..a012e9d 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -131,6 +131,8 @@ static void init_delay_params(SyncClocks *sc, const CPUState *cpu)
>  void cpu_loop_exit(CPUState *cpu)
>  {
>      cpu->current_tb = NULL;
> +    /* Release those mutex before long jump so other thread can work. */
> +    tb_lock_reset();
>      siglongjmp(cpu->jmp_env, 1);
>  }
>  
> @@ -143,6 +145,8 @@ void cpu_resume_from_signal(CPUState *cpu, void *puc)
>      /* XXX: restore cpu registers saved in host registers */
>  
>      cpu->exception_index = -1;
> +    /* Release those mutex before long jump so other thread can work. */
> +    tb_lock_reset();
>      siglongjmp(cpu->jmp_env, 1);
>  }
>  

I think you should start easy and reuse the existing tb_lock code in
cpu-exec.c:

diff --git a/cpu-exec.c b/cpu-exec.c
index 9305f03..2909ec2 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -307,7 +307,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, target_ulong pc,
 
     tb = tb_find_physical(cpu, pc, cs_base, flags);
     if (!tb) {
-        tb_lock();
         /*
          * Retry to get the TB in case a CPU just translate it to avoid having
          * duplicated TB in the pool.
@@ -316,7 +315,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, target_ulong pc,
         if (!tb) {
             tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
         }
-        tb_unlock();
     }
     /* we add the TB in the virtual pc hash table */
     cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
@@ -372,11 +372,6 @@ int cpu_exec(CPUState *cpu)
     uintptr_t next_tb;
     SyncClocks sc;
 
-    /* This must be volatile so it is not trashed by longjmp() */
-#if defined(CONFIG_USER_ONLY)
-    volatile bool have_tb_lock = false;
-#endif
-
     if (cpu->halted) {
         if (!cpu_has_work(cpu)) {
             return EXCP_HALTED;
@@ -480,10 +475,7 @@ int cpu_exec(CPUState *cpu)
                     cpu->exception_index = EXCP_INTERRUPT;
                     cpu_loop_exit(cpu);
                 }
-#if defined(CONFIG_USER_ONLY)
-                qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
-                have_tb_lock = true;
-#endif
+                tb_lock();
                 tb = tb_find_fast(cpu);
                 /* Note: we do it here to avoid a gcc bug on Mac OS X when
                    doing it in tb_find_slow */
@@ -505,10 +497,7 @@ int cpu_exec(CPUState *cpu)
                     tb_add_jump((TranslationBlock *)(next_tb & ~TB_EXIT_MASK),
                                 next_tb & TB_EXIT_MASK, tb);
                 }
-#if defined(CONFIG_USER_ONLY)
-                have_tb_lock = false;
-                qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
-#endif
+                tb_unlock();
                 /* cpu_interrupt might be called while translating the
                    TB, but before it is linked into a potentially
                    infinite loop and becomes env->current_tb. Avoid
@@ -575,12 +564,7 @@ int cpu_exec(CPUState *cpu)
             x86_cpu = X86_CPU(cpu);
             env = &x86_cpu->env;
 #endif
-#if defined(CONFIG_USER_ONLY)
-            if (have_tb_lock) {
-                qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
-                have_tb_lock = false;
-            }
-#endif
+            tb_lock_reset();
         }
     } /* for(;;) */
 

Optimizations should then come on top.

> diff --git a/target-arm/translate.c b/target-arm/translate.c
> index 69ac18c..960c75e 100644
> --- a/target-arm/translate.c
> +++ b/target-arm/translate.c
> @@ -11166,6 +11166,8 @@ static inline void gen_intermediate_code_internal(ARMCPU *cpu,
>  
>      dc->tb = tb;
>  
> +    tb_lock();

This locks twice, I think?  Both cpu_restore_state_from_tb and 
tb_gen_code (which calls cpu_gen_code) take the lock.  How does it work?

> +
>      dc->is_jmp = DISAS_NEXT;
>      dc->pc = pc_start;
>      dc->singlestep_enabled = cs->singlestep_enabled;
> @@ -11506,6 +11508,7 @@ done_generating:
>          tb->size = dc->pc - pc_start;
>          tb->icount = num_insns;
>      }
> +    tb_unlock();
>  }
>  

> +/* tb_lock must be help for tcg_malloc_internal. */

"Held", not "help".

Paolo

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 19/19] target-arm/psci.c: wake up sleeping CPUs (MTTCG)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 19/19] target-arm/psci.c: wake up sleeping CPUs (MTTCG) fred.konrad
@ 2015-08-10 16:41   ` Paolo Bonzini
  2015-08-10 18:38     ` Alex Bennée
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:41 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> From: Alex Bennée <alex.bennee@linaro.org>
> 
> Testing with Alexander's bare metal syncronisation tests fails in MTTCG
> leaving one CPU spinning forever waiting for the second CPU to wake up.
> We simply need to poke the halt_cond once we have processed the PSCI
> power on call.
> 
> Tested-by: Alex Bennée <alex.bennee@linaro.org>
> CC: Alexander Spyridakis <a.spyridakis@virtualopensystems.com>
> ---
>  target-arm/psci.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/target-arm/psci.c b/target-arm/psci.c
> index 20e4cb6..83e309c 100644
> --- a/target-arm/psci.c
> +++ b/target-arm/psci.c
> @@ -211,6 +211,8 @@ void arm_handle_psci_call(ARMCPU *cpu)
>          }
>          target_cpu_class->set_pc(target_cpu_state, entry);
>  
> +        qemu_cond_signal(target_cpu_state->halt_cond);
> +

qemu_cpu_kick, not qemu_cond_signal.

Paolo

>          ret = 0;
>          break;
>      case QEMU_PSCI_0_1_FN_CPU_OFF:
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-10 16:36   ` Paolo Bonzini
@ 2015-08-10 16:50     ` Paolo Bonzini
  2015-08-10 18:39       ` Alex Bennée
  2015-08-11  6:46     ` Frederic Konrad
  1 sibling, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:50 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 10/08/2015 18:36, Paolo Bonzini wrote:
>> > diff --git a/target-arm/translate.c b/target-arm/translate.c
>> > index 69ac18c..960c75e 100644
>> > --- a/target-arm/translate.c
>> > +++ b/target-arm/translate.c
>> > @@ -11166,6 +11166,8 @@ static inline void gen_intermediate_code_internal(ARMCPU *cpu,
>> >  
>> >      dc->tb = tb;
>> >  
>> > +    tb_lock();
> This locks twice, I think?  Both cpu_restore_state_from_tb and 
> tb_gen_code (which calls cpu_gen_code) take the lock.  How does it work?
> 

... ah, the lock is recursive!

I think this can be avoided.  Let's look at it next week.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 13/19] add a callback when tb_invalidate is called.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 13/19] add a callback when tb_invalidate is called fred.konrad
@ 2015-08-10 16:52   ` Paolo Bonzini
  2015-08-10 18:41     ` Alex Bennée
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-10 16:52 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
> 
> Instead of doing the jump cache invalidation directly in tb_invalidate delay it
> after the exit so we don't have an other CPU trying to execute the code being
> invalidated.
> 
> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
> ---
>  translate-all.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 59 insertions(+), 2 deletions(-)

If you take the easy way and avoid the optimizations in patch 7, this is
not necessary: tb_find_fast and tb_add_jump are only called from within
tb_lock, so all of tb_jmp_cache/jmp_first/jmp_next are protected by tb_lock.

Let's get everything in and then optimize; the order should be:

- Alvise's LL/SC implementation

- conversion of atomics to LL/SC for all front-ends

- the main MTTCG series, reusing the locking already in-place for
user-mode emulation (with some audit...)

- any further push-downs of tb_lock

Paolo

> diff --git a/translate-all.c b/translate-all.c
> index 954c67a..fc5162a 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -62,6 +62,7 @@
>  #include "translate-all.h"
>  #include "qemu/bitmap.h"
>  #include "qemu/timer.h"
> +#include "sysemu/cpus.h"
>  
>  //#define DEBUG_TB_INVALIDATE
>  //#define DEBUG_FLUSH
> @@ -967,14 +968,58 @@ static inline void tb_reset_jump(TranslationBlock *tb, int n)
>      tb_set_jmp_target(tb, n, (uintptr_t)(tb->tc_ptr + tb->tb_next_offset[n]));
>  }
>  
> +struct CPUDiscardTBParams {
> +    CPUState *cpu;
> +    TranslationBlock *tb;
> +};
> +
> +static void cpu_discard_tb_from_jmp_cache(void *opaque)
> +{
> +    unsigned int h;
> +    struct CPUDiscardTBParams *params = opaque;
> +
> +    h = tb_jmp_cache_hash_func(params->tb->pc);
> +    if (params->cpu->tb_jmp_cache[h] == params->tb) {
> +        params->cpu->tb_jmp_cache[h] = NULL;
> +    }
> +
> +    g_free(opaque);
> +}
> +
> +static void tb_invalidate_jmp_remove(void *opaque)
> +{
> +    TranslationBlock *tb = opaque;
> +    TranslationBlock *tb1, *tb2;
> +    unsigned int n1;
> +
> +    /* suppress this TB from the two jump lists */
> +    tb_jmp_remove(tb, 0);
> +    tb_jmp_remove(tb, 1);
> +
> +    /* suppress any remaining jumps to this TB */
> +    tb1 = tb->jmp_first;
> +    for (;;) {
> +        n1 = (uintptr_t)tb1 & 3;
> +        if (n1 == 2) {
> +            break;
> +        }
> +        tb1 = (TranslationBlock *)((uintptr_t)tb1 & ~3);
> +        tb2 = tb1->jmp_next[n1];
> +        tb_reset_jump(tb1, n1);
> +        tb1->jmp_next[n1] = NULL;
> +        tb1 = tb2;
> +    }
> +    tb->jmp_first = (TranslationBlock *)((uintptr_t)tb | 2); /* fail safe */
> +}
> +
>  /* invalidate one TB */
>  void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>  {
>      CPUState *cpu;
>      PageDesc *p;
> -    unsigned int h, n1;
> +    unsigned int h;
>      tb_page_addr_t phys_pc;
> -    TranslationBlock *tb1, *tb2;
> +    struct CPUDiscardTBParams *params;
>  
>      tb_lock();
>  
> @@ -997,6 +1042,9 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>  
>      tcg_ctx.tb_ctx.tb_invalidated_flag = 1;
>  
> +#if 0 /*MTTCG*/
> +    TranslationBlock *tb1, *tb2;
> +    unsigned int n1;
>      /* remove the TB from the hash list */
>      h = tb_jmp_cache_hash_func(tb->pc);
>      CPU_FOREACH(cpu) {
> @@ -1023,6 +1071,15 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>          tb1 = tb2;
>      }
>      tb->jmp_first = (TranslationBlock *)((uintptr_t)tb | 2); /* fail safe */
> +#else
> +    CPU_FOREACH(cpu) {
> +        params = g_malloc(sizeof(struct CPUDiscardTBParams));
> +        params->cpu = cpu;
> +        params->tb = tb;
> +        async_run_on_cpu(cpu, cpu_discard_tb_from_jmp_cache, params);
> +    }
> +    async_run_safe_work_on_cpu(first_cpu, tb_invalidate_jmp_remove, tb);
> +#endif /* MTTCG */
>  
>      tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
>      tb_unlock();
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (18 preceding siblings ...)
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 19/19] target-arm/psci.c: wake up sleeping CPUs (MTTCG) fred.konrad
@ 2015-08-10 18:34 ` Alex Bennée
  2015-08-10 23:02   ` Frederic Konrad
  2015-08-11  6:15 ` Benjamin Herrenschmidt
  2015-08-11 12:45 ` Paolo Bonzini
  21 siblings, 1 reply; 81+ messages in thread
From: Alex Bennée @ 2015-08-10 18:34 UTC (permalink / raw)
  To: fred.konrad
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue, pbonzini


fred.konrad@greensocs.com writes:

> From: KONRAD Frederic <fred.konrad@greensocs.com>
>
> This is the 7th round of the MTTCG patch series.
>
>
> It can be cloned from:
> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.

I'm not seeing this yet, did you remember to push?


>
> This patch-set try to address the different issues in the global picture of
> MTTCG, presented on the wiki.
>
> == Needed patch for our work ==
>
> Some preliminaries are needed for our work:
>  * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>    the CPUState.
>  * We need to run some work safely when all VCPUs are outside their execution
>    loop. This is done with the async_run_safe_work_on_cpu function introduced
>    in this series.
>  * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>    atomic instruction.
>
> == Code generation and cache ==
>
> As Qemu stands, there is no protection at all against two threads attempting to
> generate code at the same time or modifying a TranslationBlock.
> The "protect TBContext with tb_lock" patch address the issue of code generation
> and makes all the tb_* function thread safe (except tb_flush).
> This raised the question of one or multiple caches. We choosed to use one
> unified cache because it's easier as a first step and since the structure of
> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
> don't see the benefit of having two pools of tbs.
>
> == Dirty tracking ==
>
> Protecting the IOs:
> To allows all VCPUs threads to run at the same time we need to drop the
> global_mutex as soon as possible. The io access need to take the mutex. This is
> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
> will be upstreamed.
>
> Invalidation of TranslationBlocks:
> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
> it's jump cache itself as it is in CPUState so that can be handled by a simple
> call to async_run_on_cpu. However tb_invalidate also writes to the
> TranslationBlock which is shared as we have only one pool.
> Hence this part of invalidate requires all VCPUs to exit before it can be done.
> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
>
> == Atomic instruction ==
>
> For now only ARM on x64 is supported by using an cmpxchg instruction.
> Specifically the limitation of this approach is that it is harder to support
> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
> cmpxchg (we believe this could be the case for some PPC cores).  For now this
> case is not correctly handled. The existing atomic patch will attempt to execute
> the 64 bit cmpxchg functionality in a non thread safe fashion. Our intention is
> to provide a new multi-thread ARM atomic patch for 64bit ARM on effective 32bit
> hosts.
> This atomic instruction part has been tested with Alexander's atomic stress repo
> available here:
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>
> The execution is a little slower than upstream probably because of the different
> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spin_lock
> reduce considerably the difference.
>
> == Testing ==
>
> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux guest show
> a good performance progression: it takes basically 18s upstream to complete vs
> 10s with MTTCG.
>
> Testing image is available here:
> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3
>
> Then simply:
> ./configure --target-list=arm-softmmu
> make -j8
> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
> --append "console=ttyAMA0"
>
> login: root
>
> The dhrystone command is the last one in the history.
> "dhrystone 10000000 & dhrystone 10000000"
>
> The atomic spinlock benchmark from Alexander shows that atomic basically work.
> Just follow the instruction here:
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>
> == Known issues ==
>
> * GDB stub:
>   GDB stub is not tested right now it will probably requires some changes to
>   work.
>
> * deadlock on exit:
>   When exiting QEMU Ctrl-C some VCPU's thread are not able to exit and continue
>   execution.
>   http://git.greensocs.com/fkonrad/mttcg/issues/1
>
> * memory_region_rom_device_set_romd from pflash01 just crashes the TCG code.
>   Strangely this happen only with "-smp 4" and 2 in the DTB.
>   http://git.greensocs.com/fkonrad/mttcg/issues/2
>
> Changes V6 -> V7:
>   * global_lock:
>      * Don't protect softmmu read/write helper as it's now done in
>        adress_space_rw.
>   * tcg_exec_flag:
>      * Make the flag atomically test and set through an API.
>   * introduce async_safe_work:
>      * move qemu_cpu_kick_thread to avoid prototype declaration.
>      * use the work_mutex.
>   * async_work:
>      * protect it with a mutex (work_mutex) against concurent access.
>   * tb_lock:
>      * protect tcg_malloc_internal as well.
>   * signal the VCPU even of current_cpu is NULL.
>   * added PSCI patch.
>   * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899).
>
> Changes V5 -> V6:
>   * Introduce async_safe_work to do the tb_flush and some part of tb_invalidate.
>   * Introduce QemuSpin from Guillaume which allow a faster atomic instruction
>     (6s to pass Alexander's atomic test instead of 30s before).
>   * Don't take tb_lock before tb_find_fast.
>   * Handle tb_flush with async_safe_work.
>   * Handle tb_invalidate with async_work and async_safe_work.
>   * Drop the tlb_flush_request mechanism and use async_work as well.
>   * Fix the wrong lenght in atomic patch.
>   * Fix the wrong return address for exception in atomic patch.
>
> Alex Bennée (1):
>   target-arm/psci.c: wake up sleeping CPUs (MTTCG)
>
> Guillaume Delbergue (1):
>   add support for spin lock on POSIX systems exclusively
>
> KONRAD Frederic (17):
>   cpus: protect queued_work_* with work_mutex.
>   cpus: add tcg_exec_flag.
>   cpus: introduce async_run_safe_work_on_cpu.
>   replace spinlock by QemuMutex.
>   remove unused spinlock.
>   protect TBContext with tb_lock.
>   tcg: remove tcg_halt_cond global variable.
>   Drop global lock during TCG code execution
>   cpu: remove exit_request global.
>   tcg: switch on multithread.
>   Use atomic cmpxchg to atomically check the exclusive value in a STREX
>   add a callback when tb_invalidate is called.
>   cpu: introduce tlb_flush*_all.
>   arm: use tlb_flush*_all
>   translate-all: introduces tb_flush_safe.
>   translate-all: (wip) use tb_flush_safe when we can't alloc more tb.
>   mttcg: signal the associated cpu anyway.
>
>  cpu-exec.c                  |  98 +++++++++------
>  cpus.c                      | 295 +++++++++++++++++++++++++-------------------
>  cputlb.c                    |  81 ++++++++++++
>  include/exec/exec-all.h     |   8 +-
>  include/exec/spinlock.h     |  49 --------
>  include/qemu/thread-posix.h |   4 +
>  include/qemu/thread-win32.h |   4 +
>  include/qemu/thread.h       |   7 ++
>  include/qom/cpu.h           |  57 +++++++++
>  linux-user/main.c           |   6 +-
>  qom/cpu.c                   |  20 +++
>  target-arm/cpu.c            |  21 ++++
>  target-arm/cpu.h            |   6 +
>  target-arm/helper.c         |  58 +++------
>  target-arm/helper.h         |   4 +
>  target-arm/op_helper.c      | 128 ++++++++++++++++++-
>  target-arm/psci.c           |   2 +
>  target-arm/translate.c      | 101 +++------------
>  target-i386/mem_helper.c    |  16 ++-
>  target-i386/misc_helper.c   |  27 +++-
>  tcg/i386/tcg-target.c       |   8 ++
>  tcg/tcg.h                   |  14 ++-
>  translate-all.c             | 217 +++++++++++++++++++++++++++-----
>  util/qemu-thread-posix.c    |  45 +++++++
>  util/qemu-thread-win32.c    |  30 +++++
>  vl.c                        |   6 +
>  26 files changed, 934 insertions(+), 378 deletions(-)
>  delete mode 100644 include/exec/spinlock.h

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 19/19] target-arm/psci.c: wake up sleeping CPUs (MTTCG)
  2015-08-10 16:41   ` Paolo Bonzini
@ 2015-08-10 18:38     ` Alex Bennée
  0 siblings, 0 replies; 81+ messages in thread
From: Alex Bennée @ 2015-08-10 18:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue, fred.konrad

[-- Attachment #1: Type: text/plain, Size: 1221 bytes --]


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>> From: Alex Bennée <alex.bennee@linaro.org>
>> 
>> Testing with Alexander's bare metal syncronisation tests fails in MTTCG
>> leaving one CPU spinning forever waiting for the second CPU to wake up.
>> We simply need to poke the halt_cond once we have processed the PSCI
>> power on call.
>> 
>> Tested-by: Alex Bennée <alex.bennee@linaro.org>
>> CC: Alexander Spyridakis <a.spyridakis@virtualopensystems.com>
>> ---
>>  target-arm/psci.c | 2 ++
>>  1 file changed, 2 insertions(+)
>> 
>> diff --git a/target-arm/psci.c b/target-arm/psci.c
>> index 20e4cb6..83e309c 100644
>> --- a/target-arm/psci.c
>> +++ b/target-arm/psci.c
>> @@ -211,6 +211,8 @@ void arm_handle_psci_call(ARMCPU *cpu)
>>          }
>>          target_cpu_class->set_pc(target_cpu_state, entry);
>>  
>> +        qemu_cond_signal(target_cpu_state->halt_cond);
>> +
>
> qemu_cpu_kick, not qemu_cond_signal.

I did a v2 in my branch and didn't realise Fred had picked it up into
his tree. I hadn't sent it up list because it didn't seem to be worth it
for non-MTTCG builds. I can send it if you want?

Fred,

Use the attached on your next rebase.


[-- Attachment #2: v2 of PSCI patch --]
[-- Type: text/x-diff, Size: 1151 bytes --]

>From bb8aabadc0880a21bfe5821af172c047474841d6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Alex=20Benn=C3=A9e?= <alex.bennee@linaro.org>
Date: Tue, 7 Jul 2015 08:28:05 +0100
Subject: [PATCH] target-arm/psci.c: wake up sleeping CPUs (MTTCG)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Testing with Alexander's bare metal syncronisation tests fails in MTTCG
leaving one CPU spinning forever waiting for the second CPU to wake up.
We simply need to poke the halt_cond once we have processed the PSCI
power on call.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
CC: Alexander Spyridakis <a.spyridakis@virtualopensystems.com>

---
v2
  - use qemu_cpu_kick()
---
 target-arm/psci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/target-arm/psci.c b/target-arm/psci.c
index 20e4cb6..4643743 100644
--- a/target-arm/psci.c
+++ b/target-arm/psci.c
@@ -211,6 +211,8 @@ void arm_handle_psci_call(ARMCPU *cpu)
         }
         target_cpu_class->set_pc(target_cpu_state, entry);
 
+        qemu_cpu_kick(target_cpu_state);
+
         ret = 0;
         break;
     case QEMU_PSCI_0_1_FN_CPU_OFF:
-- 
2.5.0


[-- Attachment #3: Type: text/plain, Size: 115 bytes --]



>
> Paolo
>
>>          ret = 0;
>>          break;
>>      case QEMU_PSCI_0_1_FN_CPU_OFF:
>> 

-- 
Alex Bennée

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-10 16:50     ` Paolo Bonzini
@ 2015-08-10 18:39       ` Alex Bennée
  2015-08-11  8:31         ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Bennée @ 2015-08-10 18:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue, fred.konrad


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 10/08/2015 18:36, Paolo Bonzini wrote:
>>> > diff --git a/target-arm/translate.c b/target-arm/translate.c
>>> > index 69ac18c..960c75e 100644
>>> > --- a/target-arm/translate.c
>>> > +++ b/target-arm/translate.c
>>> > @@ -11166,6 +11166,8 @@ static inline void gen_intermediate_code_internal(ARMCPU *cpu,
>>> >  
>>> >      dc->tb = tb;
>>> >  
>>> > +    tb_lock();
>> This locks twice, I think?  Both cpu_restore_state_from_tb and 
>> tb_gen_code (which calls cpu_gen_code) take the lock.  How does it work?
>> 
>
> ... ah, the lock is recursive!
>
> I think this can be avoided.  Let's look at it next week.

I take it your around on the Tuesday (Fred and I arrive Monday evening).
Shall we pick a time or hunt for each other in the hacking room?

>
> Paolo

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 13/19] add a callback when tb_invalidate is called.
  2015-08-10 16:52   ` Paolo Bonzini
@ 2015-08-10 18:41     ` Alex Bennée
  0 siblings, 0 replies; 81+ messages in thread
From: Alex Bennée @ 2015-08-10 18:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue, fred.konrad


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>> 
>> Instead of doing the jump cache invalidation directly in tb_invalidate delay it
>> after the exit so we don't have an other CPU trying to execute the code being
>> invalidated.
>> 
>> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
>> ---
>>  translate-all.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 59 insertions(+), 2 deletions(-)
>
> If you take the easy way and avoid the optimizations in patch 7, this is
> not necessary: tb_find_fast and tb_add_jump are only called from within
> tb_lock, so all of tb_jmp_cache/jmp_first/jmp_next are protected by tb_lock.
>
> Let's get everything in and then optimize; the order should be:
>
> - Alvise's LL/SC implementation
>
> - conversion of atomics to LL/SC for all front-ends
>
> - the main MTTCG series, reusing the locking already in-place for
> user-mode emulation (with some audit...)

- including dropping the cmpxchg fix and including Alvise's MTTCG aware patches
  that build on top of LL/SC work.

>
> - any further push-downs of tb_lock
>
> Paolo
>
>> diff --git a/translate-all.c b/translate-all.c
>> index 954c67a..fc5162a 100644
>> --- a/translate-all.c
>> +++ b/translate-all.c
>> @@ -62,6 +62,7 @@
>>  #include "translate-all.h"
>>  #include "qemu/bitmap.h"
>>  #include "qemu/timer.h"
>> +#include "sysemu/cpus.h"
>>  
>>  //#define DEBUG_TB_INVALIDATE
>>  //#define DEBUG_FLUSH
>> @@ -967,14 +968,58 @@ static inline void tb_reset_jump(TranslationBlock *tb, int n)
>>      tb_set_jmp_target(tb, n, (uintptr_t)(tb->tc_ptr + tb->tb_next_offset[n]));
>>  }
>>  
>> +struct CPUDiscardTBParams {
>> +    CPUState *cpu;
>> +    TranslationBlock *tb;
>> +};
>> +
>> +static void cpu_discard_tb_from_jmp_cache(void *opaque)
>> +{
>> +    unsigned int h;
>> +    struct CPUDiscardTBParams *params = opaque;
>> +
>> +    h = tb_jmp_cache_hash_func(params->tb->pc);
>> +    if (params->cpu->tb_jmp_cache[h] == params->tb) {
>> +        params->cpu->tb_jmp_cache[h] = NULL;
>> +    }
>> +
>> +    g_free(opaque);
>> +}
>> +
>> +static void tb_invalidate_jmp_remove(void *opaque)
>> +{
>> +    TranslationBlock *tb = opaque;
>> +    TranslationBlock *tb1, *tb2;
>> +    unsigned int n1;
>> +
>> +    /* suppress this TB from the two jump lists */
>> +    tb_jmp_remove(tb, 0);
>> +    tb_jmp_remove(tb, 1);
>> +
>> +    /* suppress any remaining jumps to this TB */
>> +    tb1 = tb->jmp_first;
>> +    for (;;) {
>> +        n1 = (uintptr_t)tb1 & 3;
>> +        if (n1 == 2) {
>> +            break;
>> +        }
>> +        tb1 = (TranslationBlock *)((uintptr_t)tb1 & ~3);
>> +        tb2 = tb1->jmp_next[n1];
>> +        tb_reset_jump(tb1, n1);
>> +        tb1->jmp_next[n1] = NULL;
>> +        tb1 = tb2;
>> +    }
>> +    tb->jmp_first = (TranslationBlock *)((uintptr_t)tb | 2); /* fail safe */
>> +}
>> +
>>  /* invalidate one TB */
>>  void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>>  {
>>      CPUState *cpu;
>>      PageDesc *p;
>> -    unsigned int h, n1;
>> +    unsigned int h;
>>      tb_page_addr_t phys_pc;
>> -    TranslationBlock *tb1, *tb2;
>> +    struct CPUDiscardTBParams *params;
>>  
>>      tb_lock();
>>  
>> @@ -997,6 +1042,9 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>>  
>>      tcg_ctx.tb_ctx.tb_invalidated_flag = 1;
>>  
>> +#if 0 /*MTTCG*/
>> +    TranslationBlock *tb1, *tb2;
>> +    unsigned int n1;
>>      /* remove the TB from the hash list */
>>      h = tb_jmp_cache_hash_func(tb->pc);
>>      CPU_FOREACH(cpu) {
>> @@ -1023,6 +1071,15 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>>          tb1 = tb2;
>>      }
>>      tb->jmp_first = (TranslationBlock *)((uintptr_t)tb | 2); /* fail safe */
>> +#else
>> +    CPU_FOREACH(cpu) {
>> +        params = g_malloc(sizeof(struct CPUDiscardTBParams));
>> +        params->cpu = cpu;
>> +        params->tb = tb;
>> +        async_run_on_cpu(cpu, cpu_discard_tb_from_jmp_cache, params);
>> +    }
>> +    async_run_safe_work_on_cpu(first_cpu, tb_invalidate_jmp_remove, tb);
>> +#endif /* MTTCG */
>>  
>>      tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
>>      tb_unlock();
>> 

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-10 18:34 ` [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG Alex Bennée
@ 2015-08-10 23:02   ` Frederic Konrad
  0 siblings, 0 replies; 81+ messages in thread
From: Frederic Konrad @ 2015-08-10 23:02 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue, pbonzini

On 10/08/2015 20:34, Alex Bennée wrote:
> fred.konrad@greensocs.com writes:
>
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>
>> This is the 7th round of the MTTCG patch series.
>>
>>
>> It can be cloned from:
>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
> I'm not seeing this yet, did you remember to push?
oops sorry done!
>
>> This patch-set try to address the different issues in the global picture of
>> MTTCG, presented on the wiki.
>>
>> == Needed patch for our work ==
>>
>> Some preliminaries are needed for our work:
>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>>     the CPUState.
>>   * We need to run some work safely when all VCPUs are outside their execution
>>     loop. This is done with the async_run_safe_work_on_cpu function introduced
>>     in this series.
>>   * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>>     atomic instruction.
>>
>> == Code generation and cache ==
>>
>> As Qemu stands, there is no protection at all against two threads attempting to
>> generate code at the same time or modifying a TranslationBlock.
>> The "protect TBContext with tb_lock" patch address the issue of code generation
>> and makes all the tb_* function thread safe (except tb_flush).
>> This raised the question of one or multiple caches. We choosed to use one
>> unified cache because it's easier as a first step and since the structure of
>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
>> don't see the benefit of having two pools of tbs.
>>
>> == Dirty tracking ==
>>
>> Protecting the IOs:
>> To allows all VCPUs threads to run at the same time we need to drop the
>> global_mutex as soon as possible. The io access need to take the mutex. This is
>> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>> will be upstreamed.
>>
>> Invalidation of TranslationBlocks:
>> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
>> it's jump cache itself as it is in CPUState so that can be handled by a simple
>> call to async_run_on_cpu. However tb_invalidate also writes to the
>> TranslationBlock which is shared as we have only one pool.
>> Hence this part of invalidate requires all VCPUs to exit before it can be done.
>> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
>>
>> == Atomic instruction ==
>>
>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>> Specifically the limitation of this approach is that it is harder to support
>> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
>> cmpxchg (we believe this could be the case for some PPC cores).  For now this
>> case is not correctly handled. The existing atomic patch will attempt to execute
>> the 64 bit cmpxchg functionality in a non thread safe fashion. Our intention is
>> to provide a new multi-thread ARM atomic patch for 64bit ARM on effective 32bit
>> hosts.
>> This atomic instruction part has been tested with Alexander's atomic stress repo
>> available here:
>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>
>> The execution is a little slower than upstream probably because of the different
>> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spin_lock
>> reduce considerably the difference.
>>
>> == Testing ==
>>
>> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux guest show
>> a good performance progression: it takes basically 18s upstream to complete vs
>> 10s with MTTCG.
>>
>> Testing image is available here:
>> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3
>>
>> Then simply:
>> ./configure --target-list=arm-softmmu
>> make -j8
>> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
>> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
>> --append "console=ttyAMA0"
>>
>> login: root
>>
>> The dhrystone command is the last one in the history.
>> "dhrystone 10000000 & dhrystone 10000000"
>>
>> The atomic spinlock benchmark from Alexander shows that atomic basically work.
>> Just follow the instruction here:
>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>
>> == Known issues ==
>>
>> * GDB stub:
>>    GDB stub is not tested right now it will probably requires some changes to
>>    work.
>>
>> * deadlock on exit:
>>    When exiting QEMU Ctrl-C some VCPU's thread are not able to exit and continue
>>    execution.
>>    http://git.greensocs.com/fkonrad/mttcg/issues/1
>>
>> * memory_region_rom_device_set_romd from pflash01 just crashes the TCG code.
>>    Strangely this happen only with "-smp 4" and 2 in the DTB.
>>    http://git.greensocs.com/fkonrad/mttcg/issues/2
>>
>> Changes V6 -> V7:
>>    * global_lock:
>>       * Don't protect softmmu read/write helper as it's now done in
>>         adress_space_rw.
>>    * tcg_exec_flag:
>>       * Make the flag atomically test and set through an API.
>>    * introduce async_safe_work:
>>       * move qemu_cpu_kick_thread to avoid prototype declaration.
>>       * use the work_mutex.
>>    * async_work:
>>       * protect it with a mutex (work_mutex) against concurent access.
>>    * tb_lock:
>>       * protect tcg_malloc_internal as well.
>>    * signal the VCPU even of current_cpu is NULL.
>>    * added PSCI patch.
>>    * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899).
>>
>> Changes V5 -> V6:
>>    * Introduce async_safe_work to do the tb_flush and some part of tb_invalidate.
>>    * Introduce QemuSpin from Guillaume which allow a faster atomic instruction
>>      (6s to pass Alexander's atomic test instead of 30s before).
>>    * Don't take tb_lock before tb_find_fast.
>>    * Handle tb_flush with async_safe_work.
>>    * Handle tb_invalidate with async_work and async_safe_work.
>>    * Drop the tlb_flush_request mechanism and use async_work as well.
>>    * Fix the wrong lenght in atomic patch.
>>    * Fix the wrong return address for exception in atomic patch.
>>
>> Alex Bennée (1):
>>    target-arm/psci.c: wake up sleeping CPUs (MTTCG)
>>
>> Guillaume Delbergue (1):
>>    add support for spin lock on POSIX systems exclusively
>>
>> KONRAD Frederic (17):
>>    cpus: protect queued_work_* with work_mutex.
>>    cpus: add tcg_exec_flag.
>>    cpus: introduce async_run_safe_work_on_cpu.
>>    replace spinlock by QemuMutex.
>>    remove unused spinlock.
>>    protect TBContext with tb_lock.
>>    tcg: remove tcg_halt_cond global variable.
>>    Drop global lock during TCG code execution
>>    cpu: remove exit_request global.
>>    tcg: switch on multithread.
>>    Use atomic cmpxchg to atomically check the exclusive value in a STREX
>>    add a callback when tb_invalidate is called.
>>    cpu: introduce tlb_flush*_all.
>>    arm: use tlb_flush*_all
>>    translate-all: introduces tb_flush_safe.
>>    translate-all: (wip) use tb_flush_safe when we can't alloc more tb.
>>    mttcg: signal the associated cpu anyway.
>>
>>   cpu-exec.c                  |  98 +++++++++------
>>   cpus.c                      | 295 +++++++++++++++++++++++++-------------------
>>   cputlb.c                    |  81 ++++++++++++
>>   include/exec/exec-all.h     |   8 +-
>>   include/exec/spinlock.h     |  49 --------
>>   include/qemu/thread-posix.h |   4 +
>>   include/qemu/thread-win32.h |   4 +
>>   include/qemu/thread.h       |   7 ++
>>   include/qom/cpu.h           |  57 +++++++++
>>   linux-user/main.c           |   6 +-
>>   qom/cpu.c                   |  20 +++
>>   target-arm/cpu.c            |  21 ++++
>>   target-arm/cpu.h            |   6 +
>>   target-arm/helper.c         |  58 +++------
>>   target-arm/helper.h         |   4 +
>>   target-arm/op_helper.c      | 128 ++++++++++++++++++-
>>   target-arm/psci.c           |   2 +
>>   target-arm/translate.c      | 101 +++------------
>>   target-i386/mem_helper.c    |  16 ++-
>>   target-i386/misc_helper.c   |  27 +++-
>>   tcg/i386/tcg-target.c       |   8 ++
>>   tcg/tcg.h                   |  14 ++-
>>   translate-all.c             | 217 +++++++++++++++++++++++++++-----
>>   util/qemu-thread-posix.c    |  45 +++++++
>>   util/qemu-thread-win32.c    |  30 +++++
>>   vl.c                        |   6 +
>>   26 files changed, 934 insertions(+), 378 deletions(-)
>>   delete mode 100644 include/exec/spinlock.h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (19 preceding siblings ...)
  2015-08-10 18:34 ` [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG Alex Bennée
@ 2015-08-11  6:15 ` Benjamin Herrenschmidt
  2015-08-11  6:27   ` Frederic Konrad
  2015-08-11  7:54   ` Alex Bennée
  2015-08-11 12:45 ` Paolo Bonzini
  21 siblings, 2 replies; 81+ messages in thread
From: Benjamin Herrenschmidt @ 2015-08-11  6:15 UTC (permalink / raw)
  To: fred.konrad
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue,
	pbonzini, alex.bennee

On Mon, 2015-08-10 at 17:26 +0200, fred.konrad@greensocs.com wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
> 
> This is the 7th round of the MTTCG patch series.
> 
> 
> It can be cloned from:
> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
> 
> This patch-set try to address the different issues in the global picture of
> MTTCG, presented on the wiki.
> 
> == Needed patch for our work ==
> 
> Some preliminaries are needed for our work:
>  * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>    the CPUState.

Can't you just make it a TLS ?

>  * We need to run some work safely when all VCPUs are outside their execution
>    loop. This is done with the async_run_safe_work_on_cpu function introduced
>    in this series.
>  * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>    atomic instruction.

How do you handle the memory model ? IE , ARM and PPC are OO while x86
is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
x86 on ARM or PPC will lead to problems unless you generate memory
barriers with every load/store ..

At least on POWER7 and later on PPC we have the possibility of setting
the attribute "Strong Access Ordering" with mremap/mprotect (I dont'
remember which one) which gives us x86-like memory semantics...

I don't know if ARM supports something similar. On the other hand, when
emulating ARM on PPC or vice-versa, we can probably get away with no
barriers.

Do you expose some kind of guest memory model info to the TCG backend so
it can decide how to handle these things ?

> == Code generation and cache ==
> 
> As Qemu stands, there is no protection at all against two threads attempting to
> generate code at the same time or modifying a TranslationBlock.
> The "protect TBContext with tb_lock" patch address the issue of code generation
> and makes all the tb_* function thread safe (except tb_flush).
> This raised the question of one or multiple caches. We choosed to use one
> unified cache because it's easier as a first step and since the structure of
> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
> don't see the benefit of having two pools of tbs.
> 
> == Dirty tracking ==
> 
> Protecting the IOs:
> To allows all VCPUs threads to run at the same time we need to drop the
> global_mutex as soon as possible. The io access need to take the mutex. This is
> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
> will be upstreamed.
> 
> Invalidation of TranslationBlocks:
> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
> it's jump cache itself as it is in CPUState so that can be handled by a simple
> call to async_run_on_cpu. However tb_invalidate also writes to the
> TranslationBlock which is shared as we have only one pool.
> Hence this part of invalidate requires all VCPUs to exit before it can be done.
> Hence the async_run_safe_work_on_cpu is introduced to handle this case.

What about the host MMU emulation ? Is that multithreaded ? It has
potential issues when doing things like dirty bit updates into guest
memory, those need to be done atomically. Also TLB invalidations on ARM
and PPC are global, so they will need to invalidate the remote SW TLBs
as well.

Do you have a mechanism to synchronize with another thread ? IE, make it
pop out of TCG if already in and prevent it from getting in ? That way
you can "remotely" invalidate its TLB...

> == Atomic instruction ==
> 
> For now only ARM on x64 is supported by using an cmpxchg instruction.
> Specifically the limitation of this approach is that it is harder to support
> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
> cmpxchg (we believe this could be the case for some PPC cores).

Right, on the other hand 64-bit will do fine. But then x86 has 2-value
atomics nowadays, doesn't it ? And that will be hard to emulate on
anything. You might need to have some kind of global hashed lock list
used by atomics (hash the physical address) as a fallback if you don't
have a 1:1 match between host and guest capabilities.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11  6:15 ` Benjamin Herrenschmidt
@ 2015-08-11  6:27   ` Frederic Konrad
  2015-10-07 12:46     ` Claudio Fontana
  2015-08-11  7:54   ` Alex Bennée
  1 sibling, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-11  6:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue,
	pbonzini, alex.bennee

On 11/08/2015 08:15, Benjamin Herrenschmidt wrote:
> On Mon, 2015-08-10 at 17:26 +0200, fred.konrad@greensocs.com wrote:
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>
>> This is the 7th round of the MTTCG patch series.
>>
>>
>> It can be cloned from:
>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
>>
>> This patch-set try to address the different issues in the global picture of
>> MTTCG, presented on the wiki.
>>
>> == Needed patch for our work ==
>>
>> Some preliminaries are needed for our work:
>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>>     the CPUState.
> Can't you just make it a TLS ?

True that can be done as well. But the tcg_exec_flags has a second 
meaning saying
"you can't start executing code right now because I want to do a safe_work".
>
>>   * We need to run some work safely when all VCPUs are outside their execution
>>     loop. This is done with the async_run_safe_work_on_cpu function introduced
>>     in this series.
>>   * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>>     atomic instruction.
> How do you handle the memory model ? IE , ARM and PPC are OO while x86
> is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
> x86 on ARM or PPC will lead to problems unless you generate memory
> barriers with every load/store ..

For the moment we are trying to do the first case.
>
> At least on POWER7 and later on PPC we have the possibility of setting
> the attribute "Strong Access Ordering" with mremap/mprotect (I dont'
> remember which one) which gives us x86-like memory semantics...
>
> I don't know if ARM supports something similar. On the other hand, when
> emulating ARM on PPC or vice-versa, we can probably get away with no
> barriers.
>
> Do you expose some kind of guest memory model info to the TCG backend so
> it can decide how to handle these things ?
>
>> == Code generation and cache ==
>>
>> As Qemu stands, there is no protection at all against two threads attempting to
>> generate code at the same time or modifying a TranslationBlock.
>> The "protect TBContext with tb_lock" patch address the issue of code generation
>> and makes all the tb_* function thread safe (except tb_flush).
>> This raised the question of one or multiple caches. We choosed to use one
>> unified cache because it's easier as a first step and since the structure of
>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
>> don't see the benefit of having two pools of tbs.
>>
>> == Dirty tracking ==
>>
>> Protecting the IOs:
>> To allows all VCPUs threads to run at the same time we need to drop the
>> global_mutex as soon as possible. The io access need to take the mutex. This is
>> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>> will be upstreamed.
>>
>> Invalidation of TranslationBlocks:
>> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
>> it's jump cache itself as it is in CPUState so that can be handled by a simple
>> call to async_run_on_cpu. However tb_invalidate also writes to the
>> TranslationBlock which is shared as we have only one pool.
>> Hence this part of invalidate requires all VCPUs to exit before it can be done.
>> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
> What about the host MMU emulation ? Is that multithreaded ? It has
> potential issues when doing things like dirty bit updates into guest
> memory, those need to be done atomically. Also TLB invalidations on ARM
> and PPC are global, so they will need to invalidate the remote SW TLBs
> as well.
>
> Do you have a mechanism to synchronize with another thread ? IE, make it
> pop out of TCG if already in and prevent it from getting in ? That way
> you can "remotely" invalidate its TLB...
Yes that's what the safe_work is doing. Ask everybody to exit prevent 
VCPUs to
resume (tcg_exec_flag) and do the work when everybody is outside cpu-exec.

>
>> == Atomic instruction ==
>>
>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>> Specifically the limitation of this approach is that it is harder to support
>> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
>> cmpxchg (we believe this could be the case for some PPC cores).
> Right, on the other hand 64-bit will do fine. But then x86 has 2-value
> atomics nowadays, doesn't it ? And that will be hard to emulate on
> anything. You might need to have some kind of global hashed lock list
> used by atomics (hash the physical address) as a fallback if you don't
> have a 1:1 match between host and guest capabilities.
VOS did a "Slow path for atomic instruction translation" series you can 
find here:
https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg00971.html

Which will be used in the end.

Thanks,
Fred
>
> Cheers,
> Ben.
>
>
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-10 16:36   ` Paolo Bonzini
  2015-08-10 16:50     ` Paolo Bonzini
@ 2015-08-11  6:46     ` Frederic Konrad
  2015-08-11  8:34       ` Paolo Bonzini
  1 sibling, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-11  6:46 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue

On 10/08/2015 18:36, Paolo Bonzini wrote:
>
> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>> diff --git a/cpu-exec.c b/cpu-exec.c
>> index f3358a9..a012e9d 100644
>> --- a/cpu-exec.c
>> +++ b/cpu-exec.c
>> @@ -131,6 +131,8 @@ static void init_delay_params(SyncClocks *sc, const CPUState *cpu)
>>   void cpu_loop_exit(CPUState *cpu)
>>   {
>>       cpu->current_tb = NULL;
>> +    /* Release those mutex before long jump so other thread can work. */
>> +    tb_lock_reset();
>>       siglongjmp(cpu->jmp_env, 1);
>>   }
>>   
>> @@ -143,6 +145,8 @@ void cpu_resume_from_signal(CPUState *cpu, void *puc)
>>       /* XXX: restore cpu registers saved in host registers */
>>   
>>       cpu->exception_index = -1;
>> +    /* Release those mutex before long jump so other thread can work. */
>> +    tb_lock_reset();
>>       siglongjmp(cpu->jmp_env, 1);
>>   }
>>   
> I think you should start easy and reuse the existing tb_lock code in
> cpu-exec.c:

I think it's definitely not sufficient. Is user-mode multithread still 
working today?
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 9305f03..2909ec2 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -307,7 +307,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, target_ulong pc,
>   
>       tb = tb_find_physical(cpu, pc, cs_base, flags);
>       if (!tb) {
> -        tb_lock();
>           /*
>            * Retry to get the TB in case a CPU just translate it to avoid having
>            * duplicated TB in the pool.
> @@ -316,7 +315,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, target_ulong pc,
>           if (!tb) {
>               tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>           }
> -        tb_unlock();
>       }
>       /* we add the TB in the virtual pc hash table */
>       cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
> @@ -372,11 +372,6 @@ int cpu_exec(CPUState *cpu)
>       uintptr_t next_tb;
>       SyncClocks sc;
>   
> -    /* This must be volatile so it is not trashed by longjmp() */
> -#if defined(CONFIG_USER_ONLY)
> -    volatile bool have_tb_lock = false;
> -#endif
> -
>       if (cpu->halted) {
>           if (!cpu_has_work(cpu)) {
>               return EXCP_HALTED;
> @@ -480,10 +475,7 @@ int cpu_exec(CPUState *cpu)
>                       cpu->exception_index = EXCP_INTERRUPT;
>                       cpu_loop_exit(cpu);
>                   }
> -#if defined(CONFIG_USER_ONLY)
> -                qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
> -                have_tb_lock = true;
> -#endif
> +                tb_lock();
>                   tb = tb_find_fast(cpu);
>                   /* Note: we do it here to avoid a gcc bug on Mac OS X when
>                      doing it in tb_find_slow */
> @@ -505,10 +497,7 @@ int cpu_exec(CPUState *cpu)
>                       tb_add_jump((TranslationBlock *)(next_tb & ~TB_EXIT_MASK),
>                                   next_tb & TB_EXIT_MASK, tb);
>                   }
> -#if defined(CONFIG_USER_ONLY)
> -                have_tb_lock = false;
> -                qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
> -#endif
> +                tb_unlock();
>                   /* cpu_interrupt might be called while translating the
>                      TB, but before it is linked into a potentially
>                      infinite loop and becomes env->current_tb. Avoid
> @@ -575,12 +564,7 @@ int cpu_exec(CPUState *cpu)
>               x86_cpu = X86_CPU(cpu);
>               env = &x86_cpu->env;
>   #endif
> -#if defined(CONFIG_USER_ONLY)
> -            if (have_tb_lock) {
> -                qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
> -                have_tb_lock = false;
> -            }
> -#endif
> +            tb_lock_reset();
>           }
>       } /* for(;;) */
>   
>
> Optimizations should then come on top.
>
>> diff --git a/target-arm/translate.c b/target-arm/translate.c
>> index 69ac18c..960c75e 100644
>> --- a/target-arm/translate.c
>> +++ b/target-arm/translate.c
>> @@ -11166,6 +11166,8 @@ static inline void gen_intermediate_code_internal(ARMCPU *cpu,
>>   
>>       dc->tb = tb;
>>   
>> +    tb_lock();
> This locks twice, I think?  Both cpu_restore_state_from_tb and
> tb_gen_code (which calls cpu_gen_code) take the lock.  How does it work?
>

Yes it's recursive we might not need that though. I probably locked too 
much some
function.

Thanks,
Fred
>> +
>>       dc->is_jmp = DISAS_NEXT;
>>       dc->pc = pc_start;
>>       dc->singlestep_enabled = cs->singlestep_enabled;
>> @@ -11506,6 +11508,7 @@ done_generating:
>>           tb->size = dc->pc - pc_start;
>>           tb->icount = num_insns;
>>       }
>> +    tb_unlock();
>>   }
>>   
>> +/* tb_lock must be help for tcg_malloc_internal. */
> "Held", not "help".
>
> Paolo
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution
  2015-08-10 16:15   ` Paolo Bonzini
@ 2015-08-11  6:55     ` Frederic Konrad
  2015-08-11 20:12     ` Alex Bennée
  1 sibling, 0 replies; 81+ messages in thread
From: Frederic Konrad @ 2015-08-11  6:55 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue

On 10/08/2015 18:15, Paolo Bonzini wrote:
>
> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>>   void qemu_mutex_lock_iothread(void)
>>   {
>> -    atomic_inc(&iothread_requesting_mutex);
>> -    /* In the simple case there is no need to bump the VCPU thread out of
>> -     * TCG code execution.
>> -     */
>> -    if (!tcg_enabled() || qemu_in_vcpu_thread() ||
>> -        !first_cpu || !first_cpu->thread) {
>> -        qemu_mutex_lock(&qemu_global_mutex);
>> -        atomic_dec(&iothread_requesting_mutex);
>> -    } else {
>> -        if (qemu_mutex_trylock(&qemu_global_mutex)) {
>> -            qemu_cpu_kick_thread(first_cpu);
>> -            qemu_mutex_lock(&qemu_global_mutex);
>> -        }
>> -        atomic_dec(&iothread_requesting_mutex);
>> -        qemu_cond_broadcast(&qemu_io_proceeded_cond);
>> -    }
>> -    iothread_locked = true;
> "iothread_locked = true" must be kept.  Otherwise... yay! :)

oops :).
>
>> @@ -125,8 +128,10 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
>>      can be detected */
>>   void tlb_protect_code(ram_addr_t ram_addr)
>>   {
>> +    qemu_mutex_lock_iothread();
>>       cpu_physical_memory_test_and_clear_dirty(ram_addr, TARGET_PAGE_SIZE,
>>                                                DIRTY_MEMORY_CODE);
>> +    qemu_mutex_unlock_iothread();
>>   }
>>   
> Not needed anymore.
>
>> diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c
>> index 52c5d65..55f63bf 100644
>> --- a/target-i386/misc_helper.c
>> +++ b/target-i386/misc_helper.c
> None of this is needed anymore either! :)
>
>> +    /*
>> +     * Some device's reset needs to grab the global_mutex. So just release it
>> +     * here.
>> +     */
>> +    qemu_mutex_unlock_iothread();
>>       /* reset all devices */
>>       QTAILQ_FOREACH_SAFE(re, &reset_handlers, entry, nre) {
>>           re->func(re->opaque);
>>       }
>> +    qemu_mutex_lock_iothread();
> Should never have been true?  (And, I think, it was pointed out in a
> previous version too).
I had a double lock with the reset handler from vexpress-a15. I don't really
remember why. But I hacked that. It's fixed now :)

Thanks,
Fred

> Paolo
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11  6:15 ` Benjamin Herrenschmidt
  2015-08-11  6:27   ` Frederic Konrad
@ 2015-08-11  7:54   ` Alex Bennée
  2015-08-11  9:22     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 81+ messages in thread
From: Alex Bennée @ 2015-08-11  7:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, fred.konrad


Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Mon, 2015-08-10 at 17:26 +0200, fred.konrad@greensocs.com wrote:
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>> 
>> This is the 7th round of the MTTCG patch series.
>> 
>> 
>> It can be cloned from:
>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
>> 
>> This patch-set try to address the different issues in the global picture of
>> MTTCG, presented on the wiki.
>> 
>> == Needed patch for our work ==
>> 
>> Some preliminaries are needed for our work:
>>  * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>>    the CPUState.
>
> Can't you just make it a TLS ?
>
>>  * We need to run some work safely when all VCPUs are outside their execution
>>    loop. This is done with the async_run_safe_work_on_cpu function introduced
>>    in this series.
>>  * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>>    atomic instruction.
>
> How do you handle the memory model ? IE , ARM and PPC are OO while x86
> is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
> x86 on ARM or PPC will lead to problems unless you generate memory
> barriers with every load/store ..

This is the next chunk of work. We have Alvise's LL/SC patches which
allow us to do proper emulation of ARMs Load/store exclusive behaviour
and any weak order target will have to use such constructs.

Currently the plan is to introduce a barrier TCG op which will translate
to the strongest backend barrier available. Even x86 should be using
barriers to ensure cross-core visibility which then leaves LS
re-ordering on the same core.

> At least on POWER7 and later on PPC we have the possibility of setting
> the attribute "Strong Access Ordering" with mremap/mprotect (I dont'
> remember which one) which gives us x86-like memory semantics...
>
> I don't know if ARM supports something similar. On the other hand, when
> emulating ARM on PPC or vice-versa, we can probably get away with no
> barriers.
>
> Do you expose some kind of guest memory model info to the TCG backend so
> it can decide how to handle these things ?
>
>> == Code generation and cache ==
>> 
>> As Qemu stands, there is no protection at all against two threads attempting to
>> generate code at the same time or modifying a TranslationBlock.
>> The "protect TBContext with tb_lock" patch address the issue of code generation
>> and makes all the tb_* function thread safe (except tb_flush).
>> This raised the question of one or multiple caches. We choosed to use one
>> unified cache because it's easier as a first step and since the structure of
>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
>> don't see the benefit of having two pools of tbs.
>> 
>> == Dirty tracking ==
>> 
>> Protecting the IOs:
>> To allows all VCPUs threads to run at the same time we need to drop the
>> global_mutex as soon as possible. The io access need to take the mutex. This is
>> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>> will be upstreamed.
>> 
>> Invalidation of TranslationBlocks:
>> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
>> it's jump cache itself as it is in CPUState so that can be handled by a simple
>> call to async_run_on_cpu. However tb_invalidate also writes to the
>> TranslationBlock which is shared as we have only one pool.
>> Hence this part of invalidate requires all VCPUs to exit before it can be done.
>> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
>
> What about the host MMU emulation ? Is that multithreaded ? It has
> potential issues when doing things like dirty bit updates into guest
> memory, those need to be done atomically. Also TLB invalidations on ARM
> and PPC are global, so they will need to invalidate the remote SW TLBs
> as well.
>
> Do you have a mechanism to synchronize with another thread ? IE, make it
> pop out of TCG if already in and prevent it from getting in ? That way
> you can "remotely" invalidate its TLB...
>
>> == Atomic instruction ==
>> 
>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>> Specifically the limitation of this approach is that it is harder to support
>> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
>> cmpxchg (we believe this could be the case for some PPC cores).
>
> Right, on the other hand 64-bit will do fine. But then x86 has 2-value
> atomics nowadays, doesn't it ? And that will be hard to emulate on
> anything. You might need to have some kind of global hashed lock list
> used by atomics (hash the physical address) as a fallback if you don't
> have a 1:1 match between host and guest capabilities.
>
> Cheers,
> Ben.

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-10 18:39       ` Alex Bennée
@ 2015-08-11  8:31         ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-11  8:31 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue, fred.konrad



On 10/08/2015 20:39, Alex Bennée wrote:
> > ... ah, the lock is recursive!
> >
> > I think this can be avoided.  Let's look at it next week.
> 
> I take it your around on the Tuesday (Fred and I arrive Monday evening).
> Shall we pick a time or hunt for each other in the hacking room?

Yes, I will.  Early afternoon, since I will be at CloudOpen at 4pm and
then at QEMU Summit.

I hope to post something this week too.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-11  6:46     ` Frederic Konrad
@ 2015-08-11  8:34       ` Paolo Bonzini
  2015-08-11  9:21         ` Peter Maydell
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-11  8:34 UTC (permalink / raw)
  To: Frederic Konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 11/08/2015 08:46, Frederic Konrad wrote:
>> I think you should start easy and reuse the existing tb_lock code in
>> cpu-exec.c:
> 
> I think it's definitely not sufficient. Is user-mode multithread still
> working today?

For some definition of "working", yes.  It's not sufficient, but it's a
good start.

The main problem with user-mode multithreading is that there is no clear
lock hierarchy between mmap_lock and tb_lock.  But this is not a problem
for softmmu.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-11  8:34       ` Paolo Bonzini
@ 2015-08-11  9:21         ` Peter Maydell
  2015-08-11  9:59           ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Maydell @ 2015-08-11  9:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Mark Burton, Alvise Rigo, QEMU Developers,
	Guillaume Delbergue, Alex Bennée, Frederic Konrad

On 11 August 2015 at 09:34, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 11/08/2015 08:46, Frederic Konrad wrote:
>>> I think you should start easy and reuse the existing tb_lock code in
>>> cpu-exec.c:
>>
>> I think it's definitely not sufficient. Is user-mode multithread still
>> working today?
>
> For some definition of "working", yes.  It's not sufficient, but it's a
> good start.
>
> The main problem with user-mode multithreading is that there is no clear
> lock hierarchy between mmap_lock and tb_lock.  But this is not a problem
> for softmmu.

And also that we don't have a serious design for the locking at all.
I was hoping this would be something that would come out of the
multithreaded-TCG work...

-- PMM

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11  7:54   ` Alex Bennée
@ 2015-08-11  9:22     ` Benjamin Herrenschmidt
  2015-08-11  9:29       ` Peter Maydell
  2015-08-11 19:22       ` Alex Bennée
  0 siblings, 2 replies; 81+ messages in thread
From: Benjamin Herrenschmidt @ 2015-08-11  9:22 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, fred.konrad

On Tue, 2015-08-11 at 08:54 +0100, Alex Bennée wrote:
> 
> > How do you handle the memory model ? IE , ARM and PPC are OO while x86
> > is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
> > x86 on ARM or PPC will lead to problems unless you generate memory
> > barriers with every load/store ..
> 
> This is the next chunk of work. We have Alvise's LL/SC patches which
> allow us to do proper emulation of ARMs Load/store exclusive behaviour
> and any weak order target will have to use such constructs.

God no ! You don't want to use ll/sc for dealing with weak ordering, you
want to use barriers... ll/sc will allow you to deal with front-end
things such as atomic inc/dec etc...  

> Currently the plan is to introduce a barrier TCG op which will translate
> to the strongest backend barrier available.

I would advocate at least two barriers, full barrier and write barrier,
so at least when emulating ARM or PPC on x86, we don't actually send
fences on every load/stores.

IE. the x86 memory model is *not* fully ordered, so a ARM or PPC full
barrier must translate into a x86 fence afaik (or whatever is the x86
name of its full barrier), but you don't want to translate all ARM/PPC
weaker barriers into those.

>  Even x86 should be using barriers to ensure cross-core visibility which
> then leaves LS re-ordering on the same core.

Only for store + load, which is afaik the only case where x86 re-orders.

But in any case, expose to the target (TGC target) the ordering
expectations of the source so that we can use whatever facilities might
be at hand to avoid some of those barriers, for example the SAO mapping
attribute I mentioned.

I'll try to look at your patch more closely when I get a chance and see
if I can produce a ppc target but don't hold your breath, I'm a bit
swamped at the moment.

Ben.

> > At least on POWER7 and later on PPC we have the possibility of setting
> > the attribute "Strong Access Ordering" with mremap/mprotect (I dont'
> > remember which one) which gives us x86-like memory semantics...
> >
> > I don't know if ARM supports something similar. On the other hand, when
> > emulating ARM on PPC or vice-versa, we can probably get away with no
> > barriers.
> >
> > Do you expose some kind of guest memory model info to the TCG backend so
> > it can decide how to handle these things ?
> >
> >> == Code generation and cache ==
> >> 
> >> As Qemu stands, there is no protection at all against two threads attempting to
> >> generate code at the same time or modifying a TranslationBlock.
> >> The "protect TBContext with tb_lock" patch address the issue of code generation
> >> and makes all the tb_* function thread safe (except tb_flush).
> >> This raised the question of one or multiple caches. We choosed to use one
> >> unified cache because it's easier as a first step and since the structure of
> >> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
> >> don't see the benefit of having two pools of tbs.
> >> 
> >> == Dirty tracking ==
> >> 
> >> Protecting the IOs:
> >> To allows all VCPUs threads to run at the same time we need to drop the
> >> global_mutex as soon as possible. The io access need to take the mutex. This is
> >> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
> >> will be upstreamed.
> >> 
> >> Invalidation of TranslationBlocks:
> >> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
> >> it's jump cache itself as it is in CPUState so that can be handled by a simple
> >> call to async_run_on_cpu. However tb_invalidate also writes to the
> >> TranslationBlock which is shared as we have only one pool.
> >> Hence this part of invalidate requires all VCPUs to exit before it can be done.
> >> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
> >
> > What about the host MMU emulation ? Is that multithreaded ? It has
> > potential issues when doing things like dirty bit updates into guest
> > memory, those need to be done atomically. Also TLB invalidations on ARM
> > and PPC are global, so they will need to invalidate the remote SW TLBs
> > as well.
> >
> > Do you have a mechanism to synchronize with another thread ? IE, make it
> > pop out of TCG if already in and prevent it from getting in ? That way
> > you can "remotely" invalidate its TLB...
> >
> >> == Atomic instruction ==
> >> 
> >> For now only ARM on x64 is supported by using an cmpxchg instruction.
> >> Specifically the limitation of this approach is that it is harder to support
> >> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
> >> cmpxchg (we believe this could be the case for some PPC cores).
> >
> > Right, on the other hand 64-bit will do fine. But then x86 has 2-value
> > atomics nowadays, doesn't it ? And that will be hard to emulate on
> > anything. You might need to have some kind of global hashed lock list
> > used by atomics (hash the physical address) as a fallback if you don't
> > have a 1:1 match between host and guest capabilities.
> >
> > Cheers,
> > Ben.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11  9:22     ` Benjamin Herrenschmidt
@ 2015-08-11  9:29       ` Peter Maydell
  2015-08-11 10:09         ` Benjamin Herrenschmidt
  2015-08-11 19:22       ` Alex Bennée
  1 sibling, 1 reply; 81+ messages in thread
From: Peter Maydell @ 2015-08-11  9:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: mttcg, Mark Burton, Alvise Rigo, QEMU Developers,
	Guillaume Delbergue, Paolo Bonzini, Alex Bennée,
	KONRAD Frédéric

On 11 August 2015 at 10:22, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2015-08-11 at 08:54 +0100, Alex Bennée wrote:
>>
>> > How do you handle the memory model ? IE , ARM and PPC are OO while x86
>> > is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
>> > x86 on ARM or PPC will lead to problems unless you generate memory
>> > barriers with every load/store ..
>>
>> This is the next chunk of work. We have Alvise's LL/SC patches which
>> allow us to do proper emulation of ARMs Load/store exclusive behaviour
>> and any weak order target will have to use such constructs.
>
> God no ! You don't want to use ll/sc for dealing with weak ordering, you
> want to use barriers... ll/sc will allow you to deal with front-end
> things such as atomic inc/dec etc...

Agreed. ll/sc is atomicity, not ordering. (Conversely, barriers and
load-acquire/store-release are ordering, not atomicity...)

>> Currently the plan is to introduce a barrier TCG op which will translate
>> to the strongest backend barrier available.
>
> I would advocate at least two barriers, full barrier and write barrier,
> so at least when emulating ARM or PPC on x86, we don't actually send
> fences on every load/stores.

Is it possible in some of these combinations to use the load-acquire
and store-release instructions rather than explicit barriers?
(ARMv8 has those, which I think should be slightly less heavyweight
than explicit barriers everywhere, if the semantics line up.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-11  9:21         ` Peter Maydell
@ 2015-08-11  9:59           ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-11  9:59 UTC (permalink / raw)
  To: Peter Maydell
  Cc: mttcg, Mark Burton, Alvise Rigo, QEMU Developers,
	Guillaume Delbergue, Alex Bennée, Frederic Konrad



On 11/08/2015 11:21, Peter Maydell wrote:
> > > I think it's definitely not sufficient. Is user-mode multithread still
> > > working today?
> >
> > For some definition of "working", yes.  It's not sufficient, but it's a
> > good start.
> >
> > The main problem with user-mode multithreading is that there is no clear
> > lock hierarchy between mmap_lock and tb_lock.  But this is not a problem
> > for softmmu.
>
> And also that we don't have a serious design for the locking at all.
> I was hoping this would be something that would come out of the
> multithreaded-TCG work...

Indeed.  Fred made an effort to identify the translate-all.c functions
that need tb_lock, and there are only a few that need mmap_lock.  I'm
now trying to document it so that the patches are actually reviewable...

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11  9:29       ` Peter Maydell
@ 2015-08-11 10:09         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 81+ messages in thread
From: Benjamin Herrenschmidt @ 2015-08-11 10:09 UTC (permalink / raw)
  To: Peter Maydell
  Cc: mttcg, Mark Burton, Alvise Rigo, QEMU Developers,
	Guillaume Delbergue, Paolo Bonzini, Alex Bennée,
	KONRAD Frédéric

On Tue, 2015-08-11 at 10:29 +0100, Peter Maydell wrote:
> Is it possible in some of these combinations to use the load-acquire
> and store-release instructions rather than explicit barriers?
> (ARMv8 has those, which I think should be slightly less heavyweight
> than explicit barriers everywhere, if the semantics line up.)

Possibly though I suppose starting with a simple barrier and then adding
more fine-grained ones to TCG might be a workable path. IE, it shouldn't
be too hard.

The problem of course is that not 2 architecture have exactly the same
semantics :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 02/19] cpus: add tcg_exec_flag.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 02/19] cpus: add tcg_exec_flag fred.konrad
@ 2015-08-11 10:53   ` Paolo Bonzini
  2015-08-11 11:11     ` Frederic Konrad
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-11 10:53 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> @@ -583,5 +587,6 @@ int cpu_exec(CPUState *cpu)
>  
>      /* fail safe : never use current_cpu outside cpu_exec() */
>      current_cpu = NULL;
> +    tcg_cpu_allow_execution(cpu);

I don't think this is correct; safe_work_pending() is a much clearer
test.  I'll revert locally to the previous version to play more with the
code.

Paolo

>      return ret;
>  }
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 02/19] cpus: add tcg_exec_flag.
  2015-08-11 10:53   ` Paolo Bonzini
@ 2015-08-11 11:11     ` Frederic Konrad
  2015-08-11 12:57       ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-11 11:11 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue

On 11/08/2015 12:53, Paolo Bonzini wrote:
>
> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>> @@ -583,5 +587,6 @@ int cpu_exec(CPUState *cpu)
>>   
>>       /* fail safe : never use current_cpu outside cpu_exec() */
>>       current_cpu = NULL;
>> +    tcg_cpu_allow_execution(cpu);
> I don't think this is correct; safe_work_pending() is a much clearer
> test.  I'll revert locally to the previous version to play more with the
> code.
>
> Paolo

Yes definitely but we might have a race if we just use safe_work_pending().

Fred
>
>>       return ret;
>>   }
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
                   ` (20 preceding siblings ...)
  2015-08-11  6:15 ` Benjamin Herrenschmidt
@ 2015-08-11 12:45 ` Paolo Bonzini
  2015-08-11 13:59   ` Frederic Konrad
  21 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-11 12:45 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue

On 10/08/2015 17:26, fred.konrad@greensocs.com wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
> 
> This is the 7th round of the MTTCG patch series.

Here is a list of issues that I found:

- tb_lock usage in tb_find_fast is complicated and introduces the need
for other complicated code such as the tb_invalidate callback.  Instead,
the tb locking should reuse the cpu-exec.c code for user-mode emulation,
with additional locking in the spots identified by Fred.

- tb_lock uses a recursive lock, but this is not necessary.  Did I ever
say I dislike recursive mutexes? :)  The wrappers
tb_lock()/tb_unlock()/tb_lock_reset() can catch recursive locking for
us, so it's not hard to do without it.

- code_bitmap is not protected by any mutex
(tb_invalidate_phys_page_fast is called with the iothread mutex taken,
but other users of code_bitmap do not use it).  Writes should be
protected by the tb_lock, reads by either tb_lock or RCU.

- memory barriers are probably requested around accesses to
->exit_request.  ->thread_kicked also needs to be accessed with atomics,
because async_run_{,safe_}on_cpu can be called outside the big QEMU lock.

- the whole signal-based qemu_cpu_kick can just go away.  Just setting
tcg_exit_req and exit_request will kick the TCG thread.  The hairy Win32
SuspendThread/ResumeThread goes away too.  I suggest doing it now,
because proving it unnecessary is easier than proving it correct.

- user-mode emulation is broken (does not compile)

- the big QEMU lock is not taken anywhere for MMIO accesses that require
it (i.e. basically all of them)

- some code wants to be called _outside_ the big QEMU lock, for example
because it longjmps back to cpu_exec.  For example, I suspect that the
notdirty callbacks must be marked with memory_region_clear_global_locking.

I've started looking at them (and documenting the locking conventions
for functions), and I hope to post it to some git repo later this week.

Paolo

> 
> It can be cloned from:
> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
> 
> This patch-set try to address the different issues in the global picture of
> MTTCG, presented on the wiki.
> 
> == Needed patch for our work ==
> 
> Some preliminaries are needed for our work:
>  * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>    the CPUState.
>  * We need to run some work safely when all VCPUs are outside their execution
>    loop. This is done with the async_run_safe_work_on_cpu function introduced
>    in this series.
>  * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>    atomic instruction.
> 
> == Code generation and cache ==
> 
> As Qemu stands, there is no protection at all against two threads attempting to
> generate code at the same time or modifying a TranslationBlock.
> The "protect TBContext with tb_lock" patch address the issue of code generation
> and makes all the tb_* function thread safe (except tb_flush).
> This raised the question of one or multiple caches. We choosed to use one
> unified cache because it's easier as a first step and since the structure of
> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
> don't see the benefit of having two pools of tbs.
> 
> == Dirty tracking ==
> 
> Protecting the IOs:
> To allows all VCPUs threads to run at the same time we need to drop the
> global_mutex as soon as possible. The io access need to take the mutex. This is
> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
> will be upstreamed.
> 
> Invalidation of TranslationBlocks:
> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
> it's jump cache itself as it is in CPUState so that can be handled by a simple
> call to async_run_on_cpu. However tb_invalidate also writes to the
> TranslationBlock which is shared as we have only one pool.
> Hence this part of invalidate requires all VCPUs to exit before it can be done.
> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
> 
> == Atomic instruction ==
> 
> For now only ARM on x64 is supported by using an cmpxchg instruction.
> Specifically the limitation of this approach is that it is harder to support
> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
> cmpxchg (we believe this could be the case for some PPC cores).  For now this
> case is not correctly handled. The existing atomic patch will attempt to execute
> the 64 bit cmpxchg functionality in a non thread safe fashion. Our intention is
> to provide a new multi-thread ARM atomic patch for 64bit ARM on effective 32bit
> hosts.
> This atomic instruction part has been tested with Alexander's atomic stress repo
> available here:
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
> 
> The execution is a little slower than upstream probably because of the different
> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spin_lock
> reduce considerably the difference.
> 
> == Testing ==
> 
> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux guest show
> a good performance progression: it takes basically 18s upstream to complete vs
> 10s with MTTCG.
> 
> Testing image is available here:
> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3
> 
> Then simply:
> ./configure --target-list=arm-softmmu
> make -j8
> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
> --append "console=ttyAMA0"
> 
> login: root
> 
> The dhrystone command is the last one in the history.
> "dhrystone 10000000 & dhrystone 10000000"
> 
> The atomic spinlock benchmark from Alexander shows that atomic basically work.
> Just follow the instruction here:
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
> 
> == Known issues ==
> 
> * GDB stub:
>   GDB stub is not tested right now it will probably requires some changes to
>   work.
> 
> * deadlock on exit:
>   When exiting QEMU Ctrl-C some VCPU's thread are not able to exit and continue
>   execution.
>   http://git.greensocs.com/fkonrad/mttcg/issues/1
> 
> * memory_region_rom_device_set_romd from pflash01 just crashes the TCG code.
>   Strangely this happen only with "-smp 4" and 2 in the DTB.
>   http://git.greensocs.com/fkonrad/mttcg/issues/2
> 
> Changes V6 -> V7:
>   * global_lock:
>      * Don't protect softmmu read/write helper as it's now done in
>        adress_space_rw.
>   * tcg_exec_flag:
>      * Make the flag atomically test and set through an API.
>   * introduce async_safe_work:
>      * move qemu_cpu_kick_thread to avoid prototype declaration.
>      * use the work_mutex.
>   * async_work:
>      * protect it with a mutex (work_mutex) against concurent access.
>   * tb_lock:
>      * protect tcg_malloc_internal as well.
>   * signal the VCPU even of current_cpu is NULL.
>   * added PSCI patch.
>   * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899).
> 
> Changes V5 -> V6:
>   * Introduce async_safe_work to do the tb_flush and some part of tb_invalidate.
>   * Introduce QemuSpin from Guillaume which allow a faster atomic instruction
>     (6s to pass Alexander's atomic test instead of 30s before).
>   * Don't take tb_lock before tb_find_fast.
>   * Handle tb_flush with async_safe_work.
>   * Handle tb_invalidate with async_work and async_safe_work.
>   * Drop the tlb_flush_request mechanism and use async_work as well.
>   * Fix the wrong lenght in atomic patch.
>   * Fix the wrong return address for exception in atomic patch.
> 
> Alex Bennée (1):
>   target-arm/psci.c: wake up sleeping CPUs (MTTCG)
> 
> Guillaume Delbergue (1):
>   add support for spin lock on POSIX systems exclusively
> 
> KONRAD Frederic (17):
>   cpus: protect queued_work_* with work_mutex.
>   cpus: add tcg_exec_flag.
>   cpus: introduce async_run_safe_work_on_cpu.
>   replace spinlock by QemuMutex.
>   remove unused spinlock.
>   protect TBContext with tb_lock.
>   tcg: remove tcg_halt_cond global variable.
>   Drop global lock during TCG code execution
>   cpu: remove exit_request global.
>   tcg: switch on multithread.
>   Use atomic cmpxchg to atomically check the exclusive value in a STREX
>   add a callback when tb_invalidate is called.
>   cpu: introduce tlb_flush*_all.
>   arm: use tlb_flush*_all
>   translate-all: introduces tb_flush_safe.
>   translate-all: (wip) use tb_flush_safe when we can't alloc more tb.
>   mttcg: signal the associated cpu anyway.
> 
>  cpu-exec.c                  |  98 +++++++++------
>  cpus.c                      | 295 +++++++++++++++++++++++++-------------------
>  cputlb.c                    |  81 ++++++++++++
>  include/exec/exec-all.h     |   8 +-
>  include/exec/spinlock.h     |  49 --------
>  include/qemu/thread-posix.h |   4 +
>  include/qemu/thread-win32.h |   4 +
>  include/qemu/thread.h       |   7 ++
>  include/qom/cpu.h           |  57 +++++++++
>  linux-user/main.c           |   6 +-
>  qom/cpu.c                   |  20 +++
>  target-arm/cpu.c            |  21 ++++
>  target-arm/cpu.h            |   6 +
>  target-arm/helper.c         |  58 +++------
>  target-arm/helper.h         |   4 +
>  target-arm/op_helper.c      | 128 ++++++++++++++++++-
>  target-arm/psci.c           |   2 +
>  target-arm/translate.c      | 101 +++------------
>  target-i386/mem_helper.c    |  16 ++-
>  target-i386/misc_helper.c   |  27 +++-
>  tcg/i386/tcg-target.c       |   8 ++
>  tcg/tcg.h                   |  14 ++-
>  translate-all.c             | 217 +++++++++++++++++++++++++++-----
>  util/qemu-thread-posix.c    |  45 +++++++
>  util/qemu-thread-win32.c    |  30 +++++
>  vl.c                        |   6 +
>  26 files changed, 934 insertions(+), 378 deletions(-)
>  delete mode 100644 include/exec/spinlock.h
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 02/19] cpus: add tcg_exec_flag.
  2015-08-11 11:11     ` Frederic Konrad
@ 2015-08-11 12:57       ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-11 12:57 UTC (permalink / raw)
  To: Frederic Konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 11/08/2015 13:11, Frederic Konrad wrote:
> On 11/08/2015 12:53, Paolo Bonzini wrote:
>>
>> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>>> @@ -583,5 +587,6 @@ int cpu_exec(CPUState *cpu)
>>>         /* fail safe : never use current_cpu outside cpu_exec() */
>>>       current_cpu = NULL;
>>> +    tcg_cpu_allow_execution(cpu);
>> I don't think this is correct; safe_work_pending() is a much clearer
>> test.  I'll revert locally to the previous version to play more with the
>> code.
>>
>> Paolo
> 
> Yes definitely but we might have a race if we just use safe_work_pending().

The trick is to order the accesses correctly.  For example, cpu_exec
will check tcg_exit_req, then clear exit_request, then check
queued_work_first.  On the write side the order is the opposite:
queued_work_first must be written first, then exit_request, then
tcg_exit_req.

Here it is the same.  safe_work_pending must be incremented first to
prevent threads from entering cpu-exec.c; for those that are already in
there you write queued_safe_work_first, then exit_request, then
tcg_exit_req.  Similarly safe_work_pending must be decremented last.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11 12:45 ` Paolo Bonzini
@ 2015-08-11 13:59   ` Frederic Konrad
  2015-08-11 14:10     ` Paolo Bonzini
  2015-08-12 15:19     ` Frederic Konrad
  0 siblings, 2 replies; 81+ messages in thread
From: Frederic Konrad @ 2015-08-11 13:59 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue

On 11/08/2015 14:45, Paolo Bonzini wrote:
> On 10/08/2015 17:26, fred.konrad@greensocs.com wrote:
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>
>> This is the 7th round of the MTTCG patch series.
Thanks to look at this.
> Here is a list of issues that I found:
>
> - tb_lock usage in tb_find_fast is complicated and introduces the need
> for other complicated code such as the tb_invalidate callback.  Instead,
> the tb locking should reuse the cpu-exec.c code for user-mode emulation,
> with additional locking in the spots identified by Fred.
The reason for this is that locking around tb_find_fast just kills the 
performance.

>
> - tb_lock uses a recursive lock, but this is not necessary.  Did I ever
> say I dislike recursive mutexes? :)  The wrappers
> tb_lock()/tb_unlock()/tb_lock_reset() can catch recursive locking for
> us, so it's not hard to do without it.
True this is definitely not nice but seems some code eg: tb_invalidate 
are called from
different place with or without the tb_lock. We might probably be able 
to lock the
caller.

>
> - code_bitmap is not protected by any mutex
> (tb_invalidate_phys_page_fast is called with the iothread mutex taken,
> but other users of code_bitmap do not use it).  Writes should be
> protected by the tb_lock, reads by either tb_lock or RCU.
Yes good point I missed this one.

> - memory barriers are probably requested around accesses to
> ->exit_request.  ->thread_kicked also needs to be accessed with atomics,
> because async_run_{,safe_}on_cpu can be called outside the big QEMU lock.
>
> - the whole signal-based qemu_cpu_kick can just go away.  Just setting
> tcg_exit_req and exit_request will kick the TCG thread.  The hairy Win32
> SuspendThread/ResumeThread goes away too.  I suggest doing it now,
> because proving it unnecessary is easier than proving it correct.
Just setting tcg_exit_req and exit_request and signal the cpu->halt_cond 
I guess?

> - user-mode emulation is broken (does not compile)
>
> - the big QEMU lock is not taken anywhere for MMIO accesses that require
> it (i.e. basically all of them)
Isn't that handled by prepare_mmio_access?

>
> - some code wants to be called _outside_ the big QEMU lock, for example
> because it longjmps back to cpu_exec.  For example, I suspect that the
> notdirty callbacks must be marked with memory_region_clear_global_locking.

Fred
>
> I've started looking at them (and documenting the locking conventions
> for functions), and I hope to post it to some git repo later this week.
>
> Paolo
>
>> It can be cloned from:
>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
>>
>> This patch-set try to address the different issues in the global picture of
>> MTTCG, presented on the wiki.
>>
>> == Needed patch for our work ==
>>
>> Some preliminaries are needed for our work:
>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>>     the CPUState.
>>   * We need to run some work safely when all VCPUs are outside their execution
>>     loop. This is done with the async_run_safe_work_on_cpu function introduced
>>     in this series.
>>   * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>>     atomic instruction.
>>
>> == Code generation and cache ==
>>
>> As Qemu stands, there is no protection at all against two threads attempting to
>> generate code at the same time or modifying a TranslationBlock.
>> The "protect TBContext with tb_lock" patch address the issue of code generation
>> and makes all the tb_* function thread safe (except tb_flush).
>> This raised the question of one or multiple caches. We choosed to use one
>> unified cache because it's easier as a first step and since the structure of
>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
>> don't see the benefit of having two pools of tbs.
>>
>> == Dirty tracking ==
>>
>> Protecting the IOs:
>> To allows all VCPUs threads to run at the same time we need to drop the
>> global_mutex as soon as possible. The io access need to take the mutex. This is
>> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>> will be upstreamed.
>>
>> Invalidation of TranslationBlocks:
>> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
>> it's jump cache itself as it is in CPUState so that can be handled by a simple
>> call to async_run_on_cpu. However tb_invalidate also writes to the
>> TranslationBlock which is shared as we have only one pool.
>> Hence this part of invalidate requires all VCPUs to exit before it can be done.
>> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
>>
>> == Atomic instruction ==
>>
>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>> Specifically the limitation of this approach is that it is harder to support
>> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
>> cmpxchg (we believe this could be the case for some PPC cores).  For now this
>> case is not correctly handled. The existing atomic patch will attempt to execute
>> the 64 bit cmpxchg functionality in a non thread safe fashion. Our intention is
>> to provide a new multi-thread ARM atomic patch for 64bit ARM on effective 32bit
>> hosts.
>> This atomic instruction part has been tested with Alexander's atomic stress repo
>> available here:
>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>
>> The execution is a little slower than upstream probably because of the different
>> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spin_lock
>> reduce considerably the difference.
>>
>> == Testing ==
>>
>> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux guest show
>> a good performance progression: it takes basically 18s upstream to complete vs
>> 10s with MTTCG.
>>
>> Testing image is available here:
>> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3
>>
>> Then simply:
>> ./configure --target-list=arm-softmmu
>> make -j8
>> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
>> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
>> --append "console=ttyAMA0"
>>
>> login: root
>>
>> The dhrystone command is the last one in the history.
>> "dhrystone 10000000 & dhrystone 10000000"
>>
>> The atomic spinlock benchmark from Alexander shows that atomic basically work.
>> Just follow the instruction here:
>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>
>> == Known issues ==
>>
>> * GDB stub:
>>    GDB stub is not tested right now it will probably requires some changes to
>>    work.
>>
>> * deadlock on exit:
>>    When exiting QEMU Ctrl-C some VCPU's thread are not able to exit and continue
>>    execution.
>>    http://git.greensocs.com/fkonrad/mttcg/issues/1
>>
>> * memory_region_rom_device_set_romd from pflash01 just crashes the TCG code.
>>    Strangely this happen only with "-smp 4" and 2 in the DTB.
>>    http://git.greensocs.com/fkonrad/mttcg/issues/2
>>
>> Changes V6 -> V7:
>>    * global_lock:
>>       * Don't protect softmmu read/write helper as it's now done in
>>         adress_space_rw.
>>    * tcg_exec_flag:
>>       * Make the flag atomically test and set through an API.
>>    * introduce async_safe_work:
>>       * move qemu_cpu_kick_thread to avoid prototype declaration.
>>       * use the work_mutex.
>>    * async_work:
>>       * protect it with a mutex (work_mutex) against concurent access.
>>    * tb_lock:
>>       * protect tcg_malloc_internal as well.
>>    * signal the VCPU even of current_cpu is NULL.
>>    * added PSCI patch.
>>    * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899).
>>
>> Changes V5 -> V6:
>>    * Introduce async_safe_work to do the tb_flush and some part of tb_invalidate.
>>    * Introduce QemuSpin from Guillaume which allow a faster atomic instruction
>>      (6s to pass Alexander's atomic test instead of 30s before).
>>    * Don't take tb_lock before tb_find_fast.
>>    * Handle tb_flush with async_safe_work.
>>    * Handle tb_invalidate with async_work and async_safe_work.
>>    * Drop the tlb_flush_request mechanism and use async_work as well.
>>    * Fix the wrong lenght in atomic patch.
>>    * Fix the wrong return address for exception in atomic patch.
>>
>> Alex Bennée (1):
>>    target-arm/psci.c: wake up sleeping CPUs (MTTCG)
>>
>> Guillaume Delbergue (1):
>>    add support for spin lock on POSIX systems exclusively
>>
>> KONRAD Frederic (17):
>>    cpus: protect queued_work_* with work_mutex.
>>    cpus: add tcg_exec_flag.
>>    cpus: introduce async_run_safe_work_on_cpu.
>>    replace spinlock by QemuMutex.
>>    remove unused spinlock.
>>    protect TBContext with tb_lock.
>>    tcg: remove tcg_halt_cond global variable.
>>    Drop global lock during TCG code execution
>>    cpu: remove exit_request global.
>>    tcg: switch on multithread.
>>    Use atomic cmpxchg to atomically check the exclusive value in a STREX
>>    add a callback when tb_invalidate is called.
>>    cpu: introduce tlb_flush*_all.
>>    arm: use tlb_flush*_all
>>    translate-all: introduces tb_flush_safe.
>>    translate-all: (wip) use tb_flush_safe when we can't alloc more tb.
>>    mttcg: signal the associated cpu anyway.
>>
>>   cpu-exec.c                  |  98 +++++++++------
>>   cpus.c                      | 295 +++++++++++++++++++++++++-------------------
>>   cputlb.c                    |  81 ++++++++++++
>>   include/exec/exec-all.h     |   8 +-
>>   include/exec/spinlock.h     |  49 --------
>>   include/qemu/thread-posix.h |   4 +
>>   include/qemu/thread-win32.h |   4 +
>>   include/qemu/thread.h       |   7 ++
>>   include/qom/cpu.h           |  57 +++++++++
>>   linux-user/main.c           |   6 +-
>>   qom/cpu.c                   |  20 +++
>>   target-arm/cpu.c            |  21 ++++
>>   target-arm/cpu.h            |   6 +
>>   target-arm/helper.c         |  58 +++------
>>   target-arm/helper.h         |   4 +
>>   target-arm/op_helper.c      | 128 ++++++++++++++++++-
>>   target-arm/psci.c           |   2 +
>>   target-arm/translate.c      | 101 +++------------
>>   target-i386/mem_helper.c    |  16 ++-
>>   target-i386/misc_helper.c   |  27 +++-
>>   tcg/i386/tcg-target.c       |   8 ++
>>   tcg/tcg.h                   |  14 ++-
>>   translate-all.c             | 217 +++++++++++++++++++++++++++-----
>>   util/qemu-thread-posix.c    |  45 +++++++
>>   util/qemu-thread-win32.c    |  30 +++++
>>   vl.c                        |   6 +
>>   26 files changed, 934 insertions(+), 378 deletions(-)
>>   delete mode 100644 include/exec/spinlock.h
>>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11 13:59   ` Frederic Konrad
@ 2015-08-11 14:10     ` Paolo Bonzini
  2015-08-12 15:19     ` Frederic Konrad
  1 sibling, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-11 14:10 UTC (permalink / raw)
  To: Frederic Konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 11/08/2015 15:59, Frederic Konrad wrote:
>> - tb_lock usage in tb_find_fast is complicated and introduces the need
>> for other complicated code such as the tb_invalidate callback.  Instead,
>> the tb locking should reuse the cpu-exec.c code for user-mode emulation,
>> with additional locking in the spots identified by Fred.
> 
> The reason for this is that locking around tb_find_fast just kills the
> performance.

Let's make it correct first. :)

>> - the whole signal-based qemu_cpu_kick can just go away.  Just setting
>> tcg_exit_req and exit_request will kick the TCG thread.  The hairy Win32
>> SuspendThread/ResumeThread goes away too.  I suggest doing it now,
>> because proving it unnecessary is easier than proving it correct.
> 
> Just setting tcg_exit_req and exit_request and signal the cpu->halt_cond
> I guess?

Yes.

>> - the big QEMU lock is not taken anywhere for MMIO accesses that require
>> it (i.e. basically all of them)
> Isn't that handled by prepare_mmio_access?

That's not used on the TCG path (which calls
memory_region_dispatch_read/write directly).

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11  9:22     ` Benjamin Herrenschmidt
  2015-08-11  9:29       ` Peter Maydell
@ 2015-08-11 19:22       ` Alex Bennée
  1 sibling, 0 replies; 81+ messages in thread
From: Alex Bennée @ 2015-08-11 19:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, fred.konrad


Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Tue, 2015-08-11 at 08:54 +0100, Alex Bennée wrote:
>> 
>> > How do you handle the memory model ? IE , ARM and PPC are OO while x86
>> > is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
>> > x86 on ARM or PPC will lead to problems unless you generate memory
>> > barriers with every load/store ..
>> 
>> This is the next chunk of work. We have Alvise's LL/SC patches which
>> allow us to do proper emulation of ARMs Load/store exclusive behaviour
>> and any weak order target will have to use such constructs.
>
> God no ! You don't want to use ll/sc for dealing with weak ordering, you
> want to use barriers... ll/sc will allow you to deal with front-end
> things such as atomic inc/dec etc...

Sorry I wasn't clear - ll/sc is required to properly support weak
ordered system atomic-like ops. So while it doesn't offer guarantees on
memory ordering it does ensure you can safely do atomic operations. 

>> Currently the plan is to introduce a barrier TCG op which will translate
>> to the strongest backend barrier available.
>
> I would advocate at least two barriers, full barrier and write barrier,
> so at least when emulating ARM or PPC on x86, we don't actually send
> fences on every load/stores.

I was considering finer grained barriers as an optimisation step.

>
> IE. the x86 memory model is *not* fully ordered, so a ARM or PPC full
> barrier must translate into a x86 fence afaik (or whatever is the x86
> name of its full barrier), but you don't want to translate all ARM/PPC
> weaker barriers into those.
>
>>  Even x86 should be using barriers to ensure cross-core visibility which
>> then leaves LS re-ordering on the same core.
>
> Only for store + load, which is afaik the only case where x86
> re-orders.

To be clear this is non-dependant stores. A store to an address that is
then loaded in the same CPU shouldn't overtake the load.

I note that ARM, PPC and SPARC RMO are all broadly similar in their
models. Alpha seems to be the funky one but as we don't have a backend
for it I think we are OK.

> But in any case, expose to the target (TGC target) the ordering
> expectations of the source so that we can use whatever facilities might
> be at hand to avoid some of those barriers, for example the SAO mapping
> attribute I mentioned.
>
> I'll try to look at your patch more closely when I get a chance and see
> if I can produce a ppc target but don't hold your breath, I'm a bit
> swamped at the moment.

We haven't done anything on barriers yet. I've mostly been concentrating
writing up the test cases to demonstrate the failures but given the x86
backend it is surprisingly hard to come up with a test case that will
fail. I suspect I need to port my ARM tests to x86 and run on the ARM
backend.

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution
  2015-08-10 16:15   ` Paolo Bonzini
  2015-08-11  6:55     ` Frederic Konrad
@ 2015-08-11 20:12     ` Alex Bennée
  2015-08-11 21:34       ` Frederic Konrad
  1 sibling, 1 reply; 81+ messages in thread
From: Alex Bennée @ 2015-08-11 20:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue, fred.konrad


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>>  void qemu_mutex_lock_iothread(void)
>>  {
>> -    atomic_inc(&iothread_requesting_mutex);
>> -    /* In the simple case there is no need to bump the VCPU thread out of
>> -     * TCG code execution.
>> -     */
>> -    if (!tcg_enabled() || qemu_in_vcpu_thread() ||
>> -        !first_cpu || !first_cpu->thread) {
>> -        qemu_mutex_lock(&qemu_global_mutex);
>> -        atomic_dec(&iothread_requesting_mutex);
>> -    } else {
>> -        if (qemu_mutex_trylock(&qemu_global_mutex)) {
>> -            qemu_cpu_kick_thread(first_cpu);
>> -            qemu_mutex_lock(&qemu_global_mutex);
>> -        }
>> -        atomic_dec(&iothread_requesting_mutex);
>> -        qemu_cond_broadcast(&qemu_io_proceeded_cond);
>> -    }
>> -    iothread_locked = true;
>
> "iothread_locked = true" must be kept.  Otherwise... yay! :)
>

Also if qemu_cond_broadcast(&qemu_io_proceeded_cond) is being dropped
there is no point keeping the guff around in qemu_tcg_wait_io_event.

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution
  2015-08-11 20:12     ` Alex Bennée
@ 2015-08-11 21:34       ` Frederic Konrad
  2015-08-12  9:58         ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-11 21:34 UTC (permalink / raw)
  To: Alex Bennée, Paolo Bonzini
  Cc: mttcg, guillaume.delbergue, mark.burton, qemu-devel, a.rigo

On 11/08/2015 22:12, Alex Bennée wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>
>> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>>>   void qemu_mutex_lock_iothread(void)
>>>   {
>>> -    atomic_inc(&iothread_requesting_mutex);
>>> -    /* In the simple case there is no need to bump the VCPU thread out of
>>> -     * TCG code execution.
>>> -     */
>>> -    if (!tcg_enabled() || qemu_in_vcpu_thread() ||
>>> -        !first_cpu || !first_cpu->thread) {
>>> -        qemu_mutex_lock(&qemu_global_mutex);
>>> -        atomic_dec(&iothread_requesting_mutex);
>>> -    } else {
>>> -        if (qemu_mutex_trylock(&qemu_global_mutex)) {
>>> -            qemu_cpu_kick_thread(first_cpu);
>>> -            qemu_mutex_lock(&qemu_global_mutex);
>>> -        }
>>> -        atomic_dec(&iothread_requesting_mutex);
>>> -        qemu_cond_broadcast(&qemu_io_proceeded_cond);
>>> -    }
>>> -    iothread_locked = true;
>> "iothread_locked = true" must be kept.  Otherwise... yay! :)
>>
> Also if qemu_cond_broadcast(&qemu_io_proceeded_cond) is being dropped
> there is no point keeping the guff around in qemu_tcg_wait_io_event.
>
Yes good point.

BTW this leads to high consumption of host CPU eg: 100% per VCPU thread as
the VCPUs thread are no longer waiting for qemu_io_proceeded_cond.

Fred

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution
  2015-08-11 21:34       ` Frederic Konrad
@ 2015-08-12  9:58         ` Paolo Bonzini
  2015-08-12 12:32           ` Frederic Konrad
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-12  9:58 UTC (permalink / raw)
  To: Frederic Konrad, Alex Bennée
  Cc: mttcg, a.rigo, mark.burton, guillaume.delbergue, qemu-devel



On 11/08/2015 23:34, Frederic Konrad wrote:
>>>
>> Also if qemu_cond_broadcast(&qemu_io_proceeded_cond) is being dropped
>> there is no point keeping the guff around in qemu_tcg_wait_io_event.
>>
> Yes good point.
> 
> BTW this leads to high consumption of host CPU eg: 100% per VCPU thread as
> the VCPUs thread are no longer waiting for qemu_io_proceeded_cond.

If the guest CPU is busy waiting, that's expected.  But if the guest CPU
is halted, it should not have 100% host CPU consumption.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution
  2015-08-12  9:58         ` Paolo Bonzini
@ 2015-08-12 12:32           ` Frederic Konrad
  0 siblings, 0 replies; 81+ messages in thread
From: Frederic Konrad @ 2015-08-12 12:32 UTC (permalink / raw)
  To: Paolo Bonzini, Alex Bennée
  Cc: mttcg, qemu-devel, mark.burton, a.rigo, guillaume.delbergue

On 12/08/2015 11:58, Paolo Bonzini wrote:
>
> On 11/08/2015 23:34, Frederic Konrad wrote:
>>> Also if qemu_cond_broadcast(&qemu_io_proceeded_cond) is being dropped
>>> there is no point keeping the guff around in qemu_tcg_wait_io_event.
>>>
>> Yes good point.
>>
>> BTW this leads to high consumption of host CPU eg: 100% per VCPU thread as
>> the VCPUs thread are no longer waiting for qemu_io_proceeded_cond.
> If the guest CPU is busy waiting, that's expected.  But if the guest CPU
> is halted, it should not have 100% host CPU consumption.
>
> Paolo
>
Hmm so that's definitely strange... I mean theorically it's the same as 
before?

An other thing. It seems that we need to signal the VCPU when the 
iothread take
the lock eg:

     if (tcg_enabled() && qemu_thread_is_self(&io_thread)) {
         CPU_FOREACH(cpu) {
             cpu_exit(cpu);
         }
     }

To make this patch working without MTTCG.

Fred

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe fred.konrad
  2015-08-10 16:26   ` Paolo Bonzini
@ 2015-08-12 14:09   ` Paolo Bonzini
  2015-08-12 14:11     ` Frederic Konrad
  1 sibling, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-12 14:09 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
> 
> tb_flush is not thread safe we definitely need to exit VCPUs to do that.
> This introduces tb_flush_safe which just creates an async safe work which will
> do a tb_flush later.
> 
> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>

You could also allocate a new code buffer and free the old one with
call_rcu.  This should simplify things a lot.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe.
  2015-08-12 14:09   ` Paolo Bonzini
@ 2015-08-12 14:11     ` Frederic Konrad
  2015-08-12 14:14       ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-12 14:11 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue

On 12/08/2015 16:09, Paolo Bonzini wrote:
>
> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>
>> tb_flush is not thread safe we definitely need to exit VCPUs to do that.
>> This introduces tb_flush_safe which just creates an async safe work which will
>> do a tb_flush later.
>>
>> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
> You could also allocate a new code buffer and free the old one with
> call_rcu.  This should simplify things a lot.
>
> Paolo
Depending the size of the code buffer this might be a good idea. :).

Fred

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe.
  2015-08-12 14:11     ` Frederic Konrad
@ 2015-08-12 14:14       ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-12 14:14 UTC (permalink / raw)
  To: Frederic Konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 12/08/2015 16:11, Frederic Konrad wrote:
>> You could also allocate a new code buffer and free the old one with
>> call_rcu.  This should simplify things a lot.
>
> Depending the size of the code buffer this might be a good idea. :).

32 megabytes.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11 13:59   ` Frederic Konrad
  2015-08-11 14:10     ` Paolo Bonzini
@ 2015-08-12 15:19     ` Frederic Konrad
  2015-08-12 15:39       ` Paolo Bonzini
  1 sibling, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-12 15:19 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue

On 11/08/2015 15:59, Frederic Konrad wrote:
> On 11/08/2015 14:45, Paolo Bonzini wrote:
>> On 10/08/2015 17:26, fred.konrad@greensocs.com wrote:
>>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>>
>>> This is the 7th round of the MTTCG patch series.
> Thanks to look at this.
>> Here is a list of issues that I found:
>>
>> - tb_lock usage in tb_find_fast is complicated and introduces the need
>> for other complicated code such as the tb_invalidate callback. Instead,
>> the tb locking should reuse the cpu-exec.c code for user-mode emulation,
>> with additional locking in the spots identified by Fred.
> The reason for this is that locking around tb_find_fast just kills the 
> performance.
>
>>
>> - tb_lock uses a recursive lock, but this is not necessary.  Did I ever
>> say I dislike recursive mutexes? :)  The wrappers
>> tb_lock()/tb_unlock()/tb_lock_reset() can catch recursive locking for
>> us, so it's not hard to do without it.
> True this is definitely not nice but seems some code eg: tb_invalidate 
> are called from
> different place with or without the tb_lock. We might probably be able 
> to lock the
> caller.
>
>>
>> - code_bitmap is not protected by any mutex
>> (tb_invalidate_phys_page_fast is called with the iothread mutex taken,
>> but other users of code_bitmap do not use it).  Writes should be
>> protected by the tb_lock, reads by either tb_lock or RCU.
> Yes good point I missed this one.
>
>> - memory barriers are probably requested around accesses to
>> ->exit_request.  ->thread_kicked also needs to be accessed with atomics,
>> because async_run_{,safe_}on_cpu can be called outside the big QEMU 
>> lock.
>>
>> - the whole signal-based qemu_cpu_kick can just go away.  Just setting
>> tcg_exit_req and exit_request will kick the TCG thread.  The hairy Win32
>> SuspendThread/ResumeThread goes away too.  I suggest doing it now,
>> because proving it unnecessary is easier than proving it correct.
> Just setting tcg_exit_req and exit_request and signal the 
> cpu->halt_cond I guess?
BTW that affect KVM as well. Seems this mechanism is used as well with
qemu_cpu_kick_self().. Which is a little strange as it seems the SIGIPI 
trigger a
dummy signal handler?

     memset(&sigact, 0, sizeof(sigact));
     sigact.sa_handler = dummy_signal;
     sigaction(SIG_IPI, &sigact, NULL);


>
>> - user-mode emulation is broken (does not compile)
>>
>> - the big QEMU lock is not taken anywhere for MMIO accesses that require
>> it (i.e. basically all of them)
> Isn't that handled by prepare_mmio_access?
>
>>
>> - some code wants to be called _outside_ the big QEMU lock, for example
>> because it longjmps back to cpu_exec.  For example, I suspect that the
>> notdirty callbacks must be marked with 
>> memory_region_clear_global_locking.
>
> Fred
>>
>> I've started looking at them (and documenting the locking conventions
>> for functions), and I hope to post it to some git repo later this week.
>>
>> Paolo
>>
>>> It can be cloned from:
>>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
>>>
>>> This patch-set try to address the different issues in the global 
>>> picture of
>>> MTTCG, presented on the wiki.
>>>
>>> == Needed patch for our work ==
>>>
>>> Some preliminaries are needed for our work:
>>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag 
>>> is added to
>>>     the CPUState.
>>>   * We need to run some work safely when all VCPUs are outside their 
>>> execution
>>>     loop. This is done with the async_run_safe_work_on_cpu function 
>>> introduced
>>>     in this series.
>>>   * QemuSpin lock is introduced (on posix only yet) to allow a 
>>> faster handling of
>>>     atomic instruction.
>>>
>>> == Code generation and cache ==
>>>
>>> As Qemu stands, there is no protection at all against two threads 
>>> attempting to
>>> generate code at the same time or modifying a TranslationBlock.
>>> The "protect TBContext with tb_lock" patch address the issue of code 
>>> generation
>>> and makes all the tb_* function thread safe (except tb_flush).
>>> This raised the question of one or multiple caches. We choosed to 
>>> use one
>>> unified cache because it's easier as a first step and since the 
>>> structure of
>>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump 
>>> cache, we
>>> don't see the benefit of having two pools of tbs.
>>>
>>> == Dirty tracking ==
>>>
>>> Protecting the IOs:
>>> To allows all VCPUs threads to run at the same time we need to drop the
>>> global_mutex as soon as possible. The io access need to take the 
>>> mutex. This is
>>> likely to change when 
>>> http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>>> will be upstreamed.
>>>
>>> Invalidation of TranslationBlocks:
>>> We can have all VCPUs running during an invalidation. Each VCPU is 
>>> able to clean
>>> it's jump cache itself as it is in CPUState so that can be handled 
>>> by a simple
>>> call to async_run_on_cpu. However tb_invalidate also writes to the
>>> TranslationBlock which is shared as we have only one pool.
>>> Hence this part of invalidate requires all VCPUs to exit before it 
>>> can be done.
>>> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
>>>
>>> == Atomic instruction ==
>>>
>>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>>> Specifically the limitation of this approach is that it is harder to 
>>> support
>>> 64bit ARM on a host architecture that is multi-core, but only 
>>> supports 32 bit
>>> cmpxchg (we believe this could be the case for some PPC cores).  For 
>>> now this
>>> case is not correctly handled. The existing atomic patch will 
>>> attempt to execute
>>> the 64 bit cmpxchg functionality in a non thread safe fashion. Our 
>>> intention is
>>> to provide a new multi-thread ARM atomic patch for 64bit ARM on 
>>> effective 32bit
>>> hosts.
>>> This atomic instruction part has been tested with Alexander's atomic 
>>> stress repo
>>> available here:
>>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>>
>>> The execution is a little slower than upstream probably because of 
>>> the different
>>> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to 
>>> spin_lock
>>> reduce considerably the difference.
>>>
>>> == Testing ==
>>>
>>> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux 
>>> guest show
>>> a good performance progression: it takes basically 18s upstream to 
>>> complete vs
>>> 10s with MTTCG.
>>>
>>> Testing image is available here:
>>> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3
>>>
>>> Then simply:
>>> ./configure --target-list=arm-softmmu
>>> make -j8
>>> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
>>> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
>>> --append "console=ttyAMA0"
>>>
>>> login: root
>>>
>>> The dhrystone command is the last one in the history.
>>> "dhrystone 10000000 & dhrystone 10000000"
>>>
>>> The atomic spinlock benchmark from Alexander shows that atomic 
>>> basically work.
>>> Just follow the instruction here:
>>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
>>>
>>> == Known issues ==
>>>
>>> * GDB stub:
>>>    GDB stub is not tested right now it will probably requires some 
>>> changes to
>>>    work.
>>>
>>> * deadlock on exit:
>>>    When exiting QEMU Ctrl-C some VCPU's thread are not able to exit 
>>> and continue
>>>    execution.
>>>    http://git.greensocs.com/fkonrad/mttcg/issues/1
>>>
>>> * memory_region_rom_device_set_romd from pflash01 just crashes the 
>>> TCG code.
>>>    Strangely this happen only with "-smp 4" and 2 in the DTB.
>>>    http://git.greensocs.com/fkonrad/mttcg/issues/2
>>>
>>> Changes V6 -> V7:
>>>    * global_lock:
>>>       * Don't protect softmmu read/write helper as it's now done in
>>>         adress_space_rw.
>>>    * tcg_exec_flag:
>>>       * Make the flag atomically test and set through an API.
>>>    * introduce async_safe_work:
>>>       * move qemu_cpu_kick_thread to avoid prototype declaration.
>>>       * use the work_mutex.
>>>    * async_work:
>>>       * protect it with a mutex (work_mutex) against concurent access.
>>>    * tb_lock:
>>>       * protect tcg_malloc_internal as well.
>>>    * signal the VCPU even of current_cpu is NULL.
>>>    * added PSCI patch.
>>>    * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899).
>>>
>>> Changes V5 -> V6:
>>>    * Introduce async_safe_work to do the tb_flush and some part of 
>>> tb_invalidate.
>>>    * Introduce QemuSpin from Guillaume which allow a faster atomic 
>>> instruction
>>>      (6s to pass Alexander's atomic test instead of 30s before).
>>>    * Don't take tb_lock before tb_find_fast.
>>>    * Handle tb_flush with async_safe_work.
>>>    * Handle tb_invalidate with async_work and async_safe_work.
>>>    * Drop the tlb_flush_request mechanism and use async_work as well.
>>>    * Fix the wrong lenght in atomic patch.
>>>    * Fix the wrong return address for exception in atomic patch.
>>>
>>> Alex Bennée (1):
>>>    target-arm/psci.c: wake up sleeping CPUs (MTTCG)
>>>
>>> Guillaume Delbergue (1):
>>>    add support for spin lock on POSIX systems exclusively
>>>
>>> KONRAD Frederic (17):
>>>    cpus: protect queued_work_* with work_mutex.
>>>    cpus: add tcg_exec_flag.
>>>    cpus: introduce async_run_safe_work_on_cpu.
>>>    replace spinlock by QemuMutex.
>>>    remove unused spinlock.
>>>    protect TBContext with tb_lock.
>>>    tcg: remove tcg_halt_cond global variable.
>>>    Drop global lock during TCG code execution
>>>    cpu: remove exit_request global.
>>>    tcg: switch on multithread.
>>>    Use atomic cmpxchg to atomically check the exclusive value in a 
>>> STREX
>>>    add a callback when tb_invalidate is called.
>>>    cpu: introduce tlb_flush*_all.
>>>    arm: use tlb_flush*_all
>>>    translate-all: introduces tb_flush_safe.
>>>    translate-all: (wip) use tb_flush_safe when we can't alloc more tb.
>>>    mttcg: signal the associated cpu anyway.
>>>
>>>   cpu-exec.c                  |  98 +++++++++------
>>>   cpus.c                      | 295 
>>> +++++++++++++++++++++++++-------------------
>>>   cputlb.c                    |  81 ++++++++++++
>>>   include/exec/exec-all.h     |   8 +-
>>>   include/exec/spinlock.h     |  49 --------
>>>   include/qemu/thread-posix.h |   4 +
>>>   include/qemu/thread-win32.h |   4 +
>>>   include/qemu/thread.h       |   7 ++
>>>   include/qom/cpu.h           |  57 +++++++++
>>>   linux-user/main.c           |   6 +-
>>>   qom/cpu.c                   |  20 +++
>>>   target-arm/cpu.c            |  21 ++++
>>>   target-arm/cpu.h            |   6 +
>>>   target-arm/helper.c         |  58 +++------
>>>   target-arm/helper.h         |   4 +
>>>   target-arm/op_helper.c      | 128 ++++++++++++++++++-
>>>   target-arm/psci.c           |   2 +
>>>   target-arm/translate.c      | 101 +++------------
>>>   target-i386/mem_helper.c    |  16 ++-
>>>   target-i386/misc_helper.c   |  27 +++-
>>>   tcg/i386/tcg-target.c       |   8 ++
>>>   tcg/tcg.h                   |  14 ++-
>>>   translate-all.c             | 217 +++++++++++++++++++++++++++-----
>>>   util/qemu-thread-posix.c    |  45 +++++++
>>>   util/qemu-thread-win32.c    |  30 +++++
>>>   vl.c                        |   6 +
>>>   26 files changed, 934 insertions(+), 378 deletions(-)
>>>   delete mode 100644 include/exec/spinlock.h
>>>
>
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-12 15:19     ` Frederic Konrad
@ 2015-08-12 15:39       ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-12 15:39 UTC (permalink / raw)
  To: Frederic Konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 12/08/2015 17:19, Frederic Konrad wrote:
> BTW that affect KVM as well. Seems this mechanism is used as well with
> qemu_cpu_kick_self().. Which is a little strange as it seems the SIGIPI
> trigger a
> dummy signal handler?
> 
>     memset(&sigact, 0, sizeof(sigact));
>     sigact.sa_handler = dummy_signal;
>     sigaction(SIG_IPI, &sigact, NULL);

KVM is different, the signal handler is used to kick the VM out of
KVM_RUN.  We're going to add another path (a ioctl) but it cannot use
the same code as TCG.

qemu_cpu_kick_self is needed in some special cases where KVM tells you
"call KVM_RUN asap" but you know you have more work to do in userspace.
 Calling qemu_cpu_kick_self lets you call KVM_RUN work and immediately
do the userspace work.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock fred.konrad
  2015-08-10 16:36   ` Paolo Bonzini
@ 2015-08-12 17:45   ` Frederic Konrad
  2015-08-12 18:20     ` Alex Bennée
  1 sibling, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-12 17:45 UTC (permalink / raw)
  To: qemu-devel, mttcg, alex.bennee
  Cc: pbonzini, mark.burton, a.rigo, guillaume.delbergue

On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
>
> This protects TBContext with tb_lock to make tb_* thread safe.
>
> We can still have issue with tb_flush in case of multithread TCG:
>    An other CPU can be executing code during a flush.
>
> This can be fixed later by making all other TCG thread exiting before calling
> tb_flush().
>
> tb_find_slow is separated into tb_find_slow and tb_find_physical as the whole
> tb_find_slow doesn't require to lock the tb.
>
> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
>
> Changes:
[...]
>   
> @@ -675,6 +710,7 @@ static inline void code_gen_alloc(size_t tb_size)
>               CODE_GEN_AVG_BLOCK_SIZE;
>       tcg_ctx.tb_ctx.tbs =
>               g_malloc(tcg_ctx.code_gen_max_blocks * sizeof(TranslationBlock));
> +    qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>   }
>   
>   /* Must be called before using the QEMU cpus. 'tb_size' is the size
> @@ -699,16 +735,22 @@ bool tcg_enabled(void)
>       return tcg_ctx.code_gen_buffer != NULL;
>   }
>   
> -/* Allocate a new translation block. Flush the translation buffer if
> -   too many translation blocks or too much generated code. */
> +/*
> + * Allocate a new translation block. Flush the translation buffer if
> + * too many translation blocks or too much generated code.
> + * tb_alloc is not thread safe but tb_gen_code is protected by a mutex so this
> + * function is called only by one thread.
> + */
>   static TranslationBlock *tb_alloc(target_ulong pc)
>   {
> -    TranslationBlock *tb;
> +    TranslationBlock *tb = NULL;
>   
>       if (tcg_ctx.tb_ctx.nb_tbs >= tcg_ctx.code_gen_max_blocks ||
>           (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) >=
>            tcg_ctx.code_gen_buffer_max_size) {
> -        return NULL;
> +        tb = &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs++];
> +        tb->pc = pc;
> +        tb->cflags = 0;

Missed this wrong unreverted part which in the end doesn't do a tb_flush 
when required and crashes!
Fixing that allows me to boot with jessie and virt.

Fred

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-12 17:45   ` Frederic Konrad
@ 2015-08-12 18:20     ` Alex Bennée
  2015-08-12 18:22       ` Paolo Bonzini
  2015-08-14  8:38       ` Frederic Konrad
  0 siblings, 2 replies; 81+ messages in thread
From: Alex Bennée @ 2015-08-12 18:20 UTC (permalink / raw)
  To: Frederic Konrad
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue, pbonzini


Frederic Konrad <fred.konrad@greensocs.com> writes:

> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>
>> This protects TBContext with tb_lock to make tb_* thread safe.
>>
>> We can still have issue with tb_flush in case of multithread TCG:
>>    An other CPU can be executing code during a flush.
>>
>> This can be fixed later by making all other TCG thread exiting before calling
>> tb_flush().
>>
>> tb_find_slow is separated into tb_find_slow and tb_find_physical as the whole
>> tb_find_slow doesn't require to lock the tb.
>>
>> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
>>
>> Changes:
> [...]
>>   
>> @@ -675,6 +710,7 @@ static inline void code_gen_alloc(size_t tb_size)
>>               CODE_GEN_AVG_BLOCK_SIZE;
>>       tcg_ctx.tb_ctx.tbs =
>>               g_malloc(tcg_ctx.code_gen_max_blocks * sizeof(TranslationBlock));
>> +    qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>>   }
>>   
>>   /* Must be called before using the QEMU cpus. 'tb_size' is the size
>> @@ -699,16 +735,22 @@ bool tcg_enabled(void)
>>       return tcg_ctx.code_gen_buffer != NULL;
>>   }
>>   
>> -/* Allocate a new translation block. Flush the translation buffer if
>> -   too many translation blocks or too much generated code. */
>> +/*
>> + * Allocate a new translation block. Flush the translation buffer if
>> + * too many translation blocks or too much generated code.
>> + * tb_alloc is not thread safe but tb_gen_code is protected by a mutex so this
>> + * function is called only by one thread.
>> + */
>>   static TranslationBlock *tb_alloc(target_ulong pc)
>>   {
>> -    TranslationBlock *tb;
>> +    TranslationBlock *tb = NULL;
>>   
>>       if (tcg_ctx.tb_ctx.nb_tbs >= tcg_ctx.code_gen_max_blocks ||
>>           (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) >=
>>            tcg_ctx.code_gen_buffer_max_size) {
>> -        return NULL;
>> +        tb = &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs++];
>> +        tb->pc = pc;
>> +        tb->cflags = 0;
>
> Missed this wrong unreverted part which in the end doesn't do a tb_flush 
> when required and crashes!
> Fixing that allows me to boot with jessie and virt.

\o/

Do you see crashes while it is running?

It's interesting that I've not had a problem booting jessie with virt
though - just crashes while hanging.

Are you likely to push a v8 this week (or a temp branch?) with this and
any other obvious fixes? I appreciate Paolo has given you a not-so-small
pile of review comments as well so I wasn't looking for a complete new
patch set!


>
> Fred

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-12 18:20     ` Alex Bennée
@ 2015-08-12 18:22       ` Paolo Bonzini
  2015-08-14  8:38       ` Frederic Konrad
  1 sibling, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-12 18:22 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark burton, a rigo, qemu-devel, guillaume delbergue,
	Frederic Konrad


> Are you likely to push a v8 this week (or a temp branch?) with this and
> any other obvious fixes? I appreciate Paolo has given you a not-so-small
> pile of review comments as well so I wasn't looking for a complete new
> patch set!

FWIW, reviews of the patches I posted a hour or two ago are welcome.
Most of them are either very simple, just documentation, or straight
out of v7.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 11/19] tcg: switch on multithread.
  2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 11/19] tcg: switch on multithread fred.konrad
@ 2015-08-13 11:17   ` Paolo Bonzini
  2015-08-13 14:41     ` Frederic Konrad
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-13 11:17 UTC (permalink / raw)
  To: fred.konrad, qemu-devel, mttcg
  Cc: alex.bennee, mark.burton, a.rigo, guillaume.delbergue



On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
> +    while (!cpu->exit_request) {
>          qemu_clock_enable(QEMU_CLOCK_VIRTUAL,
>                            (cpu->singlestep_enabled & SSTEP_NOTIMER) == 0);
>  
> @@ -1507,7 +1480,7 @@ static void tcg_exec_all(void)
>          }
>      }
>  
> -    first_cpu->exit_request = 0;
> +    cpu->exit_request = 0;

One issue here is that when tcg_cpu_exec returns EXCP_HALTED, the
function keeps looping.  There is no need to set cpu->exit_request in
that case, since in fact there is no request pending, so the while loop
probably should be an "if".

Also, cpu->interrupt_request is not protected by any mutex, so
everything apart from the non-zero test must take the iothread mutex.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 11/19] tcg: switch on multithread.
  2015-08-13 11:17   ` Paolo Bonzini
@ 2015-08-13 14:41     ` Frederic Konrad
  2015-08-13 14:58       ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-13 14:41 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue

On 13/08/2015 13:17, Paolo Bonzini wrote:
>
> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>> +    while (!cpu->exit_request) {
>>           qemu_clock_enable(QEMU_CLOCK_VIRTUAL,
>>                             (cpu->singlestep_enabled & SSTEP_NOTIMER) == 0);
>>   
>> @@ -1507,7 +1480,7 @@ static void tcg_exec_all(void)
>>           }
>>       }
>>   
>> -    first_cpu->exit_request = 0;
>> +    cpu->exit_request = 0;
> One issue here is that when tcg_cpu_exec returns EXCP_HALTED, the
> function keeps looping.  There is no need to set cpu->exit_request in
> that case, since in fact there is no request pending, so the while loop
> probably should be an "if".
Nice catch thanks!

I missed the fact that it was running through the list of VCPUs and 
exited the
for(;;) loop.

I should rework this patch a little.. Maybe it's better to keep this 
loop and exit it
when necessary eg: when icount elapse or cpu halted.

Fred

>
> Also, cpu->interrupt_request is not protected by any mutex, so
> everything apart from the non-zero test must take the iothread mutex.
>
> Paolo
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 11/19] tcg: switch on multithread.
  2015-08-13 14:41     ` Frederic Konrad
@ 2015-08-13 14:58       ` Paolo Bonzini
  2015-08-13 15:18         ` Frederic Konrad
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-13 14:58 UTC (permalink / raw)
  To: Frederic Konrad, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue



On 13/08/2015 16:41, Frederic Konrad wrote:
>>>
>> One issue here is that when tcg_cpu_exec returns EXCP_HALTED, the
>> function keeps looping.  There is no need to set cpu->exit_request in
>> that case, since in fact there is no request pending, so the while loop
>> probably should be an "if".
> Nice catch thanks!
> 
> I missed the fact that it was running through the list of VCPUs and
> exited the
> for(;;) loop.
> 
> I should rework this patch a little.. Maybe it's better to keep this
> loop and exit it
> when necessary eg: when icount elapse or cpu halted.

Yeah, I don't have a particularly strong opinion on that.  You can look
at my mttcg github branch for my rebase on top of yesterday's series.
It seems to work at least on the small GreenSoCs buildroot image.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 11/19] tcg: switch on multithread.
  2015-08-13 14:58       ` Paolo Bonzini
@ 2015-08-13 15:18         ` Frederic Konrad
  0 siblings, 0 replies; 81+ messages in thread
From: Frederic Konrad @ 2015-08-13 15:18 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, mttcg
  Cc: mark.burton, alex.bennee, a.rigo, guillaume.delbergue

On 13/08/2015 16:58, Paolo Bonzini wrote:
>
> On 13/08/2015 16:41, Frederic Konrad wrote:
>>> One issue here is that when tcg_cpu_exec returns EXCP_HALTED, the
>>> function keeps looping.  There is no need to set cpu->exit_request in
>>> that case, since in fact there is no request pending, so the while loop
>>> probably should be an "if".
>> Nice catch thanks!
>>
>> I missed the fact that it was running through the list of VCPUs and
>> exited the
>> for(;;) loop.
>>
>> I should rework this patch a little.. Maybe it's better to keep this
>> loop and exit it
>> when necessary eg: when icount elapse or cpu halted.
> Yeah, I don't have a particularly strong opinion on that.  You can look
> at my mttcg github branch for my rebase on top of yesterday's series.
> It seems to work at least on the small GreenSoCs buildroot image.
>
> Paolo

Their still seems to be something wrong with 
memory_region_rom_device_set_romd
or something more general.
I'm trying to find this.

Fred

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-12 18:20     ` Alex Bennée
  2015-08-12 18:22       ` Paolo Bonzini
@ 2015-08-14  8:38       ` Frederic Konrad
  2015-08-15  0:04         ` Paolo Bonzini
  1 sibling, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-08-14  8:38 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue, pbonzini

On 12/08/2015 20:20, Alex Bennée wrote:
> Frederic Konrad <fred.konrad@greensocs.com> writes:
>
>> On 10/08/2015 17:27, fred.konrad@greensocs.com wrote:
>>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>>
>>> This protects TBContext with tb_lock to make tb_* thread safe.
>>>
>>> We can still have issue with tb_flush in case of multithread TCG:
>>>     An other CPU can be executing code during a flush.
>>>
>>> This can be fixed later by making all other TCG thread exiting before calling
>>> tb_flush().
>>>
>>> tb_find_slow is separated into tb_find_slow and tb_find_physical as the whole
>>> tb_find_slow doesn't require to lock the tb.
>>>
>>> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
>>>
>>> Changes:
>> [...]
>>>    
>>> @@ -675,6 +710,7 @@ static inline void code_gen_alloc(size_t tb_size)
>>>                CODE_GEN_AVG_BLOCK_SIZE;
>>>        tcg_ctx.tb_ctx.tbs =
>>>                g_malloc(tcg_ctx.code_gen_max_blocks * sizeof(TranslationBlock));
>>> +    qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>>>    }
>>>    
>>>    /* Must be called before using the QEMU cpus. 'tb_size' is the size
>>> @@ -699,16 +735,22 @@ bool tcg_enabled(void)
>>>        return tcg_ctx.code_gen_buffer != NULL;
>>>    }
>>>    
>>> -/* Allocate a new translation block. Flush the translation buffer if
>>> -   too many translation blocks or too much generated code. */
>>> +/*
>>> + * Allocate a new translation block. Flush the translation buffer if
>>> + * too many translation blocks or too much generated code.
>>> + * tb_alloc is not thread safe but tb_gen_code is protected by a mutex so this
>>> + * function is called only by one thread.
>>> + */
>>>    static TranslationBlock *tb_alloc(target_ulong pc)
>>>    {
>>> -    TranslationBlock *tb;
>>> +    TranslationBlock *tb = NULL;
>>>    
>>>        if (tcg_ctx.tb_ctx.nb_tbs >= tcg_ctx.code_gen_max_blocks ||
>>>            (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) >=
>>>             tcg_ctx.code_gen_buffer_max_size) {
>>> -        return NULL;
>>> +        tb = &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs++];
>>> +        tb->pc = pc;
>>> +        tb->cflags = 0;
>> Missed this wrong unreverted part which in the end doesn't do a tb_flush
>> when required and crashes!
>> Fixing that allows me to boot with jessie and virt.
> \o/
>
> Do you see crashes while it is running?
>
> It's interesting that I've not had a problem booting jessie with virt
> though - just crashes while hanging.
>
> Are you likely to push a v8 this week (or a temp branch?) with this and
> any other obvious fixes? I appreciate Paolo has given you a not-so-small
> pile of review comments as well so I wasn't looking for a complete new
> patch set!
here is something I did yesterday:
multi_tcg_v7_bugfixed

The patch-set is a mess and not re-based on the patch-set sent by Paolo.

Fred

>
>> Fred

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock.
  2015-08-14  8:38       ` Frederic Konrad
@ 2015-08-15  0:04         ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-08-15  0:04 UTC (permalink / raw)
  To: Frederic Konrad, Alex Bennée
  Cc: mttcg, guillaume.delbergue, mark.burton, qemu-devel, a.rigo



On 14/08/2015 10:38, Frederic Konrad wrote:
>> Are you likely to push a v8 this week (or a temp branch?) with this and
>> any other obvious fixes? I appreciate Paolo has given you a not-so-small
>> pile of review comments as well so I wasn't looking for a complete new
>> patch set!
> here is something I did yesterday:
> multi_tcg_v7_bugfixed
> 
> The patch-set is a mess and not re-based on the patch-set sent by Paolo.

My own mttcg branch is probably pretty close to a v8, except that some
of the TODO/FIXMEs I added should not be there. :)

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-08-11  6:27   ` Frederic Konrad
@ 2015-10-07 12:46     ` Claudio Fontana
  2015-10-07 14:52       ` Frederic Konrad
  0 siblings, 1 reply; 81+ messages in thread
From: Claudio Fontana @ 2015-10-07 12:46 UTC (permalink / raw)
  To: Frederic Konrad, Benjamin Herrenschmidt
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue,
	pbonzini, alex.bennee

Hello Frederic,

On 11.08.2015 08:27, Frederic Konrad wrote:
> On 11/08/2015 08:15, Benjamin Herrenschmidt wrote:
>> On Mon, 2015-08-10 at 17:26 +0200, fred.konrad@greensocs.com wrote:
>>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>>
>>> This is the 7th round of the MTTCG patch series.
>>>
>>>
>>> It can be cloned from:
>>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.

would it be possible to rebase on latest qemu? I wonder if mttcg is diverging a bit too much from mainline,
which will make it more difficult to rebase later..(Or did I get confused about all these repos?)

Thank you!

Claudio

>>>
>>> This patch-set try to address the different issues in the global picture of
>>> MTTCG, presented on the wiki.
>>>
>>> == Needed patch for our work ==
>>>
>>> Some preliminaries are needed for our work:
>>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>>>     the CPUState.
>> Can't you just make it a TLS ?
> 
> True that can be done as well. But the tcg_exec_flags has a second meaning saying
> "you can't start executing code right now because I want to do a safe_work".
>>
>>>   * We need to run some work safely when all VCPUs are outside their execution
>>>     loop. This is done with the async_run_safe_work_on_cpu function introduced
>>>     in this series.
>>>   * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>>>     atomic instruction.
>> How do you handle the memory model ? IE , ARM and PPC are OO while x86
>> is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
>> x86 on ARM or PPC will lead to problems unless you generate memory
>> barriers with every load/store ..
> 
> For the moment we are trying to do the first case.
>>
>> At least on POWER7 and later on PPC we have the possibility of setting
>> the attribute "Strong Access Ordering" with mremap/mprotect (I dont'
>> remember which one) which gives us x86-like memory semantics...
>>
>> I don't know if ARM supports something similar. On the other hand, when
>> emulating ARM on PPC or vice-versa, we can probably get away with no
>> barriers.
>>
>> Do you expose some kind of guest memory model info to the TCG backend so
>> it can decide how to handle these things ?
>>
>>> == Code generation and cache ==
>>>
>>> As Qemu stands, there is no protection at all against two threads attempting to
>>> generate code at the same time or modifying a TranslationBlock.
>>> The "protect TBContext with tb_lock" patch address the issue of code generation
>>> and makes all the tb_* function thread safe (except tb_flush).
>>> This raised the question of one or multiple caches. We choosed to use one
>>> unified cache because it's easier as a first step and since the structure of
>>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
>>> don't see the benefit of having two pools of tbs.
>>>
>>> == Dirty tracking ==
>>>
>>> Protecting the IOs:
>>> To allows all VCPUs threads to run at the same time we need to drop the
>>> global_mutex as soon as possible. The io access need to take the mutex. This is
>>> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>>> will be upstreamed.
>>>
>>> Invalidation of TranslationBlocks:
>>> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
>>> it's jump cache itself as it is in CPUState so that can be handled by a simple
>>> call to async_run_on_cpu. However tb_invalidate also writes to the
>>> TranslationBlock which is shared as we have only one pool.
>>> Hence this part of invalidate requires all VCPUs to exit before it can be done.
>>> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
>> What about the host MMU emulation ? Is that multithreaded ? It has
>> potential issues when doing things like dirty bit updates into guest
>> memory, those need to be done atomically. Also TLB invalidations on ARM
>> and PPC are global, so they will need to invalidate the remote SW TLBs
>> as well.
>>
>> Do you have a mechanism to synchronize with another thread ? IE, make it
>> pop out of TCG if already in and prevent it from getting in ? That way
>> you can "remotely" invalidate its TLB...
> Yes that's what the safe_work is doing. Ask everybody to exit prevent VCPUs to
> resume (tcg_exec_flag) and do the work when everybody is outside cpu-exec.
> 
>>
>>> == Atomic instruction ==
>>>
>>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>>> Specifically the limitation of this approach is that it is harder to support
>>> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
>>> cmpxchg (we believe this could be the case for some PPC cores).
>> Right, on the other hand 64-bit will do fine. But then x86 has 2-value
>> atomics nowadays, doesn't it ? And that will be hard to emulate on
>> anything. You might need to have some kind of global hashed lock list
>> used by atomics (hash the physical address) as a fallback if you don't
>> have a 1:1 match between host and guest capabilities.
> VOS did a "Slow path for atomic instruction translation" series you can find here:
> https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg00971.html
> 
> Which will be used in the end.
> 
> Thanks,
> Fred
>>
>> Cheers,
>> Ben.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-10-07 12:46     ` Claudio Fontana
@ 2015-10-07 14:52       ` Frederic Konrad
  2015-10-21 15:09         ` Claudio Fontana
  0 siblings, 1 reply; 81+ messages in thread
From: Frederic Konrad @ 2015-10-07 14:52 UTC (permalink / raw)
  To: Claudio Fontana, Benjamin Herrenschmidt
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue,
	pbonzini, alex.bennee

Hi Claudio,

I'll rebase soon tomorrow with a bit of luck ;).

Thanks,
Fred

On 07/10/2015 14:46, Claudio Fontana wrote:
> Hello Frederic,
>
> On 11.08.2015 08:27, Frederic Konrad wrote:
>> On 11/08/2015 08:15, Benjamin Herrenschmidt wrote:
>>> On Mon, 2015-08-10 at 17:26 +0200, fred.konrad@greensocs.com wrote:
>>>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>>>
>>>> This is the 7th round of the MTTCG patch series.
>>>>
>>>>
>>>> It can be cloned from:
>>>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
> would it be possible to rebase on latest qemu? I wonder if mttcg is diverging a bit too much from mainline,
> which will make it more difficult to rebase later..(Or did I get confused about all these repos?)
>
> Thank you!
>
> Claudio
>
>>>> This patch-set try to address the different issues in the global picture of
>>>> MTTCG, presented on the wiki.
>>>>
>>>> == Needed patch for our work ==
>>>>
>>>> Some preliminaries are needed for our work:
>>>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>>>>     the CPUState.
>>> Can't you just make it a TLS ?
>> True that can be done as well. But the tcg_exec_flags has a second meaning saying
>> "you can't start executing code right now because I want to do a safe_work".
>>>>   * We need to run some work safely when all VCPUs are outside their execution
>>>>     loop. This is done with the async_run_safe_work_on_cpu function introduced
>>>>     in this series.
>>>>   * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>>>>     atomic instruction.
>>> How do you handle the memory model ? IE , ARM and PPC are OO while x86
>>> is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
>>> x86 on ARM or PPC will lead to problems unless you generate memory
>>> barriers with every load/store ..
>> For the moment we are trying to do the first case.
>>> At least on POWER7 and later on PPC we have the possibility of setting
>>> the attribute "Strong Access Ordering" with mremap/mprotect (I dont'
>>> remember which one) which gives us x86-like memory semantics...
>>>
>>> I don't know if ARM supports something similar. On the other hand, when
>>> emulating ARM on PPC or vice-versa, we can probably get away with no
>>> barriers.
>>>
>>> Do you expose some kind of guest memory model info to the TCG backend so
>>> it can decide how to handle these things ?
>>>
>>>> == Code generation and cache ==
>>>>
>>>> As Qemu stands, there is no protection at all against two threads attempting to
>>>> generate code at the same time or modifying a TranslationBlock.
>>>> The "protect TBContext with tb_lock" patch address the issue of code generation
>>>> and makes all the tb_* function thread safe (except tb_flush).
>>>> This raised the question of one or multiple caches. We choosed to use one
>>>> unified cache because it's easier as a first step and since the structure of
>>>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
>>>> don't see the benefit of having two pools of tbs.
>>>>
>>>> == Dirty tracking ==
>>>>
>>>> Protecting the IOs:
>>>> To allows all VCPUs threads to run at the same time we need to drop the
>>>> global_mutex as soon as possible. The io access need to take the mutex. This is
>>>> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>>>> will be upstreamed.
>>>>
>>>> Invalidation of TranslationBlocks:
>>>> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
>>>> it's jump cache itself as it is in CPUState so that can be handled by a simple
>>>> call to async_run_on_cpu. However tb_invalidate also writes to the
>>>> TranslationBlock which is shared as we have only one pool.
>>>> Hence this part of invalidate requires all VCPUs to exit before it can be done.
>>>> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
>>> What about the host MMU emulation ? Is that multithreaded ? It has
>>> potential issues when doing things like dirty bit updates into guest
>>> memory, those need to be done atomically. Also TLB invalidations on ARM
>>> and PPC are global, so they will need to invalidate the remote SW TLBs
>>> as well.
>>>
>>> Do you have a mechanism to synchronize with another thread ? IE, make it
>>> pop out of TCG if already in and prevent it from getting in ? That way
>>> you can "remotely" invalidate its TLB...
>> Yes that's what the safe_work is doing. Ask everybody to exit prevent VCPUs to
>> resume (tcg_exec_flag) and do the work when everybody is outside cpu-exec.
>>
>>>> == Atomic instruction ==
>>>>
>>>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>>>> Specifically the limitation of this approach is that it is harder to support
>>>> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
>>>> cmpxchg (we believe this could be the case for some PPC cores).
>>> Right, on the other hand 64-bit will do fine. But then x86 has 2-value
>>> atomics nowadays, doesn't it ? And that will be hard to emulate on
>>> anything. You might need to have some kind of global hashed lock list
>>> used by atomics (hash the physical address) as a fallback if you don't
>>> have a 1:1 match between host and guest capabilities.
>> VOS did a "Slow path for atomic instruction translation" series you can find here:
>> https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg00971.html
>>
>> Which will be used in the end.
>>
>> Thanks,
>> Fred
>>> Cheers,
>>> Ben.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
  2015-10-07 14:52       ` Frederic Konrad
@ 2015-10-21 15:09         ` Claudio Fontana
  0 siblings, 0 replies; 81+ messages in thread
From: Claudio Fontana @ 2015-10-21 15:09 UTC (permalink / raw)
  To: Frederic Konrad, Benjamin Herrenschmidt
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue,
	pbonzini, alex.bennee

On 07.10.2015 16:52, Frederic Konrad wrote:
> Hi Claudio,
> 
> I'll rebase soon tomorrow with a bit of luck ;).
> 
> Thanks,
> Fred

a respectful ping on this one :-)

I am looking at http://git.greensocs.com/fkonrad/mttcg.git
branch multi_tcg_v7_bugfixed,
is there something new?

Ciao,

Claudio

> 
> On 07/10/2015 14:46, Claudio Fontana wrote:
>> Hello Frederic,
>>
>> On 11.08.2015 08:27, Frederic Konrad wrote:
>>> On 11/08/2015 08:15, Benjamin Herrenschmidt wrote:
>>>> On Mon, 2015-08-10 at 17:26 +0200, fred.konrad@greensocs.com wrote:
>>>>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>>>>
>>>>> This is the 7th round of the MTTCG patch series.
>>>>>
>>>>>
>>>>> It can be cloned from:
>>>>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
>> would it be possible to rebase on latest qemu? I wonder if mttcg is diverging a bit too much from mainline,
>> which will make it more difficult to rebase later..(Or did I get confused about all these repos?)
>>
>> Thank you!
>>
>> Claudio
>>
>>>>> This patch-set try to address the different issues in the global picture of
>>>>> MTTCG, presented on the wiki.
>>>>>
>>>>> == Needed patch for our work ==
>>>>>
>>>>> Some preliminaries are needed for our work:
>>>>>   * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>>>>>     the CPUState.
>>>> Can't you just make it a TLS ?
>>> True that can be done as well. But the tcg_exec_flags has a second meaning saying
>>> "you can't start executing code right now because I want to do a safe_work".
>>>>>   * We need to run some work safely when all VCPUs are outside their execution
>>>>>     loop. This is done with the async_run_safe_work_on_cpu function introduced
>>>>>     in this series.
>>>>>   * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>>>>>     atomic instruction.
>>>> How do you handle the memory model ? IE , ARM and PPC are OO while x86
>>>> is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating
>>>> x86 on ARM or PPC will lead to problems unless you generate memory
>>>> barriers with every load/store ..
>>> For the moment we are trying to do the first case.
>>>> At least on POWER7 and later on PPC we have the possibility of setting
>>>> the attribute "Strong Access Ordering" with mremap/mprotect (I dont'
>>>> remember which one) which gives us x86-like memory semantics...
>>>>
>>>> I don't know if ARM supports something similar. On the other hand, when
>>>> emulating ARM on PPC or vice-versa, we can probably get away with no
>>>> barriers.
>>>>
>>>> Do you expose some kind of guest memory model info to the TCG backend so
>>>> it can decide how to handle these things ?
>>>>
>>>>> == Code generation and cache ==
>>>>>
>>>>> As Qemu stands, there is no protection at all against two threads attempting to
>>>>> generate code at the same time or modifying a TranslationBlock.
>>>>> The "protect TBContext with tb_lock" patch address the issue of code generation
>>>>> and makes all the tb_* function thread safe (except tb_flush).
>>>>> This raised the question of one or multiple caches. We choosed to use one
>>>>> unified cache because it's easier as a first step and since the structure of
>>>>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
>>>>> don't see the benefit of having two pools of tbs.
>>>>>
>>>>> == Dirty tracking ==
>>>>>
>>>>> Protecting the IOs:
>>>>> To allows all VCPUs threads to run at the same time we need to drop the
>>>>> global_mutex as soon as possible. The io access need to take the mutex. This is
>>>>> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
>>>>> will be upstreamed.
>>>>>
>>>>> Invalidation of TranslationBlocks:
>>>>> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
>>>>> it's jump cache itself as it is in CPUState so that can be handled by a simple
>>>>> call to async_run_on_cpu. However tb_invalidate also writes to the
>>>>> TranslationBlock which is shared as we have only one pool.
>>>>> Hence this part of invalidate requires all VCPUs to exit before it can be done.
>>>>> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
>>>> What about the host MMU emulation ? Is that multithreaded ? It has
>>>> potential issues when doing things like dirty bit updates into guest
>>>> memory, those need to be done atomically. Also TLB invalidations on ARM
>>>> and PPC are global, so they will need to invalidate the remote SW TLBs
>>>> as well.
>>>>
>>>> Do you have a mechanism to synchronize with another thread ? IE, make it
>>>> pop out of TCG if already in and prevent it from getting in ? That way
>>>> you can "remotely" invalidate its TLB...
>>> Yes that's what the safe_work is doing. Ask everybody to exit prevent VCPUs to
>>> resume (tcg_exec_flag) and do the work when everybody is outside cpu-exec.
>>>
>>>>> == Atomic instruction ==
>>>>>
>>>>> For now only ARM on x64 is supported by using an cmpxchg instruction.
>>>>> Specifically the limitation of this approach is that it is harder to support
>>>>> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
>>>>> cmpxchg (we believe this could be the case for some PPC cores).
>>>> Right, on the other hand 64-bit will do fine. But then x86 has 2-value
>>>> atomics nowadays, doesn't it ? And that will be hard to emulate on
>>>> anything. You might need to have some kind of global hashed lock list
>>>> used by atomics (hash the physical address) as a fallback if you don't
>>>> have a 1:1 match between host and guest capabilities.
>>> VOS did a "Slow path for atomic instruction translation" series you can find here:
>>> https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg00971.html
>>>
>>> Which will be used in the end.
>>>
>>> Thanks,
>>> Fred
>>>> Cheers,
>>>> Ben.
> 


-- 
Claudio Fontana
Server Virtualization Architect
Huawei Technologies Duesseldorf GmbH
Riesstraße 25 - 80992 München

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2015-10-21 15:10 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-10 15:26 [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG fred.konrad
2015-08-10 15:26 ` [Qemu-devel] [RFC PATCH V7 01/19] cpus: protect queued_work_* with work_mutex fred.konrad
2015-08-10 15:59   ` Paolo Bonzini
2015-08-10 16:04     ` Frederic Konrad
2015-08-10 16:06       ` Paolo Bonzini
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 02/19] cpus: add tcg_exec_flag fred.konrad
2015-08-11 10:53   ` Paolo Bonzini
2015-08-11 11:11     ` Frederic Konrad
2015-08-11 12:57       ` Paolo Bonzini
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 03/19] cpus: introduce async_run_safe_work_on_cpu fred.konrad
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 04/19] replace spinlock by QemuMutex fred.konrad
2015-08-10 16:09   ` Paolo Bonzini
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 05/19] remove unused spinlock fred.konrad
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 06/19] add support for spin lock on POSIX systems exclusively fred.konrad
2015-08-10 16:10   ` Paolo Bonzini
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 07/19] protect TBContext with tb_lock fred.konrad
2015-08-10 16:36   ` Paolo Bonzini
2015-08-10 16:50     ` Paolo Bonzini
2015-08-10 18:39       ` Alex Bennée
2015-08-11  8:31         ` Paolo Bonzini
2015-08-11  6:46     ` Frederic Konrad
2015-08-11  8:34       ` Paolo Bonzini
2015-08-11  9:21         ` Peter Maydell
2015-08-11  9:59           ` Paolo Bonzini
2015-08-12 17:45   ` Frederic Konrad
2015-08-12 18:20     ` Alex Bennée
2015-08-12 18:22       ` Paolo Bonzini
2015-08-14  8:38       ` Frederic Konrad
2015-08-15  0:04         ` Paolo Bonzini
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 08/19] tcg: remove tcg_halt_cond global variable fred.konrad
2015-08-10 16:12   ` Paolo Bonzini
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 09/19] Drop global lock during TCG code execution fred.konrad
2015-08-10 16:15   ` Paolo Bonzini
2015-08-11  6:55     ` Frederic Konrad
2015-08-11 20:12     ` Alex Bennée
2015-08-11 21:34       ` Frederic Konrad
2015-08-12  9:58         ` Paolo Bonzini
2015-08-12 12:32           ` Frederic Konrad
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 10/19] cpu: remove exit_request global fred.konrad
2015-08-10 15:51   ` Paolo Bonzini
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 11/19] tcg: switch on multithread fred.konrad
2015-08-13 11:17   ` Paolo Bonzini
2015-08-13 14:41     ` Frederic Konrad
2015-08-13 14:58       ` Paolo Bonzini
2015-08-13 15:18         ` Frederic Konrad
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 12/19] Use atomic cmpxchg to atomically check the exclusive value in a STREX fred.konrad
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 13/19] add a callback when tb_invalidate is called fred.konrad
2015-08-10 16:52   ` Paolo Bonzini
2015-08-10 18:41     ` Alex Bennée
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 14/19] cpu: introduce tlb_flush*_all fred.konrad
2015-08-10 15:54   ` Paolo Bonzini
2015-08-10 16:00     ` Peter Maydell
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 15/19] arm: use tlb_flush*_all fred.konrad
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 16/19] translate-all: introduces tb_flush_safe fred.konrad
2015-08-10 16:26   ` Paolo Bonzini
2015-08-12 14:09   ` Paolo Bonzini
2015-08-12 14:11     ` Frederic Konrad
2015-08-12 14:14       ` Paolo Bonzini
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 17/19] translate-all: (wip) use tb_flush_safe when we can't alloc more tb fred.konrad
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 18/19] mttcg: signal the associated cpu anyway fred.konrad
2015-08-10 15:51   ` Paolo Bonzini
2015-08-10 15:27 ` [Qemu-devel] [RFC PATCH V7 19/19] target-arm/psci.c: wake up sleeping CPUs (MTTCG) fred.konrad
2015-08-10 16:41   ` Paolo Bonzini
2015-08-10 18:38     ` Alex Bennée
2015-08-10 18:34 ` [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG Alex Bennée
2015-08-10 23:02   ` Frederic Konrad
2015-08-11  6:15 ` Benjamin Herrenschmidt
2015-08-11  6:27   ` Frederic Konrad
2015-10-07 12:46     ` Claudio Fontana
2015-10-07 14:52       ` Frederic Konrad
2015-10-21 15:09         ` Claudio Fontana
2015-08-11  7:54   ` Alex Bennée
2015-08-11  9:22     ` Benjamin Herrenschmidt
2015-08-11  9:29       ` Peter Maydell
2015-08-11 10:09         ` Benjamin Herrenschmidt
2015-08-11 19:22       ` Alex Bennée
2015-08-11 12:45 ` Paolo Bonzini
2015-08-11 13:59   ` Frederic Konrad
2015-08-11 14:10     ` Paolo Bonzini
2015-08-12 15:19     ` Frederic Konrad
2015-08-12 15:39       ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.