[Qemu-devel] [RFC PATCH V6 00/18] Multithread TCG.

* [Qemu-devel] [RFC PATCH V6 00/18] Multithread TCG.
@ 2015-06-26 14:47 fred.konrad
  2015-06-26 14:47 ` [Qemu-devel] [RFC PATCH V6 01/18] cpu: make cpu_thread_is_idle public fred.konrad
                   ` (17 more replies)
  0 siblings, 18 replies; 82+ messages in thread
From: fred.konrad @ 2015-06-26 14:47 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: peter.maydell, a.spyridakis, mark.burton, agraf,
	alistair.francis, guillaume.delbergue, pbonzini, alex.bennee,
	fred.konrad

From: KONRAD Frederic <fred.konrad@greensocs.com>

This is the 6th round of the MTTCG patch series with hopefully a lot of
improvements since the last version. Basically the atomic patch has been
significantly improved, some issues has been fixed and the speed has been
improved.

It can be cloned from:
git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v6.

This patch-set try to address the different issues in the global picture of
MTTCG, presented on the wiki.

== Needed patch for our work ==

Some preliminaries are needed for our work:
 * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
   the CPUState.
 * We need to run some work safely when all VCPUs are outside their execution
   loop. This is done with the async_run_safe_work_on_cpu function introduced
   in this series.
 * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
   atomic instruction.

== Code generation and cache ==

As Qemu stands, there is no protection at all against two threads attempting to
generate code at the same time or modifying a TranslationBlock.
The "protect TBContext with tb_lock" patch address the issue of code generation
and makes all the tb_* function thread safe (except tb_flush).
This raised the question of one or multiple caches. We choosed to use one
unified cache because it's easier as a first step and since the structure of
QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
don't see the benefit of having two pools of tbs.

== Dirty tracking ==

Protecting the IOs:
To allows all VCPUs threads to run at the same time we need to drop the
global_mutex as soon as possible. The io access need to take the mutex. This is
likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
will be upstreamed.

Invalidation of TranslationBlocks:
We can have all VCPUs running during an invalidation. Each VCPU is able to clean
it's jump cache itself as it is in CPUState so that can be handled by a simple
call to async_run_on_cpu. However tb_invalidate also writes to the
TranslationBlock which is shared as we have only one pool.
Hence this part of invalidate requires all VCPUs to exit before it can be done.
Hence the async_run_safe_work_on_cpu is introduced to handle this case.

== Atomic instruction ==

For now only ARM on x64 is supported by using an cmpxchg instruction.
Specifically the limitation of this approach is that it is harder to support
64bit ARM on a host architecture that is multi-core, but only supports 32 bit
cmpxchg (we believe this could be the case for some PPC cores).  For now this
case is not correctly handled. The existing atomic patch will attempt to execute
the 64 bit cmpxchg functionality in a non thread safe fashion. Our intention is
to provide a new multi-thread ARM atomic patch for 64bit ARM on effective 32bit
hosts.
This atomic instruction part has been tested with Alexander's atomic stress repo
available here:
https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html

The execution is a little slower than upstream probably because of the different
VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spin_lock
reduce considerably the difference.

== Testing ==

A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux guest show
a good performance progression: it takes basically 18s upstream to complete vs
10s with MTTCG.

Testing image is available here:
https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3

Then simply:
./configure --target-list=arm-softmmu
make -j8
./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
-initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
--append "console=ttyAMA0"

login: root

The dhrystone command is the last one in the history.
"dhrystone 10000000 & dhrystone 10000000"

The atomic spinlock benchmark from Alexander shows that atomic basically work.
Just follow the instruction here:
https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html

== Known issues ==

* Virtio double lock:
  Virtio does accidently a double qemu_mutex_iothread_lock.

* GDB stub:
  GDB stub is not tested right now it will probably requires some changes to
  work.

Changes:
  * Introduce async_safe_work to do the tb_flush and some part of tb_invalidate.
  * Introduce QemuSpin from Guillaume which allow a faster atomic instruction
    (6s to pass Alexander's atomic test instead of 30s before).
  * Don't take tb_lock before tb_find_fast.
  * Handle tb_flush with async_safe_work.
  * Handle tb_invalidate with async_work and async_safe_work.
  * Drop the tlb_flush_request mechanism and use async_work as well.
  * Fix the wrong lenght in atomic patch.
  * Fix the wrong return address for exception in atomic patch.

Guillaume Delbergue (1):
  add support for spin lock on POSIX systems exclusively

Jan Kiszka (1):
  Drop global lock during TCG code execution

KONRAD Frederic (16):
  cpu: make cpu_thread_is_idle public.
  replace spinlock by QemuMutex.
  remove unused spinlock.
  protect TBContext with tb_lock.
  tcg: remove tcg_halt_cond global variable.
  cpu: remove exit_request global.
  cpu: add a tcg_executing flag.
  tcg: switch on multithread.
  cpus: make qemu_cpu_kick_thread public.
  Use atomic cmpxchg to atomically check the exclusive value in a STREX
  cpu: introduce async_run_safe_work_on_cpu.
  add a callback when tb_invalidate is called.
  cpu: introduce tlb_flush*_all.
  arm: use tlb_flush*_all
  translate-all: introduces tb_flush_safe.
  translate-all: (wip) use tb_flush_safe when we can't alloc more tb.

 cpu-exec.c                  |  96 ++++++++++++--------
 cpus.c                      | 208 +++++++++++++++++++++++++-----------------
 cputlb.c                    |  81 +++++++++++++++++
 exec.c                      |  25 +++++
 include/exec/exec-all.h     |   7 +-
 include/exec/spinlock.h     |  49 ----------
 include/qemu/thread-posix.h |   4 +
 include/qemu/thread-win32.h |   4 +
 include/qemu/thread.h       |   7 ++
 include/qom/cpu.h           |  35 +++++++
 include/sysemu/cpus.h       |   1 +
 linux-user/main.c           |   6 +-
 qom/cpu.c                   |   1 +
 scripts/checkpatch.pl       |   9 +-
 softmmu_template.h          |   5 +
 target-arm/cpu.c            |  21 +++++
 target-arm/cpu.h            |   6 ++
 target-arm/helper.c         |  58 ++++--------
 target-arm/helper.h         |   4 +
 target-arm/op_helper.c      | 128 +++++++++++++++++++++++++-
 target-arm/translate.c      | 103 +++++----------------
 target-i386/mem_helper.c    |  16 +++-
 target-i386/misc_helper.c   |  27 +++++-
 tcg/i386/tcg-target.c       |   8 ++
 tcg/tcg.h                   |   7 ++
 translate-all.c             | 217 +++++++++++++++++++++++++++++++++++++-------
 util/qemu-thread-posix.c    |  45 +++++++++
 util/qemu-thread-win32.c    |  30 ++++++
 vl.c                        |   6 ++
 29 files changed, 869 insertions(+), 345 deletions(-)
 delete mode 100644 include/exec/spinlock.h

-- 
1.9.0

^ permalink raw reply	[flat|nested] 82+ messages in thread