[Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode

* [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode
@ 2015-08-24  0:23 Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow Emilio G. Cota
                   ` (39 more replies)
  0 siblings, 40 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Hi all,

Here is MTTCG code I've been working on out-of-tree for the last few months.

The patchset applies on top of pbonzini's mttcg branch, commit ca56de6f.
Fetch the branch from: https://github.com/bonzini/qemu/commits/mttcg

The highlights of the patchset are as follows:

- The first 5 patches are direct fixes to bugs only in the mttcg
  branch.

- Patches 6-12 fix issues in the master branch.

- The remaining patches are really the meat of this patchset.
  The main features are:

  * Support of MTTCG for both user and system mode.

  * Design: per-CPU TB jump list protected by a seqlock,
    if the TB is not found there then check on the global, RCU-protected 'hash table'
    (i.e. fixed number of buckets), if not there then grab lock, check again,
    and if it's not there then add generate the code and add the TB to the hash table.

    It makes sense that Paolo's recent work on the mttcg branch ended up
    being almost identical to this--it's simple and it scales well.

  * tb_lock must be held every time code is generated. The rationale is
    that most of the time QEMU is executing code, not generating it.

  * tb_flush: do it once all other CPUs have been put to sleep by calling
    rcu_synchronize().
    We also instrument tb_lock to make sure that only one tb_flush request can
    happen at a given time.  For this a mechanism to schedule work is added to
    supersede cpu_sched_safe_work, which cannot work in usermode.  Here I've
    toyed with an alternative version that doesn't force the flushing CPU to
    exit, but in order to make this work we have save/restore the RCU read
    lock while tb_lock is held in order to avoid deadlocks. This isn't too
    pretty but it's good to know that the option is there.

  * I focused on x86 since it is a complex ISA and we support many cores via -smp.
    I work on a 64-core machine so concurrency bugs show up relatively easily.

    Atomics are modeled using spinlocks, i.e. one host lock per guest cache line.
    Note that spinlocks are way better than mutexes for this--perf on 64-cores
    is 2X with spinlocks on highly concurrent workloads (synchrobench, see below).

    Advantages:

    + Scalability. No unrelated atomics (e.g. atomics on the same page)
      can interfere with each other. Of course if the guest code
      has false sharing (i.e. atomics on the same cache line), then
      there's not much the host can do about that.
      This is an improved version over what I sent in May:
        https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg01641.html
      Performance numbers are below.

    + No requirements on the capabilities of the host machine, e.g.
      no need for a host cmpxchg instruction. That is, we'd have no problem
      running x86 code on a weaker host (say ARM/PPC) although of course we'd
      have to sprinkle quite a few memory barriers.  Note that the current
      MTTCG relies on cmpxchg(), which would be insufficient to run x86 code
      on ARM/PPC since that cmpxchg could very well race with a regular store
      (whereas in x86 it cannot).

    + Works unchanged for both system and user modes. As far as I can
      tell the TLB-based approach that Alvise is working on couldn't
      be used without the TLB--correct me if I'm wrong, it's been
      quite some time since I looked at that work.

    Disadvantages:
    - Overhead is added to every guest store. Depending on how frequent
      stores are, this can end up being significant single-threaded
      overhead (I've measured from a few % to up to ~50%).

      Note that this overhead applies to strong memory models such
      as x86, since the ISA can deal with concurrent stores and atomic
      instructions. Weaker memory models such as ARM/PPC's wouldn't have this
      overhead.

  * Performance
    I've used four C/C++ benchmarks from synchrobench:
      https://github.com/gramoli/synchrobench
    I'm running them with these arguments: -u 0 -f 1 -d 10000 -t $n_threads
    Here are two comparisons;
    * usermode vs. native     http://imgur.com/RggzgyU
    * qemu-system vs qemu-KVM http://imgur.com/H9iH06B
    (full-system is run with -m 4096).

    Throughput is normalised for each of the four configurations over their
    throughput with 1 thread.

    For single-thread performance overhead of instrumenting writes I used
    two apps from PARSEC, all of them with the 'large' input:

    [Note that for the multithreaded tests I did not use PARSEC; it doesn't
     scale at all on large systems]

    blackscholes 1 thread, ~8% of stores per instruction:
    pbonzini/mttcg+Patches1-5:	62.922099012 seconds ( +-  0.05% )
    +entire patchset:		67.680987626 seconds ( +-  0.35% )
    That's about an 8% perf overhead.

    swaptions 1 thread, ~7% of stores per instruction:
    pbonzini/mttcg+Patches1-5:	144.542495834 seconds ( +-  0.49% )
    +entire patchset:		157.673401200 seconds ( +-  0.25% )
    That's about an 9% perf overhead.

    All tests use taskset appropriately to pack threads into CPUs in the
    same NUMA node, if possible.
    All tests are run on a 64-core (4x16) AMD Opteron 6376 with turbo core
    disabled.

  * Known Issues
    - In system mode, when run with a high number of threads, segfaults on
      translated code happen every now and then.
      Is there anything useful I can do with the segfaulting address? For example:
      (gdb) bt
      #0  0x00007fbf8013d89f in ?? ()
      #1  0x0000000000000000 in ?? ()

      Also, are there any things that should be protected by tb_lock but
      aren't? The only potential issue I've thought of so far is direct jumps
      racing with tb_phys_invalidate, but need to analyze in more detail.

  * Future work
  - Run on PowerPC host to look at how bad the barrier sprinkling has to be.
    I have access to a host so should do this in the next few days. However,
    ppc-usermode doesn't work in multithreaded--help would be appreciated,
    see this thread:
      http://lists.gnu.org/archive/html/qemu-ppc/2015-06/msg00164.html

  - Support more ISAs. I have done ARM, SPARC and PPC, but haven't
    tested them much so I'm keeping them out of this patchset.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread