[Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics

* [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics
@ 2016-06-27 19:01 Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers Emilio G. Cota
                   ` (30 more replies)
  0 siblings, 31 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

This RFC provides softmmu/cpu primitives for cmpxchg operations.
These primitives are not TCG ops; they are meant to be called from target
helper functions. They apply to both system and user modes.

To increase performance for architectures that have "locked" versions
of regular ops (e.g. x86), helpers are also provided for most
of these atomic ops (e.g. atomic_add_and_fetch). Performance numbers
that compare atomic_* vs. cmpxchg-only emulation of atomics for x86
are available in patch 20's commit log.

cmpxchg helpers are also used for emulating LL/SC (in arm/aarch64),
starting from patch 21.

# Correctness
Using cmpxchg to emulate LL/SC is not correct, because it can
suffer from the ABA problem. However, I argue that a fully correct
implementation is not necessary for most real applications out
there. In other words, while it would be trivial to write a
parallel program that would incorrectly be emulated due to ABA,
most parallel code is written assuming only cmpxchg is available,
e.g. the kernel or gcc provide cmpxchg and other (simpler) "atomic"
ops, but nothing that would prevent ABA. [*]

[*] http://yarchive.net/comp/linux/cmpxchg_ll_sc_portability.html

I propose to have a cmpxchg-based implementation as the default,
to then add an option to enable a worse-performing-yet-correct LL/SC
implementation. I have a working solution that works in user and
system modes, but requires instrumenting *all* stores with helpers,
which has significant overhead when compared to cmpxchg. See
this bar chart, with measurements taken on an Intel Haswell:
http://imgur.com/KKb7S4t Overhead for those SPEC workloads is
on average ~20% ("store tracking"). Note that most of this comes from
calling the helpers ("store helpers only"). This implementation, however,
scales well, since no single lock is taken for emulating atomic accesses
(each cache line, when necessary, gets its own lock). If there's interest,
I can share this implementation.

# Performance and Scalability
The cmpxchg-based implementation in this RFC scales. It doesn't
require cross-CPU communication apart from the inevitable cache
line contention in the guest workload. In other words, no external
subsystems (e.g. TLBs, CPU loop) are touched, nor heavily-contended
locks are added.
The only locks in this RFC are added to emulate 16b atomics;
for now just a small table of locks is used. Using locks for
emulating these atomics is OK, since there are no "regular" (i.e.
non-atomic) loads/stores that could race with 16b atomics.

See the commit log in patches 22 (ARM) and 26 (aarch64) for a performance
comparison vs. the existing linux-user emulation on a 64-core system.
Tests are done using a newly added benchmark that stresses the cache
hierarchy with a configurable level of contention (patch 19).

# Testing
All implementations (x86_64, ARM, aarch64) pass the ck_pr validation
tests in concurrencykit[*] (ck/regressions/ck_pr_validate) in user-mode.
I have not tested the "paired" flavours of LL/SC in aarch64, so help
on this would be appreciated. I haven't even compile-tested on
32-bit hosts.
[*] http://concurrencykit.org/

# Why this is an RFC, and not a PATCH set
I'd like to have your input on the following:

- Is it OK *not* to write on a failed cmpxchg on user-only x86?
  On system-mode we have the write TLB access, so we'd get
  a write fault (all atomic/cmpxchg ops fault as a write fault,
  which AFAIK is the right thing to do), so that should suffice.

- What to do when atomic ops are used on something other than RAM?
  Should we have a "slow path" that is not atomic for these cases, or
  it's OK to assume code is bogus? For now, I just wrote XXX.

- In target-i386 code, I might have added unnecessary temps; I'm not
  sure which temps need to retain a meaningful value after translating
  instructions, so what I did is to make sure they retain the same
  value they'd have if the lock prefix wasn't there.

- Is it worth adding more checks to ensure that the LOCK prefix is
  where it should be? In some places I'm assuming "if (LOCK_PREFIX)"
  then we must have a memory operand (not a register operand).
  Assuming that instructions are always well-formed might be overly
  optimistic.

- How/where to fail when cmpxchg16 is performed on an address that
  is not 16b-aligned? x86 requires cmpxchg16b to be aligned, but
  doesn't enforce alignment on other atomics.

- Unaligned atomic ops: AFAIK only x86 supports them. I'm just punting
  on this since I'm passing the possibly-unaligned address to the hosts'
  compiler. AFAIK gcc deals with this by enlarging the size of the
  atomic to be used, but if for instance we're doing a cmpxchg8b
  on an unaligned address, very few hosts have a cmpxchg16b to deal
  with this.

- What to do with architectures that cannot guarantee atomic regular
  accesses? For instance, when emulating a 64-bit system on a 32-bit
  one. This will break a lot of parallel code, unless we serialize all
  loads/stores. I assume we won't support MTTCG on these architectures,
  and be done with this, right? Otherwise we'd have to instrument all
  loads and stores.

- ARM's cmpxchg syscall in linux-user could be improved; I've ignored
  it so far.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 44+ messages in thread