All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics
@ 2016-06-27 19:01 Emilio G. Cota
  2016-06-27 19:01 ` [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers Emilio G. Cota
                   ` (30 more replies)
  0 siblings, 31 replies; 44+ messages in thread
From: Emilio G. Cota @ 2016-06-27 19:01 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov, Alvise Rigo, Peter Maydell

This RFC provides softmmu/cpu primitives for cmpxchg operations.
These primitives are not TCG ops; they are meant to be called from target
helper functions. They apply to both system and user modes.

To increase performance for architectures that have "locked" versions
of regular ops (e.g. x86), helpers are also provided for most
of these atomic ops (e.g. atomic_add_and_fetch). Performance numbers
that compare atomic_* vs. cmpxchg-only emulation of atomics for x86
are available in patch 20's commit log.

cmpxchg helpers are also used for emulating LL/SC (in arm/aarch64),
starting from patch 21.

# Correctness
Using cmpxchg to emulate LL/SC is not correct, because it can
suffer from the ABA problem. However, I argue that a fully correct
implementation is not necessary for most real applications out
there. In other words, while it would be trivial to write a
parallel program that would incorrectly be emulated due to ABA,
most parallel code is written assuming only cmpxchg is available,
e.g. the kernel or gcc provide cmpxchg and other (simpler) "atomic"
ops, but nothing that would prevent ABA. [*]

[*] http://yarchive.net/comp/linux/cmpxchg_ll_sc_portability.html

I propose to have a cmpxchg-based implementation as the default,
to then add an option to enable a worse-performing-yet-correct LL/SC
implementation. I have a working solution that works in user and
system modes, but requires instrumenting *all* stores with helpers,
which has significant overhead when compared to cmpxchg. See
this bar chart, with measurements taken on an Intel Haswell:
http://imgur.com/KKb7S4t Overhead for those SPEC workloads is
on average ~20% ("store tracking"). Note that most of this comes from
calling the helpers ("store helpers only"). This implementation, however,
scales well, since no single lock is taken for emulating atomic accesses
(each cache line, when necessary, gets its own lock). If there's interest,
I can share this implementation.

# Performance and Scalability
The cmpxchg-based implementation in this RFC scales. It doesn't
require cross-CPU communication apart from the inevitable cache
line contention in the guest workload. In other words, no external
subsystems (e.g. TLBs, CPU loop) are touched, nor heavily-contended
locks are added.
The only locks in this RFC are added to emulate 16b atomics;
for now just a small table of locks is used. Using locks for
emulating these atomics is OK, since there are no "regular" (i.e.
non-atomic) loads/stores that could race with 16b atomics.

See the commit log in patches 22 (ARM) and 26 (aarch64) for a performance
comparison vs. the existing linux-user emulation on a 64-core system.
Tests are done using a newly added benchmark that stresses the cache
hierarchy with a configurable level of contention (patch 19).

# Testing
All implementations (x86_64, ARM, aarch64) pass the ck_pr validation
tests in concurrencykit[*] (ck/regressions/ck_pr_validate) in user-mode.
I have not tested the "paired" flavours of LL/SC in aarch64, so help
on this would be appreciated. I haven't even compile-tested on
32-bit hosts.
[*] http://concurrencykit.org/

# Why this is an RFC, and not a PATCH set
I'd like to have your input on the following:

- Is it OK *not* to write on a failed cmpxchg on user-only x86?
  On system-mode we have the write TLB access, so we'd get
  a write fault (all atomic/cmpxchg ops fault as a write fault,
  which AFAIK is the right thing to do), so that should suffice.

- What to do when atomic ops are used on something other than RAM?
  Should we have a "slow path" that is not atomic for these cases, or
  it's OK to assume code is bogus? For now, I just wrote XXX.

- In target-i386 code, I might have added unnecessary temps; I'm not
  sure which temps need to retain a meaningful value after translating
  instructions, so what I did is to make sure they retain the same
  value they'd have if the lock prefix wasn't there.

- Is it worth adding more checks to ensure that the LOCK prefix is
  where it should be? In some places I'm assuming "if (LOCK_PREFIX)"
  then we must have a memory operand (not a register operand).
  Assuming that instructions are always well-formed might be overly
  optimistic.

- How/where to fail when cmpxchg16 is performed on an address that
  is not 16b-aligned? x86 requires cmpxchg16b to be aligned, but
  doesn't enforce alignment on other atomics.

- Unaligned atomic ops: AFAIK only x86 supports them. I'm just punting
  on this since I'm passing the possibly-unaligned address to the hosts'
  compiler. AFAIK gcc deals with this by enlarging the size of the
  atomic to be used, but if for instance we're doing a cmpxchg8b
  on an unaligned address, very few hosts have a cmpxchg16b to deal
  with this.

- What to do with architectures that cannot guarantee atomic regular
  accesses? For instance, when emulating a 64-bit system on a 32-bit
  one. This will break a lot of parallel code, unless we serialize all
  loads/stores. I assume we won't support MTTCG on these architectures,
  and be done with this, right? Otherwise we'd have to instrument all
  loads and stores.

- ARM's cmpxchg syscall in linux-user could be improved; I've ignored
  it so far.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2016-06-28 19:52 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-27 19:01 [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 01/30] softmmu: add cmpxchg helpers Emilio G. Cota
2016-06-27 20:11   ` Richard Henderson
2016-06-27 21:19     ` Emilio G. Cota
2016-06-27 21:43       ` Richard Henderson
2016-06-27 21:48         ` Peter Maydell
2016-06-27 21:53           ` Richard Henderson
2016-06-27 19:01 ` [Qemu-devel] [RFC 02/30] tcg: add tcg_cmpxchg_lock Emilio G. Cota
2016-06-27 20:07   ` Richard Henderson
2016-06-27 20:41     ` Emilio G. Cota
2016-06-27 21:02       ` Richard Henderson
2016-06-27 19:01 ` [Qemu-devel] [RFC 03/30] cpu_ldst: add cpu_cmpxchg helpers Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 04/30] target-i386: add cmpxchg helpers Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 05/30] target-i386: emulate LOCK'ed cmpxchg using " Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 06/30] target-i386: emulate LOCK'ed cmpxchg8b/16b " Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 07/30] atomics: add atomic_xor Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 08/30] atomics: add atomic_op_fetch variants Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 09/30] softmmu: add atomic helpers Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 10/30] cpu_ldst: add cpu_atomic helpers Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 11/30] target-i386: add atomic helpers Emilio G. Cota
2016-06-27 20:27   ` Richard Henderson
2016-06-27 21:39     ` Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 12/30] target-i386: emulate LOCK'ed OP instructions using " Emilio G. Cota
2016-06-27 19:01 ` [Qemu-devel] [RFC 13/30] target-i386: emulate LOCK'ed INC using atomic helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 14/30] target-i386: emulate LOCK'ed NOT " Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 15/30] target-i386: emulate LOCK'ed NEG using cmpxchg helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 16/30] target-i386: emulate LOCK'ed XADD using atomic helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 17/30] target-i386: emulate LOCK'ed BTX ops using atomic helpers Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 18/30] target-i386: emulate XCHG using atomic helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 19/30] tests: add atomic_add-bench Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 20/30] target-i386: remove helper_lock() Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 21/30] target-arm: add cmpxchg helpers Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 22/30] target-arm: emulate LL/SC using " Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 23/30] target-arm: add atomic_xchg helper Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 24/30] target-arm: emulate SWP with " Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 25/30] helper: add DEF_HELPER_6 Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 26/30] target-arm: add cmpxchg helpers for aarch64 Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 27/30] target-arm: emulate aarch64's LL/SC using cmpxchg helpers Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 28/30] linux-user: remove handling of ARM's EXCP_STREX Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 29/30] linux-user: remove handling of aarch64's EXCP_STREX Emilio G. Cota
2016-06-27 19:02 ` [Qemu-devel] [RFC 30/30] target-arm: remove EXCP_STREX + cpu_exclusive_{test, info} Emilio G. Cota
2016-06-28  8:45 ` [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics Lluís Vilanova
2016-06-28 15:48   ` Richard Henderson
2016-06-28 19:52   ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.