[Qemu-devel] [PATCH v2.1 00/20] Emulate guest vector operations with host vector operations

* [Qemu-devel] [PATCH v2.1 00/20] Emulate guest vector operations with host vector operations
@ 2017-02-02 14:34 Kirill Batuzov
  2017-02-02 14:34 ` [Qemu-devel] [PATCH v2.1 01/21] tcg: add support for 128bit vector type Kirill Batuzov
                   ` (22 more replies)
  0 siblings, 23 replies; 28+ messages in thread
From: Kirill Batuzov @ 2017-02-02 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Alex Bennée,
	Kirill Batuzov

The goal of these patch series is to set up an infrastructure to emulate
guest vector operations using host vector operations. Preliminary
experiments show that simply translating loads and stores increases
performance of x264 video codec by 10%. The performance of a gcc vectorized
for loop increased 2x.

To be able to emulate guest vector operations using host vector operations,
several things need to be done.

1. Corresponding vector types should be added to TCG. These series add
TCG_v128 and TCG_v64. I've made TCG_v64 a different type than TCG_i64
because it usually needs to be allocated to different registers and
supports different operations.

2. Load/store operations for these new types need to be implemented.

3. For seamless transition from current model to a new one we need to
handle cases where memory occupied by global variable can be accessed via
pointer to the CPUArchState structure. A very simple conservative alias
analysis has been added to do it. This analysis tracks memory loads and
stores that overlap with fields of CPUArchState and provides this
information to the register allocator. The allocator then spills and
reloads affected globals when needed.

4. Allow overlapping globals. For scalar registers this is a rare case, and
overlapping registers can ba handled as a single one (ah, al, ax, eax,
rax). In ARM every Q-register consists of two D-register each consisting of
two S-registers. Handling 4 S-registers as one because they are parts of
the same Q-register is way too inefficient.

5. Add new memory addressing mode to MMU code for large accesses and create
needed helpers. Only 128-bit vectors have been handled for now.

6. Create TCG opcodes for vector operations. Only addition has beed handled
in these series. Each operation has a wrapper that checks if the backend
supports the corresponding operation or not. In one case the vector opcode
is generated, in the other the operation is emulated with scalar
operations. The emulation code is generated inline for performance reasons
(there is a huge performance difference between inline generation
and calling a helper). As a positive side effect this will eventually allow
 to merge similar emulation code for vector instructions from different
frontends to target-independent implementation.

7. Use new operations in the frontend (ARM was used in these series).

8. Support new operations in the backend (x86_64 was used in these series).

For experiments I have used ARM guest on x86_64 host. I wanted some pair of
different architectures with vector extensions both. ARM and x86_64 pair
fits well.

v1 -> v2:
 - represent v128 type with smaller types when it is not supported by the host
 - detect AVX support and use AVX instructions when available
 - tcg/README updated
 - generate two v64 adds instead of one v128 when applicable
 - rebased to newer master
 - overlap detection for temps added (it needs to be explicitly called from
   <arch>_translate_init)
 - the stack is used to temporary store 128 bit variables to memory
   (instead of the TCGContext field)

v2 -> v2.1
 - automatic build failure fixed

Outstanding issues:
 - qemu_ld_v128 and qemu_st_v128 do not generate fallback code if the host
   does not support 128 bit registers. The reason is that I do not know how to
   handle the host/guest different endianness (whether do we swap only bytes
   in elements or whole vectors?). Different targets seem to have different
   ideas on how this should be done.

Kirill Batuzov (20):
  tcg: add support for 128bit vector type
  tcg: add support for 64bit vector type
  tcg: support representing vector type with smaller vector or scalar
    types
  tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes
  tcg: add simple alias analysis
  tcg: use results of alias analysis in liveness analysis
  tcg: allow globals to overlap
  tcg: add vector addition operations
  target/arm: support access to vector guest registers as globals
  target/arm: use vector opcode to handle vadd.<size> instruction
  tcg/i386: add support for vector opcodes
  tcg/i386: support 64-bit vector operations
  tcg/i386: support remaining vector addition operations
  tcg: do not rely on exact values of MO_BSWAP or MO_SIGN in backend
  tcg: introduce new TCGMemOp - MO_128
  tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes
  softmmu: create helpers for vector loads
  tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops
  target/arm: load two consecutive 64-bits vector regs as a 128-bit
    vector reg
  tcg/README: update README to include information about vector opcodes

Kirill Batuzov (21):
  tcg: add support for 128bit vector type
  tcg: add support for 64bit vector type
  tcg: support representing vector type with smaller vector or scalar
    types
  tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes
  tcg: add simple alias analysis
  tcg: use results of alias analysis in liveness analysis
  tcg: allow globals to overlap
  tcg: add vector addition operations
  target/arm: support access to vector guest registers as globals
  target/arm: use vector opcode to handle vadd.<size> instruction
  tcg/i386: add support for vector opcodes
  tcg/i386: support 64-bit vector operations
  tcg/i386: support remaining vector addition operations
  tcg: do not rely on exact values of MO_BSWAP or MO_SIGN in backend
  target/aarch64: do not check for non-existent TCGMemOp
  tcg: introduce new TCGMemOp - MO_128
  tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes
  softmmu: create helpers for vector loads
  tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops
  target/arm: load two consecutive 64-bits vector regs as a 128-bit
    vector reg
  tcg/README: update README to include information about vector opcodes

 cputlb.c                     |   4 +
 softmmu_template_vector.h    | 266 +++++++++++++++++++++++++++++++
 target/arm/translate-a64.c   |   1 -
 target/arm/translate.c       |  76 ++++++++-
 tcg/README                   |  47 +++++-
 tcg/aarch64/tcg-target.inc.c |   4 +-
 tcg/arm/tcg-target.inc.c     |   4 +-
 tcg/i386/tcg-target.h        |  45 +++++-
 tcg/i386/tcg-target.inc.c    | 260 +++++++++++++++++++++++++++++--
 tcg/mips/tcg-target.inc.c    |   4 +-
 tcg/optimize.c               | 165 +++++++++++++++++++-
 tcg/ppc/tcg-target.inc.c     |   4 +-
 tcg/s390/tcg-target.inc.c    |   4 +-
 tcg/sparc/tcg-target.inc.c   |  12 +-
 tcg/tcg-op.c                 |  92 ++++++++++-
 tcg/tcg-op.h                 | 267 +++++++++++++++++++++++++++++++
 tcg/tcg-opc.h                |  34 ++++
 tcg/tcg.c                    | 363 +++++++++++++++++++++++++++++++++++++------
 tcg/tcg.h                    | 163 ++++++++++++++++++-
 19 files changed, 1722 insertions(+), 93 deletions(-)
 create mode 100644 softmmu_template_vector.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 28+ messages in thread