[RFC PATCH 00/37] target/i386: new decoder + AVX implementation

* [RFC PATCH 00/37] target/i386: new decoder + AVX implementation
@ 2022-09-11 23:03 Paolo Bonzini
  2022-09-11 23:03 ` [PATCH 01/37] target/i386: Define XMMReg and access macros, align ZMM registers Paolo Bonzini
                   ` (36 more replies)
  0 siblings, 37 replies; 86+ messages in thread
From: Paolo Bonzini @ 2022-09-11 23:03 UTC (permalink / raw)
  To: qemu-devel

This series fleshes out more of the prototype table-driven decoder
and uses it to implement SSE and AVX.

As expected, there's a lot more lines here than in Paul's version
(roughly twice as much), but in my opinion it is a price worth paying
for future maintainability and for easier code review.  Nevertheless,
I cannot even explain how much his work helped; the authorship in
the log below doesn't tell the whole story because a bunch of patches
were extracted from his post and somewhat redone, not to mention the
wonderful tests.  In other words, he did all the hard and important work
that allowed me to proceed piecewise (and fast).

Interestingly, this also found a bunch of errors in the Intel manual,
which I have noted in the source and have forwarded to the contacts I
have there.  But overall the manual proved to be very easy to convert
to source code.  About 10% of the instructions needed some kind of
special casing, most of them due to differences between register and
memory operands.

Inspired by Richard's series at
https://lore.kernel.org/qemu-devel/20220822223722.1697758-1-richard.henderson@linaro.org,
I also took the opportunity to use gvec here and there, especially for
moves.  However, helpers are not converted to gvec style.  Another
possible cleanup to be done on top entails checking which helpers really
need the cpu_env argument.  Most of the integer ones probably don't,
for example, because the separate generator functions of the new decoder
provide a little more flexibility there.

This is not quite ready for commit, mostly because I haven't yet tested
it on big-endian architectures and on system emulation, but I thought I'd
throw it out early.  SSE4a translation is not tested yet either, and I will
probably convert 3DNow to the new decoder too, because it seems trivial.

Compared to the very first post, of course the decoder core is more
fleshed out.  New features include: supporting MMX instructions without
having to write them down separately; 4-operand instructions; CPUID
feature testing; simplified handling of opcodes with mandatory 66/F3/F2
prefixes.  On the other hand I left out the implementation of the one
byte opcodes, focusing on MMX and SSE instead.  Some "specials" are
not needed (e.g. NoSeg which is used for LEA) so it's not included.

Patches 1-4 are cleanups to translate.c that come in handy with the new
decoder.

Patches 5-11 add the generic framework for x86 decoding.

Patches 12-13 move all existing VEX instructions to the new decoder.
While at it, it also add the other integer instructions in the 0F38 and
0F3A opcode range, because the corresponding patches for SSE and AVX
instructions are big enough.  As of patch 13, however, these non-VEX
instructions will still be translated by the old decoder.

Patches 14-19 extend the helpers to support AVX 3- and 4-operand
instructions.  Patch 15 is by far the nastiest but it's not really
possible to split it further; a consolation is that the translate.c
part of it goes away with the SSE reimplementation.

Unlike in Paul's AVX series, I chose to implement the "merging" behavior
of AVX 3-operand scalar operations (VADDSx, VSQRTSx, etc.) in the
helpers rather than in TCG ops, so that change is also introduced at
this step; for more information see patches 16 and 17.

Patches 20-31 implement SSE and AVX instruction translation in the new
decoder.  The patches generally operate on groups of 8 to 24 opcodes,
though there are a couple bigger ones for the 0F38 and 0F3A opcodes.

Patches 32-35 finally enable AVX, and patches 35-36 removes now-unused
translator and helper code.

Paolo

Paolo Bonzini (32):
  target/i386: make ldo/sto operations consistent with ldq
  target/i386: REPZ and REPNZ are mutually exclusive
  target/i386: introduce insn_get_addr
  target/i386: add core of new i386 decoder
  target/i386: add ALU load/writeback core
  target/i386: add CPUID[EAX=7,ECX=0].ECX to DisasContext
  target/i386: add CPUID feature checks to new decoder
  target/i386: validate VEX prefixes via the instructions' exception classes
  target/i386: validate SSE prefixes directly in the decoding table
  target/i386: add scalar 0F 38 and 0F 3A instruction to new decoder
  target/i386: remove scalar VEX instructions from old decoder
  target/i386: extend helpers to support VEX.V 3- and 4- operand encodings
  target/i386: support operand merging in binary scalar helpers
  target/i386: provide 3-operand versions of unary scalar helpers
  target/i386: implement additional AVX comparison operators
  target/i386: Introduce 256-bit vector helpers
  target/i386: reimplement 0x0f 0x60-0x6f, add AVX
  target/i386: reimplement 0x0f 0xd8-0xdf, 0xe8-0xef, 0xf8-0xff, add AVX
  target/i386: reimplement 0x0f 0x50-0x5f, add AVX
  target/i386: reimplement 0x0f 0x78-0x7f, add AVX
  target/i386: reimplement 0x0f 0x70-0x77, add AVX
  target/i386: reimplement 0x0f 0xd0-0xd7, 0xe0-0xe7, 0xf0-0xf7, add AVX
  target/i386: reimplement 0x0f 0x3a, add AVX
  target/i386: reimplement 0x0f 0x38, add AVX
  target/i386: reimplement 0x0f 0xc2, 0xc4-0xc6, add AVX
  target/i386: reimplement 0x0f 0x10-0x17, add AVX
  target/i386: reimplement 0x0f 0x28-0x2f, add AVX
  target/i386: implement XSAVE and XRSTOR of AVX registers
  target/i386: implement VLDMXCSR/VSTMXCSR
  tests/tcg: extend SSE tests to AVX
  target/i386: move 3DNow completely out of gen_sse
  target/i386: remove old SSE decoder

Paul Brook (3):
  target/i386: add AVX_EN hflag
  target/i386: Prepare ops_sse_header.h for 256 bit AVX
  target/i386: Enable AVX cpuid bits when using TCG

Richard Henderson (2):
  target/i386: Define XMMReg and access macros, align ZMM registers
  target/i386: Use tcg gvec ops for pmovmskb

 target/i386/cpu.c                |   10 +-
 target/i386/cpu.h                |   59 +-
 target/i386/helper.c             |   12 +
 target/i386/helper.h             |    9 +
 target/i386/ops_sse.h            |  655 ++++++---
 target/i386/ops_sse_header.h     |  339 +++--
 target/i386/tcg/decode-new.c.inc | 1694 ++++++++++++++++++++++
 target/i386/tcg/decode-new.h     |  242 ++++
 target/i386/tcg/emit.c.inc       | 2323 ++++++++++++++++++++++++++++++
 target/i386/tcg/fpu_helper.c     |  134 +-
 target/i386/tcg/translate.c      | 2071 ++------------------------
 tests/tcg/i386/Makefile.target   |    2 +-
 tests/tcg/i386/test-avx.c        |  201 +--
 tests/tcg/i386/test-avx.py       |    3 +-
 14 files changed, 5374 insertions(+), 2380 deletions(-)
 create mode 100644 target/i386/tcg/decode-new.c.inc
 create mode 100644 target/i386/tcg/decode-new.h
 create mode 100644 target/i386/tcg/emit.c.inc

-- 
2.37.2

^ permalink raw reply	[flat|nested] 86+ messages in thread