All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series
@ 2021-10-12 10:10 Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 01/30] Hexagon HVX (target/hexagon) README Taylor Simpson
                   ` (29 more replies)
  0 siblings, 30 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

This series adds support for the Hexagon Vector eXtensions (HVX)

These instructions are documented here
https://developer.qualcomm.com/downloads/qualcomm-hexagon-v66-hvx-programmer-s-reference-manual

Hexagon HVX is a wide vector engine with 128 byte vectors.

See patch 01 Hexagon HVX README for more information.


*** Known checkpatch issues ***

The following are known checkpatch errors in the series
    target/hexagon/gen_semantics.c    Suspicious ; after while (0)
    tests/tcg/hexagon/hvx_misc.c      Spaces around operator in macro invocation


*** Changes in v4 ***
Address feedback from Richard Henderson <richard.henderson@linaro.org>
- Use tcg_gen_gvec_orc/tcg_gen_gvec_andc
Add tests for the following opcodes
    V6_pred_or
    V6_pred_or_n
    V6_pred_and
    V6_pred_and_n
    V6_pred_xor
Probe the HVX stores at the beginning of gen_commit_packet
    See patches 11 (helper functions) and 12 (TCG generation)
Use tcg_constant_* instead of tcg_const_*

*** Changes in v3 ***
Clean up gen_log_vreg_write
- Remove has_vhist parameter
Remove VRegs_updated_tmp from runtime state
- Check there is exactly one tmp for vhist at TCG generation time
Remove VRegs_select from runtime state
Add test_max_temps test to tests/tcg/hexagon/hvx_misc.c
Don't pass slot to HVX helpers

*** Changes in v2 ***
Address feedback from Richard Henderson <richard.henderson@linaro.org>
- Remove zero_vector from CPUHexagonState
- Remove gather_issued from CPUHexagonState
- Remove is_gather_store_insn from DisasContext and CPUHexagonState
- Change VStoreLog.mask to a bitmap
- Change VTCMStoreLog.mask to a bitmap
- Convert future_VRegs, tmp_Vregs to allocate as-needed
- Don't cast away const
- Remove/simplify count_leading_ones_2
- Control HVX dump with CPU_DUMP_FPU
Remove HVX support from target/hexagon/gdbstub.c
- Hexagon uses lldb which will require support for qRegisterInfo in the
  target-independent gdbstub.  Will contribute this separately
Convert the histogram instructions to execute at the end of packet commit
- This is necessary to allocate future_VRegs as-needed
Additional tests added in tests/tcg/hexagon
Added helper overrides for several instructions
- As a result, cleaned up utility functions
Additional_cleanup
- Change TCGv_ptr to not _local_
- Remove env argument from gen_commit_hvx


Taylor Simpson (30):
  Hexagon HVX (target/hexagon) README
  Hexagon HVX (target/hexagon) add Hexagon Vector eXtensions (HVX) to
    core
  Hexagon HVX (target/hexagon) register names
  Hexagon HVX (target/hexagon) instruction attributes
  Hexagon HVX (target/hexagon) macros
  Hexagon HVX (target/hexagon) import macro definitions
  Hexagon HVX (target/hexagon) semantics generator
  Hexagon HVX (target/hexagon) semantics generator - part 2
  Hexagon HVX (target/hexagon) C preprocessor for decode tree
  Hexagon HVX (target/hexagon) instruction utility functions
  Hexagon HVX (target/hexagon) helper functions
  Hexagon HVX (target/hexagon) TCG generation
  Hexagon HVX (target/hexagon) helper overrides infrastructure
  Hexagon HVX (target/hexagon) helper overrides for histogram
    instructions
  Hexagon HVX (target/hexagon) helper overrides - vector assign & cmov
  Hexagon HVX (target/hexagon) helper overrides - vector add & sub
  Hexagon HVX (target/hexagon) helper overrides - vector shifts
  Hexagon HVX (target/hexagon) helper overrides - vector max/min
  Hexagon HVX (target/hexagon) helper overrides - vector logical ops
  Hexagon HVX (target/hexagon) helper overrides - vector compares
  Hexagon HVX (target/hexagon) helper overrides - vector splat and abs
  Hexagon HVX (target/hexagon) helper overrides - vector loads
  Hexagon HVX (target/hexagon) helper overrides - vector stores
  Hexagon HVX (target/hexagon) import semantics
  Hexagon HVX (target/hexagon) instruction decoding
  Hexagon HVX (target/hexagon) import instruction encodings
  Hexagon HVX (tests/tcg/hexagon) vector_add_int test
  Hexagon HVX (tests/tcg/hexagon) hvx_misc test
  Hexagon HVX (tests/tcg/hexagon) scatter_gather test
  Hexagon HVX (tests/tcg/hexagon) histogram test

 target/hexagon/cpu.h                         |   35 +-
 target/hexagon/gen_tcg_hvx.h                 |  903 +++++++++
 target/hexagon/helper.h                      |   16 +
 target/hexagon/hex_arch_types.h              |    5 +
 target/hexagon/hex_regs.h                    |    1 +
 target/hexagon/insn.h                        |    3 +
 target/hexagon/internal.h                    |    3 +
 target/hexagon/macros.h                      |   22 +
 target/hexagon/mmvec/decode_ext_mmvec.h      |   24 +
 target/hexagon/mmvec/macros.h                |  354 ++++
 target/hexagon/mmvec/mmvec.h                 |   83 +
 target/hexagon/mmvec/system_ext_mmvec.h      |   29 +
 target/hexagon/translate.h                   |   61 +
 tests/tcg/hexagon/hvx_histogram_input.h      |  717 +++++++
 tests/tcg/hexagon/hvx_histogram_row.h        |   24 +
 target/hexagon/attribs_def.h.inc             |   22 +
 target/hexagon/cpu.c                         |   80 +-
 target/hexagon/decode.c                      |   28 +-
 target/hexagon/gen_dectree_import.c          |   13 +
 target/hexagon/gen_semantics.c               |   33 +
 target/hexagon/genptr.c                      |  188 ++
 target/hexagon/mmvec/decode_ext_mmvec.c      |  236 +++
 target/hexagon/mmvec/system_ext_mmvec.c      |   66 +
 target/hexagon/op_helper.c                   |  282 ++-
 target/hexagon/translate.c                   |  243 ++-
 tests/tcg/hexagon/hvx_histogram.c            |   88 +
 tests/tcg/hexagon/hvx_misc.c                 |  469 +++++
 tests/tcg/hexagon/scatter_gather.c           | 1011 ++++++++++
 tests/tcg/hexagon/vector_add_int.c           |   61 +
 target/hexagon/README                        |   81 +-
 target/hexagon/gen_helper_funcs.py           |  115 +-
 target/hexagon/gen_helper_protos.py          |   19 +-
 target/hexagon/gen_tcg_funcs.py              |  257 ++-
 target/hexagon/hex_common.py                 |   13 +
 target/hexagon/imported/allext.idef          |   25 +
 target/hexagon/imported/allext_macros.def    |   25 +
 target/hexagon/imported/allextenc.def        |   20 +
 target/hexagon/imported/allidefs.def         |    1 +
 target/hexagon/imported/encode.def           |    1 +
 target/hexagon/imported/macros.def           |   88 +
 target/hexagon/imported/mmvec/encode_ext.def |  794 ++++++++
 target/hexagon/imported/mmvec/ext.idef       | 2606 ++++++++++++++++++++++++++
 target/hexagon/imported/mmvec/macros.def     |  842 +++++++++
 target/hexagon/meson.build                   |   15 +-
 tests/tcg/hexagon/Makefile.target            |   12 +
 tests/tcg/hexagon/hvx_histogram_row.S        |  294 +++
 46 files changed, 10261 insertions(+), 47 deletions(-)
 create mode 100644 target/hexagon/gen_tcg_hvx.h
 create mode 100644 target/hexagon/mmvec/decode_ext_mmvec.h
 create mode 100644 target/hexagon/mmvec/macros.h
 create mode 100644 target/hexagon/mmvec/mmvec.h
 create mode 100644 target/hexagon/mmvec/system_ext_mmvec.h
 create mode 100644 tests/tcg/hexagon/hvx_histogram_input.h
 create mode 100644 tests/tcg/hexagon/hvx_histogram_row.h
 create mode 100644 target/hexagon/mmvec/decode_ext_mmvec.c
 create mode 100644 target/hexagon/mmvec/system_ext_mmvec.c
 create mode 100644 tests/tcg/hexagon/hvx_histogram.c
 create mode 100644 tests/tcg/hexagon/hvx_misc.c
 create mode 100644 tests/tcg/hexagon/scatter_gather.c
 create mode 100644 tests/tcg/hexagon/vector_add_int.c
 create mode 100644 target/hexagon/imported/allext.idef
 create mode 100644 target/hexagon/imported/allext_macros.def
 create mode 100644 target/hexagon/imported/allextenc.def
 create mode 100644 target/hexagon/imported/mmvec/encode_ext.def
 create mode 100644 target/hexagon/imported/mmvec/ext.idef
 create mode 100755 target/hexagon/imported/mmvec/macros.def
 create mode 100644 tests/tcg/hexagon/hvx_histogram_row.S

-- 
2.7.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 01/30] Hexagon HVX (target/hexagon) README
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 02/30] Hexagon HVX (target/hexagon) add Hexagon Vector eXtensions (HVX) to core Taylor Simpson
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/README | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/target/hexagon/README b/target/hexagon/README
index b0b2435..372e247 100644
--- a/target/hexagon/README
+++ b/target/hexagon/README
@@ -1,9 +1,13 @@
 Hexagon is Qualcomm's very long instruction word (VLIW) digital signal
-processor(DSP).
+processor(DSP).  We also support Hexagon Vector eXtensions (HVX).  HVX
+is a wide vector coprocessor designed for high performance computer vision,
+image processing, machine learning, and other workloads.
 
 The following versions of the Hexagon core are supported
     Scalar core: v67
     https://developer.qualcomm.com/downloads/qualcomm-hexagon-v67-programmer-s-reference-manual
+    HVX extension: v66
+    https://developer.qualcomm.com/downloads/qualcomm-hexagon-v66-hvx-programmer-s-reference-manual
 
 We presented an overview of the project at the 2019 KVM Forum.
     https://kvmforum2019.sched.com/event/Tmwc/qemu-hexagon-automatic-translation-of-the-isa-manual-pseudcode-to-tiny-code-instructions-of-a-vliw-architecture-niccolo-izzo-revng-taylor-simpson-qualcomm-innovation-center
@@ -124,6 +128,71 @@ There are also cases where we brute force the TCG code generation.
 Instructions with multiple definitions are examples.  These require special
 handling because qemu helpers can only return a single value.
 
+For HVX vectors, the generator behaves slightly differently.  The wide vectors
+won't fit in a TCGv or TCGv_i64, so we pass TCGv_ptr variables to pass the
+address to helper functions.  Here's an example for an HVX vector-add-word
+istruction.
+    static void generate_V6_vaddw(
+                    CPUHexagonState *env,
+                    DisasContext *ctx,
+                    Insn *insn,
+                    Packet *pkt)
+    {
+        const int VdN = insn->regno[0];
+        const intptr_t VdV_off =
+            ctx_future_vreg_off(ctx, VdN, 1, true);
+        TCGv_ptr VdV = tcg_temp_local_new_ptr();
+        tcg_gen_addi_ptr(VdV, cpu_env, VdV_off);
+        const int VuN = insn->regno[1];
+        const intptr_t VuV_off =
+            vreg_src_off(ctx, VuN);
+        TCGv_ptr VuV = tcg_temp_local_new_ptr();
+        const int VvN = insn->regno[2];
+        const intptr_t VvV_off =
+            vreg_src_off(ctx, VvN);
+        TCGv_ptr VvV = tcg_temp_local_new_ptr();
+        tcg_gen_addi_ptr(VuV, cpu_env, VuV_off);
+        tcg_gen_addi_ptr(VvV, cpu_env, VvV_off);
+        TCGv slot = tcg_constant_tl(insn->slot);
+        gen_helper_V6_vaddw(cpu_env, VdV, VuV, VvV, slot);
+        tcg_temp_free(slot);
+        gen_log_vreg_write(ctx, VdV_off, VdN, EXT_DFL, insn->slot, false);
+        ctx_log_vreg_write(ctx, VdN, EXT_DFL, false);
+        tcg_temp_free_ptr(VdV);
+        tcg_temp_free_ptr(VuV);
+        tcg_temp_free_ptr(VvV);
+    }
+
+Notice that we also generate a variable named <operand>_off for each operand of
+the instruction.  This makes it easy to override the instruction semantics with
+functions from tcg-op-gvec.h.  Here's the override for this instruction.
+    #define fGEN_TCG_V6_vaddw(SHORTCODE) \
+        tcg_gen_gvec_add(MO_32, VdV_off, VuV_off, VvV_off, \
+                         sizeof(MMVector), sizeof(MMVector))
+
+Finally, we notice that the override doesn't use the TCGv_ptr variables, so
+we don't generate them when an override is present.  Here is what we generate
+when the override is present.
+    static void generate_V6_vaddw(
+                    CPUHexagonState *env,
+                    DisasContext *ctx,
+                    Insn *insn,
+                    Packet *pkt)
+    {
+        const int VdN = insn->regno[0];
+        const intptr_t VdV_off =
+            ctx_future_vreg_off(ctx, VdN, 1, true);
+        const int VuN = insn->regno[1];
+        const intptr_t VuV_off =
+            vreg_src_off(ctx, VuN);
+        const int VvN = insn->regno[2];
+        const intptr_t VvV_off =
+            vreg_src_off(ctx, VvN);
+        fGEN_TCG_V6_vaddw({ fHIDE(int i;) fVFOREACH(32, i) { VdV.w[i] = VuV.w[i] + VvV.w[i] ; } });
+        gen_log_vreg_write(ctx, VdV_off, VdN, EXT_DFL, insn->slot, false);
+        ctx_log_vreg_write(ctx, VdN, EXT_DFL, false);
+    }
+
 In addition to instruction semantics, we use a generator to create the decode
 tree.  This generation is also a two step process.  The first step is to run
 target/hexagon/gen_dectree_import.c to produce
@@ -140,6 +209,7 @@ runtime information for each thread and contains stuff like the GPR and
 predicate registers.
 
 macros.h
+mmvec/macros.h
 
 The Hexagon arch lib relies heavily on macros for the instruction semantics.
 This is a great advantage for qemu because we can override them for different
@@ -203,6 +273,15 @@ During runtime, the following fields in CPUHexagonState (see cpu.h) are used
     pred_written          boolean indicating if predicate was written
     mem_log_stores        record of the stores (indexed by slot)
 
+For Hexagon Vector eXtensions (HVX), the following fields are used
+    VRegs                       Vector registers
+    future_VRegs                Registers to be stored during packet commit
+    tmp_VRegs                   Temporary registers *not* stored during commit
+    VRegs_updated               Mask of predicated vector writes
+    QRegs                       Q (vector predicate) registers
+    future_QRegs                Registers to be stored during packet commit
+    QRegs_updated               Mask of predicated vector writes
+
 *** Debugging ***
 
 You can turn on a lot of debugging by changing the HEX_DEBUG macro to 1 in
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 02/30] Hexagon HVX (target/hexagon) add Hexagon Vector eXtensions (HVX) to core
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 01/30] Hexagon HVX (target/hexagon) README Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 03/30] Hexagon HVX (target/hexagon) register names Taylor Simpson
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

HVX is a set of wide vector instructions.  Machine state includes
    vector registers (VRegs)
    vector predicate registers (QRegs)
    temporary registers for intermediate values
    store buffer (masked stores and scatter/gather)

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/cpu.h            | 35 ++++++++++++++++-
 target/hexagon/hex_arch_types.h |  5 +++
 target/hexagon/insn.h           |  3 ++
 target/hexagon/internal.h       |  3 ++
 target/hexagon/mmvec/mmvec.h    | 83 +++++++++++++++++++++++++++++++++++++++++
 target/hexagon/cpu.c            | 78 ++++++++++++++++++++++++++++++++++++--
 6 files changed, 201 insertions(+), 6 deletions(-)
 create mode 100644 target/hexagon/mmvec/mmvec.h

diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h
index f7d0438..e696699 100644
--- a/target/hexagon/cpu.h
+++ b/target/hexagon/cpu.h
@@ -26,6 +26,7 @@ typedef struct CPUHexagonState CPUHexagonState;
 #include "qemu-common.h"
 #include "exec/cpu-defs.h"
 #include "hex_regs.h"
+#include "mmvec/mmvec.h"
 
 #define NUM_PREGS 4
 #define TOTAL_PER_THREAD_REGS 64
@@ -34,6 +35,7 @@ typedef struct CPUHexagonState CPUHexagonState;
 #define STORES_MAX 2
 #define REG_WRITES_MAX 32
 #define PRED_WRITES_MAX 5                   /* 4 insns + endloop */
+#define VSTORES_MAX 2
 
 #define TYPE_HEXAGON_CPU "hexagon-cpu"
 
@@ -52,6 +54,13 @@ typedef struct {
     uint64_t data64;
 } MemLog;
 
+typedef struct {
+    target_ulong va;
+    int size;
+    DECLARE_BITMAP(mask, MAX_VEC_SIZE_BYTES / 8) QEMU_ALIGNED(16);
+    MMVector data QEMU_ALIGNED(16);
+} VStoreLog;
+
 #define EXEC_STATUS_OK          0x0000
 #define EXEC_STATUS_STOP        0x0002
 #define EXEC_STATUS_REPLAY      0x0010
@@ -64,6 +73,9 @@ typedef struct {
 #define CLEAR_EXCEPTION         (env->status &= (~EXEC_STATUS_EXCEPTION))
 #define SET_EXCEPTION           (env->status |= EXEC_STATUS_EXCEPTION)
 
+/* Maximum number of vector temps in a packet */
+#define VECTOR_TEMPS_MAX            4
+
 struct CPUHexagonState {
     target_ulong gpr[TOTAL_PER_THREAD_REGS];
     target_ulong pred[NUM_PREGS];
@@ -97,8 +109,27 @@ struct CPUHexagonState {
     target_ulong llsc_val;
     uint64_t     llsc_val_i64;
 
-    target_ulong is_gather_store_insn;
-    target_ulong gather_issued;
+    MMVector VRegs[NUM_VREGS] QEMU_ALIGNED(16);
+    MMVector future_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
+    MMVector tmp_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
+
+    VRegMask VRegs_updated;
+
+    MMQReg QRegs[NUM_QREGS] QEMU_ALIGNED(16);
+    MMQReg future_QRegs[NUM_QREGS] QEMU_ALIGNED(16);
+    QRegMask QRegs_updated;
+
+    /* Temporaries used within instructions */
+    MMVectorPair VuuV QEMU_ALIGNED(16);
+    MMVectorPair VvvV QEMU_ALIGNED(16);
+    MMVectorPair VxxV QEMU_ALIGNED(16);
+    MMVector     vtmp QEMU_ALIGNED(16);
+    MMQReg       qtmp QEMU_ALIGNED(16);
+
+    VStoreLog vstore[VSTORES_MAX];
+    target_ulong vstore_pending[VSTORES_MAX];
+    bool vtcm_pending;
+    VTCMStoreLog vtcm_log;
 };
 
 #define HEXAGON_CPU_CLASS(klass) \
diff --git a/target/hexagon/hex_arch_types.h b/target/hexagon/hex_arch_types.h
index d721e1f..78ad607 100644
--- a/target/hexagon/hex_arch_types.h
+++ b/target/hexagon/hex_arch_types.h
@@ -19,6 +19,7 @@
 #define HEXAGON_ARCH_TYPES_H
 
 #include "qemu/osdep.h"
+#include "mmvec/mmvec.h"
 #include "qemu/int128.h"
 
 /*
@@ -35,4 +36,8 @@ typedef uint64_t    size8u_t;
 typedef int64_t     size8s_t;
 typedef Int128      size16s_t;
 
+typedef MMVector          mmvector_t;
+typedef MMVectorPair      mmvector_pair_t;
+typedef MMQReg            mmqret_t;
+
 #endif
diff --git a/target/hexagon/insn.h b/target/hexagon/insn.h
index 2e34591..aa26389 100644
--- a/target/hexagon/insn.h
+++ b/target/hexagon/insn.h
@@ -67,6 +67,9 @@ struct Packet {
     bool pkt_has_store_s0;
     bool pkt_has_store_s1;
 
+    bool pkt_has_hvx;
+    Insn *vhist_insn;
+
     Insn insn[INSTRUCTIONS_MAX];
 };
 
diff --git a/target/hexagon/internal.h b/target/hexagon/internal.h
index 6b20aff..82ac304 100644
--- a/target/hexagon/internal.h
+++ b/target/hexagon/internal.h
@@ -31,6 +31,9 @@
 
 int hexagon_gdb_read_register(CPUState *cpu, GByteArray *buf, int reg);
 int hexagon_gdb_write_register(CPUState *cpu, uint8_t *buf, int reg);
+
+void hexagon_debug_vreg(CPUHexagonState *env, int regnum);
+void hexagon_debug_qreg(CPUHexagonState *env, int regnum);
 void hexagon_debug(CPUHexagonState *env);
 
 extern const char * const hexagon_regnames[TOTAL_PER_THREAD_REGS];
diff --git a/target/hexagon/mmvec/mmvec.h b/target/hexagon/mmvec/mmvec.h
new file mode 100644
index 0000000..6196c52
--- /dev/null
+++ b/target/hexagon/mmvec/mmvec.h
@@ -0,0 +1,83 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HEXAGON_MMVEC_H
+#define HEXAGON_MMVEC_H
+
+#define MAX_VEC_SIZE_LOGBYTES 7
+#define MAX_VEC_SIZE_BYTES  (1 << MAX_VEC_SIZE_LOGBYTES)
+
+#define NUM_VREGS           32
+#define NUM_QREGS           4
+
+typedef uint32_t VRegMask; /* at least NUM_VREGS bits */
+typedef uint32_t QRegMask; /* at least NUM_QREGS bits */
+
+#define VECTOR_SIZE_BYTE    (fVECSIZE())
+
+typedef union {
+    uint64_t ud[MAX_VEC_SIZE_BYTES / 8];
+    int64_t   d[MAX_VEC_SIZE_BYTES / 8];
+    uint32_t uw[MAX_VEC_SIZE_BYTES / 4];
+    int32_t   w[MAX_VEC_SIZE_BYTES / 4];
+    uint16_t uh[MAX_VEC_SIZE_BYTES / 2];
+    int16_t   h[MAX_VEC_SIZE_BYTES / 2];
+    uint8_t  ub[MAX_VEC_SIZE_BYTES / 1];
+    int8_t    b[MAX_VEC_SIZE_BYTES / 1];
+} MMVector;
+
+typedef union {
+    uint64_t ud[2 * MAX_VEC_SIZE_BYTES / 8];
+    int64_t   d[2 * MAX_VEC_SIZE_BYTES / 8];
+    uint32_t uw[2 * MAX_VEC_SIZE_BYTES / 4];
+    int32_t   w[2 * MAX_VEC_SIZE_BYTES / 4];
+    uint16_t uh[2 * MAX_VEC_SIZE_BYTES / 2];
+    int16_t   h[2 * MAX_VEC_SIZE_BYTES / 2];
+    uint8_t  ub[2 * MAX_VEC_SIZE_BYTES / 1];
+    int8_t    b[2 * MAX_VEC_SIZE_BYTES / 1];
+    MMVector v[2];
+} MMVectorPair;
+
+typedef union {
+    uint64_t ud[MAX_VEC_SIZE_BYTES / 8 / 8];
+    int64_t   d[MAX_VEC_SIZE_BYTES / 8 / 8];
+    uint32_t uw[MAX_VEC_SIZE_BYTES / 4 / 8];
+    int32_t   w[MAX_VEC_SIZE_BYTES / 4 / 8];
+    uint16_t uh[MAX_VEC_SIZE_BYTES / 2 / 8];
+    int16_t   h[MAX_VEC_SIZE_BYTES / 2 / 8];
+    uint8_t  ub[MAX_VEC_SIZE_BYTES / 1 / 8];
+    int8_t    b[MAX_VEC_SIZE_BYTES / 1 / 8];
+} MMQReg;
+
+typedef struct {
+    MMVector data;
+    DECLARE_BITMAP(mask, MAX_VEC_SIZE_BYTES);
+    int size;
+    target_ulong va[MAX_VEC_SIZE_BYTES];
+    bool op;
+    int op_size;
+} VTCMStoreLog;
+
+
+/* Types of vector register assignment */
+typedef enum {
+    EXT_DFL,      /* Default */
+    EXT_NEW,      /* New - value used in the same packet */
+    EXT_TMP       /* Temp - value used but not stored to register */
+} VRegWriteType;
+
+#endif
diff --git a/target/hexagon/cpu.c b/target/hexagon/cpu.c
index 3338365..989bd76 100644
--- a/target/hexagon/cpu.c
+++ b/target/hexagon/cpu.c
@@ -113,7 +113,66 @@ static void print_reg(FILE *f, CPUHexagonState *env, int regnum)
                  hexagon_regnames[regnum], value);
 }
 
-static void hexagon_dump(CPUHexagonState *env, FILE *f)
+static void print_vreg(FILE *f, CPUHexagonState *env, int regnum,
+                       bool skip_if_zero)
+{
+    if (skip_if_zero) {
+        bool nonzero_found = false;
+        for (int i = 0; i < MAX_VEC_SIZE_BYTES; i++) {
+            if (env->VRegs[regnum].ub[i] != 0) {
+                nonzero_found = true;
+                break;
+            }
+        }
+        if (!nonzero_found) {
+            return;
+        }
+    }
+
+    qemu_fprintf(f, "  v%d = ( ", regnum);
+    qemu_fprintf(f, "0x%02x", env->VRegs[regnum].ub[MAX_VEC_SIZE_BYTES - 1]);
+    for (int i = MAX_VEC_SIZE_BYTES - 2; i >= 0; i--) {
+        qemu_fprintf(f, ", 0x%02x", env->VRegs[regnum].ub[i]);
+    }
+    qemu_fprintf(f, " )\n");
+}
+
+void hexagon_debug_vreg(CPUHexagonState *env, int regnum)
+{
+    print_vreg(stdout, env, regnum, false);
+}
+
+static void print_qreg(FILE *f, CPUHexagonState *env, int regnum,
+                       bool skip_if_zero)
+{
+    if (skip_if_zero) {
+        bool nonzero_found = false;
+        for (int i = 0; i < MAX_VEC_SIZE_BYTES / 8; i++) {
+            if (env->QRegs[regnum].ub[i] != 0) {
+                nonzero_found = true;
+                break;
+            }
+        }
+        if (!nonzero_found) {
+            return;
+        }
+    }
+
+    qemu_fprintf(f, "  q%d = ( ", regnum);
+    qemu_fprintf(f, "0x%02x",
+                 env->QRegs[regnum].ub[MAX_VEC_SIZE_BYTES / 8 - 1]);
+    for (int i = MAX_VEC_SIZE_BYTES / 8 - 2; i >= 0; i--) {
+        qemu_fprintf(f, ", 0x%02x", env->QRegs[regnum].ub[i]);
+    }
+    qemu_fprintf(f, " )\n");
+}
+
+void hexagon_debug_qreg(CPUHexagonState *env, int regnum)
+{
+    print_qreg(stdout, env, regnum, false);
+}
+
+static void hexagon_dump(CPUHexagonState *env, FILE *f, int flags)
 {
     HexagonCPU *cpu = env_archcpu(env);
 
@@ -159,6 +218,17 @@ static void hexagon_dump(CPUHexagonState *env, FILE *f)
     print_reg(f, env, HEX_REG_CS1);
 #endif
     qemu_fprintf(f, "}\n");
+
+    if (flags & CPU_DUMP_FPU) {
+        qemu_fprintf(f, "Vector Registers = {\n");
+        for (int i = 0; i < NUM_VREGS; i++) {
+            print_vreg(f, env, i, true);
+        }
+        for (int i = 0; i < NUM_QREGS; i++) {
+            print_qreg(f, env, i, true);
+        }
+        qemu_fprintf(f, "}\n");
+    }
 }
 
 static void hexagon_dump_state(CPUState *cs, FILE *f, int flags)
@@ -166,12 +236,12 @@ static void hexagon_dump_state(CPUState *cs, FILE *f, int flags)
     HexagonCPU *cpu = HEXAGON_CPU(cs);
     CPUHexagonState *env = &cpu->env;
 
-    hexagon_dump(env, f);
+    hexagon_dump(env, f, flags);
 }
 
 void hexagon_debug(CPUHexagonState *env)
 {
-    hexagon_dump(env, stdout);
+    hexagon_dump(env, stdout, CPU_DUMP_FPU);
 }
 
 static void hexagon_cpu_set_pc(CPUState *cs, vaddr value)
@@ -292,7 +362,7 @@ static void hexagon_cpu_class_init(ObjectClass *c, void *data)
     cc->set_pc = hexagon_cpu_set_pc;
     cc->gdb_read_register = hexagon_gdb_read_register;
     cc->gdb_write_register = hexagon_gdb_write_register;
-    cc->gdb_num_core_regs = TOTAL_PER_THREAD_REGS;
+    cc->gdb_num_core_regs = TOTAL_PER_THREAD_REGS + NUM_VREGS + NUM_QREGS;
     cc->gdb_stop_before_watchpoint = true;
     cc->disas_set_info = hexagon_cpu_disas_set_info;
     cc->tcg_ops = &hexagon_tcg_ops;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 03/30] Hexagon HVX (target/hexagon) register names
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 01/30] Hexagon HVX (target/hexagon) README Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 02/30] Hexagon HVX (target/hexagon) add Hexagon Vector eXtensions (HVX) to core Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 04/30] Hexagon HVX (target/hexagon) instruction attributes Taylor Simpson
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/hex_regs.h | 1 +
 target/hexagon/cpu.c      | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/target/hexagon/hex_regs.h b/target/hexagon/hex_regs.h
index f291911..e1b3149 100644
--- a/target/hexagon/hex_regs.h
+++ b/target/hexagon/hex_regs.h
@@ -76,6 +76,7 @@ enum {
     /* Use reserved control registers for qemu execution counts */
     HEX_REG_QEMU_PKT_CNT      = 52,
     HEX_REG_QEMU_INSN_CNT     = 53,
+    HEX_REG_QEMU_HVX_CNT      = 54,
     HEX_REG_UTIMERLO          = 62,
     HEX_REG_UTIMERHI          = 63,
 };
diff --git a/target/hexagon/cpu.c b/target/hexagon/cpu.c
index 989bd76..3bd3f10 100644
--- a/target/hexagon/cpu.c
+++ b/target/hexagon/cpu.c
@@ -59,7 +59,7 @@ const char * const hexagon_regnames[TOTAL_PER_THREAD_REGS] = {
   "r24", "r25", "r26", "r27", "r28",  "r29", "r30", "r31",
   "sa0", "lc0", "sa1", "lc1", "p3_0", "c5",  "m0",  "m1",
   "usr", "pc",  "ugp", "gp",  "cs0",  "cs1", "c14", "c15",
-  "c16", "c17", "c18", "c19", "pkt_cnt",  "insn_cnt", "c22", "c23",
+  "c16", "c17", "c18", "c19", "pkt_cnt",  "insn_cnt", "hvx_cnt", "c23",
   "c24", "c25", "c26", "c27", "c28",  "c29", "c30", "c31",
 };
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 04/30] Hexagon HVX (target/hexagon) instruction attributes
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (2 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 03/30] Hexagon HVX (target/hexagon) register names Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 05/30] Hexagon HVX (target/hexagon) macros Taylor Simpson
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/attribs_def.h.inc | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/target/hexagon/attribs_def.h.inc b/target/hexagon/attribs_def.h.inc
index 3815509..4138a7a 100644
--- a/target/hexagon/attribs_def.h.inc
+++ b/target/hexagon/attribs_def.h.inc
@@ -41,6 +41,27 @@ DEF_ATTRIB(STORE, "Stores to memory", "", "")
 DEF_ATTRIB(MEMLIKE, "Memory-like instruction", "", "")
 DEF_ATTRIB(MEMLIKE_PACKET_RULES, "follows Memory-like packet rules", "", "")
 
+/* V6 Vector attributes */
+DEF_ATTRIB(CVI, "Executes on the HVX extension", "", "")
+
+DEF_ATTRIB(CVI_NEW, "New value memory instruction executes on HVX", "", "")
+DEF_ATTRIB(CVI_VM, "Memory instruction executes on HVX", "", "")
+DEF_ATTRIB(CVI_VP, "Permute instruction executes on HVX", "", "")
+DEF_ATTRIB(CVI_VP_VS, "Double vector permute/shft insn executes on HVX", "", "")
+DEF_ATTRIB(CVI_VX, "Multiply instruction executes on HVX", "", "")
+DEF_ATTRIB(CVI_VX_DV, "Double vector multiply insn executes on HVX", "", "")
+DEF_ATTRIB(CVI_VS, "Shift instruction executes on HVX", "", "")
+DEF_ATTRIB(CVI_VS_VX, "Permute/shift and multiply insn executes on HVX", "", "")
+DEF_ATTRIB(CVI_VA, "ALU instruction executes on HVX", "", "")
+DEF_ATTRIB(CVI_VA_DV, "Double vector alu instruction executes on HVX", "", "")
+DEF_ATTRIB(CVI_4SLOT, "Consumes all the vector execution resources", "", "")
+DEF_ATTRIB(CVI_TMP, "Transient Memory Load not written to register", "", "")
+DEF_ATTRIB(CVI_GATHER, "CVI Gather operation", "", "")
+DEF_ATTRIB(CVI_SCATTER, "CVI Scatter operation", "", "")
+DEF_ATTRIB(CVI_SCATTER_RELEASE, "CVI Store Release for scatter", "", "")
+DEF_ATTRIB(CVI_TMP_DST, "CVI instruction that doesn't write a register", "", "")
+DEF_ATTRIB(CVI_SLOT23, "Can execute in slot 2 or slot 3 (HVX)", "", "")
+
 
 /* Change-of-flow attributes */
 DEF_ATTRIB(JUMP, "Jump-type instruction", "", "")
@@ -86,6 +107,7 @@ DEF_ATTRIB(HWLOOP1_END, "Ends HW loop1", "", "")
 DEF_ATTRIB(DCZEROA, "dczeroa type", "", "")
 DEF_ATTRIB(ICFLUSHOP, "icflush op type", "", "")
 DEF_ATTRIB(DCFLUSHOP, "dcflush op type", "", "")
+DEF_ATTRIB(L2FLUSHOP, "l2flush op type", "", "")
 DEF_ATTRIB(DCFETCH, "dcfetch type", "", "")
 
 DEF_ATTRIB(L2FETCH, "Instruction is l2fetch type", "", "")
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 05/30] Hexagon HVX (target/hexagon) macros
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (3 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 04/30] Hexagon HVX (target/hexagon) instruction attributes Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 06/30] Hexagon HVX (target/hexagon) import macro definitions Taylor Simpson
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

macros to interface with the generator
macros referenced in instruction semantics

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/macros.h       |  22 +++
 target/hexagon/mmvec/macros.h | 354 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 376 insertions(+)
 create mode 100644 target/hexagon/mmvec/macros.h

diff --git a/target/hexagon/macros.h b/target/hexagon/macros.h
index 44e9b85..4421285 100644
--- a/target/hexagon/macros.h
+++ b/target/hexagon/macros.h
@@ -266,6 +266,10 @@ static inline void gen_pred_cancel(TCGv pred, int slot_num)
 
 #define fNEWREG_ST(VAL) (VAL)
 
+#define fVSATUVALN(N, VAL) \
+    ({ \
+        (((int)(VAL)) < 0) ? 0 : ((1LL << (N)) - 1); \
+    })
 #define fSATUVALN(N, VAL) \
     ({ \
         fSET_OVERFLOW(); \
@@ -276,10 +280,16 @@ static inline void gen_pred_cancel(TCGv pred, int slot_num)
         fSET_OVERFLOW(); \
         ((VAL) < 0) ? (-(1LL << ((N) - 1))) : ((1LL << ((N) - 1)) - 1); \
     })
+#define fVSATVALN(N, VAL) \
+    ({ \
+        ((VAL) < 0) ? (-(1LL << ((N) - 1))) : ((1LL << ((N) - 1)) - 1); \
+    })
 #define fZXTN(N, M, VAL) (((N) != 0) ? extract64((VAL), 0, (N)) : 0LL)
 #define fSXTN(N, M, VAL) (((N) != 0) ? sextract64((VAL), 0, (N)) : 0LL)
 #define fSATN(N, VAL) \
     ((fSXTN(N, 64, VAL) == (VAL)) ? (VAL) : fSATVALN(N, VAL))
+#define fVSATN(N, VAL) \
+    ((fSXTN(N, 64, VAL) == (VAL)) ? (VAL) : fVSATVALN(N, VAL))
 #define fADDSAT64(DST, A, B) \
     do { \
         uint64_t __a = fCAST8u(A); \
@@ -302,12 +312,18 @@ static inline void gen_pred_cancel(TCGv pred, int slot_num)
             DST = __sum; \
         } \
     } while (0)
+#define fVSATUN(N, VAL) \
+    ((fZXTN(N, 64, VAL) == (VAL)) ? (VAL) : fVSATUVALN(N, VAL))
 #define fSATUN(N, VAL) \
     ((fZXTN(N, 64, VAL) == (VAL)) ? (VAL) : fSATUVALN(N, VAL))
 #define fSATH(VAL) (fSATN(16, VAL))
 #define fSATUH(VAL) (fSATUN(16, VAL))
+#define fVSATH(VAL) (fVSATN(16, VAL))
+#define fVSATUH(VAL) (fVSATUN(16, VAL))
 #define fSATUB(VAL) (fSATUN(8, VAL))
 #define fSATB(VAL) (fSATN(8, VAL))
+#define fVSATUB(VAL) (fVSATUN(8, VAL))
+#define fVSATB(VAL) (fVSATN(8, VAL))
 #define fIMMEXT(IMM) (IMM = IMM)
 #define fMUST_IMMEXT(IMM) fIMMEXT(IMM)
 
@@ -414,6 +430,8 @@ static inline TCGv gen_read_ireg(TCGv result, TCGv val, int shift)
 #define fCAST4s(A) ((int32_t)(A))
 #define fCAST8u(A) ((uint64_t)(A))
 #define fCAST8s(A) ((int64_t)(A))
+#define fCAST2_2s(A) ((int16_t)(A))
+#define fCAST2_2u(A) ((uint16_t)(A))
 #define fCAST4_4s(A) ((int32_t)(A))
 #define fCAST4_4u(A) ((uint32_t)(A))
 #define fCAST4_8s(A) ((int64_t)((int32_t)(A)))
@@ -511,7 +529,9 @@ static inline TCGv gen_read_ireg(TCGv result, TCGv val, int shift)
 #define fPM_M(REG, MVAL)    do { REG = REG + (MVAL); } while (0)
 #endif
 #define fSCALE(N, A) (((int64_t)(A)) << N)
+#define fVSATW(A) fVSATN(32, ((long long)A))
 #define fSATW(A) fSATN(32, ((long long)A))
+#define fVSAT(A) fVSATN(32, (A))
 #define fSAT(A) fSATN(32, (A))
 #define fSAT_ORIG_SHL(A, ORIG_REG) \
     ((((int32_t)((fSAT(A)) ^ ((int32_t)(ORIG_REG)))) < 0) \
@@ -648,12 +668,14 @@ static inline TCGv gen_read_ireg(TCGv result, TCGv val, int shift)
             fSETBIT(j, DST, VAL); \
         } \
     } while (0)
+#define fCOUNTONES_2(VAL) ctpop16(VAL)
 #define fCOUNTONES_4(VAL) ctpop32(VAL)
 #define fCOUNTONES_8(VAL) ctpop64(VAL)
 #define fBREV_8(VAL) revbit64(VAL)
 #define fBREV_4(VAL) revbit32(VAL)
 #define fCL1_8(VAL) clo64(VAL)
 #define fCL1_4(VAL) clo32(VAL)
+#define fCL1_2(VAL) (clz32(~(uint16_t)(VAL) & 0xffff) - 16)
 #define fINTERLEAVE(ODD, EVEN) interleave(ODD, EVEN)
 #define fDEINTERLEAVE(MIXED) deinterleave(MIXED)
 #define fHIDE(A) A
diff --git a/target/hexagon/mmvec/macros.h b/target/hexagon/mmvec/macros.h
new file mode 100644
index 0000000..eff882c
--- /dev/null
+++ b/target/hexagon/mmvec/macros.h
@@ -0,0 +1,354 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HEXAGON_MMVEC_MACROS_H
+#define HEXAGON_MMVEC_MACROS_H
+
+#include "qemu/osdep.h"
+#include "qemu/host-utils.h"
+#include "arch.h"
+#include "mmvec/system_ext_mmvec.h"
+
+#ifndef QEMU_GENERATE
+#define VdV      (*(MMVector *)(VdV_void))
+#define VsV      (*(MMVector *)(VsV_void))
+#define VuV      (*(MMVector *)(VuV_void))
+#define VvV      (*(MMVector *)(VvV_void))
+#define VwV      (*(MMVector *)(VwV_void))
+#define VxV      (*(MMVector *)(VxV_void))
+#define VyV      (*(MMVector *)(VyV_void))
+
+#define VddV     (*(MMVectorPair *)(VddV_void))
+#define VuuV     (*(MMVectorPair *)(VuuV_void))
+#define VvvV     (*(MMVectorPair *)(VvvV_void))
+#define VxxV     (*(MMVectorPair *)(VxxV_void))
+
+#define QeV      (*(MMQReg *)(QeV_void))
+#define QdV      (*(MMQReg *)(QdV_void))
+#define QsV      (*(MMQReg *)(QsV_void))
+#define QtV      (*(MMQReg *)(QtV_void))
+#define QuV      (*(MMQReg *)(QuV_void))
+#define QvV      (*(MMQReg *)(QvV_void))
+#define QxV      (*(MMQReg *)(QxV_void))
+#endif
+
+#define LOG_VTCM_BYTE(VA, MASK, VAL, IDX) \
+    do { \
+        env->vtcm_log.data.ub[IDX] = (VAL); \
+        if (MASK) { \
+            set_bit((IDX), env->vtcm_log.mask); \
+        } else { \
+            clear_bit((IDX), env->vtcm_log.mask); \
+        } \
+        env->vtcm_log.va[IDX] = (VA); \
+    } while (0)
+
+#define fNOTQ(VAL) \
+    ({ \
+        MMQReg _ret;  \
+        int _i_;  \
+        for (_i_ = 0; _i_ < fVECSIZE() / 64; _i_++) { \
+            _ret.ud[_i_] = ~VAL.ud[_i_]; \
+        } \
+        _ret;\
+     })
+#define fGETQBITS(REG, WIDTH, MASK, BITNO) \
+    ((MASK) & (REG.w[(BITNO) >> 5] >> ((BITNO) & 0x1f)))
+#define fGETQBIT(REG, BITNO) fGETQBITS(REG, 1, 1, BITNO)
+#define fGENMASKW(QREG, IDX) \
+    (((fGETQBIT(QREG, (IDX * 4 + 0)) ? 0xFF : 0x0) << 0)  | \
+     ((fGETQBIT(QREG, (IDX * 4 + 1)) ? 0xFF : 0x0) << 8)  | \
+     ((fGETQBIT(QREG, (IDX * 4 + 2)) ? 0xFF : 0x0) << 16) | \
+     ((fGETQBIT(QREG, (IDX * 4 + 3)) ? 0xFF : 0x0) << 24))
+#define fGETNIBBLE(IDX, SRC) (fSXTN(4, 8, (SRC >> (4 * IDX)) & 0xF))
+#define fGETCRUMB(IDX, SRC) (fSXTN(2, 8, (SRC >> (2 * IDX)) & 0x3))
+#define fGETCRUMB_SYMMETRIC(IDX, SRC) \
+    ((fGETCRUMB(IDX, SRC) >= 0 ? (2 - fGETCRUMB(IDX, SRC)) \
+                               : fGETCRUMB(IDX, SRC)))
+#define fGENMASKH(QREG, IDX) \
+    (((fGETQBIT(QREG, (IDX * 2 + 0)) ? 0xFF : 0x0) << 0) | \
+     ((fGETQBIT(QREG, (IDX * 2 + 1)) ? 0xFF : 0x0) << 8))
+#define fGETMASKW(VREG, QREG, IDX) (VREG.w[IDX] & fGENMASKW((QREG), IDX))
+#define fGETMASKH(VREG, QREG, IDX) (VREG.h[IDX] & fGENMASKH((QREG), IDX))
+#define fCONDMASK8(QREG, IDX, YESVAL, NOVAL) \
+    (fGETQBIT(QREG, IDX) ? (YESVAL) : (NOVAL))
+#define fCONDMASK16(QREG, IDX, YESVAL, NOVAL) \
+    ((fGENMASKH(QREG, IDX) & (YESVAL)) | \
+     (fGENMASKH(fNOTQ(QREG), IDX) & (NOVAL)))
+#define fCONDMASK32(QREG, IDX, YESVAL, NOVAL) \
+    ((fGENMASKW(QREG, IDX) & (YESVAL)) | \
+     (fGENMASKW(fNOTQ(QREG), IDX) & (NOVAL)))
+#define fSETQBITS(REG, WIDTH, MASK, BITNO, VAL) \
+    do { \
+        uint32_t __TMP = (VAL); \
+        REG.w[(BITNO) >> 5] &= ~((MASK) << ((BITNO) & 0x1f)); \
+        REG.w[(BITNO) >> 5] |= (((__TMP) & (MASK)) << ((BITNO) & 0x1f)); \
+    } while (0)
+#define fSETQBIT(REG, BITNO, VAL) fSETQBITS(REG, 1, 1, BITNO, VAL)
+#define fVBYTES() (fVECSIZE())
+#define fVALIGN(ADDR, LOG2_ALIGNMENT) (ADDR = ADDR & ~(LOG2_ALIGNMENT - 1))
+#define fVLASTBYTE(ADDR, LOG2_ALIGNMENT) (ADDR = ADDR | (LOG2_ALIGNMENT - 1))
+#define fVELEM(WIDTH) ((fVECSIZE() * 8) / WIDTH)
+#define fVECLOGSIZE() (7)
+#define fVECSIZE() (1 << fVECLOGSIZE())
+#define fSWAPB(A, B) do { uint8_t tmp = A; A = B; B = tmp; } while (0)
+#define fV_AL_CHECK(EA, MASK) \
+    if ((EA) & (MASK)) { \
+        warn("aligning misaligned vector. EA=%08x", (EA)); \
+    }
+#define fSCATTER_INIT(REGION_START, LENGTH, ELEMENT_SIZE) \
+    mem_vector_scatter_init(env, slot, REGION_START, LENGTH, ELEMENT_SIZE)
+#define fGATHER_INIT(REGION_START, LENGTH, ELEMENT_SIZE) \
+    mem_vector_gather_init(env, REGION_START, LENGTH, ELEMENT_SIZE)
+#define fSCATTER_FINISH(OP)
+#define fGATHER_FINISH()
+#define fLOG_SCATTER_OP(SIZE) \
+    do { \
+        env->vtcm_log.op = true; \
+        env->vtcm_log.op_size = SIZE; \
+    } while (0)
+#define fVLOG_VTCM_WORD_INCREMENT(EA, OFFSET, INC, IDX, ALIGNMENT, LEN) \
+    do { \
+        int log_byte = 0; \
+        target_ulong va = EA; \
+        target_ulong va_high = EA + LEN; \
+        for (int i0 = 0; i0 < 4; i0++) { \
+            log_byte = (va + i0) <= va_high; \
+            LOG_VTCM_BYTE(va + i0, log_byte, INC. ub[4 * IDX + i0], \
+                          4 * IDX + i0); \
+        } \
+    } while (0)
+#define fVLOG_VTCM_HALFWORD_INCREMENT(EA, OFFSET, INC, IDX, ALIGNMENT, LEN) \
+    do { \
+        int log_byte = 0; \
+        target_ulong va = EA; \
+        target_ulong va_high = EA + LEN; \
+        for (int i0 = 0; i0 < 2; i0++) { \
+            log_byte = (va + i0) <= va_high; \
+            LOG_VTCM_BYTE(va + i0, log_byte, INC.ub[2 * IDX + i0], \
+                          2 * IDX + i0); \
+        } \
+    } while (0)
+
+#define fVLOG_VTCM_HALFWORD_INCREMENT_DV(EA, OFFSET, INC, IDX, IDX2, IDX_H, \
+                                         ALIGNMENT, LEN) \
+    do { \
+        int log_byte = 0; \
+        target_ulong va = EA; \
+        target_ulong va_high = EA + LEN; \
+        for (int i0 = 0; i0 < 2; i0++) { \
+            log_byte = (va + i0) <= va_high; \
+            LOG_VTCM_BYTE(va + i0, log_byte, INC.ub[2 * IDX + i0], \
+                          2 * IDX + i0); \
+        } \
+    } while (0)
+
+/* NOTE - Will this always be tmp_VRegs[0]; */
+#define GATHER_FUNCTION(EA, OFFSET, IDX, LEN, ELEMENT_SIZE, BANK_IDX, QVAL) \
+    do { \
+        int i0; \
+        target_ulong va = EA; \
+        target_ulong va_high = EA + LEN; \
+        uintptr_t ra = GETPC(); \
+        int log_bank = 0; \
+        int log_byte = 0; \
+        for (i0 = 0; i0 < ELEMENT_SIZE; i0++) { \
+            log_byte = ((va + i0) <= va_high) && QVAL; \
+            log_bank |= (log_byte << i0); \
+            uint8_t B; \
+            B = cpu_ldub_data_ra(env, EA + i0, ra); \
+            env->tmp_VRegs[0].ub[ELEMENT_SIZE * IDX + i0] = B; \
+            LOG_VTCM_BYTE(va + i0, log_byte, B, ELEMENT_SIZE * IDX + i0); \
+        } \
+    } while (0)
+#define fVLOG_VTCM_GATHER_WORD(EA, OFFSET, IDX, LEN) \
+    do { \
+        GATHER_FUNCTION(EA, OFFSET, IDX, LEN, 4, IDX, 1); \
+    } while (0)
+#define fVLOG_VTCM_GATHER_HALFWORD(EA, OFFSET, IDX, LEN) \
+    do { \
+        GATHER_FUNCTION(EA, OFFSET, IDX, LEN, 2, IDX, 1); \
+    } while (0)
+#define fVLOG_VTCM_GATHER_HALFWORD_DV(EA, OFFSET, IDX, IDX2, IDX_H, LEN) \
+    do { \
+        GATHER_FUNCTION(EA, OFFSET, IDX, LEN, 2, (2 * IDX2 + IDX_H), 1); \
+    } while (0)
+#define fVLOG_VTCM_GATHER_WORDQ(EA, OFFSET, IDX, Q, LEN) \
+    do { \
+        GATHER_FUNCTION(EA, OFFSET, IDX, LEN, 4, IDX, \
+                        fGETQBIT(QsV, 4 * IDX + i0)); \
+    } while (0)
+#define fVLOG_VTCM_GATHER_HALFWORDQ(EA, OFFSET, IDX, Q, LEN) \
+    do { \
+        GATHER_FUNCTION(EA, OFFSET, IDX, LEN, 2, IDX, \
+                        fGETQBIT(QsV, 2 * IDX + i0)); \
+    } while (0)
+#define fVLOG_VTCM_GATHER_HALFWORDQ_DV(EA, OFFSET, IDX, IDX2, IDX_H, Q, LEN) \
+    do { \
+        GATHER_FUNCTION(EA, OFFSET, IDX, LEN, 2, (2 * IDX2 + IDX_H), \
+                        fGETQBIT(QsV, 2 * IDX + i0)); \
+    } while (0)
+#define SCATTER_OP_WRITE_TO_MEM(TYPE) \
+    do { \
+        uintptr_t ra = GETPC(); \
+        for (int i = 0; i < env->vtcm_log.size; i += sizeof(TYPE)) { \
+            if (test_bit(i, env->vtcm_log.mask)) { \
+                TYPE dst = 0; \
+                TYPE inc = 0; \
+                for (int j = 0; j < sizeof(TYPE); j++) { \
+                    uint8_t val; \
+                    val = cpu_ldub_data_ra(env, env->vtcm_log.va[i + j], ra); \
+                    dst |= val << (8 * j); \
+                    inc |= env->vtcm_log.data.ub[j + i] << (8 * j); \
+                    clear_bit(j + i, env->vtcm_log.mask); \
+                    env->vtcm_log.data.ub[j + i] = 0; \
+                } \
+                dst += inc; \
+                for (int j = 0; j < sizeof(TYPE); j++) { \
+                    cpu_stb_data_ra(env, env->vtcm_log.va[i + j], \
+                                    (dst >> (8 * j)) & 0xFF, ra); \
+                } \
+            } \
+        } \
+    } while (0)
+#define SCATTER_OP_PROBE_MEM(TYPE, MMU_IDX, RETADDR) \
+    do { \
+        for (int i = 0; i < env->vtcm_log.size; i += sizeof(TYPE)) { \
+            if (test_bit(i, env->vtcm_log.mask)) { \
+                for (int j = 0; j < sizeof(TYPE); j++) { \
+                    probe_read(env, env->vtcm_log.va[i + j], 1, \
+                               MMU_IDX, RETADDR); \
+                    probe_write(env, env->vtcm_log.va[i + j], 1, \
+                                MMU_IDX, RETADDR); \
+                } \
+            } \
+        } \
+    } while (0)
+#define SCATTER_FUNCTION(EA, OFFSET, IDX, LEN, ELEM_SIZE, BANK_IDX, QVAL, IN) \
+    do { \
+        int i0; \
+        target_ulong va = EA; \
+        target_ulong va_high = EA + LEN; \
+        int log_bank = 0; \
+        int log_byte = 0; \
+        for (i0 = 0; i0 < ELEM_SIZE; i0++) { \
+            log_byte = ((va + i0) <= va_high) && QVAL; \
+            log_bank |= (log_byte << i0); \
+            LOG_VTCM_BYTE(va + i0, log_byte, IN.ub[ELEM_SIZE * IDX + i0], \
+                          ELEM_SIZE * IDX + i0); \
+        } \
+    } while (0)
+#define fVLOG_VTCM_HALFWORD(EA, OFFSET, IN, IDX, LEN) \
+    do { \
+        SCATTER_FUNCTION(EA, OFFSET, IDX, LEN, 2, IDX, 1, IN); \
+    } while (0)
+#define fVLOG_VTCM_WORD(EA, OFFSET, IN, IDX, LEN) \
+    do { \
+        SCATTER_FUNCTION(EA, OFFSET, IDX, LEN, 4, IDX, 1, IN); \
+    } while (0)
+#define fVLOG_VTCM_HALFWORDQ(EA, OFFSET, IN, IDX, Q, LEN) \
+    do { \
+        SCATTER_FUNCTION(EA, OFFSET, IDX, LEN, 2, IDX, \
+                         fGETQBIT(QsV, 2 * IDX + i0), IN); \
+    } while (0)
+#define fVLOG_VTCM_WORDQ(EA, OFFSET, IN, IDX, Q, LEN) \
+    do { \
+        SCATTER_FUNCTION(EA, OFFSET, IDX, LEN, 4, IDX, \
+                         fGETQBIT(QsV, 4 * IDX + i0), IN); \
+    } while (0)
+#define fVLOG_VTCM_HALFWORD_DV(EA, OFFSET, IN, IDX, IDX2, IDX_H, LEN) \
+    do { \
+        SCATTER_FUNCTION(EA, OFFSET, IDX, LEN, 2, \
+                         (2 * IDX2 + IDX_H), 1, IN); \
+    } while (0)
+#define fVLOG_VTCM_HALFWORDQ_DV(EA, OFFSET, IN, IDX, Q, IDX2, IDX_H, LEN) \
+    do { \
+        SCATTER_FUNCTION(EA, OFFSET, IDX, LEN, 2, (2 * IDX2 + IDX_H), \
+                         fGETQBIT(QsV, 2 * IDX + i0), IN); \
+    } while (0)
+#define fSTORERELEASE(EA, TYPE) \
+    do { \
+        fV_AL_CHECK(EA, fVECSIZE() - 1); \
+    } while (0)
+#ifdef QEMU_GENERATE
+#define fLOADMMV(EA, DST) gen_vreg_load(ctx, DST##_off, EA, true)
+#endif
+#ifdef QEMU_GENERATE
+#define fLOADMMVU(EA, DST) gen_vreg_load(ctx, DST##_off, EA, false)
+#endif
+#ifdef QEMU_GENERATE
+#define fSTOREMMV(EA, SRC) \
+    gen_vreg_store(ctx, insn, pkt, EA, SRC##_off, insn->slot, true)
+#endif
+#ifdef QEMU_GENERATE
+#define fSTOREMMVQ(EA, SRC, MASK) \
+    gen_vreg_masked_store(ctx, EA, SRC##_off, MASK##_off, insn->slot, false)
+#endif
+#ifdef QEMU_GENERATE
+#define fSTOREMMVNQ(EA, SRC, MASK) \
+    gen_vreg_masked_store(ctx, EA, SRC##_off, MASK##_off, insn->slot, true)
+#endif
+#ifdef QEMU_GENERATE
+#define fSTOREMMVU(EA, SRC) \
+    gen_vreg_store(ctx, insn, pkt, EA, SRC##_off, insn->slot, false)
+#endif
+#define fVFOREACH(WIDTH, VAR) for (VAR = 0; VAR < fVELEM(WIDTH); VAR++)
+#define fVARRAY_ELEMENT_ACCESS(ARRAY, TYPE, INDEX) \
+    ARRAY.v[(INDEX) / (fVECSIZE() / (sizeof(ARRAY.TYPE[0])))].TYPE[(INDEX) % \
+    (fVECSIZE() / (sizeof(ARRAY.TYPE[0])))]
+
+#define fVSATDW(U, V) fVSATW(((((long long)U) << 32) | fZXTN(32, 64, V)))
+#define fVASL_SATHI(U, V) fVSATW(((U) << 1) | ((V) >> 31))
+#define fVUADDSAT(WIDTH, U, V) \
+    fVSATUN(WIDTH, fZXTN(WIDTH, 2 * WIDTH, U) + fZXTN(WIDTH, 2 * WIDTH, V))
+#define fVSADDSAT(WIDTH, U, V) \
+    fVSATN(WIDTH, fSXTN(WIDTH, 2 * WIDTH, U) + fSXTN(WIDTH, 2 * WIDTH, V))
+#define fVUSUBSAT(WIDTH, U, V) \
+    fVSATUN(WIDTH, fZXTN(WIDTH, 2 * WIDTH, U) - fZXTN(WIDTH, 2 * WIDTH, V))
+#define fVSSUBSAT(WIDTH, U, V) \
+    fVSATN(WIDTH, fSXTN(WIDTH, 2 * WIDTH, U) - fSXTN(WIDTH, 2 * WIDTH, V))
+#define fVAVGU(WIDTH, U, V) \
+    ((fZXTN(WIDTH, 2 * WIDTH, U) + fZXTN(WIDTH, 2 * WIDTH, V)) >> 1)
+#define fVAVGURND(WIDTH, U, V) \
+    ((fZXTN(WIDTH, 2 * WIDTH, U) + fZXTN(WIDTH, 2 * WIDTH, V) + 1) >> 1)
+#define fVNAVGU(WIDTH, U, V) \
+    ((fZXTN(WIDTH, 2 * WIDTH, U) - fZXTN(WIDTH, 2 * WIDTH, V)) >> 1)
+#define fVNAVGURNDSAT(WIDTH, U, V) \
+    fVSATUN(WIDTH, ((fZXTN(WIDTH, 2 * WIDTH, U) - \
+                     fZXTN(WIDTH, 2 * WIDTH, V) + 1) >> 1))
+#define fVAVGS(WIDTH, U, V) \
+    ((fSXTN(WIDTH, 2 * WIDTH, U) + fSXTN(WIDTH, 2 * WIDTH, V)) >> 1)
+#define fVAVGSRND(WIDTH, U, V) \
+    ((fSXTN(WIDTH, 2 * WIDTH, U) + fSXTN(WIDTH, 2 * WIDTH, V) + 1) >> 1)
+#define fVNAVGS(WIDTH, U, V) \
+    ((fSXTN(WIDTH, 2 * WIDTH, U) - fSXTN(WIDTH, 2 * WIDTH, V)) >> 1)
+#define fVNAVGSRND(WIDTH, U, V) \
+    ((fSXTN(WIDTH, 2 * WIDTH, U) - fSXTN(WIDTH, 2 * WIDTH, V) + 1) >> 1)
+#define fVNAVGSRNDSAT(WIDTH, U, V) \
+    fVSATN(WIDTH, ((fSXTN(WIDTH, 2 * WIDTH, U) - \
+                    fSXTN(WIDTH, 2 * WIDTH, V) + 1) >> 1))
+#define fVNOROUND(VAL, SHAMT) VAL
+#define fVNOSAT(VAL) VAL
+#define fVROUND(VAL, SHAMT) \
+    ((VAL) + (((SHAMT) > 0) ? (1LL << ((SHAMT) - 1)) : 0))
+#define fCARRY_FROM_ADD32(A, B, C) \
+    (((fZXTN(32, 64, A) + fZXTN(32, 64, B) + C) >> 32) & 1)
+#define fUARCH_NOTE_PUMP_4X()
+#define fUARCH_NOTE_PUMP_2X()
+
+#define IV1DEAD()
+#endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 06/30] Hexagon HVX (target/hexagon) import macro definitions
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (4 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 05/30] Hexagon HVX (target/hexagon) macros Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 07/30] Hexagon HVX (target/hexagon) semantics generator Taylor Simpson
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Imported from the Hexagon architecture library
    imported/allext_macros.def       Top level macro include for all extensions
    imported/macros.def              Scalar core macros (some HVX here)
    imported/mmvec/macros.def        HVX macro definitions
The macro definition files specify instruction attributes that are applied
to each instruction that reverences the macro.

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/imported/allext_macros.def |  25 +
 target/hexagon/imported/macros.def        |  88 ++++
 target/hexagon/imported/mmvec/macros.def  | 842 ++++++++++++++++++++++++++++++
 3 files changed, 955 insertions(+)
 create mode 100644 target/hexagon/imported/allext_macros.def
 create mode 100755 target/hexagon/imported/mmvec/macros.def

diff --git a/target/hexagon/imported/allext_macros.def b/target/hexagon/imported/allext_macros.def
new file mode 100644
index 0000000..9c91199
--- /dev/null
+++ b/target/hexagon/imported/allext_macros.def
@@ -0,0 +1,25 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Top level file for all instruction set extensions
+ */
+#define EXTNAME mmvec
+#define EXTSTR "mmvec"
+#include "mmvec/macros.def"
+#undef EXTNAME
+#undef EXTSTR
diff --git a/target/hexagon/imported/macros.def b/target/hexagon/imported/macros.def
index 32ed3bf..e23f915 100755
--- a/target/hexagon/imported/macros.def
+++ b/target/hexagon/imported/macros.def
@@ -177,6 +177,12 @@ DEF_MACRO(
 )
 
 DEF_MACRO(
+    fVSATUVALN,
+    ({ ((VAL) < 0) ? 0 : ((1LL<<(N))-1);}),
+    ()
+)
+
+DEF_MACRO(
     fSATUVALN,
     ({fSET_OVERFLOW(); ((VAL) < 0) ? 0 : ((1LL<<(N))-1);}),
     ()
@@ -189,6 +195,12 @@ DEF_MACRO(
 )
 
 DEF_MACRO(
+    fVSATVALN,
+    ({((VAL) < 0) ? (-(1LL<<((N)-1))) : ((1LL<<((N)-1))-1);}),
+    ()
+)
+
+DEF_MACRO(
     fZXTN, /* macro name */
     ((VAL) & ((1LL<<(N))-1)),
     /* attribs */
@@ -205,6 +217,11 @@ DEF_MACRO(
     ((fSXTN(N,64,VAL) == (VAL)) ? (VAL) : fSATVALN(N,VAL)),
     ()
 )
+DEF_MACRO(
+    fVSATN,
+    ((fSXTN(N,64,VAL) == (VAL)) ? (VAL) : fVSATVALN(N,VAL)),
+    ()
+)
 
 DEF_MACRO(
     fADDSAT64,
@@ -235,6 +252,12 @@ DEF_MACRO(
 )
 
 DEF_MACRO(
+    fVSATUN,
+    ((fZXTN(N,64,VAL) == (VAL)) ? (VAL) : fVSATUVALN(N,VAL)),
+    ()
+)
+
+DEF_MACRO(
     fSATUN,
     ((fZXTN(N,64,VAL) == (VAL)) ? (VAL) : fSATUVALN(N,VAL)),
     ()
@@ -254,6 +277,19 @@ DEF_MACRO(
 )
 
 DEF_MACRO(
+    fVSATH,
+    (fVSATN(16,VAL)),
+    ()
+)
+
+DEF_MACRO(
+    fVSATUH,
+    (fVSATUN(16,VAL)),
+    ()
+)
+
+
+DEF_MACRO(
     fSATUB,
     (fSATUN(8,VAL)),
     ()
@@ -265,6 +301,20 @@ DEF_MACRO(
 )
 
 
+DEF_MACRO(
+    fVSATUB,
+    (fVSATUN(8,VAL)),
+    ()
+)
+DEF_MACRO(
+    fVSATB,
+    (fVSATN(8,VAL)),
+    ()
+)
+
+
+
+
 /*************************************/
 /* immediate extension               */
 /*************************************/
@@ -557,6 +607,18 @@ DEF_MACRO(
 )
 
 DEF_MACRO(
+    fCAST2_2s, /* macro name */
+    ((size2s_t)(A)),
+    /* optional attributes */
+)
+
+DEF_MACRO(
+    fCAST2_2u, /* macro name */
+    ((size2u_t)(A)),
+    /* optional attributes */
+)
+
+DEF_MACRO(
     fCAST4_4s, /* macro name */
     ((size4s_t)(A)),
     /* optional attributes */
@@ -876,6 +938,11 @@ DEF_MACRO(
     (((size8s_t)(A))<<N),
     /* optional attributes */
 )
+DEF_MACRO(
+    fVSATW, /* saturating to 32-bits*/
+    fVSATN(32,((long long)A)),
+    ()
+)
 
 DEF_MACRO(
     fSATW, /* saturating to 32-bits*/
@@ -884,6 +951,12 @@ DEF_MACRO(
 )
 
 DEF_MACRO(
+    fVSAT, /* saturating to 32-bits*/
+    fVSATN(32,(A)),
+    ()
+)
+
+DEF_MACRO(
     fSAT, /* saturating to 32-bits*/
     fSATN(32,(A)),
     ()
@@ -1389,6 +1462,11 @@ DEF_MACRO(fSETBITS,
 /*************************************/
 /* Used for parity, etc........      */
 /*************************************/
+DEF_MACRO(fCOUNTONES_2,
+    count_ones_2(VAL),
+    /* nothing */
+)
+
 DEF_MACRO(fCOUNTONES_4,
     count_ones_4(VAL),
     /* nothing */
@@ -1419,6 +1497,11 @@ DEF_MACRO(fCL1_4,
     /* nothing */
 )
 
+DEF_MACRO(fCL1_2,
+    count_leading_ones_2(VAL),
+    /* nothing */
+)
+
 DEF_MACRO(fINTERLEAVE,
     interleave(ODD,EVEN),
     /* nothing */
@@ -1576,3 +1659,8 @@ DEF_MACRO(fBRANCH_SPECULATE_STALL,
     },
     ()
 )
+
+DEF_MACRO(IV1DEAD,
+    ,
+    ()
+)
diff --git a/target/hexagon/imported/mmvec/macros.def b/target/hexagon/imported/mmvec/macros.def
new file mode 100755
index 0000000..7e5438a
--- /dev/null
+++ b/target/hexagon/imported/mmvec/macros.def
@@ -0,0 +1,842 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+DEF_MACRO(fDUMPQ,
+	do {
+		printf(STR ":" #REG ": 0x%016llx\n",REG.ud[0]);
+	} while (0),
+	()
+)
+
+DEF_MACRO(fUSE_LOOKUP_ADDRESS_BY_REV,
+	PROC->arch_proc_options->mmvec_use_full_va_for_lookup,
+	()
+)
+
+DEF_MACRO(fUSE_LOOKUP_ADDRESS,
+	1,
+	()
+)
+
+DEF_MACRO(fNOTQ,
+	({mmqreg_t _ret = {0}; int _i_; for (_i_ = 0; _i_ < fVECSIZE()/64; _i_++) _ret.ud[_i_] = ~VAL.ud[_i_]; _ret;}),
+	()
+)
+
+DEF_MACRO(fGETQBITS,
+	((MASK) & (REG.w[(BITNO)>>5] >> ((BITNO) & 0x1f))),
+	()
+)
+
+DEF_MACRO(fGETQBIT,
+	fGETQBITS(REG,1,1,BITNO),
+	()
+)
+
+DEF_MACRO(fGENMASKW,
+	(((fGETQBIT(QREG,(IDX*4+0)) ? 0xFF : 0x0) << 0)
+	|((fGETQBIT(QREG,(IDX*4+1)) ? 0xFF : 0x0) << 8)
+	|((fGETQBIT(QREG,(IDX*4+2)) ? 0xFF : 0x0) << 16)
+	|((fGETQBIT(QREG,(IDX*4+3)) ? 0xFF : 0x0) << 24)),
+	()
+)
+DEF_MACRO(fGET10BIT,
+	{
+		COE = (((((fGETUBYTE(3,VAL) >> (2 * POS)) & 3) << 8) | fGETUBYTE(POS,VAL)) << 6);
+		COE >>= 6;
+	},
+	()
+)
+
+DEF_MACRO(fVMAX,
+	(X>Y) ? X : Y,
+	()
+)
+
+
+DEF_MACRO(fGETNIBBLE,
+    ( fSXTN(4,8,(SRC >> (4*IDX)) & 0xF) ),
+    ()
+)
+
+DEF_MACRO(fGETCRUMB,
+    ( fSXTN(2,8,(SRC >> (2*IDX)) & 0x3) ),
+    ()
+)
+
+DEF_MACRO(fGETCRUMB_SYMMETRIC,
+    ( (fGETCRUMB(IDX,SRC)>=0 ? (2-fGETCRUMB(IDX,SRC)) : fGETCRUMB(IDX,SRC) ) ),
+    ()
+)
+
+#define ZERO_OFFSET_2B +
+
+DEF_MACRO(fGENMASKH,
+	(((fGETQBIT(QREG,(IDX*2+0)) ? 0xFF : 0x0) << 0)
+	|((fGETQBIT(QREG,(IDX*2+1)) ? 0xFF : 0x0) << 8)),
+	()
+)
+
+DEF_MACRO(fGETMASKW,
+	(VREG.w[IDX] & fGENMASKW((QREG),IDX)),
+	()
+)
+
+DEF_MACRO(fGETMASKH,
+	(VREG.h[IDX] & fGENMASKH((QREG),IDX)),
+	()
+)
+
+DEF_MACRO(fCONDMASK8,
+	(fGETQBIT(QREG,IDX) ? (YESVAL) : (NOVAL)),
+	()
+)
+
+DEF_MACRO(fCONDMASK16,
+	((fGENMASKH(QREG,IDX) & (YESVAL)) | (fGENMASKH(fNOTQ(QREG),IDX) & (NOVAL))),
+	()
+)
+
+DEF_MACRO(fCONDMASK32,
+	((fGENMASKW(QREG,IDX) & (YESVAL)) | (fGENMASKW(fNOTQ(QREG),IDX) & (NOVAL))),
+	()
+)
+
+
+DEF_MACRO(fSETQBITS,
+	do {
+		size4u_t __TMP = (VAL);
+		REG.w[(BITNO)>>5] &= ~((MASK) << ((BITNO) & 0x1f));
+		REG.w[(BITNO)>>5] |= (((__TMP) & (MASK)) << ((BITNO) & 0x1f));
+	} while (0),
+	()
+)
+
+DEF_MACRO(fSETQBIT,
+	fSETQBITS(REG,1,1,BITNO,VAL),
+	()
+)
+
+DEF_MACRO(fVBYTES,
+	(fVECSIZE()),
+	()
+)
+
+DEF_MACRO(fVHALVES,
+	(fVECSIZE()/2),
+	()
+)
+
+DEF_MACRO(fVWORDS,
+	(fVECSIZE()/4),
+	()
+)
+
+DEF_MACRO(fVDWORDS,
+	(fVECSIZE()/8),
+	()
+)
+
+DEF_MACRO(fVALIGN,
+    ( ADDR = ADDR & ~(LOG2_ALIGNMENT-1)),
+    ()
+)
+
+DEF_MACRO(fVLASTBYTE,
+    ( ADDR = ADDR | (LOG2_ALIGNMENT-1)),
+    ()
+)
+
+
+DEF_MACRO(fVELEM,
+    ((fVECSIZE()*8)/WIDTH),
+    ()
+)
+
+DEF_MACRO(fVECLOGSIZE,
+    (mmvec_current_veclogsize(thread)),
+    ()
+)
+
+DEF_MACRO(fVECSIZE,
+    (1<<fVECLOGSIZE()),
+    ()
+)
+
+DEF_MACRO(fSWAPB,
+    {
+		size1u_t tmp = A;
+		A = B;
+		B = tmp;
+	},
+    /* NOTHING */
+)
+
+DEF_MACRO(
+	fVZERO,
+	mmvec_zero_vector(),
+	()
+)
+
+DEF_MACRO(
+    fNEWVREG,
+    ((THREAD2STRUCT->VRegs_updated & (((VRegMask)1)<<VNUM)) ? THREAD2STRUCT->future_VRegs[VNUM] : mmvec_zero_vector()),
+    (A_DOTNEWVALUE,A_RESTRICT_SLOT0ONLY)
+)
+
+DEF_MACRO(
+	fV_AL_CHECK,
+	if ((EA) & (MASK)) {
+		warn("aligning misaligned vector. PC=%08x EA=%08x",thread->Regs[REG_PC],(EA));
+	},
+	()
+)
+DEF_MACRO(fSCATTER_INIT,
+    {
+    mem_vector_scatter_init(thread, insn,   REGION_START, LENGTH, ELEMENT_SIZE);
+	if (EXCEPTION_DETECTED) return;
+    },
+    (A_STORE,A_MEMLIKE,A_RESTRICT_SLOT0ONLY)
+)
+
+DEF_MACRO(fGATHER_INIT,
+    {
+    mem_vector_gather_init(thread, insn,   REGION_START, LENGTH, ELEMENT_SIZE);
+	if (EXCEPTION_DETECTED) return;
+    },
+    (A_LOAD,A_MEMLIKE,A_RESTRICT_SLOT1ONLY)
+)
+
+DEF_MACRO(fSCATTER_FINISH,
+    {
+	if (EXCEPTION_DETECTED) return;
+    mem_vector_scatter_finish(thread, insn, OP);
+    },
+    ()
+)
+
+DEF_MACRO(fGATHER_FINISH,
+    {
+	if (EXCEPTION_DETECTED) return;
+    mem_vector_gather_finish(thread, insn);
+    },
+    ()
+)
+
+
+DEF_MACRO(CHECK_VTCM_PAGE,
+     {
+        int slot = insn->slot;
+        paddr_t pa = thread->mem_access[slot].paddr+OFFSET;
+        pa = pa & ~(ALIGNMENT-1);
+        FLAG = (pa < (thread->mem_access[slot].paddr+LENGTH));
+     },
+    ()
+)
+DEF_MACRO(COUNT_OUT_OF_BOUNDS,
+     {
+        if (!FLAG)
+        {
+               THREAD2STRUCT->vtcm_log.oob_access += SIZE;
+               warn("Scatter/Gather out of bounds of region");
+        }
+     },
+    ()
+)
+
+DEF_MACRO(fLOG_SCATTER_OP,
+    {
+        // Log the size and indicate that the extension ext.c file needs to increment right before memory write
+        THREAD2STRUCT->vtcm_log.op = 1;
+        THREAD2STRUCT->vtcm_log.op_size = SIZE;
+    },
+    ()
+)
+
+
+
+DEF_MACRO(fVLOG_VTCM_WORD_INCREMENT,
+    {
+        int slot = insn->slot;
+        int log_bank = 0;
+        int log_byte =0;
+        paddr_t pa = thread->mem_access[slot].paddr+(OFFSET & ~(ALIGNMENT-1));
+        paddr_t pa_high = thread->mem_access[slot].paddr+LEN;
+        for(int i0 = 0; i0 < 4; i0++)
+        {
+            log_byte =  ((OFFSET>=0)&&((pa+i0)<=pa_high));
+            log_bank |= (log_byte<<i0);
+            LOG_VTCM_BYTE(pa+i0,log_byte,INC.ub[4*IDX+i0],4*IDX+i0);
+        }
+        { LOG_VTCM_BANK(pa, log_bank, IDX); }
+    },
+    ()
+)
+
+DEF_MACRO(fVLOG_VTCM_HALFWORD_INCREMENT,
+    {
+        int slot = insn->slot;
+        int log_bank = 0;
+        int log_byte = 0;
+        paddr_t pa = thread->mem_access[slot].paddr+(OFFSET & ~(ALIGNMENT-1));
+        paddr_t pa_high = thread->mem_access[slot].paddr+LEN;
+        for(int i0 = 0; i0 < 2; i0++) {
+            log_byte =  ((OFFSET>=0)&&((pa+i0)<=pa_high));
+            log_bank |= (log_byte<<i0);
+            LOG_VTCM_BYTE(pa+i0,log_byte,INC.ub[2*IDX+i0],2*IDX+i0);
+        }
+        { LOG_VTCM_BANK(pa, log_bank,IDX); }
+    },
+    ()
+)
+
+DEF_MACRO(fVLOG_VTCM_HALFWORD_INCREMENT_DV,
+    {
+        int slot = insn->slot;
+        int log_bank = 0;
+        int log_byte = 0;
+        paddr_t pa = thread->mem_access[slot].paddr+(OFFSET & ~(ALIGNMENT-1));
+        paddr_t pa_high = thread->mem_access[slot].paddr+LEN;
+        for(int i0 = 0; i0 < 2; i0++) {
+            log_byte =  ((OFFSET>=0)&&((pa+i0)<=pa_high));
+            log_bank |= (log_byte<<i0);
+            LOG_VTCM_BYTE(pa+i0,log_byte,INC.ub[2*IDX+i0],2*IDX+i0);
+        }
+        { LOG_VTCM_BANK(pa, log_bank,(2*IDX2+IDX_H));}
+    },
+    ()
+)
+
+
+
+DEF_MACRO(GATHER_FUNCTION,
+{
+        int slot = insn->slot;
+        int i0;
+        paddr_t pa = thread->mem_access[slot].paddr+OFFSET;
+        paddr_t pa_high = thread->mem_access[slot].paddr+LEN;
+        int log_bank = 0;
+        int log_byte = 0;
+        for(i0 = 0; i0 < ELEMENT_SIZE; i0++)
+        {
+            log_byte =  ((OFFSET>=0)&&((pa+i0)<=pa_high)) && QVAL;
+            log_bank |= (log_byte<<i0);
+            size1u_t B  = sim_mem_read1(thread->system_ptr, thread->threadId, thread->mem_access[slot].paddr+OFFSET+i0);
+            THREAD2STRUCT->tmp_VRegs[0].ub[ELEMENT_SIZE*IDX+i0] = B;
+            LOG_VTCM_BYTE(pa+i0,log_byte,B,ELEMENT_SIZE*IDX+i0);
+        }
+        LOG_VTCM_BANK(pa, log_bank,BANK_IDX);
+},
+()
+)
+
+
+
+DEF_MACRO(fVLOG_VTCM_GATHER_WORD,
+    {
+		GATHER_FUNCTION(EA,OFFSET,IDX, LEN, 4, IDX, 1);
+    },
+    ()
+)
+DEF_MACRO(fVLOG_VTCM_GATHER_HALFWORD,
+    {
+		GATHER_FUNCTION(EA,OFFSET,IDX, LEN, 2, IDX, 1);
+    },
+    ()
+)
+DEF_MACRO(fVLOG_VTCM_GATHER_HALFWORD_DV,
+    {
+		GATHER_FUNCTION(EA,OFFSET,IDX, LEN, 2, (2*IDX2+IDX_H), 1);
+    },
+    ()
+)
+DEF_MACRO(fVLOG_VTCM_GATHER_WORDQ,
+    {
+		GATHER_FUNCTION(EA,OFFSET,IDX, LEN, 4, IDX, fGETQBIT(QsV,4*IDX+i0));
+    },
+    ()
+)
+DEF_MACRO(fVLOG_VTCM_GATHER_HALFWORDQ,
+    {
+		GATHER_FUNCTION(EA,OFFSET,IDX, LEN, 2, IDX, fGETQBIT(QsV,2*IDX+i0));
+    },
+    ()
+)
+
+DEF_MACRO(fVLOG_VTCM_GATHER_HALFWORDQ_DV,
+    {
+		GATHER_FUNCTION(EA,OFFSET,IDX, LEN, 2, (2*IDX2+IDX_H), fGETQBIT(QsV,2*IDX+i0));
+    },
+    ()
+)
+
+
+DEF_MACRO(DEBUG_LOG_ADDR,
+    {
+
+        if (thread->processor_ptr->arch_proc_options->mmvec_network_addr_log2)
+        {
+
+            int slot = insn->slot;
+            paddr_t pa = thread->mem_access[slot].paddr+OFFSET;
+        }
+    },
+    ()
+)
+
+
+
+
+
+
+
+DEF_MACRO(SCATTER_OP_WRITE_TO_MEM,
+    {
+        for (int i = 0; i < mmvecx->vtcm_log.size; i+=sizeof(TYPE))
+        {
+            if ( mmvecx->vtcm_log.mask.ub[i] != 0) {
+                TYPE dst = 0;
+                TYPE inc = 0;
+                for(int j = 0; j < sizeof(TYPE); j++) {
+                    dst |= (sim_mem_read1(thread->system_ptr, thread->threadId, mmvecx->vtcm_log.pa[i+j]) << (8*j));
+                    inc |= mmvecx->vtcm_log.data.ub[j+i] << (8*j);
+
+                    mmvecx->vtcm_log.mask.ub[j+i] = 0;
+                    mmvecx->vtcm_log.data.ub[j+i] = 0;
+                    mmvecx->vtcm_log.offsets.ub[j+i] = 0;
+                }
+                dst += inc;
+                for(int j = 0; j < sizeof(TYPE); j++) {
+                    sim_mem_write1(thread->system_ptr,thread->threadId, mmvecx->vtcm_log.pa[i+j], (dst >> (8*j))& 0xFF );
+                }
+        }
+
+    }
+    },
+    ()
+)
+
+DEF_MACRO(SCATTER_FUNCTION,
+{
+        int slot = insn->slot;
+        int i0;
+        paddr_t pa = thread->mem_access[slot].paddr+OFFSET;
+        paddr_t pa_high = thread->mem_access[slot].paddr+LEN;
+        int log_bank = 0;
+        int log_byte = 0;
+        for(i0 = 0; i0 < ELEMENT_SIZE; i0++) {
+            log_byte = ((OFFSET>=0)&&((pa+i0)<=pa_high)) && QVAL;
+            log_bank |= (log_byte<<i0);
+            LOG_VTCM_BYTE(pa+i0,log_byte,IN.ub[ELEMENT_SIZE*IDX+i0],ELEMENT_SIZE*IDX+i0);
+        }
+        LOG_VTCM_BANK(pa, log_bank,BANK_IDX);
+
+},
+()
+)
+
+DEF_MACRO(fVLOG_VTCM_HALFWORD,
+    {
+		SCATTER_FUNCTION (EA,OFFSET,IDX, LEN, 2, IDX, 1, IN);
+    },
+    ()
+)
+DEF_MACRO(fVLOG_VTCM_WORD,
+    {
+		SCATTER_FUNCTION (EA,OFFSET,IDX, LEN, 4, IDX, 1, IN);
+    },
+    ()
+)
+
+DEF_MACRO(fVLOG_VTCM_HALFWORDQ,
+    {
+		SCATTER_FUNCTION (EA,OFFSET,IDX, LEN, 2, IDX, fGETQBIT(QsV,2*IDX+i0), IN);
+    },
+    ()
+)
+DEF_MACRO(fVLOG_VTCM_WORDQ,
+    {
+		SCATTER_FUNCTION (EA,OFFSET,IDX, LEN, 4, IDX, fGETQBIT(QsV,4*IDX+i0), IN);
+    },
+    ()
+)
+
+
+
+
+
+DEF_MACRO(fVLOG_VTCM_HALFWORD_DV,
+    {
+		SCATTER_FUNCTION (EA,OFFSET,IDX, LEN, 2, (2*IDX2+IDX_H), 1, IN);
+    },
+    ()
+)
+
+DEF_MACRO(fVLOG_VTCM_HALFWORDQ_DV,
+    {
+		SCATTER_FUNCTION (EA,OFFSET,IDX, LEN, 2, (2*IDX2+IDX_H), fGETQBIT(QsV,2*IDX+i0), IN);
+    },
+    ()
+)
+
+
+
+
+
+
+DEF_MACRO(fSTORERELEASE,
+    {
+        fV_AL_CHECK(EA,fVECSIZE()-1);
+
+        mem_store_release(thread, insn, fVECSIZE(), EA&~(fVECSIZE()-1), EA, TYPE, fUSE_LOOKUP_ADDRESS_BY_REV(thread->processor_ptr));
+    },
+	(A_STORE,A_MEMLIKE)
+)
+
+DEF_MACRO(fVFETCH_AL,
+    {
+    fV_AL_CHECK(EA,fVECSIZE()-1);
+    mem_fetch_vector(thread, insn, EA&~(fVECSIZE()-1), insn->slot, fVECSIZE());
+    },
+    (A_LOAD,A_MEMLIKE)
+)
+
+
+DEF_MACRO(fLOADMMV_AL,
+    {
+    fV_AL_CHECK(EA,ALIGNMENT-1);
+	thread->last_pkt->double_access_vec = 0;
+    mem_load_vector_oddva(thread, insn, EA&~(ALIGNMENT-1), EA, insn->slot, LEN, &DST.ub[0], LEN, fUSE_LOOKUP_ADDRESS_BY_REV(thread->processor_ptr));
+    },
+    (A_LOAD,A_MEMLIKE)
+)
+
+DEF_MACRO(fLOADMMV,
+	fLOADMMV_AL(EA,fVECSIZE(),fVECSIZE(),DST),
+	()
+)
+
+DEF_MACRO(fLOADMMVQ,
+	do {
+		int __i;
+		fLOADMMV_AL(EA,fVECSIZE(),fVECSIZE(),DST);
+		fVFOREACH(8,__i) if (!fGETQBIT(QVAL,__i)) DST.b[__i] = 0;
+	} while (0),
+	()
+)
+
+DEF_MACRO(fLOADMMVNQ,
+	do {
+		int __i;
+		fLOADMMV_AL(EA,fVECSIZE(),fVECSIZE(),DST);
+		fVFOREACH(8,__i) if (fGETQBIT(QVAL,__i)) DST.b[__i] = 0;
+	} while (0),
+	()
+)
+
+DEF_MACRO(fLOADMMVU_AL,
+    {
+    size4u_t size2 = (EA)&(ALIGNMENT-1);
+    size4u_t size1 = LEN-size2;
+	thread->last_pkt->double_access_vec = 1;
+    mem_load_vector_oddva(thread, insn, EA+size1, EA+fVECSIZE(), /* slot */ 1, size2, &DST.ub[size1], size2, fUSE_LOOKUP_ADDRESS());
+    mem_load_vector_oddva(thread, insn, EA, EA,/* slot */ 0, size1, &DST.ub[0], size1, fUSE_LOOKUP_ADDRESS_BY_REV(thread->processor_ptr));
+    },
+    (A_LOAD,A_MEMLIKE)
+)
+
+DEF_MACRO(fLOADMMVU,
+	{
+		/* if address happens to be aligned, only do aligned load */
+        thread->last_pkt->pkt_has_vtcm_access = 0;
+        thread->last_pkt->pkt_access_count = 0;
+		if ( (EA & (fVECSIZE()-1)) == 0) {
+            thread->last_pkt->pkt_has_vmemu_access = 0;
+			thread->last_pkt->double_access = 0;
+
+			fLOADMMV_AL(EA,fVECSIZE(),fVECSIZE(),DST);
+		} else {
+            thread->last_pkt->pkt_has_vmemu_access = 1;
+			thread->last_pkt->double_access = 1;
+
+			fLOADMMVU_AL(EA,fVECSIZE(),fVECSIZE(),DST);
+		}
+	},
+	()
+)
+
+DEF_MACRO(fSTOREMMV_AL,
+    {
+    fV_AL_CHECK(EA,ALIGNMENT-1);
+    mem_store_vector_oddva(thread, insn, EA&~(ALIGNMENT-1), EA, insn->slot, LEN, &SRC.ub[0], 0, 0, fUSE_LOOKUP_ADDRESS_BY_REV(thread->processor_ptr));
+    },
+    (A_STORE,A_MEMLIKE)
+)
+
+DEF_MACRO(fSTOREMMV,
+	fSTOREMMV_AL(EA,fVECSIZE(),fVECSIZE(),SRC),
+	()
+)
+
+DEF_MACRO(fSTOREMMVQ_AL,
+    do {
+	mmvector_t maskvec;
+	int i;
+	for (i = 0; i < fVECSIZE(); i++) maskvec.ub[i] = fGETQBIT(MASK,i);
+	mem_store_vector_oddva(thread, insn, EA&~(ALIGNMENT-1), EA, insn->slot, LEN, &SRC.ub[0], &maskvec.ub[0], 0, fUSE_LOOKUP_ADDRESS_BY_REV(thread->processor_ptr));
+    } while (0),
+    (A_STORE,A_MEMLIKE)
+)
+
+DEF_MACRO(fSTOREMMVQ,
+	fSTOREMMVQ_AL(EA,fVECSIZE(),fVECSIZE(),SRC,MASK),
+	()
+)
+
+DEF_MACRO(fSTOREMMVNQ_AL,
+    {
+	mmvector_t maskvec;
+	int i;
+	for (i = 0; i < fVECSIZE(); i++) maskvec.ub[i] = fGETQBIT(MASK,i);
+        fV_AL_CHECK(EA,ALIGNMENT-1);
+	mem_store_vector_oddva(thread, insn, EA&~(ALIGNMENT-1), EA, insn->slot, LEN, &SRC.ub[0], &maskvec.ub[0], 1, fUSE_LOOKUP_ADDRESS_BY_REV(thread->processor_ptr));
+    },
+    (A_STORE,A_MEMLIKE)
+)
+
+DEF_MACRO(fSTOREMMVNQ,
+	fSTOREMMVNQ_AL(EA,fVECSIZE(),fVECSIZE(),SRC,MASK),
+	()
+)
+
+DEF_MACRO(fSTOREMMVU_AL,
+    {
+    size4u_t size1 = ALIGNMENT-((EA)&(ALIGNMENT-1));
+    size4u_t size2;
+    if (size1>LEN) size1 = LEN;
+    size2 = LEN-size1;
+    mem_store_vector_oddva(thread, insn, EA+size1, EA+fVECSIZE(), /* slot */ 1, size2, &SRC.ub[size1], 0, 0, fUSE_LOOKUP_ADDRESS());
+    mem_store_vector_oddva(thread, insn, EA, EA, /* slot */ 0, size1, &SRC.ub[0], 0, 0, fUSE_LOOKUP_ADDRESS_BY_REV(thread->processor_ptr));
+    },
+    (A_STORE,A_MEMLIKE)
+)
+
+DEF_MACRO(fSTOREMMVU,
+	{
+        thread->last_pkt->pkt_has_vtcm_access = 0;
+        thread->last_pkt->pkt_access_count = 0;
+		if ( (EA & (fVECSIZE()-1)) == 0) {
+			thread->last_pkt->double_access = 0;
+			fSTOREMMV_AL(EA,fVECSIZE(),fVECSIZE(),SRC);
+		} else {
+			thread->last_pkt->double_access = 1;
+            thread->last_pkt->pkt_has_vmemu_access = 1;
+			fSTOREMMVU_AL(EA,fVECSIZE(),fVECSIZE(),SRC);
+		}
+	},
+	()
+)
+
+DEF_MACRO(fSTOREMMVQU_AL,
+    {
+	size4u_t size1 = ALIGNMENT-((EA)&(ALIGNMENT-1));
+	size4u_t size2;
+	mmvector_t maskvec;
+	int i;
+	for (i = 0; i < fVECSIZE(); i++) maskvec.ub[i] = fGETQBIT(MASK,i);
+	if (size1>LEN) size1 = LEN;
+	size2 = LEN-size1;
+	mem_store_vector_oddva(thread, insn, EA+size1, EA+fVECSIZE(),/* slot */ 1, size2, &SRC.ub[size1], &maskvec.ub[size1], 0, fUSE_LOOKUP_ADDRESS());
+	mem_store_vector_oddva(thread, insn, EA, /* slot */ 0, size1, &SRC.ub[0], &maskvec.ub[0], 0, fUSE_LOOKUP_ADDRESS_BY_REV(thread->processor_ptr));
+    },
+    (A_STORE,A_MEMLIKE)
+)
+
+DEF_MACRO(fSTOREMMVQU,
+	{
+        thread->last_pkt->pkt_has_vtcm_access = 0;
+        thread->last_pkt->pkt_access_count = 0;
+		if ( (EA & (fVECSIZE()-1)) == 0) {
+			thread->last_pkt->double_access = 0;
+			fSTOREMMVQ_AL(EA,fVECSIZE(),fVECSIZE(),SRC,MASK);
+		} else {
+			thread->last_pkt->double_access = 1;
+            thread->last_pkt->pkt_has_vmemu_access = 1;
+			fSTOREMMVQU_AL(EA,fVECSIZE(),fVECSIZE(),SRC,MASK);
+		}
+	},
+	()
+)
+
+DEF_MACRO(fSTOREMMVNQU_AL,
+    {
+	size4u_t size1 = ALIGNMENT-((EA)&(ALIGNMENT-1));
+	size4u_t size2;
+	mmvector_t maskvec;
+	int i;
+	for (i = 0; i < fVECSIZE(); i++) maskvec.ub[i] = fGETQBIT(MASK,i);
+	if (size1>LEN) size1 = LEN;
+	size2 = LEN-size1;
+	mem_store_vector_oddva(thread, insn, EA+size1, EA+fVECSIZE(), /* slot */ 1, size2, &SRC.ub[size1], &maskvec.ub[size1], 1, fUSE_LOOKUP_ADDRESS());
+	mem_store_vector_oddva(thread, insn, EA, EA, /* slot */ 0, size1, &SRC.ub[0], &maskvec.ub[0], 1, fUSE_LOOKUP_ADDRESS_BY_REV(thread->processor_ptr));
+    },
+    (A_STORE,A_MEMLIKE)
+)
+
+DEF_MACRO(fSTOREMMVNQU,
+	{
+        thread->last_pkt->pkt_has_vtcm_access = 0;
+        thread->last_pkt->pkt_access_count = 0;
+		if ( (EA & (fVECSIZE()-1)) == 0) {
+			thread->last_pkt->double_access = 0;
+			fSTOREMMVNQ_AL(EA,fVECSIZE(),fVECSIZE(),SRC,MASK);
+		} else {
+			thread->last_pkt->double_access = 1;
+            thread->last_pkt->pkt_has_vmemu_access = 1;
+			fSTOREMMVNQU_AL(EA,fVECSIZE(),fVECSIZE(),SRC,MASK);
+		}
+	},
+	()
+)
+
+
+
+
+DEF_MACRO(fVFOREACH,
+    for (VAR = 0; VAR < fVELEM(WIDTH); VAR++),
+    /* NOTHING */
+)
+
+DEF_MACRO(fVARRAY_ELEMENT_ACCESS,
+    ARRAY.v[(INDEX) / (fVECSIZE()/(sizeof(ARRAY.TYPE[0])))].TYPE[(INDEX) % (fVECSIZE()/(sizeof(ARRAY.TYPE[0])))],
+    ()
+)
+
+DEF_MACRO(fVNEWCANCEL,
+	do { THREAD2STRUCT->VRegs_select &= ~(1<<(REGNUM)); } while (0),
+	()
+)
+
+DEF_MACRO(fTMPVDATA,
+	mmvec_vtmp_data(thread),
+	(A_CVI)
+)
+
+DEF_MACRO(fVSATDW,
+    fVSATW( ( ( ((long long)U)<<32 ) | fZXTN(32,64,V) ) ),
+    /* attribs */
+)
+
+DEF_MACRO(fVASL_SATHI,
+    fVSATW(((U)<<1) | ((V)>>31)),
+    /* attribs */
+)
+
+DEF_MACRO(fVUADDSAT,
+	fVSATUN( WIDTH, fZXTN(WIDTH, 2*WIDTH, U)  + fZXTN(WIDTH, 2*WIDTH, V)),
+	/* attribs */
+)
+
+DEF_MACRO(fVSADDSAT,
+	fVSATN(  WIDTH, fSXTN(WIDTH, 2*WIDTH, U)  + fSXTN(WIDTH, 2*WIDTH, V)),
+	/* attribs */
+)
+
+DEF_MACRO(fVUSUBSAT,
+	fVSATUN( WIDTH, fZXTN(WIDTH, 2*WIDTH, U)  - fZXTN(WIDTH, 2*WIDTH, V)),
+	/* attribs */
+)
+
+DEF_MACRO(fVSSUBSAT,
+	fVSATN(  WIDTH, fSXTN(WIDTH, 2*WIDTH, U)  - fSXTN(WIDTH, 2*WIDTH, V)),
+	/* attribs */
+)
+
+DEF_MACRO(fVAVGU,
+	((fZXTN(WIDTH, 2*WIDTH, U) + fZXTN(WIDTH, 2*WIDTH, V))>>1),
+	/* attribs */
+)
+
+DEF_MACRO(fVAVGURND,
+	((fZXTN(WIDTH, 2*WIDTH, U) + fZXTN(WIDTH, 2*WIDTH, V)+1)>>1),
+	/* attribs */
+)
+
+DEF_MACRO(fVNAVGU,
+	((fZXTN(WIDTH, 2*WIDTH, U) - fZXTN(WIDTH, 2*WIDTH, V))>>1),
+	/* attribs */
+)
+
+DEF_MACRO(fVNAVGURNDSAT,
+	fVSATUN(WIDTH,((fZXTN(WIDTH, 2*WIDTH, U) - fZXTN(WIDTH, 2*WIDTH, V)+1)>>1)),
+	/* attribs */
+)
+
+DEF_MACRO(fVAVGS,
+	((fSXTN(WIDTH, 2*WIDTH, U) + fSXTN(WIDTH, 2*WIDTH, V))>>1),
+	/* attribs */
+)
+
+DEF_MACRO(fVAVGSRND,
+	((fSXTN(WIDTH, 2*WIDTH, U) + fSXTN(WIDTH, 2*WIDTH, V)+1)>>1),
+	/* attribs */
+)
+
+DEF_MACRO(fVNAVGS,
+	((fSXTN(WIDTH, 2*WIDTH, U) - fSXTN(WIDTH, 2*WIDTH, V))>>1),
+	/* attribs */
+)
+
+DEF_MACRO(fVNAVGSRND,
+	((fSXTN(WIDTH, 2*WIDTH, U) - fSXTN(WIDTH, 2*WIDTH, V)+1)>>1),
+	/* attribs */
+)
+
+DEF_MACRO(fVNAVGSRNDSAT,
+	fVSATN(WIDTH,((fSXTN(WIDTH, 2*WIDTH, U) - fSXTN(WIDTH, 2*WIDTH, V)+1)>>1)),
+	/* attribs */
+)
+
+
+DEF_MACRO(fVNOROUND,
+	VAL,
+	/* NOTHING */
+)
+DEF_MACRO(fVNOSAT,
+	VAL,
+	/* NOTHING */
+)
+
+DEF_MACRO(fVROUND,
+	((VAL) + (((SHAMT)>0)?(1LL<<((SHAMT)-1)):0)),
+	/* NOTHING */
+)
+
+DEF_MACRO(fCARRY_FROM_ADD32,
+	(((fZXTN(32,64,A)+fZXTN(32,64,B)+C) >> 32) & 1),
+	/* NOTHING */
+)
+
+DEF_MACRO(fUARCH_NOTE_PUMP_4X,
+	,
+	()
+)
+
+DEF_MACRO(fUARCH_NOTE_PUMP_2X,
+	,
+	()
+)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 07/30] Hexagon HVX (target/hexagon) semantics generator
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (5 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 06/30] Hexagon HVX (target/hexagon) import macro definitions Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 08/30] Hexagon HVX (target/hexagon) semantics generator - part 2 Taylor Simpson
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Add HVX support to the semantics generator

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_semantics.c | 33 +++++++++++++++++++++++++++++++++
 target/hexagon/hex_common.py   | 13 +++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/target/hexagon/gen_semantics.c b/target/hexagon/gen_semantics.c
index c5fccec..4a2bdd7 100644
--- a/target/hexagon/gen_semantics.c
+++ b/target/hexagon/gen_semantics.c
@@ -44,6 +44,11 @@ int main(int argc, char *argv[])
  *         Q6INSN(A2_add,"Rd32=add(Rs32,Rt32)",ATTRIBS(),
  *         "Add 32-bit registers",
  *         { RdV=RsV+RtV;})
+ *     HVX instructions have the following form
+ *         EXTINSN(V6_vinsertwr, "Vx32.w=vinsert(Rt32)",
+ *         ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX),
+ *         "Insert Word Scalar into Vector",
+ *         VxV.uw[0] = RtV;)
  */
 #define Q6INSN(TAG, BEH, ATTRIBS, DESCR, SEM) \
     do { \
@@ -59,8 +64,23 @@ int main(int argc, char *argv[])
                          ")\n", \
                 #TAG, STRINGIZE(ATTRIBS)); \
     } while (0);
+#define EXTINSN(TAG, BEH, ATTRIBS, DESCR, SEM) \
+    do { \
+        fprintf(outfile, "SEMANTICS( \\\n" \
+                         "    \"%s\", \\\n" \
+                         "    %s, \\\n" \
+                         "    \"\"\"%s\"\"\" \\\n" \
+                         ")\n", \
+                #TAG, STRINGIZE(BEH), STRINGIZE(SEM)); \
+        fprintf(outfile, "ATTRIBUTES( \\\n" \
+                         "    \"%s\", \\\n" \
+                         "    \"%s\" \\\n" \
+                         ")\n", \
+                #TAG, STRINGIZE(ATTRIBS)); \
+    } while (0);
 #include "imported/allidefs.def"
 #undef Q6INSN
+#undef EXTINSN
 
 /*
  * Process the macro definitions
@@ -83,6 +103,19 @@ int main(int argc, char *argv[])
 #include "imported/macros.def"
 #undef DEF_MACRO
 
+/*
+ * Process the macros for HVX
+ */
+#define DEF_MACRO(MNAME, BEH, ATTRS) \
+    fprintf(outfile, "MACROATTRIB( \\\n" \
+                     "    \"%s\", \\\n" \
+                     "    \"\"\"%s\"\"\", \\\n" \
+                     "    \"%s\" \\\n" \
+                     ")\n", \
+            #MNAME, STRINGIZE(BEH), STRINGIZE(ATTRS));
+#include "imported/allext_macros.def"
+#undef DEF_MACRO
+
     fclose(outfile);
     return 0;
 }
diff --git a/target/hexagon/hex_common.py b/target/hexagon/hex_common.py
index b3b5340..47fb628 100755
--- a/target/hexagon/hex_common.py
+++ b/target/hexagon/hex_common.py
@@ -143,6 +143,9 @@ def compute_tag_immediates(tag):
 ##          P                predicate register
 ##          R                GPR register
 ##          M                modifier register
+##          Q                HVX predicate vector
+##          V                HVX vector register
+##          O                HVX new vector register
 ##      regid can be one of the following
 ##          d, e             destination register
 ##          dd               destination register pair
@@ -178,6 +181,9 @@ def is_readwrite(regid):
 def is_scalar_reg(regtype):
     return regtype in "RPC"
 
+def is_hvx_reg(regtype):
+    return regtype in "VQ"
+
 def is_old_val(regtype, regid, tag):
     return regtype+regid+'V' in semdict[tag]
 
@@ -201,6 +207,13 @@ def need_ea(tag):
 def skip_qemu_helper(tag):
     return tag in overrides.keys()
 
+def is_tmp_result(tag):
+    return ('A_CVI_TMP' in attribdict[tag] or
+            'A_CVI_TMP_DST' in attribdict[tag])
+
+def is_new_result(tag):
+    return ('A_CVI_NEW' in attribdict[tag])
+
 def imm_name(immlett):
     return "%siV" % immlett
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 08/30] Hexagon HVX (target/hexagon) semantics generator - part 2
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (6 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 07/30] Hexagon HVX (target/hexagon) semantics generator Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 09/30] Hexagon HVX (target/hexagon) C preprocessor for decode tree Taylor Simpson
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_helper_funcs.py  | 112 ++++++++++++++--
 target/hexagon/gen_helper_protos.py |  16 ++-
 target/hexagon/gen_tcg_funcs.py     | 254 ++++++++++++++++++++++++++++++++++--
 3 files changed, 360 insertions(+), 22 deletions(-)

diff --git a/target/hexagon/gen_helper_funcs.py b/target/hexagon/gen_helper_funcs.py
index 2b1c5d8..ac5ce10 100755
--- a/target/hexagon/gen_helper_funcs.py
+++ b/target/hexagon/gen_helper_funcs.py
@@ -48,12 +48,26 @@ def gen_helper_arg_pair(f,regtype,regid,regno):
     if regno >= 0 : f.write(", ")
     f.write("int64_t %s%sV" % (regtype,regid))
 
+def gen_helper_arg_ext(f,regtype,regid,regno):
+    if regno > 0 : f.write(", ")
+    f.write("void *%s%sV_void" % (regtype,regid))
+
+def gen_helper_arg_ext_pair(f,regtype,regid,regno):
+    if regno > 0 : f.write(", ")
+    f.write("void *%s%sV_void" % (regtype,regid))
+
 def gen_helper_arg_opn(f,regtype,regid,i,tag):
     if (hex_common.is_pair(regid)):
-        gen_helper_arg_pair(f,regtype,regid,i)
+        if (hex_common.is_hvx_reg(regtype)):
+            gen_helper_arg_ext_pair(f,regtype,regid,i)
+        else:
+            gen_helper_arg_pair(f,regtype,regid,i)
     elif (hex_common.is_single(regid)):
         if hex_common.is_old_val(regtype, regid, tag):
-            gen_helper_arg(f,regtype,regid,i)
+            if (hex_common.is_hvx_reg(regtype)):
+                gen_helper_arg_ext(f,regtype,regid,i)
+            else:
+                gen_helper_arg(f,regtype,regid,i)
         elif hex_common.is_new_val(regtype, regid, tag):
             gen_helper_arg_new(f,regtype,regid,i)
         else:
@@ -72,25 +86,67 @@ def gen_helper_dest_decl_pair(f,regtype,regid,regno,subfield=""):
     f.write("    int64_t %s%sV%s = 0;\n" % \
         (regtype,regid,subfield))
 
+def gen_helper_dest_decl_ext(f,regtype,regid):
+    if (regtype == "Q"):
+        f.write("    /* %s%sV is *(MMQReg *)(%s%sV_void) */\n" % \
+            (regtype,regid,regtype,regid))
+    else:
+        f.write("    /* %s%sV is *(MMVector *)(%s%sV_void) */\n" % \
+            (regtype,regid,regtype,regid))
+
+def gen_helper_dest_decl_ext_pair(f,regtype,regid,regno):
+    f.write("    /* %s%sV is *(MMVectorPair *))%s%sV_void) */\n" % \
+        (regtype,regid,regtype, regid))
+
 def gen_helper_dest_decl_opn(f,regtype,regid,i):
     if (hex_common.is_pair(regid)):
-        gen_helper_dest_decl_pair(f,regtype,regid,i)
+        if (hex_common.is_hvx_reg(regtype)):
+            gen_helper_dest_decl_ext_pair(f,regtype,regid, i)
+        else:
+            gen_helper_dest_decl_pair(f,regtype,regid,i)
     elif (hex_common.is_single(regid)):
-        gen_helper_dest_decl(f,regtype,regid,i)
+        if (hex_common.is_hvx_reg(regtype)):
+            gen_helper_dest_decl_ext(f,regtype,regid)
+        else:
+            gen_helper_dest_decl(f,regtype,regid,i)
     else:
         print("Bad register parse: ",regtype,regid,toss,numregs)
 
+def gen_helper_src_var_ext(f,regtype,regid):
+    if (regtype == "Q"):
+       f.write("    /* %s%sV is *(MMQReg *)(%s%sV_void) */\n" % \
+           (regtype,regid,regtype,regid))
+    else:
+       f.write("    /* %s%sV is *(MMVector *)(%s%sV_void) */\n" % \
+           (regtype,regid,regtype,regid))
+
+def gen_helper_src_var_ext_pair(f,regtype,regid,regno):
+    f.write("    /* %s%sV%s is *(MMVectorPair *)(%s%sV%s_void) */\n" % \
+        (regtype,regid,regno,regtype,regid,regno))
+
 def gen_helper_return(f,regtype,regid,regno):
     f.write("    return %s%sV;\n" % (regtype,regid))
 
 def gen_helper_return_pair(f,regtype,regid,regno):
     f.write("    return %s%sV;\n" % (regtype,regid))
 
+def gen_helper_dst_write_ext(f,regtype,regid):
+    return
+
+def gen_helper_dst_write_ext_pair(f,regtype,regid):
+    return
+
 def gen_helper_return_opn(f, regtype, regid, i):
     if (hex_common.is_pair(regid)):
-        gen_helper_return_pair(f,regtype,regid,i)
+        if (hex_common.is_hvx_reg(regtype)):
+            gen_helper_dst_write_ext_pair(f,regtype,regid)
+        else:
+            gen_helper_return_pair(f,regtype,regid,i)
     elif (hex_common.is_single(regid)):
-        gen_helper_return(f,regtype,regid,i)
+        if (hex_common.is_hvx_reg(regtype)):
+            gen_helper_dst_write_ext(f,regtype,regid)
+        else:
+            gen_helper_return(f,regtype,regid,i)
     else:
         print("Bad register parse: ",regtype,regid,toss,numregs)
 
@@ -129,14 +185,20 @@ def gen_helper_function(f, tag, tagregs, tagimms):
                 % (tag, tag))
     else:
         ## The return type of the function is the type of the destination
-        ## register
+        ## register (if scalar)
         i=0
         for regtype,regid,toss,numregs in regs:
             if (hex_common.is_written(regid)):
                 if (hex_common.is_pair(regid)):
-                    gen_helper_return_type_pair(f,regtype,regid,i)
+                    if (hex_common.is_hvx_reg(regtype)):
+                        continue
+                    else:
+                        gen_helper_return_type_pair(f,regtype,regid,i)
                 elif (hex_common.is_single(regid)):
-                    gen_helper_return_type(f,regtype,regid,i)
+                    if (hex_common.is_hvx_reg(regtype)):
+                            continue
+                    else:
+                        gen_helper_return_type(f,regtype,regid,i)
                 else:
                     print("Bad register parse: ",regtype,regid,toss,numregs)
             i += 1
@@ -145,16 +207,37 @@ def gen_helper_function(f, tag, tagregs, tagimms):
             f.write("void")
         f.write(" HELPER(%s)(CPUHexagonState *env" % tag)
 
+        ## Arguments include the vector destination operands
         i = 1
+        for regtype,regid,toss,numregs in regs:
+            if (hex_common.is_written(regid)):
+                if (hex_common.is_pair(regid)):
+                    if (hex_common.is_hvx_reg(regtype)):
+                        gen_helper_arg_ext_pair(f,regtype,regid,i)
+                    else:
+                        continue
+                elif (hex_common.is_single(regid)):
+                    if (hex_common.is_hvx_reg(regtype)):
+                        gen_helper_arg_ext(f,regtype,regid,i)
+                    else:
+                        # This is the return value of the function
+                        continue
+                else:
+                    print("Bad register parse: ",regtype,regid,toss,numregs)
+                i += 1
 
         ## Arguments to the helper function are the source regs and immediates
         for regtype,regid,toss,numregs in regs:
             if (hex_common.is_read(regid)):
+                if (hex_common.is_hvx_reg(regtype) and
+                    hex_common.is_readwrite(regid)):
+                    continue
                 gen_helper_arg_opn(f,regtype,regid,i,tag)
                 i += 1
         for immlett,bits,immshift in imms:
             gen_helper_arg_imm(f,immlett)
             i += 1
+
         if hex_common.need_slot(tag):
             if i > 0: f.write(", ")
             f.write("uint32_t slot")
@@ -173,6 +256,17 @@ def gen_helper_function(f, tag, tagregs, tagimms):
                 gen_helper_dest_decl_opn(f,regtype,regid,i)
             i += 1
 
+        for regtype,regid,toss,numregs in regs:
+            if (hex_common.is_read(regid)):
+                if (hex_common.is_pair(regid)):
+                    if (hex_common.is_hvx_reg(regtype)):
+                        gen_helper_src_var_ext_pair(f,regtype,regid,i)
+                elif (hex_common.is_single(regid)):
+                    if (hex_common.is_hvx_reg(regtype)):
+                        gen_helper_src_var_ext(f,regtype,regid)
+                else:
+                    print("Bad register parse: ",regtype,regid,toss,numregs)
+
         if 'A_FPOP' in hex_common.attribdict[tag]:
             f.write('    arch_fpop_start(env);\n');
 
diff --git a/target/hexagon/gen_helper_protos.py b/target/hexagon/gen_helper_protos.py
index ea41007..229ef8d 100755
--- a/target/hexagon/gen_helper_protos.py
+++ b/target/hexagon/gen_helper_protos.py
@@ -94,19 +94,33 @@ def gen_helper_prototype(f, tag, tagregs, tagimms):
             f.write('DEF_HELPER_%s(%s' % (def_helper_size, tag))
 
         ## Generate the qemu DEF_HELPER type for each result
+        ## Iterate over this list twice
+        ## - Emit the scalar result
+        ## - Emit the vector result
         i=0
         for regtype,regid,toss,numregs in regs:
             if (hex_common.is_written(regid)):
-                gen_def_helper_opn(f, tag, regtype, regid, toss, numregs, i)
+                if (not hex_common.is_hvx_reg(regtype)):
+                    gen_def_helper_opn(f, tag, regtype, regid, toss, numregs, i)
                 i += 1
 
         ## Put the env between the outputs and inputs
         f.write(', env' )
         i += 1
 
+        # Second pass
+        for regtype,regid,toss,numregs in regs:
+            if (hex_common.is_written(regid)):
+                if (hex_common.is_hvx_reg(regtype)):
+                    gen_def_helper_opn(f, tag, regtype, regid, toss, numregs, i)
+                    i += 1
+
         ## Generate the qemu type for each input operand (regs and immediates)
         for regtype,regid,toss,numregs in regs:
             if (hex_common.is_read(regid)):
+                if (hex_common.is_hvx_reg(regtype) and
+                    hex_common.is_readwrite(regid)):
+                    continue
                 gen_def_helper_opn(f, tag, regtype, regid, toss, numregs, i)
                 i += 1
         for immlett,bits,immshift in imms:
diff --git a/target/hexagon/gen_tcg_funcs.py b/target/hexagon/gen_tcg_funcs.py
index ca8a801..48bcf89 100755
--- a/target/hexagon/gen_tcg_funcs.py
+++ b/target/hexagon/gen_tcg_funcs.py
@@ -119,10 +119,95 @@ def genptr_decl(f, tag, regtype, regid, regno):
                 (regtype, regid, regtype, regid))
         else:
             print("Bad register parse: ", regtype, regid)
+    elif (regtype == "V"):
+        if (regid in {"dd"}):
+            f.write("    const int %s%sN = insn->regno[%d];\n" %\
+                (regtype, regid, regno))
+            f.write("    const intptr_t %s%sV_off =\n" %\
+                 (regtype, regid))
+            if (hex_common.is_tmp_result(tag)):
+                f.write("        ctx_tmp_vreg_off(ctx, %s%sN, 2, true);\n" % \
+                     (regtype, regid))
+            else:
+                f.write("        ctx_future_vreg_off(ctx, %s%sN," % \
+                     (regtype, regid))
+                f.write(" 2, true);\n")
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    TCGv_ptr %s%sV = tcg_temp_new_ptr();\n" % \
+                    (regtype, regid))
+                f.write("    tcg_gen_addi_ptr(%s%sV, cpu_env, %s%sV_off);\n" % \
+                    (regtype, regid, regtype, regid))
+        elif (regid in {"uu", "vv", "xx"}):
+            f.write("    const int %s%sN = insn->regno[%d];\n" %\
+                (regtype, regid, regno))
+            f.write("    const intptr_t %s%sV_off =\n" % \
+                 (regtype, regid))
+            f.write("        offsetof(CPUHexagonState, %s%sV);\n" % \
+                 (regtype, regid))
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    TCGv_ptr %s%sV = tcg_temp_new_ptr();\n" % \
+                    (regtype, regid))
+                f.write("    tcg_gen_addi_ptr(%s%sV, cpu_env, %s%sV_off);\n" % \
+                    (regtype, regid, regtype, regid))
+        elif (regid in {"s", "u", "v", "w"}):
+            f.write("    const int %s%sN = insn->regno[%d];\n" % \
+                (regtype, regid, regno))
+            f.write("    const intptr_t %s%sV_off =\n" % \
+                              (regtype, regid))
+            f.write("        vreg_src_off(ctx, %s%sN);\n" % \
+                              (regtype, regid))
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    TCGv_ptr %s%sV = tcg_temp_new_ptr();\n" % \
+                    (regtype, regid))
+        elif (regid in {"d", "x", "y"}):
+            f.write("    const int %s%sN = insn->regno[%d];\n" % \
+                (regtype, regid, regno))
+            f.write("    const intptr_t %s%sV_off =\n" % \
+                (regtype, regid))
+            if (hex_common.is_tmp_result(tag)):
+                f.write("        ctx_tmp_vreg_off(ctx, %s%sN, 1, true);\n" % \
+                    (regtype, regid))
+            else:
+                f.write("        ctx_future_vreg_off(ctx, %s%sN," %\
+                    (regtype, regid))
+                f.write(" 1, true);\n");
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    TCGv_ptr %s%sV = tcg_temp_new_ptr();\n" % \
+                    (regtype, regid))
+                f.write("    tcg_gen_addi_ptr(%s%sV, cpu_env, %s%sV_off);\n" % \
+                    (regtype, regid, regtype, regid))
+        else:
+            print("Bad register parse: ", regtype, regid)
+    elif (regtype == "Q"):
+        if (regid in {"d", "e", "x"}):
+            f.write("    const int %s%sN = insn->regno[%d];\n" % \
+                (regtype, regid, regno))
+            f.write("    const intptr_t %s%sV_off =\n" % \
+                (regtype, regid))
+            f.write("        offsetof(CPUHexagonState,\n")
+            f.write("                 future_QRegs[%s%sN]);\n" % \
+                (regtype, regid))
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    TCGv_ptr %s%sV = tcg_temp_new_ptr();\n" % \
+                    (regtype, regid))
+                f.write("    tcg_gen_addi_ptr(%s%sV, cpu_env, %s%sV_off);\n" % \
+                    (regtype, regid, regtype, regid))
+        elif (regid in {"s", "t", "u", "v"}):
+            f.write("    const int %s%sN = insn->regno[%d];\n" % \
+                (regtype, regid, regno))
+            f.write("    const intptr_t %s%sV_off =\n" %\
+                (regtype, regid))
+            f.write("        offsetof(CPUHexagonState, QRegs[%s%sN]);\n" % \
+                (regtype, regid))
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    TCGv_ptr %s%sV = tcg_temp_new_ptr();\n" % \
+                    (regtype, regid))
+        else:
+            print("Bad register parse: ", regtype, regid)
     else:
         print("Bad register parse: ", regtype, regid)
 
-def genptr_decl_new(f,regtype,regid,regno):
+def genptr_decl_new(f, tag, regtype, regid, regno):
     if (regtype == "N"):
         if (regid in {"s", "t"}):
             f.write("    TCGv %s%sN = hex_new_value[insn->regno[%d]];\n" % \
@@ -135,6 +220,21 @@ def genptr_decl_new(f,regtype,regid,regno):
                 (regtype, regid, regno))
         else:
             print("Bad register parse: ", regtype, regid)
+    elif (regtype == "O"):
+        if (regid == "s"):
+            f.write("    const intptr_t %s%sN_num = insn->regno[%d];\n" % \
+                (regtype, regid, regno))
+            if (hex_common.skip_qemu_helper(tag)):
+                f.write("    const intptr_t %s%sN_off =\n" % \
+                    (regtype, regid))
+                f.write("         ctx_future_vreg_off(ctx, %s%sN_num," % \
+                    (regtype, regid))
+                f.write(" 1, true);\n")
+            else:
+                f.write("    TCGv %s%sN = tcg_constant_tl(%s%sN_num); /* HERE */\n" % \
+                    (regtype, regid, regtype, regid))
+        else:
+            print("Bad register parse: ", regtype, regid)
     else:
         print("Bad register parse: ", regtype, regid)
 
@@ -145,7 +245,7 @@ def genptr_decl_opn(f, tag, regtype, regid, toss, numregs, i):
         if hex_common.is_old_val(regtype, regid, tag):
             genptr_decl(f,tag, regtype, regid, i)
         elif hex_common.is_new_val(regtype, regid, tag):
-            genptr_decl_new(f,regtype,regid,i)
+            genptr_decl_new(f, tag, regtype, regid, i)
         else:
             print("Bad register parse: ",regtype,regid,toss,numregs)
     else:
@@ -159,7 +259,7 @@ def genptr_decl_imm(f,immlett):
     f.write("    int %s = insn->immed[%d];\n" % \
         (hex_common.imm_name(immlett), i))
 
-def genptr_free(f,regtype,regid,regno):
+def genptr_free(f, tag, regtype, regid, regno):
     if (regtype == "R"):
         if (regid in {"dd", "ss", "tt", "xx", "yy"}):
             f.write("    tcg_temp_free_i64(%s%sV);\n" % (regtype, regid))
@@ -182,33 +282,51 @@ def genptr_free(f,regtype,regid,regno):
     elif (regtype == "M"):
         if (regid != "u"):
             print("Bad register parse: ", regtype, regid)
+    elif (regtype == "V"):
+        if (regid in {"dd", "uu", "vv", "xx", \
+                      "d", "s", "u", "v", "w", "x", "y"}):
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    tcg_temp_free_ptr(%s%sV);\n" % \
+                    (regtype, regid))
+        else:
+            print("Bad register parse: ", regtype, regid)
+    elif (regtype == "Q"):
+        if (regid in {"d", "e", "s", "t", "u", "v", "x"}):
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    tcg_temp_free_ptr(%s%sV);\n" % \
+                    (regtype, regid))
+        else:
+            print("Bad register parse: ", regtype, regid)
     else:
         print("Bad register parse: ", regtype, regid)
 
-def genptr_free_new(f,regtype,regid,regno):
+def genptr_free_new(f, tag, regtype, regid, regno):
     if (regtype == "N"):
         if (regid not in {"s", "t"}):
             print("Bad register parse: ", regtype, regid)
     elif (regtype == "P"):
         if (regid not in {"t", "u", "v"}):
             print("Bad register parse: ", regtype, regid)
+    elif (regtype == "O"):
+        if (regid != "s"):
+            print("Bad register parse: ", regtype, regid)
     else:
         print("Bad register parse: ", regtype, regid)
 
 def genptr_free_opn(f,regtype,regid,i,tag):
     if (hex_common.is_pair(regid)):
-        genptr_free(f,regtype,regid,i)
+        genptr_free(f, tag, regtype, regid, i)
     elif (hex_common.is_single(regid)):
         if hex_common.is_old_val(regtype, regid, tag):
-            genptr_free(f,regtype,regid,i)
+            genptr_free(f, tag, regtype, regid, i)
         elif hex_common.is_new_val(regtype, regid, tag):
-            genptr_free_new(f,regtype,regid,i)
+            genptr_free_new(f, tag, regtype, regid, i)
         else:
             print("Bad register parse: ",regtype,regid,toss,numregs)
     else:
         print("Bad register parse: ",regtype,regid,toss,numregs)
 
-def genptr_src_read(f,regtype,regid):
+def genptr_src_read(f, tag, regtype, regid):
     if (regtype == "R"):
         if (regid in {"ss", "tt", "xx", "yy"}):
             f.write("    tcg_gen_concat_i32_i64(%s%sV, hex_gpr[%s%sN],\n" % \
@@ -238,6 +356,47 @@ def genptr_src_read(f,regtype,regid):
     elif (regtype == "M"):
         if (regid != "u"):
             print("Bad register parse: ", regtype, regid)
+    elif (regtype == "V"):
+        if (regid in {"uu", "vv", "xx"}):
+            f.write("    tcg_gen_gvec_mov(MO_64, %s%sV_off,\n" % \
+                (regtype, regid))
+            f.write("        vreg_src_off(ctx, %s%sN),\n" % \
+                (regtype, regid))
+            f.write("        sizeof(MMVector), sizeof(MMVector));\n")
+            f.write("    tcg_gen_gvec_mov(MO_64,\n")
+            f.write("        %s%sV_off + sizeof(MMVector),\n" % \
+                (regtype, regid))
+            f.write("        vreg_src_off(ctx, %s%sN ^ 1),\n" % \
+                (regtype, regid))
+            f.write("        sizeof(MMVector), sizeof(MMVector));\n")
+        elif (regid in {"s", "u", "v", "w"}):
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    tcg_gen_addi_ptr(%s%sV, cpu_env, %s%sV_off);\n" % \
+                                 (regtype, regid, regtype, regid))
+        elif (regid in {"x", "y"}):
+            f.write("    tcg_gen_gvec_mov(MO_64, %s%sV_off,\n" % \
+                             (regtype, regid))
+            f.write("        vreg_src_off(ctx, %s%sN),\n" % \
+                             (regtype, regid))
+            f.write("        sizeof(MMVector), sizeof(MMVector));\n")
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    tcg_gen_addi_ptr(%s%sV, cpu_env, %s%sV_off);\n" % \
+                                 (regtype, regid, regtype, regid))
+        else:
+            print("Bad register parse: ", regtype, regid)
+    elif (regtype == "Q"):
+        if (regid in {"s", "t", "u", "v"}):
+            if (not hex_common.skip_qemu_helper(tag)):
+                f.write("    tcg_gen_addi_ptr(%s%sV, cpu_env, %s%sV_off);\n" % \
+                    (regtype, regid, regtype, regid))
+        elif (regid in {"x"}):
+            f.write("    tcg_gen_gvec_mov(MO_64, %s%sV_off,\n" % \
+                (regtype, regid))
+            f.write("        offsetof(CPUHexagonState, QRegs[%s%sN]),\n" % \
+                (regtype, regid))
+            f.write("        sizeof(MMQReg), sizeof(MMQReg));\n")
+        else:
+            print("Bad register parse: ", regtype, regid)
     else:
         print("Bad register parse: ", regtype, regid)
 
@@ -248,15 +407,18 @@ def genptr_src_read_new(f,regtype,regid):
     elif (regtype == "P"):
         if (regid not in {"t", "u", "v"}):
             print("Bad register parse: ", regtype, regid)
+    elif (regtype == "O"):
+        if (regid != "s"):
+            print("Bad register parse: ", regtype, regid)
     else:
         print("Bad register parse: ", regtype, regid)
 
 def genptr_src_read_opn(f,regtype,regid,tag):
     if (hex_common.is_pair(regid)):
-        genptr_src_read(f,regtype,regid)
+        genptr_src_read(f, tag, regtype, regid)
     elif (hex_common.is_single(regid)):
         if hex_common.is_old_val(regtype, regid, tag):
-            genptr_src_read(f,regtype,regid)
+            genptr_src_read(f, tag, regtype, regid)
         elif hex_common.is_new_val(regtype, regid, tag):
             genptr_src_read_new(f,regtype,regid)
         else:
@@ -334,11 +496,68 @@ def genptr_dst_write(f, tag, regtype, regid):
     else:
         print("Bad register parse: ", regtype, regid)
 
+def genptr_dst_write_ext(f, tag, regtype, regid, newv="0"):
+    if (regtype == "V"):
+        if (regid in {"dd", "xx", "yy"}):
+            if ('A_CONDEXEC' in hex_common.attribdict[tag]):
+                is_predicated = "true"
+            else:
+                is_predicated = "false"
+            f.write("    gen_log_vreg_write_pair(ctx, %s%sV_off, %s%sN, " % \
+                (regtype, regid, regtype, regid))
+            f.write("%s, insn->slot, %s);\n" % \
+                (newv, is_predicated))
+            f.write("    ctx_log_vreg_write_pair(ctx, %s%sN, %s,\n" % \
+                (regtype, regid, newv))
+            f.write("        %s);\n" % (is_predicated))
+        elif (regid in {"d", "x", "y"}):
+            if ('A_CONDEXEC' in hex_common.attribdict[tag]):
+                is_predicated = "true"
+            else:
+                is_predicated = "false"
+            f.write("    gen_log_vreg_write(ctx, %s%sV_off, %s%sN, %s, " % \
+                (regtype, regid, regtype, regid, newv))
+            f.write("insn->slot, %s);\n" % \
+                (is_predicated))
+            f.write("    ctx_log_vreg_write(ctx, %s%sN, %s, %s);\n" % \
+                (regtype, regid, newv, is_predicated))
+        else:
+            print("Bad register parse: ", regtype, regid)
+    elif (regtype == "Q"):
+        if (regid in {"d", "e", "x"}):
+            if ('A_CONDEXEC' in hex_common.attribdict[tag]):
+                is_predicated = "true"
+            else:
+                is_predicated = "false"
+            f.write("    gen_log_qreg_write(%s%sV_off, %s%sN, %s, " % \
+                (regtype, regid, regtype, regid, newv))
+            f.write("insn->slot, %s);\n" % (is_predicated))
+            f.write("    ctx_log_qreg_write(ctx, %s%sN, %s);\n" % \
+                (regtype, regid, is_predicated))
+        else:
+            print("Bad register parse: ", regtype, regid)
+    else:
+        print("Bad register parse: ", regtype, regid)
+
 def genptr_dst_write_opn(f,regtype, regid, tag):
     if (hex_common.is_pair(regid)):
-        genptr_dst_write(f, tag, regtype, regid)
+        if (hex_common.is_hvx_reg(regtype)):
+            if (hex_common.is_tmp_result(tag)):
+                genptr_dst_write_ext(f, tag, regtype, regid, "EXT_TMP")
+            else:
+                genptr_dst_write_ext(f, tag, regtype, regid)
+        else:
+            genptr_dst_write(f, tag, regtype, regid)
     elif (hex_common.is_single(regid)):
-        genptr_dst_write(f, tag, regtype, regid)
+        if (hex_common.is_hvx_reg(regtype)):
+            if (hex_common.is_new_result(tag)):
+                genptr_dst_write_ext(f, tag, regtype, regid, "EXT_NEW")
+            if (hex_common.is_tmp_result(tag)):
+                genptr_dst_write_ext(f, tag, regtype, regid, "EXT_TMP")
+            else:
+                genptr_dst_write_ext(f, tag, regtype, regid, "EXT_DFL")
+        else:
+            genptr_dst_write(f, tag, regtype, regid)
     else:
         print("Bad register parse: ",regtype,regid,toss,numregs)
 
@@ -409,13 +628,24 @@ def gen_tcg_func(f, tag, regs, imms):
         ## If there is a scalar result, it is the return type
         for regtype,regid,toss,numregs in regs:
             if (hex_common.is_written(regid)):
+                if (hex_common.is_hvx_reg(regtype)):
+                    continue
                 gen_helper_call_opn(f, tag, regtype, regid, toss, numregs, i)
                 i += 1
         if (i > 0): f.write(", ")
         f.write("cpu_env")
         i=1
         for regtype,regid,toss,numregs in regs:
+            if (hex_common.is_written(regid)):
+                if (not hex_common.is_hvx_reg(regtype)):
+                    continue
+                gen_helper_call_opn(f, tag, regtype, regid, toss, numregs, i)
+                i += 1
+        for regtype,regid,toss,numregs in regs:
             if (hex_common.is_read(regid)):
+                if (hex_common.is_hvx_reg(regtype) and
+                    hex_common.is_readwrite(regid)):
+                    continue
                 gen_helper_call_opn(f, tag, regtype, regid, toss, numregs, i)
                 i += 1
         for immlett,bits,immshift in imms:
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 09/30] Hexagon HVX (target/hexagon) C preprocessor for decode tree
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (7 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 08/30] Hexagon HVX (target/hexagon) semantics generator - part 2 Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 10/30] Hexagon HVX (target/hexagon) instruction utility functions Taylor Simpson
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_dectree_import.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/target/hexagon/gen_dectree_import.c b/target/hexagon/gen_dectree_import.c
index 5b7ecfc..ee35467 100644
--- a/target/hexagon/gen_dectree_import.c
+++ b/target/hexagon/gen_dectree_import.c
@@ -40,6 +40,11 @@ const char * const opcode_names[] = {
  *         Q6INSN(A2_add,"Rd32=add(Rs32,Rt32)",ATTRIBS(),
  *         "Add 32-bit registers",
  *         { RdV=RsV+RtV;})
+ *     HVX instructions have the following form
+ *         EXTINSN(V6_vinsertwr, "Vx32.w=vinsert(Rt32)",
+ *         ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX,A_CVI_LATE),
+ *         "Insert Word Scalar into Vector",
+ *         VxV.uw[0] = RtV;)
  */
 const char * const opcode_syntax[XX_LAST_OPCODE] = {
 #define Q6INSN(TAG, BEH, ATTRIBS, DESCR, SEM) \
@@ -105,6 +110,14 @@ static const char *get_opcode_enc(int opcode)
 
 static const char *get_opcode_enc_class(int opcode)
 {
+    const char *tmp = opcode_encodings[opcode].encoding;
+    if (tmp == NULL) {
+        const char *test = "V6_";        /* HVX */
+        const char *name = opcode_names[opcode];
+        if (strncmp(name, test, strlen(test)) == 0) {
+            return "EXT_mmvec";
+        }
+    }
     return opcode_enc_class_names[opcode_encodings[opcode].enc_class];
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 10/30] Hexagon HVX (target/hexagon) instruction utility functions
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (8 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 09/30] Hexagon HVX (target/hexagon) C preprocessor for decode tree Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-29 18:53   ` Richard Henderson
  2021-10-12 10:10 ` [PATCH v4 11/30] Hexagon HVX (target/hexagon) helper functions Taylor Simpson
                   ` (19 subsequent siblings)
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Functions to support scatter/gather
Add new file to target/hexagon/meson.build

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/mmvec/system_ext_mmvec.h | 29 +++++++++++++++
 target/hexagon/mmvec/system_ext_mmvec.c | 66 +++++++++++++++++++++++++++++++++
 target/hexagon/meson.build              |  1 +
 3 files changed, 96 insertions(+)
 create mode 100644 target/hexagon/mmvec/system_ext_mmvec.h
 create mode 100644 target/hexagon/mmvec/system_ext_mmvec.c

diff --git a/target/hexagon/mmvec/system_ext_mmvec.h b/target/hexagon/mmvec/system_ext_mmvec.h
new file mode 100644
index 0000000..2963061
--- /dev/null
+++ b/target/hexagon/mmvec/system_ext_mmvec.h
@@ -0,0 +1,29 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HEXAGON_SYSTEM_EXT_MMVEC_H
+#define HEXAGON_SYSTEM_EXT_MMVEC_H
+
+void mem_gather_store(CPUHexagonState *env, target_ulong vaddr, int slot);
+void mem_vector_scatter_init(CPUHexagonState *env, int slot,
+                             target_ulong base_vaddr, int length,
+                             int element_size);
+void mem_vector_gather_init(CPUHexagonState *env,
+                            target_ulong base_vaddr, int length,
+                            int element_size);
+
+#endif
diff --git a/target/hexagon/mmvec/system_ext_mmvec.c b/target/hexagon/mmvec/system_ext_mmvec.c
new file mode 100644
index 0000000..9de1a25
--- /dev/null
+++ b/target/hexagon/mmvec/system_ext_mmvec.c
@@ -0,0 +1,66 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "cpu.h"
+#include "mmvec/system_ext_mmvec.h"
+
+void mem_gather_store(CPUHexagonState *env, target_ulong vaddr, int slot)
+{
+    size_t size = sizeof(MMVector);
+
+    env->vstore_pending[slot] = 1;
+    env->vstore[slot].va   = vaddr;
+    env->vstore[slot].size = size;
+    memcpy(&env->vstore[slot].data.ub[0], &env->tmp_VRegs[0], size);
+
+    /* On a gather store, overwrite the store mask to emulate dropped gathers */
+    bitmap_copy(env->vstore[slot].mask, env->vtcm_log.mask, size);
+}
+
+void mem_vector_scatter_init(CPUHexagonState *env, int slot,
+                             target_ulong base_vaddr,
+                             int length, int element_size)
+{
+    int i;
+
+    for (i = 0; i < sizeof(MMVector); i++) {
+        env->vtcm_log.data.ub[i] = 0;
+    }
+    bitmap_zero(env->vtcm_log.mask, MAX_VEC_SIZE_BYTES);
+
+    env->vtcm_pending = true;
+    env->vtcm_log.op = false;
+    env->vtcm_log.op_size = 0;
+    env->vtcm_log.size = sizeof(MMVector);
+}
+
+void mem_vector_gather_init(CPUHexagonState *env,
+                            target_ulong base_vaddr,
+                            int length, int element_size)
+{
+    int i;
+
+    for (i = 0; i < sizeof(MMVector); i++) {
+        env->vtcm_log.data.ub[i] = 0;
+        env->vtcm_log.va[i] = 0;
+        env->tmp_VRegs[0].ub[i] = 0;
+    }
+    bitmap_zero(env->vtcm_log.mask, MAX_VEC_SIZE_BYTES / 8);
+    env->vtcm_log.op = false;
+    env->vtcm_log.op_size = 0;
+}
diff --git a/target/hexagon/meson.build b/target/hexagon/meson.build
index c6d858f..0bfaa41 100644
--- a/target/hexagon/meson.build
+++ b/target/hexagon/meson.build
@@ -174,6 +174,7 @@ hexagon_ss.add(files(
     'printinsn.c',
     'arch.c',
     'fma_emu.c',
+    'mmvec/system_ext_mmvec.c',
 ))
 
 target_arch += {'hexagon': hexagon_ss}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 11/30] Hexagon HVX (target/hexagon) helper functions
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (9 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 10/30] Hexagon HVX (target/hexagon) instruction utility functions Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-29 18:58   ` Richard Henderson
  2021-10-12 10:10 ` [PATCH v4 12/30] Hexagon HVX (target/hexagon) TCG generation Taylor Simpson
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Probe and commit vector stores (masked and scatter/gather)
Log vector register writes
Add the execution counters to the debug log
Histogram instructions

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/helper.h    |  16 +++
 target/hexagon/op_helper.c | 282 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 296 insertions(+), 2 deletions(-)

diff --git a/target/hexagon/helper.h b/target/hexagon/helper.h
index 89de2a3..c89aa4e 100644
--- a/target/hexagon/helper.h
+++ b/target/hexagon/helper.h
@@ -23,6 +23,8 @@ DEF_HELPER_1(debug_start_packet, void, env)
 DEF_HELPER_FLAGS_3(debug_check_store_width, TCG_CALL_NO_WG, void, env, int, int)
 DEF_HELPER_FLAGS_3(debug_commit_end, TCG_CALL_NO_WG, void, env, int, int)
 DEF_HELPER_2(commit_store, void, env, int)
+DEF_HELPER_3(gather_store, void, env, i32, int)
+DEF_HELPER_1(commit_hvx_stores, void, env)
 DEF_HELPER_FLAGS_4(fcircadd, TCG_CALL_NO_RWG_SE, s32, s32, s32, s32, s32)
 DEF_HELPER_FLAGS_1(fbrev, TCG_CALL_NO_RWG_SE, i32, i32)
 DEF_HELPER_3(sfrecipa, i64, env, f32, f32)
@@ -90,4 +92,18 @@ DEF_HELPER_4(sffms_lib, f32, env, f32, f32, f32)
 DEF_HELPER_3(dfmpyfix, f64, env, f64, f64)
 DEF_HELPER_4(dfmpyhh, f64, env, f64, f64, f64)
 
+/* Histogram instructions */
+DEF_HELPER_1(vhist, void, env)
+DEF_HELPER_1(vhistq, void, env)
+DEF_HELPER_1(vwhist256, void, env)
+DEF_HELPER_1(vwhist256q, void, env)
+DEF_HELPER_1(vwhist256_sat, void, env)
+DEF_HELPER_1(vwhist256q_sat, void, env)
+DEF_HELPER_1(vwhist128, void, env)
+DEF_HELPER_1(vwhist128q, void, env)
+DEF_HELPER_2(vwhist128m, void, env, s32)
+DEF_HELPER_2(vwhist128qm, void, env, s32)
+
 DEF_HELPER_2(probe_pkt_scalar_store_s0, void, env, int)
+DEF_HELPER_2(probe_hvx_stores, void, env, int)
+DEF_HELPER_3(probe_pkt_scalar_hvx_stores, void, env, int, int)
diff --git a/target/hexagon/op_helper.c b/target/hexagon/op_helper.c
index af32de4..a67a148 100644
--- a/target/hexagon/op_helper.c
+++ b/target/hexagon/op_helper.c
@@ -27,6 +27,8 @@
 #include "arch.h"
 #include "hex_arch_types.h"
 #include "fma_emu.h"
+#include "mmvec/mmvec.h"
+#include "mmvec/macros.h"
 
 #define SF_BIAS        127
 #define SF_MANTBITS    23
@@ -164,6 +166,57 @@ void HELPER(commit_store)(CPUHexagonState *env, int slot_num)
     }
 }
 
+void HELPER(gather_store)(CPUHexagonState *env, uint32_t addr, int slot)
+{
+    mem_gather_store(env, addr, slot);
+}
+
+void HELPER(commit_hvx_stores)(CPUHexagonState *env)
+{
+    uintptr_t ra = GETPC();
+    int i;
+
+    /* Normal (possibly masked) vector store */
+    for (i = 0; i < VSTORES_MAX; i++) {
+        if (env->vstore_pending[i]) {
+            env->vstore_pending[i] = 0;
+            target_ulong va = env->vstore[i].va;
+            int size = env->vstore[i].size;
+            for (int j = 0; j < size; j++) {
+                if (test_bit(j, env->vstore[i].mask)) {
+                    cpu_stb_data_ra(env, va + j, env->vstore[i].data.ub[j], ra);
+                }
+            }
+        }
+    }
+
+    /* Scatter store */
+    if (env->vtcm_pending) {
+        env->vtcm_pending = false;
+        if (env->vtcm_log.op) {
+            /* Need to perform the scatter read/modify/write at commit time */
+            if (env->vtcm_log.op_size == 2) {
+                SCATTER_OP_WRITE_TO_MEM(uint16_t);
+            } else if (env->vtcm_log.op_size == 4) {
+                /* Word Scatter += */
+                SCATTER_OP_WRITE_TO_MEM(uint32_t);
+            } else {
+                g_assert_not_reached();
+            }
+        } else {
+            for (i = 0; i < env->vtcm_log.size; i++) {
+                if (test_bit(i, env->vtcm_log.mask)) {
+                    cpu_stb_data_ra(env, env->vtcm_log.va[i],
+                                    env->vtcm_log.data.ub[i], ra);
+                    clear_bit(i, env->vtcm_log.mask);
+                    env->vtcm_log.data.ub[i] = 0;
+                }
+
+            }
+        }
+    }
+}
+
 static void print_store(CPUHexagonState *env, int slot)
 {
     if (!(env->slot_cancelled & (1 << slot))) {
@@ -242,9 +295,10 @@ void HELPER(debug_commit_end)(CPUHexagonState *env, int has_st0, int has_st1)
     HEX_DEBUG_LOG("Next PC = " TARGET_FMT_lx "\n", env->next_PC);
     HEX_DEBUG_LOG("Exec counters: pkt = " TARGET_FMT_lx
                   ", insn = " TARGET_FMT_lx
-                  "\n",
+                  ", hvx = " TARGET_FMT_lx "\n",
                   env->gpr[HEX_REG_QEMU_PKT_CNT],
-                  env->gpr[HEX_REG_QEMU_INSN_CNT]);
+                  env->gpr[HEX_REG_QEMU_INSN_CNT],
+                  env->gpr[HEX_REG_QEMU_HVX_CNT]);
 
 }
 
@@ -393,6 +447,65 @@ void HELPER(probe_pkt_scalar_store_s0)(CPUHexagonState *env, int mmu_idx)
     probe_store(env, 0, mmu_idx);
 }
 
+void HELPER(probe_hvx_stores)(CPUHexagonState *env, int mmu_idx)
+{
+    uintptr_t retaddr = GETPC();
+    int i;
+
+    /* Normal (possibly masked) vector store */
+    for (i = 0; i < VSTORES_MAX; i++) {
+        if (env->vstore_pending[i]) {
+            target_ulong va = env->vstore[i].va;
+            int size = env->vstore[i].size;
+            for (int j = 0; j < size; j++) {
+                if (test_bit(j, env->vstore[i].mask)) {
+                    probe_write(env, va + j, 1, mmu_idx, retaddr);
+                }
+            }
+        }
+    }
+
+    /* Scatter store */
+    if (env->vtcm_pending) {
+        if (env->vtcm_log.op) {
+            /* Need to perform the scatter read/modify/write at commit time */
+            if (env->vtcm_log.op_size == 2) {
+                SCATTER_OP_PROBE_MEM(size2u_t, mmu_idx, retaddr);
+            } else if (env->vtcm_log.op_size == 4) {
+                /* Word Scatter += */
+                SCATTER_OP_PROBE_MEM(size4u_t, mmu_idx, retaddr);
+            } else {
+                g_assert_not_reached();
+            }
+        } else {
+            for (int i = 0; i < env->vtcm_log.size; i++) {
+                if (test_bit(i, env->vtcm_log.mask)) {
+                    probe_write(env, env->vtcm_log.va[i], 1, mmu_idx, retaddr);
+                }
+
+            }
+        }
+    }
+}
+
+void HELPER(probe_pkt_scalar_hvx_stores)(CPUHexagonState *env, int mask,
+                                         int mmu_idx)
+{
+    bool has_st0        = (mask >> 0) & 1;
+    bool has_st1        = (mask >> 1) & 1;
+    bool has_hvx_stores = (mask >> 2) & 1;
+
+    if (has_st0) {
+        probe_store(env, 0, mmu_idx);
+    }
+    if (has_st1) {
+        probe_store(env, 1, mmu_idx);
+    }
+    if (has_hvx_stores) {
+        HELPER(probe_hvx_stores)(env, mmu_idx);
+    }
+}
+
 /*
  * mem_noshuf
  * Section 5.5 of the Hexagon V67 Programmer's Reference Manual
@@ -1181,6 +1294,171 @@ float64 HELPER(dfmpyhh)(CPUHexagonState *env, float64 RxxV,
     return RxxV;
 }
 
+/* Histogram instructions */
+
+void HELPER(vhist)(CPUHexagonState *env)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int lane = 0; lane < 8; lane++) {
+        for (int i = 0; i < sizeof(MMVector) / 8; ++i) {
+            unsigned char value = input->ub[(sizeof(MMVector) / 8) * lane + i];
+            unsigned char regno = value >> 3;
+            unsigned char element = value & 7;
+
+            env->VRegs[regno].uh[(sizeof(MMVector) / 16) * lane + element]++;
+        }
+    }
+}
+
+void HELPER(vhistq)(CPUHexagonState *env)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int lane = 0; lane < 8; lane++) {
+        for (int i = 0; i < sizeof(MMVector) / 8; ++i) {
+            unsigned char value = input->ub[(sizeof(MMVector) / 8) * lane + i];
+            unsigned char regno = value >> 3;
+            unsigned char element = value & 7;
+
+            if (fGETQBIT(env->qtmp, sizeof(MMVector) / 8 * lane + i)) {
+                env->VRegs[regno].uh[
+                    (sizeof(MMVector) / 16) * lane + element]++;
+            }
+        }
+    }
+}
+
+void HELPER(vwhist256)(CPUHexagonState *env)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int i = 0; i < (sizeof(MMVector) / 2); i++) {
+        unsigned int bucket = fGETUBYTE(0, input->h[i]);
+        unsigned int weight = fGETUBYTE(1, input->h[i]);
+        unsigned int vindex = (bucket >> 3) & 0x1F;
+        unsigned int elindex = ((i >> 0) & (~7)) | ((bucket >> 0) & 7);
+
+        env->VRegs[vindex].uh[elindex] =
+            env->VRegs[vindex].uh[elindex] + weight;
+    }
+}
+
+void HELPER(vwhist256q)(CPUHexagonState *env)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int i = 0; i < (sizeof(MMVector) / 2); i++) {
+        unsigned int bucket = fGETUBYTE(0, input->h[i]);
+        unsigned int weight = fGETUBYTE(1, input->h[i]);
+        unsigned int vindex = (bucket >> 3) & 0x1F;
+        unsigned int elindex = ((i >> 0) & (~7)) | ((bucket >> 0) & 7);
+
+        if (fGETQBIT(env->qtmp, 2 * i)) {
+            env->VRegs[vindex].uh[elindex] =
+                env->VRegs[vindex].uh[elindex] + weight;
+        }
+    }
+}
+
+void HELPER(vwhist256_sat)(CPUHexagonState *env)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int i = 0; i < (sizeof(MMVector) / 2); i++) {
+        unsigned int bucket = fGETUBYTE(0, input->h[i]);
+        unsigned int weight = fGETUBYTE(1, input->h[i]);
+        unsigned int vindex = (bucket >> 3) & 0x1F;
+        unsigned int elindex = ((i >> 0) & (~7)) | ((bucket >> 0) & 7);
+
+        env->VRegs[vindex].uh[elindex] =
+            fVSATUH(env->VRegs[vindex].uh[elindex] + weight);
+    }
+}
+
+void HELPER(vwhist256q_sat)(CPUHexagonState *env)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int i = 0; i < (sizeof(MMVector) / 2); i++) {
+        unsigned int bucket = fGETUBYTE(0, input->h[i]);
+        unsigned int weight = fGETUBYTE(1, input->h[i]);
+        unsigned int vindex = (bucket >> 3) & 0x1F;
+        unsigned int elindex = ((i >> 0) & (~7)) | ((bucket >> 0) & 7);
+
+        if (fGETQBIT(env->qtmp, 2 * i)) {
+            env->VRegs[vindex].uh[elindex] =
+                fVSATUH(env->VRegs[vindex].uh[elindex] + weight);
+        }
+    }
+}
+
+void HELPER(vwhist128)(CPUHexagonState *env)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int i = 0; i < (sizeof(MMVector) / 2); i++) {
+        unsigned int bucket = fGETUBYTE(0, input->h[i]);
+        unsigned int weight = fGETUBYTE(1, input->h[i]);
+        unsigned int vindex = (bucket >> 3) & 0x1F;
+        unsigned int elindex = ((i >> 1) & (~3)) | ((bucket >> 1) & 3);
+
+        env->VRegs[vindex].uw[elindex] =
+            env->VRegs[vindex].uw[elindex] + weight;
+    }
+}
+
+void HELPER(vwhist128q)(CPUHexagonState *env)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int i = 0; i < (sizeof(MMVector) / 2); i++) {
+        unsigned int bucket = fGETUBYTE(0, input->h[i]);
+        unsigned int weight = fGETUBYTE(1, input->h[i]);
+        unsigned int vindex = (bucket >> 3) & 0x1F;
+        unsigned int elindex = ((i >> 1) & (~3)) | ((bucket >> 1) & 3);
+
+        if (fGETQBIT(env->qtmp, 2 * i)) {
+            env->VRegs[vindex].uw[elindex] =
+                env->VRegs[vindex].uw[elindex] + weight;
+        }
+    }
+}
+
+void HELPER(vwhist128m)(CPUHexagonState *env, int32_t uiV)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int i = 0; i < (sizeof(MMVector) / 2); i++) {
+        unsigned int bucket = fGETUBYTE(0, input->h[i]);
+        unsigned int weight = fGETUBYTE(1, input->h[i]);
+        unsigned int vindex = (bucket >> 3) & 0x1F;
+        unsigned int elindex = ((i >> 1) & (~3)) | ((bucket >> 1) & 3);
+
+        if ((bucket & 1) == uiV) {
+            env->VRegs[vindex].uw[elindex] =
+                env->VRegs[vindex].uw[elindex] + weight;
+        }
+    }
+}
+
+void HELPER(vwhist128qm)(CPUHexagonState *env, int32_t uiV)
+{
+    MMVector *input = &env->tmp_VRegs[0];
+
+    for (int i = 0; i < (sizeof(MMVector) / 2); i++) {
+        unsigned int bucket = fGETUBYTE(0, input->h[i]);
+        unsigned int weight = fGETUBYTE(1, input->h[i]);
+        unsigned int vindex = (bucket >> 3) & 0x1F;
+        unsigned int elindex = ((i >> 1) & (~3)) | ((bucket >> 1) & 3);
+
+        if (((bucket & 1) == uiV) && fGETQBIT(env->qtmp, 2 * i)) {
+            env->VRegs[vindex].uw[elindex] =
+                env->VRegs[vindex].uw[elindex] + weight;
+        }
+    }
+}
+
 static void cancel_slot(CPUHexagonState *env, uint32_t slot)
 {
     HEX_DEBUG_LOG("Slot %d cancelled\n", slot);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 12/30] Hexagon HVX (target/hexagon) TCG generation
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (10 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 11/30] Hexagon HVX (target/hexagon) helper functions Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-29 18:59   ` Richard Henderson
  2021-10-12 10:10 ` [PATCH v4 13/30] Hexagon HVX (target/hexagon) helper overrides infrastructure Taylor Simpson
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/translate.h |  61 ++++++++++++
 target/hexagon/genptr.c    |  15 +++
 target/hexagon/translate.c | 243 ++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 315 insertions(+), 4 deletions(-)

diff --git a/target/hexagon/translate.h b/target/hexagon/translate.h
index 703fd13..fccfb94 100644
--- a/target/hexagon/translate.h
+++ b/target/hexagon/translate.h
@@ -29,6 +29,7 @@ typedef struct DisasContext {
     uint32_t mem_idx;
     uint32_t num_packets;
     uint32_t num_insns;
+    uint32_t num_hvx_insns;
     int reg_log[REG_WRITES_MAX];
     int reg_log_idx;
     DECLARE_BITMAP(regs_written, TOTAL_PER_THREAD_REGS);
@@ -37,6 +38,20 @@ typedef struct DisasContext {
     DECLARE_BITMAP(pregs_written, NUM_PREGS);
     uint8_t store_width[STORES_MAX];
     bool s1_store_processed;
+    int future_vregs_idx;
+    int future_vregs_num[VECTOR_TEMPS_MAX];
+    int tmp_vregs_idx;
+    int tmp_vregs_num[VECTOR_TEMPS_MAX];
+    int vreg_log[NUM_VREGS];
+    bool vreg_is_predicated[NUM_VREGS];
+    int vreg_log_idx;
+    DECLARE_BITMAP(vregs_updated_tmp, NUM_VREGS);
+    DECLARE_BITMAP(vregs_updated, NUM_VREGS);
+    DECLARE_BITMAP(vregs_select, NUM_VREGS);
+    int qreg_log[NUM_QREGS];
+    bool qreg_is_predicated[NUM_QREGS];
+    int qreg_log_idx;
+    bool pre_commit;
 } DisasContext;
 
 static inline void ctx_log_reg_write(DisasContext *ctx, int rnum)
@@ -67,6 +82,46 @@ static inline bool is_preloaded(DisasContext *ctx, int num)
     return test_bit(num, ctx->regs_written);
 }
 
+intptr_t ctx_future_vreg_off(DisasContext *ctx, int regnum,
+                             int num, bool alloc_ok);
+intptr_t ctx_tmp_vreg_off(DisasContext *ctx, int regnum,
+                          int num, bool alloc_ok);
+
+static inline void ctx_log_vreg_write(DisasContext *ctx,
+                                      int rnum, VRegWriteType type,
+                                      bool is_predicated)
+{
+    if (type != EXT_TMP) {
+        ctx->vreg_log[ctx->vreg_log_idx] = rnum;
+        ctx->vreg_is_predicated[ctx->vreg_log_idx] = is_predicated;
+        ctx->vreg_log_idx++;
+
+        set_bit(rnum, ctx->vregs_updated);
+    }
+    if (type == EXT_NEW) {
+        set_bit(rnum, ctx->vregs_select);
+    }
+    if (type == EXT_TMP) {
+        set_bit(rnum, ctx->vregs_updated_tmp);
+    }
+}
+
+static inline void ctx_log_vreg_write_pair(DisasContext *ctx,
+                                           int rnum, VRegWriteType type,
+                                           bool is_predicated)
+{
+    ctx_log_vreg_write(ctx, rnum ^ 0, type, is_predicated);
+    ctx_log_vreg_write(ctx, rnum ^ 1, type, is_predicated);
+}
+
+static inline void ctx_log_qreg_write(DisasContext *ctx,
+                                      int rnum, bool is_predicated)
+{
+    ctx->qreg_log[ctx->qreg_log_idx] = rnum;
+    ctx->qreg_is_predicated[ctx->qreg_log_idx] = is_predicated;
+    ctx->qreg_log_idx++;
+}
+
 extern TCGv hex_gpr[TOTAL_PER_THREAD_REGS];
 extern TCGv hex_pred[NUM_PREGS];
 extern TCGv hex_next_PC;
@@ -85,6 +140,12 @@ extern TCGv hex_dczero_addr;
 extern TCGv hex_llsc_addr;
 extern TCGv hex_llsc_val;
 extern TCGv_i64 hex_llsc_val_i64;
+extern TCGv hex_VRegs_updated;
+extern TCGv hex_QRegs_updated;
+extern TCGv hex_vstore_addr[VSTORES_MAX];
+extern TCGv hex_vstore_size[VSTORES_MAX];
+extern TCGv hex_vstore_pending[VSTORES_MAX];
 
+bool is_gather_store_insn(Insn *insn, Packet *pkt);
 void process_store(DisasContext *ctx, Packet *pkt, int slot_num);
 #endif
diff --git a/target/hexagon/genptr.c b/target/hexagon/genptr.c
index 4a21fa5..d16ff74 100644
--- a/target/hexagon/genptr.c
+++ b/target/hexagon/genptr.c
@@ -165,6 +165,9 @@ static inline void gen_read_ctrl_reg(DisasContext *ctx, const int reg_num,
     } else if (reg_num == HEX_REG_QEMU_INSN_CNT) {
         tcg_gen_addi_tl(dest, hex_gpr[HEX_REG_QEMU_INSN_CNT],
                         ctx->num_insns);
+    } else if (reg_num == HEX_REG_QEMU_HVX_CNT) {
+        tcg_gen_addi_tl(dest, hex_gpr[HEX_REG_QEMU_HVX_CNT],
+                        ctx->num_hvx_insns);
     } else {
         tcg_gen_mov_tl(dest, hex_gpr[reg_num]);
     }
@@ -191,6 +194,12 @@ static inline void gen_read_ctrl_reg_pair(DisasContext *ctx, const int reg_num,
         tcg_gen_concat_i32_i64(dest, pkt_cnt, insn_cnt);
         tcg_temp_free(pkt_cnt);
         tcg_temp_free(insn_cnt);
+    } else if (reg_num == HEX_REG_QEMU_HVX_CNT) {
+        TCGv hvx_cnt = tcg_temp_new();
+        tcg_gen_addi_tl(hvx_cnt, hex_gpr[HEX_REG_QEMU_HVX_CNT],
+                        ctx->num_hvx_insns);
+        tcg_gen_concat_i32_i64(dest, hvx_cnt, hex_gpr[reg_num + 1]);
+        tcg_temp_free(hvx_cnt);
     } else {
         tcg_gen_concat_i32_i64(dest,
             hex_gpr[reg_num],
@@ -226,6 +235,9 @@ static inline void gen_write_ctrl_reg(DisasContext *ctx, int reg_num,
         if (reg_num == HEX_REG_QEMU_INSN_CNT) {
             ctx->num_insns = 0;
         }
+        if (reg_num == HEX_REG_QEMU_HVX_CNT) {
+            ctx->num_hvx_insns = 0;
+        }
     }
 }
 
@@ -247,6 +259,9 @@ static inline void gen_write_ctrl_reg_pair(DisasContext *ctx, int reg_num,
             ctx->num_packets = 0;
             ctx->num_insns = 0;
         }
+        if (reg_num == HEX_REG_QEMU_HVX_CNT) {
+            ctx->num_hvx_insns = 0;
+        }
     }
 }
 
diff --git a/target/hexagon/translate.c b/target/hexagon/translate.c
index 4f05ce3..e33e39c 100644
--- a/target/hexagon/translate.c
+++ b/target/hexagon/translate.c
@@ -19,6 +19,7 @@
 #include "qemu/osdep.h"
 #include "cpu.h"
 #include "tcg/tcg-op.h"
+#include "tcg/tcg-op-gvec.h"
 #include "exec/cpu_ldst.h"
 #include "exec/log.h"
 #include "internal.h"
@@ -47,11 +48,60 @@ TCGv hex_dczero_addr;
 TCGv hex_llsc_addr;
 TCGv hex_llsc_val;
 TCGv_i64 hex_llsc_val_i64;
+TCGv hex_VRegs_updated;
+TCGv hex_QRegs_updated;
+TCGv hex_vstore_addr[VSTORES_MAX];
+TCGv hex_vstore_size[VSTORES_MAX];
+TCGv hex_vstore_pending[VSTORES_MAX];
 
 static const char * const hexagon_prednames[] = {
   "p0", "p1", "p2", "p3"
 };
 
+intptr_t ctx_future_vreg_off(DisasContext *ctx, int regnum,
+                          int num, bool alloc_ok)
+{
+    intptr_t offset;
+
+    /* See if it is already allocated */
+    for (int i = 0; i < ctx->future_vregs_idx; i++) {
+        if (ctx->future_vregs_num[i] == regnum) {
+            return offsetof(CPUHexagonState, future_VRegs[i]);
+        }
+    }
+
+    g_assert(alloc_ok);
+    offset = offsetof(CPUHexagonState, future_VRegs[ctx->future_vregs_idx]);
+    for (int i = 0; i < num; i++) {
+        ctx->future_vregs_num[ctx->future_vregs_idx + i] = regnum++;
+    }
+    ctx->future_vregs_idx += num;
+    g_assert(ctx->future_vregs_idx <= VECTOR_TEMPS_MAX);
+    return offset;
+}
+
+intptr_t ctx_tmp_vreg_off(DisasContext *ctx, int regnum,
+                          int num, bool alloc_ok)
+{
+    intptr_t offset;
+
+    /* See if it is already allocated */
+    for (int i = 0; i < ctx->tmp_vregs_idx; i++) {
+        if (ctx->tmp_vregs_num[i] == regnum) {
+            return offsetof(CPUHexagonState, tmp_VRegs[i]);
+        }
+    }
+
+    g_assert(alloc_ok);
+    offset = offsetof(CPUHexagonState, tmp_VRegs[ctx->tmp_vregs_idx]);
+    for (int i = 0; i < num; i++) {
+        ctx->tmp_vregs_num[ctx->tmp_vregs_idx + i] = regnum++;
+    }
+    ctx->tmp_vregs_idx += num;
+    g_assert(ctx->tmp_vregs_idx <= VECTOR_TEMPS_MAX);
+    return offset;
+}
+
 static void gen_exception_raw(int excp)
 {
     gen_helper_raise_exception(cpu_env, tcg_constant_i32(excp));
@@ -63,6 +113,8 @@ static void gen_exec_counters(DisasContext *ctx)
                     hex_gpr[HEX_REG_QEMU_PKT_CNT], ctx->num_packets);
     tcg_gen_addi_tl(hex_gpr[HEX_REG_QEMU_INSN_CNT],
                     hex_gpr[HEX_REG_QEMU_INSN_CNT], ctx->num_insns);
+    tcg_gen_addi_tl(hex_gpr[HEX_REG_QEMU_HVX_CNT],
+                    hex_gpr[HEX_REG_QEMU_HVX_CNT], ctx->num_hvx_insns);
 }
 
 static void gen_end_tb(DisasContext *ctx)
@@ -171,11 +223,19 @@ static void gen_start_packet(DisasContext *ctx, Packet *pkt)
     bitmap_zero(ctx->regs_written, TOTAL_PER_THREAD_REGS);
     ctx->preg_log_idx = 0;
     bitmap_zero(ctx->pregs_written, NUM_PREGS);
+    ctx->future_vregs_idx = 0;
+    ctx->tmp_vregs_idx = 0;
+    ctx->vreg_log_idx = 0;
+    bitmap_zero(ctx->vregs_updated_tmp, NUM_VREGS);
+    bitmap_zero(ctx->vregs_updated, NUM_VREGS);
+    bitmap_zero(ctx->vregs_select, NUM_VREGS);
+    ctx->qreg_log_idx = 0;
     for (i = 0; i < STORES_MAX; i++) {
         ctx->store_width[i] = 0;
     }
     tcg_gen_movi_tl(hex_pkt_has_store_s1, pkt->pkt_has_store_s1);
     ctx->s1_store_processed = false;
+    ctx->pre_commit = true;
 
     if (HEX_DEBUG) {
         /* Handy place to set a breakpoint before the packet executes */
@@ -197,6 +257,26 @@ static void gen_start_packet(DisasContext *ctx, Packet *pkt)
     if (need_pred_written(pkt)) {
         tcg_gen_movi_tl(hex_pred_written, 0);
     }
+
+    if (pkt->pkt_has_hvx) {
+        tcg_gen_movi_tl(hex_VRegs_updated, 0);
+        tcg_gen_movi_tl(hex_QRegs_updated, 0);
+    }
+}
+
+bool is_gather_store_insn(Insn *insn, Packet *pkt)
+{
+    if (GET_ATTRIB(insn->opcode, A_CVI_NEW) &&
+        insn->new_value_producer_slot == 1) {
+        /* Look for gather instruction */
+        for (int i = 0; i < pkt->num_insns; i++) {
+            Insn *in = &pkt->insn[i];
+            if (GET_ATTRIB(in->opcode, A_CVI_GATHER) && in->slot == 1) {
+                return true;
+            }
+        }
+    }
+    return false;
 }
 
 /*
@@ -445,10 +525,102 @@ static void process_dczeroa(DisasContext *ctx, Packet *pkt)
     }
 }
 
+static bool pkt_has_hvx_store(Packet *pkt)
+{
+    int i;
+    for (i = 0; i < pkt->num_insns; i++) {
+        int opcode = pkt->insn[i].opcode;
+        if (GET_ATTRIB(opcode, A_CVI) && GET_ATTRIB(opcode, A_STORE)) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static void gen_commit_hvx(DisasContext *ctx, Packet *pkt)
+{
+    int i;
+
+    /*
+     *    for (i = 0; i < ctx->vreg_log_idx; i++) {
+     *        int rnum = ctx->vreg_log[i];
+     *        if (ctx->vreg_is_predicated[i]) {
+     *            if (env->VRegs_updated & (1 << rnum)) {
+     *                env->VRegs[rnum] = env->future_VRegs[rnum];
+     *            }
+     *        } else {
+     *            env->VRegs[rnum] = env->future_VRegs[rnum];
+     *        }
+     *    }
+     */
+    for (i = 0; i < ctx->vreg_log_idx; i++) {
+        int rnum = ctx->vreg_log[i];
+        bool is_predicated = ctx->vreg_is_predicated[i];
+        intptr_t dstoff = offsetof(CPUHexagonState, VRegs[rnum]);
+        intptr_t srcoff = ctx_future_vreg_off(ctx, rnum, 1, false);
+        size_t size = sizeof(MMVector);
+
+        if (is_predicated) {
+            TCGv cmp = tcg_temp_local_new();
+            TCGLabel *label_skip = gen_new_label();
+
+            tcg_gen_andi_tl(cmp, hex_VRegs_updated, 1 << rnum);
+            tcg_gen_brcondi_tl(TCG_COND_EQ, cmp, 0, label_skip);
+            {
+                tcg_gen_gvec_mov(MO_64, dstoff, srcoff, size, size);
+            }
+            gen_set_label(label_skip);
+            tcg_temp_free(cmp);
+        } else {
+            tcg_gen_gvec_mov(MO_64, dstoff, srcoff, size, size);
+        }
+    }
+
+    /*
+     *    for (i = 0; i < ctx->qreg_log_idx; i++) {
+     *        int rnum = ctx->qreg_log[i];
+     *        if (ctx->qreg_is_predicated[i]) {
+     *            if (env->QRegs_updated) & (1 << rnum)) {
+     *                env->QRegs[rnum] = env->future_QRegs[rnum];
+     *            }
+     *        } else {
+     *            env->QRegs[rnum] = env->future_QRegs[rnum];
+     *        }
+     *    }
+     */
+    for (i = 0; i < ctx->qreg_log_idx; i++) {
+        int rnum = ctx->qreg_log[i];
+        bool is_predicated = ctx->qreg_is_predicated[i];
+        intptr_t dstoff = offsetof(CPUHexagonState, QRegs[rnum]);
+        intptr_t srcoff = offsetof(CPUHexagonState, future_QRegs[rnum]);
+        size_t size = sizeof(MMQReg);
+
+        if (is_predicated) {
+            TCGv cmp = tcg_temp_local_new();
+            TCGLabel *label_skip = gen_new_label();
+
+            tcg_gen_andi_tl(cmp, hex_QRegs_updated, 1 << rnum);
+            tcg_gen_brcondi_tl(TCG_COND_EQ, cmp, 0, label_skip);
+            {
+                tcg_gen_gvec_mov(MO_64, dstoff, srcoff, size, size);
+            }
+            gen_set_label(label_skip);
+            tcg_temp_free(cmp);
+        } else {
+            tcg_gen_gvec_mov(MO_64, dstoff, srcoff, size, size);
+        }
+    }
+
+    if (pkt_has_hvx_store(pkt)) {
+        gen_helper_commit_hvx_stores(cpu_env);
+    }
+}
+
 static void update_exec_counters(DisasContext *ctx, Packet *pkt)
 {
     int num_insns = pkt->num_insns;
     int num_real_insns = 0;
+    int num_hvx_insns = 0;
 
     for (int i = 0; i < num_insns; i++) {
         if (!pkt->insn[i].is_endloop &&
@@ -456,13 +628,18 @@ static void update_exec_counters(DisasContext *ctx, Packet *pkt)
             !GET_ATTRIB(pkt->insn[i].opcode, A_IT_NOP)) {
             num_real_insns++;
         }
+        if (GET_ATTRIB(pkt->insn[i].opcode, A_CVI)) {
+            num_hvx_insns++;
+        }
     }
 
     ctx->num_packets++;
     ctx->num_insns += num_real_insns;
+    ctx->num_hvx_insns += num_hvx_insns;
 }
 
-static void gen_commit_packet(DisasContext *ctx, Packet *pkt)
+static void gen_commit_packet(CPUHexagonState *env, DisasContext *ctx,
+                              Packet *pkt)
 {
     /*
      * If there is more than one store in a packet, make sure they are all OK
@@ -471,6 +648,10 @@ static void gen_commit_packet(DisasContext *ctx, Packet *pkt)
      * dczeroa has to be the only store operation in the packet, so we go
      * ahead and process that first.
      *
+     * When there is an HVX store, there can also be a scalar store in either
+     * slot 0 or slot1, so we create a mask for the helper to indicate what
+     * work to do.
+     *
      * When there are two scalar stores, we probe the one in slot 0.
      *
      * Note that we don't call the probe helper for packets with only one
@@ -479,13 +660,35 @@ static void gen_commit_packet(DisasContext *ctx, Packet *pkt)
      */
     bool has_store_s0 = pkt->pkt_has_store_s0;
     bool has_store_s1 = (pkt->pkt_has_store_s1 && !ctx->s1_store_processed);
+    bool has_hvx_store = pkt_has_hvx_store(pkt);
     if (pkt->pkt_has_dczeroa) {
         /*
          * The dczeroa will be the store in slot 0, check that we don't have
-         * a store in slot 1.
+         * a store in slot 1 or an HVX store.
          */
-        g_assert(has_store_s0 && !has_store_s1);
+        g_assert(has_store_s0 && !has_store_s1 && !has_hvx_store);
         process_dczeroa(ctx, pkt);
+    } else if (has_hvx_store) {
+        TCGv mem_idx = tcg_constant_tl(ctx->mem_idx);
+
+        if (!has_store_s0 && !has_store_s1) {
+            gen_helper_probe_hvx_stores(cpu_env, mem_idx);
+        } else {
+            int mask = 0;
+            TCGv mask_tcgv;
+
+            if (has_store_s0) {
+                mask |= (1 << 0);
+            }
+            if (has_store_s1) {
+                mask |= (1 << 1);
+            }
+            if (has_hvx_store) {
+                mask |= (1 << 2);
+            }
+            mask_tcgv = tcg_constant_tl(mask);
+            gen_helper_probe_pkt_scalar_hvx_stores(cpu_env, mask_tcgv, mem_idx);
+        }
     } else if (has_store_s0 && has_store_s1) {
         /*
          * process_store_log will execute the slot 1 store first,
@@ -500,6 +703,9 @@ static void gen_commit_packet(DisasContext *ctx, Packet *pkt)
 
     gen_reg_writes(ctx);
     gen_pred_writes(ctx, pkt);
+    if (pkt->pkt_has_hvx) {
+        gen_commit_hvx(ctx, pkt);
+    }
     update_exec_counters(ctx, pkt);
     if (HEX_DEBUG) {
         TCGv has_st0 =
@@ -511,6 +717,11 @@ static void gen_commit_packet(DisasContext *ctx, Packet *pkt)
         gen_helper_debug_commit_end(cpu_env, has_st0, has_st1);
     }
 
+    if (pkt->vhist_insn != NULL) {
+        ctx->pre_commit = false;
+        pkt->vhist_insn->generate(env, ctx, pkt->vhist_insn, pkt);
+    }
+
     if (pkt->pkt_has_cof) {
         gen_end_tb(ctx);
     }
@@ -535,7 +746,7 @@ static void decode_and_translate_packet(CPUHexagonState *env, DisasContext *ctx)
         for (i = 0; i < pkt.num_insns; i++) {
             gen_insn(env, ctx, &pkt.insn[i], &pkt);
         }
-        gen_commit_packet(ctx, &pkt);
+        gen_commit_packet(env, ctx, &pkt);
         ctx->base.pc_next += pkt.encod_pkt_size_in_bytes;
     } else {
         gen_exception_end_tb(ctx, HEX_EXCP_INVALID_PACKET);
@@ -550,6 +761,7 @@ static void hexagon_tr_init_disas_context(DisasContextBase *dcbase,
     ctx->mem_idx = MMU_USER_IDX;
     ctx->num_packets = 0;
     ctx->num_insns = 0;
+    ctx->num_hvx_insns = 0;
 }
 
 static void hexagon_tr_tb_start(DisasContextBase *db, CPUState *cpu)
@@ -658,6 +870,9 @@ static char store_addr_names[STORES_MAX][NAME_LEN];
 static char store_width_names[STORES_MAX][NAME_LEN];
 static char store_val32_names[STORES_MAX][NAME_LEN];
 static char store_val64_names[STORES_MAX][NAME_LEN];
+static char vstore_addr_names[VSTORES_MAX][NAME_LEN];
+static char vstore_size_names[VSTORES_MAX][NAME_LEN];
+static char vstore_pending_names[VSTORES_MAX][NAME_LEN];
 
 void hexagon_translate_init(void)
 {
@@ -720,6 +935,10 @@ void hexagon_translate_init(void)
         offsetof(CPUHexagonState, llsc_val), "llsc_val");
     hex_llsc_val_i64 = tcg_global_mem_new_i64(cpu_env,
         offsetof(CPUHexagonState, llsc_val_i64), "llsc_val_i64");
+    hex_VRegs_updated = tcg_global_mem_new(cpu_env,
+        offsetof(CPUHexagonState, VRegs_updated), "VRegs_updated");
+    hex_QRegs_updated = tcg_global_mem_new(cpu_env,
+        offsetof(CPUHexagonState, QRegs_updated), "QRegs_updated");
     for (i = 0; i < STORES_MAX; i++) {
         snprintf(store_addr_names[i], NAME_LEN, "store_addr_%d", i);
         hex_store_addr[i] = tcg_global_mem_new(cpu_env,
@@ -741,4 +960,20 @@ void hexagon_translate_init(void)
             offsetof(CPUHexagonState, mem_log_stores[i].data64),
             store_val64_names[i]);
     }
+    for (int i = 0; i < VSTORES_MAX; i++) {
+        snprintf(vstore_addr_names[i], NAME_LEN, "vstore_addr_%d", i);
+        hex_vstore_addr[i] = tcg_global_mem_new(cpu_env,
+            offsetof(CPUHexagonState, vstore[i].va),
+            vstore_addr_names[i]);
+
+        snprintf(vstore_size_names[i], NAME_LEN, "vstore_size_%d", i);
+        hex_vstore_size[i] = tcg_global_mem_new(cpu_env,
+            offsetof(CPUHexagonState, vstore[i].size),
+            vstore_size_names[i]);
+
+        snprintf(vstore_pending_names[i], NAME_LEN, "vstore_pending_%d", i);
+        hex_vstore_pending[i] = tcg_global_mem_new(cpu_env,
+            offsetof(CPUHexagonState, vstore_pending[i]),
+            vstore_pending_names[i]);
+    }
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 13/30] Hexagon HVX (target/hexagon) helper overrides infrastructure
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (11 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 12/30] Hexagon HVX (target/hexagon) TCG generation Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-29 16:48   ` Philippe Mathieu-Daudé
  2021-10-29 19:00   ` Richard Henderson
  2021-10-12 10:10 ` [PATCH v4 14/30] Hexagon HVX (target/hexagon) helper overrides for histogram instructions Taylor Simpson
                   ` (16 subsequent siblings)
  29 siblings, 2 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Build the infrastructure to create overrides for HVX instructions.
We create a new empty file (gen_tcg_hvx.h) that will be populated
in subsequent patches.

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h        | 21 +++++++++++++++++++++
 target/hexagon/genptr.c             |  1 +
 target/hexagon/gen_helper_funcs.py  |  3 ++-
 target/hexagon/gen_helper_protos.py |  3 ++-
 target/hexagon/gen_tcg_funcs.py     |  3 ++-
 target/hexagon/meson.build          | 13 +++++++------
 6 files changed, 35 insertions(+), 9 deletions(-)
 create mode 100644 target/hexagon/gen_tcg_hvx.h

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
new file mode 100644
index 0000000..b5c6cad
--- /dev/null
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -0,0 +1,21 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HEXAGON_GEN_TCG_HVX_H
+#define HEXAGON_GEN_TCG_HVX_H
+
+#endif
diff --git a/target/hexagon/genptr.c b/target/hexagon/genptr.c
index d16ff74..473438a 100644
--- a/target/hexagon/genptr.c
+++ b/target/hexagon/genptr.c
@@ -26,6 +26,7 @@
 #include "macros.h"
 #undef QEMU_GENERATE
 #include "gen_tcg.h"
+#include "gen_tcg_hvx.h"
 
 static inline void gen_log_predicated_reg_write(int rnum, TCGv val, int slot)
 {
diff --git a/target/hexagon/gen_helper_funcs.py b/target/hexagon/gen_helper_funcs.py
index ac5ce10..a446c45 100755
--- a/target/hexagon/gen_helper_funcs.py
+++ b/target/hexagon/gen_helper_funcs.py
@@ -286,11 +286,12 @@ def main():
     hex_common.read_semantics_file(sys.argv[1])
     hex_common.read_attribs_file(sys.argv[2])
     hex_common.read_overrides_file(sys.argv[3])
+    hex_common.read_overrides_file(sys.argv[4])
     hex_common.calculate_attribs()
     tagregs = hex_common.get_tagregs()
     tagimms = hex_common.get_tagimms()
 
-    with open(sys.argv[4], 'w') as f:
+    with open(sys.argv[5], 'w') as f:
         for tag in hex_common.tags:
             ## Skip the priv instructions
             if ( "A_PRIV" in hex_common.attribdict[tag] ) :
diff --git a/target/hexagon/gen_helper_protos.py b/target/hexagon/gen_helper_protos.py
index 229ef8d..3b4e993 100755
--- a/target/hexagon/gen_helper_protos.py
+++ b/target/hexagon/gen_helper_protos.py
@@ -135,11 +135,12 @@ def main():
     hex_common.read_semantics_file(sys.argv[1])
     hex_common.read_attribs_file(sys.argv[2])
     hex_common.read_overrides_file(sys.argv[3])
+    hex_common.read_overrides_file(sys.argv[4])
     hex_common.calculate_attribs()
     tagregs = hex_common.get_tagregs()
     tagimms = hex_common.get_tagimms()
 
-    with open(sys.argv[4], 'w') as f:
+    with open(sys.argv[5], 'w') as f:
         for tag in hex_common.tags:
             ## Skip the priv instructions
             if ( "A_PRIV" in hex_common.attribdict[tag] ) :
diff --git a/target/hexagon/gen_tcg_funcs.py b/target/hexagon/gen_tcg_funcs.py
index 48bcf89..5bee1ca 100755
--- a/target/hexagon/gen_tcg_funcs.py
+++ b/target/hexagon/gen_tcg_funcs.py
@@ -682,11 +682,12 @@ def main():
     hex_common.read_semantics_file(sys.argv[1])
     hex_common.read_attribs_file(sys.argv[2])
     hex_common.read_overrides_file(sys.argv[3])
+    hex_common.read_overrides_file(sys.argv[4])
     hex_common.calculate_attribs()
     tagregs = hex_common.get_tagregs()
     tagimms = hex_common.get_tagimms()
 
-    with open(sys.argv[4], 'w') as f:
+    with open(sys.argv[5], 'w') as f:
         f.write("#ifndef HEXAGON_TCG_FUNCS_H\n")
         f.write("#define HEXAGON_TCG_FUNCS_H\n\n")
 
diff --git a/target/hexagon/meson.build b/target/hexagon/meson.build
index 0bfaa41..a35eb28 100644
--- a/target/hexagon/meson.build
+++ b/target/hexagon/meson.build
@@ -20,6 +20,7 @@ hexagon_ss = ss.source_set()
 hex_common_py = 'hex_common.py'
 attribs_def = meson.current_source_dir() / 'attribs_def.h.inc'
 gen_tcg_h = meson.current_source_dir() / 'gen_tcg.h'
+gen_tcg_hvx_h = meson.current_source_dir() / 'gen_tcg_hvx.h'
 
 #
 #  Step 1
@@ -63,8 +64,8 @@ helper_protos_generated = custom_target(
     'helper_protos_generated.h.inc',
     output: 'helper_protos_generated.h.inc',
     depends: [semantics_generated],
-    depend_files: [hex_common_py, attribs_def, gen_tcg_h],
-    command: [python, files('gen_helper_protos.py'), semantics_generated, attribs_def, gen_tcg_h, '@OUTPUT@'],
+    depend_files: [hex_common_py, attribs_def, gen_tcg_h, gen_tcg_hvx_h],
+    command: [python, files('gen_helper_protos.py'), semantics_generated, attribs_def, gen_tcg_h, gen_tcg_hvx_h, '@OUTPUT@'],
 )
 hexagon_ss.add(helper_protos_generated)
 
@@ -72,8 +73,8 @@ tcg_funcs_generated = custom_target(
     'tcg_funcs_generated.c.inc',
     output: 'tcg_funcs_generated.c.inc',
     depends: [semantics_generated],
-    depend_files: [hex_common_py, attribs_def, gen_tcg_h],
-    command: [python, files('gen_tcg_funcs.py'), semantics_generated, attribs_def, gen_tcg_h, '@OUTPUT@'],
+    depend_files: [hex_common_py, attribs_def, gen_tcg_h, gen_tcg_hvx_h],
+    command: [python, files('gen_tcg_funcs.py'), semantics_generated, attribs_def, gen_tcg_h, gen_tcg_hvx_h, '@OUTPUT@'],
 )
 hexagon_ss.add(tcg_funcs_generated)
 
@@ -90,8 +91,8 @@ helper_funcs_generated = custom_target(
     'helper_funcs_generated.c.inc',
     output: 'helper_funcs_generated.c.inc',
     depends: [semantics_generated],
-    depend_files: [hex_common_py, attribs_def, gen_tcg_h],
-    command: [python, files('gen_helper_funcs.py'), semantics_generated, attribs_def, gen_tcg_h, '@OUTPUT@'],
+    depend_files: [hex_common_py, attribs_def, gen_tcg_h, gen_tcg_hvx_h],
+    command: [python, files('gen_helper_funcs.py'), semantics_generated, attribs_def, gen_tcg_h, gen_tcg_hvx_h, '@OUTPUT@'],
 )
 hexagon_ss.add(helper_funcs_generated)
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 14/30] Hexagon HVX (target/hexagon) helper overrides for histogram instructions
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (12 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 13/30] Hexagon HVX (target/hexagon) helper overrides infrastructure Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-29 19:04   ` Richard Henderson
  2021-10-12 10:10 ` [PATCH v4 15/30] Hexagon HVX (target/hexagon) helper overrides - vector assign & cmov Taylor Simpson
                   ` (15 subsequent siblings)
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 106 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index b5c6cad..a560504 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -18,4 +18,110 @@
 #ifndef HEXAGON_GEN_TCG_HVX_H
 #define HEXAGON_GEN_TCG_HVX_H
 
+/*
+ * Histogram instructions
+ *
+ * Note that these instructions operate directly on the vector registers
+ * and therefore happen after commit.
+ *
+ * The generate_<tag> function is called twice
+ *     The first time is during the normal TCG generation
+ *         ctx->pre_commit is true
+ *         In the masked cases, we save the mask to the qtmp temporary
+ *         Otherwise, there is nothing to do
+ *     The second call is at the end of gen_commit_packet
+ *         ctx->pre_commit is false
+ *         Generate the call to the helper
+ */
+
+static inline void assert_vhist_tmp(DisasContext *ctx)
+{
+    /* vhist instructions require exactly one .tmp to be defined */
+    g_assert(ctx->tmp_vregs_idx == 1);
+}
+
+#define fGEN_TCG_V6_vhist(SHORTCODE) \
+    if (!ctx->pre_commit) { \
+        assert_vhist_tmp(ctx); \
+        gen_helper_vhist(cpu_env); \
+    }
+#define fGEN_TCG_V6_vhistq(SHORTCODE) \
+    do { \
+        if (ctx->pre_commit) { \
+            intptr_t dstoff = offsetof(CPUHexagonState, qtmp); \
+            tcg_gen_gvec_mov(MO_64, dstoff, QvV_off, \
+                             sizeof(MMVector), sizeof(MMVector)); \
+        } else { \
+            assert_vhist_tmp(ctx); \
+            gen_helper_vhistq(cpu_env); \
+        } \
+    } while (0)
+#define fGEN_TCG_V6_vwhist256(SHORTCODE) \
+    if (!ctx->pre_commit) { \
+        assert_vhist_tmp(ctx); \
+        gen_helper_vwhist256(cpu_env); \
+    }
+#define fGEN_TCG_V6_vwhist256q(SHORTCODE) \
+    do { \
+        if (ctx->pre_commit) { \
+            intptr_t dstoff = offsetof(CPUHexagonState, qtmp); \
+            tcg_gen_gvec_mov(MO_64, dstoff, QvV_off, \
+                             sizeof(MMVector), sizeof(MMVector)); \
+        } else { \
+            assert_vhist_tmp(ctx); \
+            gen_helper_vwhist256q(cpu_env); \
+        } \
+    } while (0)
+#define fGEN_TCG_V6_vwhist256_sat(SHORTCODE) \
+    if (!ctx->pre_commit) { \
+        assert_vhist_tmp(ctx); \
+        gen_helper_vwhist256_sat(cpu_env); \
+    }
+#define fGEN_TCG_V6_vwhist256q_sat(SHORTCODE) \
+    do { \
+        if (ctx->pre_commit) { \
+            intptr_t dstoff = offsetof(CPUHexagonState, qtmp); \
+            tcg_gen_gvec_mov(MO_64, dstoff, QvV_off, \
+                             sizeof(MMVector), sizeof(MMVector)); \
+        } else { \
+            assert_vhist_tmp(ctx); \
+            gen_helper_vwhist256q_sat(cpu_env); \
+        } \
+    } while (0)
+#define fGEN_TCG_V6_vwhist128(SHORTCODE) \
+    if (!ctx->pre_commit) { \
+        assert_vhist_tmp(ctx); \
+        gen_helper_vwhist128(cpu_env); \
+    }
+#define fGEN_TCG_V6_vwhist128q(SHORTCODE) \
+    do { \
+        if (ctx->pre_commit) { \
+            intptr_t dstoff = offsetof(CPUHexagonState, qtmp); \
+            tcg_gen_gvec_mov(MO_64, dstoff, QvV_off, \
+                             sizeof(MMVector), sizeof(MMVector)); \
+        } else { \
+            assert_vhist_tmp(ctx); \
+            gen_helper_vwhist128q(cpu_env); \
+        } \
+    } while (0)
+#define fGEN_TCG_V6_vwhist128m(SHORTCODE) \
+    if (!ctx->pre_commit) { \
+        TCGv tcgv_uiV = tcg_constant_tl(uiV); \
+        assert_vhist_tmp(ctx); \
+        gen_helper_vwhist128m(cpu_env, tcgv_uiV); \
+    }
+#define fGEN_TCG_V6_vwhist128qm(SHORTCODE) \
+    do { \
+        if (ctx->pre_commit) { \
+            intptr_t dstoff = offsetof(CPUHexagonState, qtmp); \
+            tcg_gen_gvec_mov(MO_64, dstoff, QvV_off, \
+                             sizeof(MMVector), sizeof(MMVector)); \
+        } else { \
+            TCGv tcgv_uiV = tcg_constant_tl(uiV); \
+            assert_vhist_tmp(ctx); \
+            gen_helper_vwhist128qm(cpu_env, tcgv_uiV); \
+        } \
+    } while (0)
+
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 15/30] Hexagon HVX (target/hexagon) helper overrides - vector assign & cmov
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (13 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 14/30] Hexagon HVX (target/hexagon) helper overrides for histogram instructions Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 16/30] Hexagon HVX (target/hexagon) helper overrides - vector add & sub Taylor Simpson
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index a560504..916230e 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -124,4 +124,35 @@ static inline void assert_vhist_tmp(DisasContext *ctx)
     } while (0)
 
 
+#define fGEN_TCG_V6_vassign(SHORTCODE) \
+    tcg_gen_gvec_mov(MO_64, VdV_off, VuV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+/* Vector conditional move */
+#define fGEN_TCG_VEC_CMOV(PRED) \
+    do { \
+        TCGv lsb = tcg_temp_new(); \
+        TCGLabel *false_label = gen_new_label(); \
+        TCGLabel *end_label = gen_new_label(); \
+        tcg_gen_andi_tl(lsb, PsV, 1); \
+        tcg_gen_brcondi_tl(TCG_COND_NE, lsb, PRED, false_label); \
+        tcg_temp_free(lsb); \
+        tcg_gen_gvec_mov(MO_64, VdV_off, VuV_off, \
+                         sizeof(MMVector), sizeof(MMVector)); \
+        tcg_gen_br(end_label); \
+        gen_set_label(false_label); \
+        tcg_gen_ori_tl(hex_slot_cancelled, hex_slot_cancelled, \
+                       1 << insn->slot); \
+        gen_set_label(end_label); \
+    } while (0)
+
+
+/* Vector conditional move (true) */
+#define fGEN_TCG_V6_vcmov(SHORTCODE) \
+    fGEN_TCG_VEC_CMOV(1)
+
+/* Vector conditional move (false) */
+#define fGEN_TCG_V6_vncmov(SHORTCODE) \
+    fGEN_TCG_VEC_CMOV(0)
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 16/30] Hexagon HVX (target/hexagon) helper overrides - vector add & sub
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (14 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 15/30] Hexagon HVX (target/hexagon) helper overrides - vector assign & cmov Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 17/30] Hexagon HVX (target/hexagon) helper overrides - vector shifts Taylor Simpson
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 50 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index 916230e..ac2143e 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -155,4 +155,54 @@ static inline void assert_vhist_tmp(DisasContext *ctx)
 #define fGEN_TCG_V6_vncmov(SHORTCODE) \
     fGEN_TCG_VEC_CMOV(0)
 
+/* Vector add - various forms */
+#define fGEN_TCG_V6_vaddb(SHORTCODE) \
+    tcg_gen_gvec_add(MO_8, VdV_off, VuV_off, VvV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vaddh(SHORTCYDE) \
+    tcg_gen_gvec_add(MO_16, VdV_off, VuV_off, VvV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vaddw(SHORTCODE) \
+    tcg_gen_gvec_add(MO_32, VdV_off, VuV_off, VvV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vaddb_dv(SHORTCODE) \
+    tcg_gen_gvec_add(MO_8, VddV_off, VuuV_off, VvvV_off, \
+                     sizeof(MMVector) * 2, sizeof(MMVector) * 2)
+
+#define fGEN_TCG_V6_vaddh_dv(SHORTCYDE) \
+    tcg_gen_gvec_add(MO_16, VddV_off, VuuV_off, VvvV_off, \
+                     sizeof(MMVector) * 2, sizeof(MMVector) * 2)
+
+#define fGEN_TCG_V6_vaddw_dv(SHORTCODE) \
+    tcg_gen_gvec_add(MO_32, VddV_off, VuuV_off, VvvV_off, \
+                     sizeof(MMVector) * 2, sizeof(MMVector) * 2)
+
+/* Vector sub - various forms */
+#define fGEN_TCG_V6_vsubb(SHORTCODE) \
+    tcg_gen_gvec_sub(MO_8, VdV_off, VuV_off, VvV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vsubh(SHORTCODE) \
+    tcg_gen_gvec_sub(MO_16, VdV_off, VuV_off, VvV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vsubw(SHORTCODE) \
+    tcg_gen_gvec_sub(MO_32, VdV_off, VuV_off, VvV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vsubb_dv(SHORTCODE) \
+    tcg_gen_gvec_sub(MO_8, VddV_off, VuuV_off, VvvV_off, \
+                     sizeof(MMVector) * 2, sizeof(MMVector) * 2)
+
+#define fGEN_TCG_V6_vsubh_dv(SHORTCODE) \
+    tcg_gen_gvec_sub(MO_16, VddV_off, VuuV_off, VvvV_off, \
+                     sizeof(MMVector) * 2, sizeof(MMVector) * 2)
+
+#define fGEN_TCG_V6_vsubw_dv(SHORTCODE) \
+    tcg_gen_gvec_sub(MO_32, VddV_off, VuuV_off, VvvV_off, \
+                     sizeof(MMVector) * 2, sizeof(MMVector) * 2)
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 17/30] Hexagon HVX (target/hexagon) helper overrides - vector shifts
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (15 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 16/30] Hexagon HVX (target/hexagon) helper overrides - vector add & sub Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 18/30] Hexagon HVX (target/hexagon) helper overrides - vector max/min Taylor Simpson
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 122 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 122 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index ac2143e..e865410 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -205,4 +205,126 @@ static inline void assert_vhist_tmp(DisasContext *ctx)
     tcg_gen_gvec_sub(MO_32, VddV_off, VuuV_off, VvvV_off, \
                      sizeof(MMVector) * 2, sizeof(MMVector) * 2)
 
+/* Vector shift right - various forms */
+#define fGEN_TCG_V6_vasrh(SHORTCODE) \
+    do { \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 15); \
+        tcg_gen_gvec_sars(MO_16, VdV_off, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vasrh_acc(SHORTCODE) \
+    do { \
+        intptr_t tmpoff = offsetof(CPUHexagonState, vtmp); \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 15); \
+        tcg_gen_gvec_sars(MO_16, tmpoff, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_gen_gvec_add(MO_16, VxV_off, VxV_off, tmpoff, \
+                         sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vasrw(SHORTCODE) \
+    do { \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 31); \
+        tcg_gen_gvec_sars(MO_32, VdV_off, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vasrw_acc(SHORTCODE) \
+    do { \
+        intptr_t tmpoff = offsetof(CPUHexagonState, vtmp); \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 31); \
+        tcg_gen_gvec_sars(MO_32, tmpoff, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_gen_gvec_add(MO_32, VxV_off, VxV_off, tmpoff, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vlsrb(SHORTCODE) \
+    do { \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 7); \
+        tcg_gen_gvec_shrs(MO_8, VdV_off, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vlsrh(SHORTCODE) \
+    do { \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 15); \
+        tcg_gen_gvec_shrs(MO_16, VdV_off, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vlsrw(SHORTCODE) \
+    do { \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 31); \
+        tcg_gen_gvec_shrs(MO_32, VdV_off, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+/* Vector shift left - various forms */
+#define fGEN_TCG_V6_vaslb(SHORTCODE) \
+    do { \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 7); \
+        tcg_gen_gvec_shls(MO_8, VdV_off, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vaslh(SHORTCODE) \
+    do { \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 15); \
+        tcg_gen_gvec_shls(MO_16, VdV_off, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vaslh_acc(SHORTCODE) \
+    do { \
+        intptr_t tmpoff = offsetof(CPUHexagonState, vtmp); \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 15); \
+        tcg_gen_gvec_shls(MO_16, tmpoff, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_gen_gvec_add(MO_16, VxV_off, VxV_off, tmpoff, \
+                         sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vaslw(SHORTCODE) \
+    do { \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 31); \
+        tcg_gen_gvec_shls(MO_32, VdV_off, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
+#define fGEN_TCG_V6_vaslw_acc(SHORTCODE) \
+    do { \
+        intptr_t tmpoff = offsetof(CPUHexagonState, vtmp); \
+        TCGv shift = tcg_temp_new(); \
+        tcg_gen_andi_tl(shift, RtV, 31); \
+        tcg_gen_gvec_shls(MO_32, tmpoff, VuV_off, shift, \
+                          sizeof(MMVector), sizeof(MMVector)); \
+        tcg_gen_gvec_add(MO_32, VxV_off, VxV_off, tmpoff, \
+                         sizeof(MMVector), sizeof(MMVector)); \
+        tcg_temp_free(shift); \
+    } while (0)
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 18/30] Hexagon HVX (target/hexagon) helper overrides - vector max/min
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (16 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 17/30] Hexagon HVX (target/hexagon) helper overrides - vector shifts Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 19/30] Hexagon HVX (target/hexagon) helper overrides - vector logical ops Taylor Simpson
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index e865410..f548404 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -327,4 +327,38 @@ static inline void assert_vhist_tmp(DisasContext *ctx)
         tcg_temp_free(shift); \
     } while (0)
 
+/* Vector max - various forms */
+#define fGEN_TCG_V6_vmaxw(SHORTCODE) \
+    tcg_gen_gvec_smax(MO_32, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+#define fGEN_TCG_V6_vmaxh(SHORTCODE) \
+    tcg_gen_gvec_smax(MO_16, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+#define fGEN_TCG_V6_vmaxuh(SHORTCODE) \
+    tcg_gen_gvec_umax(MO_16, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+#define fGEN_TCG_V6_vmaxb(SHORTCODE) \
+    tcg_gen_gvec_smax(MO_8, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+#define fGEN_TCG_V6_vmaxub(SHORTCODE) \
+    tcg_gen_gvec_umax(MO_8, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+
+/* Vector min - various forms */
+#define fGEN_TCG_V6_vminw(SHORTCODE) \
+    tcg_gen_gvec_smin(MO_32, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+#define fGEN_TCG_V6_vminh(SHORTCODE) \
+    tcg_gen_gvec_smin(MO_16, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+#define fGEN_TCG_V6_vminuh(SHORTCODE) \
+    tcg_gen_gvec_umin(MO_16, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+#define fGEN_TCG_V6_vminb(SHORTCODE) \
+    tcg_gen_gvec_smin(MO_8, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+#define fGEN_TCG_V6_vminub(SHORTCODE) \
+    tcg_gen_gvec_umin(MO_8, VdV_off, VuV_off, VvV_off, \
+                      sizeof(MMVector), sizeof(MMVector))
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 19/30] Hexagon HVX (target/hexagon) helper overrides - vector logical ops
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (17 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 18/30] Hexagon HVX (target/hexagon) helper overrides - vector max/min Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-29 19:06   ` Richard Henderson
  2021-10-12 10:10 ` [PATCH v4 20/30] Hexagon HVX (target/hexagon) helper overrides - vector compares Taylor Simpson
                   ` (10 subsequent siblings)
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index f548404..f53a7f2 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -361,4 +361,46 @@ static inline void assert_vhist_tmp(DisasContext *ctx)
     tcg_gen_gvec_umin(MO_8, VdV_off, VuV_off, VvV_off, \
                       sizeof(MMVector), sizeof(MMVector))
 
+/* Vector logical ops */
+#define fGEN_TCG_V6_vxor(SHORTCODE) \
+    tcg_gen_gvec_xor(MO_64, VdV_off, VuV_off, VvV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vand(SHORTCODE) \
+    tcg_gen_gvec_and(MO_64, VdV_off, VuV_off, VvV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vor(SHORTCODE) \
+    tcg_gen_gvec_or(MO_64, VdV_off, VuV_off, VvV_off, \
+                    sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vnot(SHORTCODE) \
+    tcg_gen_gvec_not(MO_64, VdV_off, VuV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+/* Q register logical ops */
+#define fGEN_TCG_V6_pred_or(SHORTCODE) \
+    tcg_gen_gvec_or(MO_64, QdV_off, QsV_off, QtV_off, \
+                    sizeof(MMQReg), sizeof(MMQReg))
+
+#define fGEN_TCG_V6_pred_and(SHORTCODE) \
+    tcg_gen_gvec_and(MO_64, QdV_off, QsV_off, QtV_off, \
+                     sizeof(MMQReg), sizeof(MMQReg))
+
+#define fGEN_TCG_V6_pred_xor(SHORTCODE) \
+    tcg_gen_gvec_xor(MO_64, QdV_off, QsV_off, QtV_off, \
+                     sizeof(MMQReg), sizeof(MMQReg))
+
+#define fGEN_TCG_V6_pred_or_n(SHORTCODE) \
+    tcg_gen_gvec_orc(MO_64, QdV_off, QsV_off, QtV_off, \
+                     sizeof(MMQReg), sizeof(MMQReg))
+
+#define fGEN_TCG_V6_pred_and_n(SHORTCODE) \
+    tcg_gen_gvec_andc(MO_64, QdV_off, QsV_off, QtV_off, \
+                      sizeof(MMQReg), sizeof(MMQReg))
+
+#define fGEN_TCG_V6_pred_not(SHORTCODE) \
+    tcg_gen_gvec_not(MO_64, QdV_off, QsV_off, \
+                     sizeof(MMQReg), sizeof(MMQReg))
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 20/30] Hexagon HVX (target/hexagon) helper overrides - vector compares
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (18 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 19/30] Hexagon HVX (target/hexagon) helper overrides - vector logical ops Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:10 ` [PATCH v4 21/30] Hexagon HVX (target/hexagon) helper overrides - vector splat and abs Taylor Simpson
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 103 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index f53a7f2..32f8e20 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -403,4 +403,107 @@ static inline void assert_vhist_tmp(DisasContext *ctx)
     tcg_gen_gvec_not(MO_64, QdV_off, QsV_off, \
                      sizeof(MMQReg), sizeof(MMQReg))
 
+/* Vector compares */
+#define fGEN_TCG_VEC_CMP(COND, TYPE, SIZE) \
+    do { \
+        intptr_t tmpoff = offsetof(CPUHexagonState, vtmp); \
+        tcg_gen_gvec_cmp(COND, TYPE, tmpoff, VuV_off, VvV_off, \
+                         sizeof(MMVector), sizeof(MMVector)); \
+        vec_to_qvec(SIZE, QdV_off, tmpoff); \
+    } while (0)
+
+#define fGEN_TCG_V6_vgtw(SHORTCODE) \
+    fGEN_TCG_VEC_CMP(TCG_COND_GT, MO_32, 4)
+#define fGEN_TCG_V6_vgth(SHORTCODE) \
+    fGEN_TCG_VEC_CMP(TCG_COND_GT, MO_16, 2)
+#define fGEN_TCG_V6_vgtb(SHORTCODE) \
+    fGEN_TCG_VEC_CMP(TCG_COND_GT, MO_8, 1)
+
+#define fGEN_TCG_V6_vgtuw(SHORTCODE) \
+    fGEN_TCG_VEC_CMP(TCG_COND_GTU, MO_32, 4)
+#define fGEN_TCG_V6_vgtuh(SHORTCODE) \
+    fGEN_TCG_VEC_CMP(TCG_COND_GTU, MO_16, 2)
+#define fGEN_TCG_V6_vgtub(SHORTCODE) \
+    fGEN_TCG_VEC_CMP(TCG_COND_GTU, MO_8, 1)
+
+#define fGEN_TCG_V6_veqw(SHORTCODE) \
+    fGEN_TCG_VEC_CMP(TCG_COND_EQ, MO_32, 4)
+#define fGEN_TCG_V6_veqh(SHORTCODE) \
+    fGEN_TCG_VEC_CMP(TCG_COND_EQ, MO_16, 2)
+#define fGEN_TCG_V6_veqb(SHORTCODE) \
+    fGEN_TCG_VEC_CMP(TCG_COND_EQ, MO_8, 1)
+
+#define fGEN_TCG_VEC_CMP_OP(COND, TYPE, SIZE, OP) \
+    do { \
+        intptr_t tmpoff = offsetof(CPUHexagonState, vtmp); \
+        intptr_t qoff = offsetof(CPUHexagonState, qtmp); \
+        tcg_gen_gvec_cmp(COND, TYPE, tmpoff, VuV_off, VvV_off, \
+                         sizeof(MMVector), sizeof(MMVector)); \
+        vec_to_qvec(SIZE, qoff, tmpoff); \
+        OP(MO_64, QxV_off, QxV_off, qoff, sizeof(MMQReg), sizeof(MMQReg)); \
+    } while (0)
+
+#define fGEN_TCG_V6_vgtw_and(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GT, MO_32, 4, tcg_gen_gvec_and)
+#define fGEN_TCG_V6_vgtw_or(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GT, MO_32, 4, tcg_gen_gvec_or)
+#define fGEN_TCG_V6_vgtw_xor(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GT, MO_32, 4, tcg_gen_gvec_xor)
+
+#define fGEN_TCG_V6_vgtuw_and(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GTU, MO_32, 4, tcg_gen_gvec_and)
+#define fGEN_TCG_V6_vgtuw_or(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GTU, MO_32, 4, tcg_gen_gvec_or)
+#define fGEN_TCG_V6_vgtuw_xor(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GTU, MO_32, 4, tcg_gen_gvec_xor)
+
+#define fGEN_TCG_V6_vgth_and(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GT, MO_16, 2, tcg_gen_gvec_and)
+#define fGEN_TCG_V6_vgth_or(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GT, MO_16, 2, tcg_gen_gvec_or)
+#define fGEN_TCG_V6_vgth_xor(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GT, MO_16, 2, tcg_gen_gvec_xor)
+
+#define fGEN_TCG_V6_vgtuh_and(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GTU, MO_16, 2, tcg_gen_gvec_and)
+#define fGEN_TCG_V6_vgtuh_or(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GTU, MO_16, 2, tcg_gen_gvec_or)
+#define fGEN_TCG_V6_vgtuh_xor(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GTU, MO_16, 2, tcg_gen_gvec_xor)
+
+#define fGEN_TCG_V6_vgtb_and(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GT, MO_8, 1, tcg_gen_gvec_and)
+#define fGEN_TCG_V6_vgtb_or(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GT, MO_8, 1, tcg_gen_gvec_or)
+#define fGEN_TCG_V6_vgtb_xor(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GT, MO_8, 1, tcg_gen_gvec_xor)
+
+#define fGEN_TCG_V6_vgtub_and(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GTU, MO_8, 1, tcg_gen_gvec_and)
+#define fGEN_TCG_V6_vgtub_or(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GTU, MO_8, 1, tcg_gen_gvec_or)
+#define fGEN_TCG_V6_vgtub_xor(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_GTU, MO_8, 1, tcg_gen_gvec_xor)
+
+#define fGEN_TCG_V6_veqw_and(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_32, 4, tcg_gen_gvec_and)
+#define fGEN_TCG_V6_veqw_or(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_32, 4, tcg_gen_gvec_or)
+#define fGEN_TCG_V6_veqw_xor(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_32, 4, tcg_gen_gvec_xor)
+
+#define fGEN_TCG_V6_veqh_and(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_16, 2, tcg_gen_gvec_and)
+#define fGEN_TCG_V6_veqh_or(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_16, 2, tcg_gen_gvec_or)
+#define fGEN_TCG_V6_veqh_xor(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_16, 2, tcg_gen_gvec_xor)
+
+#define fGEN_TCG_V6_veqb_and(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_8, 1, tcg_gen_gvec_and)
+#define fGEN_TCG_V6_veqb_or(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_8, 1, tcg_gen_gvec_or)
+#define fGEN_TCG_V6_veqb_xor(SHORTCODE) \
+    fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_8, 1, tcg_gen_gvec_xor)
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 21/30] Hexagon HVX (target/hexagon) helper overrides - vector splat and abs
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (19 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 20/30] Hexagon HVX (target/hexagon) helper overrides - vector compares Taylor Simpson
@ 2021-10-12 10:10 ` Taylor Simpson
  2021-10-12 10:11 ` [PATCH v4 22/30] Hexagon HVX (target/hexagon) helper overrides - vector loads Taylor Simpson
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index 32f8e20..435c7b5 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -506,4 +506,30 @@ static inline void assert_vhist_tmp(DisasContext *ctx)
 #define fGEN_TCG_V6_veqb_xor(SHORTCODE) \
     fGEN_TCG_VEC_CMP_OP(TCG_COND_EQ, MO_8, 1, tcg_gen_gvec_xor)
 
+/* Vector splat - various forms */
+#define fGEN_TCG_V6_lvsplatw(SHORTCODE) \
+    tcg_gen_gvec_dup_i32(MO_32, VdV_off, \
+                         sizeof(MMVector), sizeof(MMVector), RtV)
+
+#define fGEN_TCG_V6_lvsplath(SHORTCODE) \
+    tcg_gen_gvec_dup_i32(MO_16, VdV_off, \
+                         sizeof(MMVector), sizeof(MMVector), RtV)
+
+#define fGEN_TCG_V6_lvsplatb(SHORTCODE) \
+    tcg_gen_gvec_dup_i32(MO_8, VdV_off, \
+                         sizeof(MMVector), sizeof(MMVector), RtV)
+
+/* Vector absolute value - various forms */
+#define fGEN_TCG_V6_vabsb(SHORTCODE) \
+    tcg_gen_gvec_abs(MO_8, VdV_off, VuV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vabsh(SHORTCODE) \
+    tcg_gen_gvec_abs(MO_16, VdV_off, VuV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
+#define fGEN_TCG_V6_vabsw(SHORTCODE) \
+    tcg_gen_gvec_abs(MO_32, VdV_off, VuV_off, \
+                     sizeof(MMVector), sizeof(MMVector))
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 22/30] Hexagon HVX (target/hexagon) helper overrides - vector loads
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (20 preceding siblings ...)
  2021-10-12 10:10 ` [PATCH v4 21/30] Hexagon HVX (target/hexagon) helper overrides - vector splat and abs Taylor Simpson
@ 2021-10-12 10:11 ` Taylor Simpson
  2021-10-12 10:11 ` [PATCH v4 23/30] Hexagon HVX (target/hexagon) helper overrides - vector stores Taylor Simpson
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 150 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 150 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index 435c7b5..2d1d778 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -532,4 +532,154 @@ static inline void assert_vhist_tmp(DisasContext *ctx)
     tcg_gen_gvec_abs(MO_32, VdV_off, VuV_off, \
                      sizeof(MMVector), sizeof(MMVector))
 
+/* Vector loads */
+#define fGEN_TCG_V6_vL32b_pi(SHORTCODE)                    SHORTCODE
+#define fGEN_TCG_V6_vL32Ub_pi(SHORTCODE)                   SHORTCODE
+#define fGEN_TCG_V6_vL32b_cur_pi(SHORTCODE)                SHORTCODE
+#define fGEN_TCG_V6_vL32b_tmp_pi(SHORTCODE)                SHORTCODE
+#define fGEN_TCG_V6_vL32b_nt_pi(SHORTCODE)                 SHORTCODE
+#define fGEN_TCG_V6_vL32b_nt_cur_pi(SHORTCODE)             SHORTCODE
+#define fGEN_TCG_V6_vL32b_nt_tmp_pi(SHORTCODE)             SHORTCODE
+#define fGEN_TCG_V6_vL32b_ai(SHORTCODE)                    SHORTCODE
+#define fGEN_TCG_V6_vL32Ub_ai(SHORTCODE)                   SHORTCODE
+#define fGEN_TCG_V6_vL32b_cur_ai(SHORTCODE)                SHORTCODE
+#define fGEN_TCG_V6_vL32b_tmp_ai(SHORTCODE)                SHORTCODE
+#define fGEN_TCG_V6_vL32b_nt_ai(SHORTCODE)                 SHORTCODE
+#define fGEN_TCG_V6_vL32b_nt_cur_ai(SHORTCODE)             SHORTCODE
+#define fGEN_TCG_V6_vL32b_nt_tmp_ai(SHORTCODE)             SHORTCODE
+#define fGEN_TCG_V6_vL32b_ppu(SHORTCODE)                   SHORTCODE
+#define fGEN_TCG_V6_vL32Ub_ppu(SHORTCODE)                  SHORTCODE
+#define fGEN_TCG_V6_vL32b_cur_ppu(SHORTCODE)               SHORTCODE
+#define fGEN_TCG_V6_vL32b_tmp_ppu(SHORTCODE)               SHORTCODE
+#define fGEN_TCG_V6_vL32b_nt_ppu(SHORTCODE)                SHORTCODE
+#define fGEN_TCG_V6_vL32b_nt_cur_ppu(SHORTCODE)            SHORTCODE
+#define fGEN_TCG_V6_vL32b_nt_tmp_ppu(SHORTCODE)            SHORTCODE
+
+/* Predicated vector loads */
+#define fGEN_TCG_PRED_VEC_LOAD(GET_EA, PRED, DSTOFF, INC) \
+    do { \
+        TCGv LSB = tcg_temp_new(); \
+        TCGLabel *false_label = gen_new_label(); \
+        TCGLabel *end_label = gen_new_label(); \
+        GET_EA; \
+        PRED; \
+        tcg_gen_brcondi_tl(TCG_COND_EQ, LSB, 0, false_label); \
+        tcg_temp_free(LSB); \
+        gen_vreg_load(ctx, DSTOFF, EA, true); \
+        INC; \
+        tcg_gen_br(end_label); \
+        gen_set_label(false_label); \
+        tcg_gen_ori_tl(hex_slot_cancelled, hex_slot_cancelled, \
+                       1 << insn->slot); \
+        gen_set_label(end_label); \
+    } while (0)
+
+#define fGEN_TCG_PRED_VEC_LOAD_pred_pi \
+    fGEN_TCG_PRED_VEC_LOAD(fLSBOLD(PvV), \
+                           fEA_REG(RxV), \
+                           VdV_off, \
+                           fPM_I(RxV, siV * sizeof(MMVector)))
+#define fGEN_TCG_PRED_VEC_LOAD_npred_pi \
+    fGEN_TCG_PRED_VEC_LOAD(fLSBOLDNOT(PvV), \
+                           fEA_REG(RxV), \
+                           VdV_off, \
+                           fPM_I(RxV, siV * sizeof(MMVector)))
+
+#define fGEN_TCG_V6_vL32b_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_pi
+#define fGEN_TCG_V6_vL32b_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_pi
+#define fGEN_TCG_V6_vL32b_cur_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_pi
+#define fGEN_TCG_V6_vL32b_cur_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_pi
+#define fGEN_TCG_V6_vL32b_tmp_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_pi
+#define fGEN_TCG_V6_vL32b_tmp_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_pi
+#define fGEN_TCG_V6_vL32b_nt_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_pi
+#define fGEN_TCG_V6_vL32b_nt_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_pi
+#define fGEN_TCG_V6_vL32b_nt_cur_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_pi
+#define fGEN_TCG_V6_vL32b_nt_cur_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_pi
+#define fGEN_TCG_V6_vL32b_nt_tmp_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_pi
+#define fGEN_TCG_V6_vL32b_nt_tmp_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_pi
+
+#define fGEN_TCG_PRED_VEC_LOAD_pred_ai \
+    fGEN_TCG_PRED_VEC_LOAD(fLSBOLD(PvV), \
+                           fEA_RI(RtV, siV * sizeof(MMVector)), \
+                           VdV_off, \
+                           do {} while (0))
+#define fGEN_TCG_PRED_VEC_LOAD_npred_ai \
+    fGEN_TCG_PRED_VEC_LOAD(fLSBOLDNOT(PvV), \
+                           fEA_RI(RtV, siV * sizeof(MMVector)), \
+                           VdV_off, \
+                           do {} while (0))
+
+#define fGEN_TCG_V6_vL32b_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ai
+#define fGEN_TCG_V6_vL32b_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ai
+#define fGEN_TCG_V6_vL32b_cur_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ai
+#define fGEN_TCG_V6_vL32b_cur_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ai
+#define fGEN_TCG_V6_vL32b_tmp_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ai
+#define fGEN_TCG_V6_vL32b_tmp_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ai
+#define fGEN_TCG_V6_vL32b_nt_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ai
+#define fGEN_TCG_V6_vL32b_nt_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ai
+#define fGEN_TCG_V6_vL32b_nt_cur_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ai
+#define fGEN_TCG_V6_vL32b_nt_cur_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ai
+#define fGEN_TCG_V6_vL32b_nt_tmp_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ai
+#define fGEN_TCG_V6_vL32b_nt_tmp_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ai
+
+#define fGEN_TCG_PRED_VEC_LOAD_pred_ppu \
+    fGEN_TCG_PRED_VEC_LOAD(fLSBOLD(PvV), \
+                           fEA_REG(RxV), \
+                           VdV_off, \
+                           fPM_M(RxV, MuV))
+#define fGEN_TCG_PRED_VEC_LOAD_npred_ppu \
+    fGEN_TCG_PRED_VEC_LOAD(fLSBOLDNOT(PvV), \
+                           fEA_REG(RxV), \
+                           VdV_off, \
+                           fPM_M(RxV, MuV))
+
+#define fGEN_TCG_V6_vL32b_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ppu
+#define fGEN_TCG_V6_vL32b_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ppu
+#define fGEN_TCG_V6_vL32b_cur_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ppu
+#define fGEN_TCG_V6_vL32b_cur_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ppu
+#define fGEN_TCG_V6_vL32b_tmp_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ppu
+#define fGEN_TCG_V6_vL32b_tmp_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ppu
+#define fGEN_TCG_V6_vL32b_nt_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ppu
+#define fGEN_TCG_V6_vL32b_nt_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ppu
+#define fGEN_TCG_V6_vL32b_nt_cur_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ppu
+#define fGEN_TCG_V6_vL32b_nt_cur_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ppu
+#define fGEN_TCG_V6_vL32b_nt_tmp_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_pred_ppu
+#define fGEN_TCG_V6_vL32b_nt_tmp_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_LOAD_npred_ppu
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 23/30] Hexagon HVX (target/hexagon) helper overrides - vector stores
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (21 preceding siblings ...)
  2021-10-12 10:11 ` [PATCH v4 22/30] Hexagon HVX (target/hexagon) helper overrides - vector loads Taylor Simpson
@ 2021-10-12 10:11 ` Taylor Simpson
  2021-10-12 10:11 ` [PATCH v4 24/30] Hexagon HVX (target/hexagon) import semantics Taylor Simpson
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/gen_tcg_hvx.h | 218 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 218 insertions(+)

diff --git a/target/hexagon/gen_tcg_hvx.h b/target/hexagon/gen_tcg_hvx.h
index 2d1d778..cdcc938 100644
--- a/target/hexagon/gen_tcg_hvx.h
+++ b/target/hexagon/gen_tcg_hvx.h
@@ -682,4 +682,222 @@ static inline void assert_vhist_tmp(DisasContext *ctx)
 #define fGEN_TCG_V6_vL32b_nt_tmp_npred_ppu(SHORTCODE) \
     fGEN_TCG_PRED_VEC_LOAD_npred_ppu
 
+/* Vector stores */
+#define fGEN_TCG_V6_vS32b_pi(SHORTCODE)                    SHORTCODE
+#define fGEN_TCG_V6_vS32Ub_pi(SHORTCODE)                   SHORTCODE
+#define fGEN_TCG_V6_vS32b_nt_pi(SHORTCODE)                 SHORTCODE
+#define fGEN_TCG_V6_vS32b_ai(SHORTCODE)                    SHORTCODE
+#define fGEN_TCG_V6_vS32Ub_ai(SHORTCODE)                   SHORTCODE
+#define fGEN_TCG_V6_vS32b_nt_ai(SHORTCODE)                 SHORTCODE
+#define fGEN_TCG_V6_vS32b_ppu(SHORTCODE)                   SHORTCODE
+#define fGEN_TCG_V6_vS32Ub_ppu(SHORTCODE)                  SHORTCODE
+#define fGEN_TCG_V6_vS32b_nt_ppu(SHORTCODE)                SHORTCODE
+
+/* New value vector stores */
+#define fGEN_TCG_NEWVAL_VEC_STORE(GET_EA, INC) \
+    do { \
+        GET_EA; \
+        gen_vreg_store(ctx, insn, pkt, EA, OsN_off, insn->slot, true); \
+        INC; \
+    } while (0)
+
+#define fGEN_TCG_NEWVAL_VEC_STORE_pi \
+    fGEN_TCG_NEWVAL_VEC_STORE(fEA_REG(RxV), fPM_I(RxV, siV * sizeof(MMVector)))
+
+#define fGEN_TCG_V6_vS32b_new_pi(SHORTCODE) \
+    fGEN_TCG_NEWVAL_VEC_STORE_pi
+#define fGEN_TCG_V6_vS32b_nt_new_pi(SHORTCODE) \
+    fGEN_TCG_NEWVAL_VEC_STORE_pi
+
+#define fGEN_TCG_NEWVAL_VEC_STORE_ai \
+    fGEN_TCG_NEWVAL_VEC_STORE(fEA_RI(RtV, siV * sizeof(MMVector)), \
+                              do { } while (0))
+
+#define fGEN_TCG_V6_vS32b_new_ai(SHORTCODE) \
+    fGEN_TCG_NEWVAL_VEC_STORE_ai
+#define fGEN_TCG_V6_vS32b_nt_new_ai(SHORTCODE) \
+    fGEN_TCG_NEWVAL_VEC_STORE_ai
+
+#define fGEN_TCG_NEWVAL_VEC_STORE_ppu \
+    fGEN_TCG_NEWVAL_VEC_STORE(fEA_REG(RxV), fPM_M(RxV, MuV))
+
+#define fGEN_TCG_V6_vS32b_new_ppu(SHORTCODE) \
+    fGEN_TCG_NEWVAL_VEC_STORE_ppu
+#define fGEN_TCG_V6_vS32b_nt_new_ppu(SHORTCODE) \
+    fGEN_TCG_NEWVAL_VEC_STORE_ppu
+
+/* Predicated vector stores */
+#define fGEN_TCG_PRED_VEC_STORE(GET_EA, PRED, SRCOFF, ALIGN, INC) \
+    do { \
+        TCGv LSB = tcg_temp_new(); \
+        TCGLabel *false_label = gen_new_label(); \
+        TCGLabel *end_label = gen_new_label(); \
+        GET_EA; \
+        PRED; \
+        tcg_gen_brcondi_tl(TCG_COND_EQ, LSB, 0, false_label); \
+        tcg_temp_free(LSB); \
+        gen_vreg_store(ctx, insn, pkt, EA, SRCOFF, insn->slot, ALIGN); \
+        INC; \
+        tcg_gen_br(end_label); \
+        gen_set_label(false_label); \
+        tcg_gen_ori_tl(hex_slot_cancelled, hex_slot_cancelled, \
+                       1 << insn->slot); \
+        gen_set_label(end_label); \
+    } while (0)
+
+#define fGEN_TCG_PRED_VEC_STORE_pred_pi(ALIGN) \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLD(PvV), \
+                            fEA_REG(RxV), \
+                            VsV_off, ALIGN, \
+                            fPM_I(RxV, siV * sizeof(MMVector)))
+#define fGEN_TCG_PRED_VEC_STORE_npred_pi(ALIGN) \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLDNOT(PvV), \
+                            fEA_REG(RxV), \
+                            VsV_off, ALIGN, \
+                            fPM_I(RxV, siV * sizeof(MMVector)))
+#define fGEN_TCG_PRED_VEC_STORE_new_pred_pi \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLD(PvV), \
+                            fEA_REG(RxV), \
+                            OsN_off, true, \
+                            fPM_I(RxV, siV * sizeof(MMVector)))
+#define fGEN_TCG_PRED_VEC_STORE_new_npred_pi \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLDNOT(PvV), \
+                            fEA_REG(RxV), \
+                            OsN_off, true, \
+                            fPM_I(RxV, siV * sizeof(MMVector)))
+
+#define fGEN_TCG_V6_vS32b_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_pred_pi(true)
+#define fGEN_TCG_V6_vS32b_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_npred_pi(true)
+#define fGEN_TCG_V6_vS32Ub_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_pred_pi(false)
+#define fGEN_TCG_V6_vS32Ub_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_npred_pi(false)
+#define fGEN_TCG_V6_vS32b_nt_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_pred_pi(true)
+#define fGEN_TCG_V6_vS32b_nt_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_npred_pi(true)
+#define fGEN_TCG_V6_vS32b_new_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_pred_pi
+#define fGEN_TCG_V6_vS32b_new_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_npred_pi
+#define fGEN_TCG_V6_vS32b_nt_new_pred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_pred_pi
+#define fGEN_TCG_V6_vS32b_nt_new_npred_pi(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_npred_pi
+
+#define fGEN_TCG_PRED_VEC_STORE_pred_ai(ALIGN) \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLD(PvV), \
+                            fEA_RI(RtV, siV * sizeof(MMVector)), \
+                            VsV_off, ALIGN, \
+                            do { } while (0))
+#define fGEN_TCG_PRED_VEC_STORE_npred_ai(ALIGN) \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLDNOT(PvV), \
+                            fEA_RI(RtV, siV * sizeof(MMVector)), \
+                            VsV_off, ALIGN, \
+                            do { } while (0))
+#define fGEN_TCG_PRED_VEC_STORE_new_pred_ai \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLD(PvV), \
+                            fEA_RI(RtV, siV * sizeof(MMVector)), \
+                            OsN_off, true, \
+                            do { } while (0))
+#define fGEN_TCG_PRED_VEC_STORE_new_npred_ai \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLDNOT(PvV), \
+                            fEA_RI(RtV, siV * sizeof(MMVector)), \
+                            OsN_off, true, \
+                            do { } while (0))
+
+#define fGEN_TCG_V6_vS32b_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_pred_ai(true)
+#define fGEN_TCG_V6_vS32b_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_npred_ai(true)
+#define fGEN_TCG_V6_vS32Ub_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_pred_ai(false)
+#define fGEN_TCG_V6_vS32Ub_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_npred_ai(false)
+#define fGEN_TCG_V6_vS32b_nt_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_pred_ai(true)
+#define fGEN_TCG_V6_vS32b_nt_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_npred_ai(true)
+#define fGEN_TCG_V6_vS32b_new_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_pred_ai
+#define fGEN_TCG_V6_vS32b_new_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_npred_ai
+#define fGEN_TCG_V6_vS32b_nt_new_pred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_pred_ai
+#define fGEN_TCG_V6_vS32b_nt_new_npred_ai(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_npred_ai
+
+#define fGEN_TCG_PRED_VEC_STORE_pred_ppu(ALIGN) \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLD(PvV), \
+                            fEA_REG(RxV), \
+                            VsV_off, ALIGN, \
+                            fPM_M(RxV, MuV))
+#define fGEN_TCG_PRED_VEC_STORE_npred_ppu(ALIGN) \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLDNOT(PvV), \
+                            fEA_REG(RxV), \
+                            VsV_off, ALIGN, \
+                            fPM_M(RxV, MuV))
+#define fGEN_TCG_PRED_VEC_STORE_new_pred_ppu \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLD(PvV), \
+                            fEA_REG(RxV), \
+                            OsN_off, true, \
+                            fPM_M(RxV, MuV))
+#define fGEN_TCG_PRED_VEC_STORE_new_npred_ppu \
+    fGEN_TCG_PRED_VEC_STORE(fLSBOLDNOT(PvV), \
+                            fEA_REG(RxV), \
+                            OsN_off, true, \
+                            fPM_M(RxV, MuV))
+
+#define fGEN_TCG_V6_vS32b_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_pred_ppu(true)
+#define fGEN_TCG_V6_vS32b_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_npred_ppu(true)
+#define fGEN_TCG_V6_vS32Ub_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_pred_ppu(false)
+#define fGEN_TCG_V6_vS32Ub_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_npred_ppu(false)
+#define fGEN_TCG_V6_vS32b_nt_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_pred_ppu(true)
+#define fGEN_TCG_V6_vS32b_nt_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_npred_ppu(true)
+#define fGEN_TCG_V6_vS32b_new_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_pred_ppu
+#define fGEN_TCG_V6_vS32b_new_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_npred_ppu
+#define fGEN_TCG_V6_vS32b_nt_new_pred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_pred_ppu
+#define fGEN_TCG_V6_vS32b_nt_new_npred_ppu(SHORTCODE) \
+    fGEN_TCG_PRED_VEC_STORE_new_npred_ppu
+
+/* Masked vector stores */
+#define fGEN_TCG_V6_vS32b_qpred_pi(SHORTCODE)              SHORTCODE
+#define fGEN_TCG_V6_vS32b_nt_qpred_pi(SHORTCODE)           SHORTCODE
+#define fGEN_TCG_V6_vS32b_qpred_ai(SHORTCODE)              SHORTCODE
+#define fGEN_TCG_V6_vS32b_nt_qpred_ai(SHORTCODE)           SHORTCODE
+#define fGEN_TCG_V6_vS32b_qpred_ppu(SHORTCODE)             SHORTCODE
+#define fGEN_TCG_V6_vS32b_nt_qpred_ppu(SHORTCODE)          SHORTCODE
+#define fGEN_TCG_V6_vS32b_nqpred_pi(SHORTCODE)             SHORTCODE
+#define fGEN_TCG_V6_vS32b_nt_nqpred_pi(SHORTCODE)          SHORTCODE
+#define fGEN_TCG_V6_vS32b_nqpred_ai(SHORTCODE)             SHORTCODE
+#define fGEN_TCG_V6_vS32b_nt_nqpred_ai(SHORTCODE)          SHORTCODE
+#define fGEN_TCG_V6_vS32b_nqpred_ppu(SHORTCODE)            SHORTCODE
+#define fGEN_TCG_V6_vS32b_nt_nqpred_ppu(SHORTCODE)         SHORTCODE
+
+/* Store release not modelled in qemu, but need to suppress compiler warnings */
+#define fGEN_TCG_V6_vS32b_srls_pi(SHORTCODE) \
+    do { \
+        siV = siV; \
+    } while (0)
+#define fGEN_TCG_V6_vS32b_srls_ai(SHORTCODE) \
+    do { \
+        RtV = RtV; \
+        siV = siV; \
+    } while (0)
+#define fGEN_TCG_V6_vS32b_srls_ppu(SHORTCODE) \
+    do { \
+        MuV = MuV; \
+    } while (0)
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 24/30] Hexagon HVX (target/hexagon) import semantics
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (22 preceding siblings ...)
  2021-10-12 10:11 ` [PATCH v4 23/30] Hexagon HVX (target/hexagon) helper overrides - vector stores Taylor Simpson
@ 2021-10-12 10:11 ` Taylor Simpson
  2021-10-12 10:11 ` [PATCH v4 25/30] Hexagon HVX (target/hexagon) instruction decoding Taylor Simpson
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Imported from the Hexagon architecture library
    imported/allext.idef           Top level file for all extensions
    imported/mmvec/ext.idef        HVX instruction definitions

Support functions added to target/hexagon/genptr.c

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/genptr.c                |  172 +++
 target/hexagon/imported/allext.idef    |   25 +
 target/hexagon/imported/allidefs.def   |    1 +
 target/hexagon/imported/mmvec/ext.idef | 2606 ++++++++++++++++++++++++++++++++
 4 files changed, 2804 insertions(+)
 create mode 100644 target/hexagon/imported/allext.idef
 create mode 100644 target/hexagon/imported/mmvec/ext.idef

diff --git a/target/hexagon/genptr.c b/target/hexagon/genptr.c
index 473438a..4419d30 100644
--- a/target/hexagon/genptr.c
+++ b/target/hexagon/genptr.c
@@ -19,11 +19,13 @@
 #include "cpu.h"
 #include "internal.h"
 #include "tcg/tcg-op.h"
+#include "tcg/tcg-op-gvec.h"
 #include "insn.h"
 #include "opcodes.h"
 #include "translate.h"
 #define QEMU_GENERATE       /* Used internally by macros.h */
 #include "macros.h"
+#include "mmvec/macros.h"
 #undef QEMU_GENERATE
 #include "gen_tcg.h"
 #include "gen_tcg_hvx.h"
@@ -462,5 +464,175 @@ static TCGv gen_8bitsof(TCGv result, TCGv value)
     return result;
 }
 
+static intptr_t vreg_src_off(DisasContext *ctx, int num)
+{
+    intptr_t offset = offsetof(CPUHexagonState, VRegs[num]);
+
+    if (test_bit(num, ctx->vregs_select)) {
+        offset = ctx_future_vreg_off(ctx, num, 1, false);
+    }
+    if (test_bit(num, ctx->vregs_updated_tmp)) {
+        offset = ctx_tmp_vreg_off(ctx, num, 1, false);
+    }
+    return offset;
+}
+
+static void gen_log_vreg_write(DisasContext *ctx, intptr_t srcoff, int num,
+                               VRegWriteType type, int slot_num,
+                               bool is_predicated)
+{
+    TCGLabel *label_end = NULL;
+    intptr_t dstoff;
+
+    if (is_predicated) {
+        TCGv cancelled = tcg_temp_local_new();
+        label_end = gen_new_label();
+
+        /* Don't do anything if the slot was cancelled */
+        tcg_gen_extract_tl(cancelled, hex_slot_cancelled, slot_num, 1);
+        tcg_gen_brcondi_tl(TCG_COND_NE, cancelled, 0, label_end);
+        tcg_temp_free(cancelled);
+    }
+
+    if (type != EXT_TMP) {
+        dstoff = ctx_future_vreg_off(ctx, num, 1, true);
+        tcg_gen_gvec_mov(MO_64, dstoff, srcoff,
+                         sizeof(MMVector), sizeof(MMVector));
+        tcg_gen_ori_tl(hex_VRegs_updated, hex_VRegs_updated, 1 << num);
+    } else {
+        dstoff = ctx_tmp_vreg_off(ctx, num, 1, false);
+        tcg_gen_gvec_mov(MO_64, dstoff, srcoff,
+                         sizeof(MMVector), sizeof(MMVector));
+    }
+
+    if (is_predicated) {
+        gen_set_label(label_end);
+    }
+}
+
+static void gen_log_vreg_write_pair(DisasContext *ctx, intptr_t srcoff, int num,
+                                    VRegWriteType type, int slot_num,
+                                    bool is_predicated)
+{
+    gen_log_vreg_write(ctx, srcoff, num ^ 0, type, slot_num, is_predicated);
+    srcoff += sizeof(MMVector);
+    gen_log_vreg_write(ctx, srcoff, num ^ 1, type, slot_num, is_predicated);
+}
+
+static void gen_log_qreg_write(intptr_t srcoff, int num, int vnew,
+                               int slot_num, bool is_predicated)
+{
+    TCGLabel *label_end = NULL;
+    intptr_t dstoff;
+
+    if (is_predicated) {
+        TCGv cancelled = tcg_temp_local_new();
+        label_end = gen_new_label();
+
+        /* Don't do anything if the slot was cancelled */
+        tcg_gen_extract_tl(cancelled, hex_slot_cancelled, slot_num, 1);
+        tcg_gen_brcondi_tl(TCG_COND_NE, cancelled, 0, label_end);
+        tcg_temp_free(cancelled);
+    }
+
+    dstoff = offsetof(CPUHexagonState, future_QRegs[num]);
+    tcg_gen_gvec_mov(MO_64, dstoff, srcoff, sizeof(MMQReg), sizeof(MMQReg));
+
+    if (is_predicated) {
+        tcg_gen_ori_tl(hex_QRegs_updated, hex_QRegs_updated, 1 << num);
+        gen_set_label(label_end);
+    }
+}
+
+static void gen_vreg_load(DisasContext *ctx, intptr_t dstoff, TCGv src,
+                          bool aligned)
+{
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    if (aligned) {
+        tcg_gen_andi_tl(src, src, ~((int32_t)sizeof(MMVector) - 1));
+    }
+    for (int i = 0; i < sizeof(MMVector) / 8; i++) {
+        tcg_gen_qemu_ld64(tmp, src, ctx->mem_idx);
+        tcg_gen_addi_tl(src, src, 8);
+        tcg_gen_st_i64(tmp, cpu_env, dstoff + i * 8);
+    }
+    tcg_temp_free_i64(tmp);
+}
+
+static void gen_vreg_store(DisasContext *ctx, Insn *insn, Packet *pkt,
+                           TCGv EA, intptr_t srcoff, int slot, bool aligned)
+{
+    intptr_t dstoff = offsetof(CPUHexagonState, vstore[slot].data);
+    intptr_t maskoff = offsetof(CPUHexagonState, vstore[slot].mask);
+
+    if (is_gather_store_insn(insn, pkt)) {
+        TCGv sl = tcg_constant_tl(slot);
+        gen_helper_gather_store(cpu_env, EA, sl);
+        return;
+    }
+
+    tcg_gen_movi_tl(hex_vstore_pending[slot], 1);
+    if (aligned) {
+        tcg_gen_andi_tl(hex_vstore_addr[slot], EA,
+                        ~((int32_t)sizeof(MMVector) - 1));
+    } else {
+        tcg_gen_mov_tl(hex_vstore_addr[slot], EA);
+    }
+    tcg_gen_movi_tl(hex_vstore_size[slot], sizeof(MMVector));
+
+    /* Copy the data to the vstore buffer */
+    tcg_gen_gvec_mov(MO_64, dstoff, srcoff, sizeof(MMVector), sizeof(MMVector));
+    /* Set the mask to all 1's */
+    tcg_gen_gvec_dup_imm(MO_64, maskoff, sizeof(MMQReg), sizeof(MMQReg), ~0LL);
+}
+
+static void gen_vreg_masked_store(DisasContext *ctx, TCGv EA, intptr_t srcoff,
+                                  intptr_t bitsoff, int slot, bool invert)
+{
+    intptr_t dstoff = offsetof(CPUHexagonState, vstore[slot].data);
+    intptr_t maskoff = offsetof(CPUHexagonState, vstore[slot].mask);
+
+    tcg_gen_movi_tl(hex_vstore_pending[slot], 1);
+    tcg_gen_andi_tl(hex_vstore_addr[slot], EA,
+                    ~((int32_t)sizeof(MMVector) - 1));
+    tcg_gen_movi_tl(hex_vstore_size[slot], sizeof(MMVector));
+
+    /* Copy the data to the vstore buffer */
+    tcg_gen_gvec_mov(MO_64, dstoff, srcoff, sizeof(MMVector), sizeof(MMVector));
+    /* Copy the mask */
+    tcg_gen_gvec_mov(MO_64, maskoff, bitsoff, sizeof(MMQReg), sizeof(MMQReg));
+    if (invert) {
+        tcg_gen_gvec_not(MO_64, maskoff, maskoff,
+                         sizeof(MMQReg), sizeof(MMQReg));
+    }
+}
+
+static void vec_to_qvec(size_t size, intptr_t dstoff, intptr_t srcoff)
+{
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 word = tcg_temp_new_i64();
+    TCGv_i64 bits = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_temp_new_i64();
+    TCGv_i64 zero = tcg_constant_i64(0);
+    TCGv_i64 ones = tcg_constant_i64(~0);
+
+    for (int i = 0; i < sizeof(MMVector) / 8; i++) {
+        tcg_gen_ld_i64(tmp, cpu_env, srcoff + i * 8);
+        tcg_gen_movi_i64(mask, 0);
+
+        for (int j = 0; j < 8; j += size) {
+            tcg_gen_extract_i64(word, tmp, j * 8, size * 8);
+            tcg_gen_movcond_i64(TCG_COND_NE, bits, word, zero, ones, zero);
+            tcg_gen_deposit_i64(mask, mask, bits, j, size);
+        }
+
+        tcg_gen_st8_i64(mask, cpu_env, dstoff + i);
+    }
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(word);
+    tcg_temp_free_i64(bits);
+    tcg_temp_free_i64(mask);
+}
+
 #include "tcg_funcs_generated.c.inc"
 #include "tcg_func_table_generated.c.inc"
diff --git a/target/hexagon/imported/allext.idef b/target/hexagon/imported/allext.idef
new file mode 100644
index 0000000..9d4b23e
--- /dev/null
+++ b/target/hexagon/imported/allext.idef
@@ -0,0 +1,25 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Top level file for all instruction set extensions
+ */
+#define EXTNAME mmvec
+#define EXTSTR "mmvec"
+#include "mmvec/ext.idef"
+#undef EXTNAME
+#undef EXTSTR
diff --git a/target/hexagon/imported/allidefs.def b/target/hexagon/imported/allidefs.def
index 2aace29..ee253b8 100644
--- a/target/hexagon/imported/allidefs.def
+++ b/target/hexagon/imported/allidefs.def
@@ -28,3 +28,4 @@
 #include "shift.idef"
 #include "system.idef"
 #include "subinsns.idef"
+#include "allext.idef"
diff --git a/target/hexagon/imported/mmvec/ext.idef b/target/hexagon/imported/mmvec/ext.idef
new file mode 100644
index 0000000..8ca5a60
--- /dev/null
+++ b/target/hexagon/imported/mmvec/ext.idef
@@ -0,0 +1,2606 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/******************************************************************************
+ *
+ *     HOYA: MULTI MEDIA INSTRUCITONS
+ *
+ ******************************************************************************/
+
+#ifndef EXTINSN
+#define EXTINSN Q6INSN
+#define __SELF_DEF_EXTINSN 1
+#endif
+
+#ifndef NO_MMVEC
+
+#define DO_FOR_EACH_CODE(WIDTH, CODE) \
+{ \
+    fHIDE(int i;) \
+    fVFOREACH(WIDTH, i) {\
+        CODE ;\
+    } \
+}
+
+
+
+
+#define ITERATOR_INSN_ANY_SLOT(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+
+
+#define ITERATOR_INSN2_ANY_SLOT(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_ANY_SLOT(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+#define ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA_DV),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+
+#define ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+
+#define ITERATOR_INSN_SHIFT_SLOT(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VS),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+
+
+#define ITERATOR_INSN_SHIFT_SLOT_VV_LATE(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VS),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_SHIFT_SLOT(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_SHIFT_SLOT(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+#define ITERATOR_INSN_PERMUTE_SLOT(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_PERMUTE_SLOT(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_PERMUTE_SLOT(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+#define ITERATOR_INSN_PERMUTE_SLOT_DEP(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),
+
+
+#define ITERATOR_INSN2_PERMUTE_SLOT_DEP(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_PERMUTE_SLOT_DEP(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+#define ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP_VS),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC_DEP(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP_VS),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_PERMUTE_SLOT_DOUBLE_VEC(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+#define ITERATOR_INSN_MPY_SLOT(WIDTH,TAG, SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, \
+ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN_MPY_SLOT_LATE(WIDTH,TAG, SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, \
+ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_MPY_SLOT(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_MPY_SLOT(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+#define ITERATOR_INSN2_MPY_SLOT_LATE(WIDTH,TAG, SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_MPY_SLOT_LATE(WIDTH,TAG, SYNTAX2,DESCR,CODE)
+
+
+#define ITERATOR_INSN_MPY_SLOT_DOUBLE_VEC(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX_DV),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_MPY_SLOT_DOUBLE_VEC(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+
+
+
+#define ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC2(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX_DV,A_CVI_VX_VSRC0_IS_DST), DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN_SLOT2_DOUBLE_VEC(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX_DV,A_RESTRICT_SLOT2ONLY), DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN_VHISTLIKE(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_4SLOT),  \
+DESCR, fHIDE(mmvector_t input;) input = fTMPVDATA(); DO_FOR_EACH_CODE(WIDTH, CODE))
+
+
+
+
+
+/******************************************************************************************
+*
+* MMVECTOR MEMORY OPERATIONS - NO NAPALI V1
+*
+*******************************************************************************************/
+
+
+
+#define ITERATOR_INSN_MPY_SLOT_DOUBLE_VEC_NOV1(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX_DV),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC_NOV1(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_MPY_SLOT_DOUBLE_VEC_NOV1(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+
+
+#define ITERATOR_INSN_SHIFT_SLOT_NOV1(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VS),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_SHIFT_SLOT_NOV1(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_SHIFT_SLOT_NOV1(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+
+#define ITERATOR_INSN_ANY_SLOT_NOV1(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_ANY_SLOT_NOV1(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_ANY_SLOT_NOV1(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+
+#define ITERATOR_INSN_MPY_SLOT_NOV1(WIDTH,TAG, SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, \
+ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN_PERMUTE_SLOT_NOV1(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_PERMUTE_SLOTT_NOV1(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_PERMUTE_SLOT(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+#define ITERATOR_INSN_PERMUTE_SLOT_DEPT_NOV1(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),
+
+
+#define ITERATOR_INSN2_PERMUTE_SLOT_DEPT_NOV1(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_PERMUTE_SLOT_DEP_NOV1(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+#define ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC_NOV1(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP_VS),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC_DEPT_NOV1(WIDTH,TAG,SYNTAX,DESCR,CODE) \
+EXTINSN(V6_##TAG, SYNTAX, ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP_VS),  \
+DESCR, DO_FOR_EACH_CODE(WIDTH, CODE))
+
+#define ITERATOR_INSN2_PERMUTE_SLOT_DOUBLE_VEC_NOV1(WIDTH,TAG,SYNTAX,SYNTAX2,DESCR,CODE) \
+ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC_NOV1(WIDTH,TAG,SYNTAX2,DESCR,CODE)
+
+#define NARROWING_SHIFT_NOV1(ITERSIZE,TAG,DSTM,DSTTYPE,SRCTYPE,SYNOPTS,SATFUNC,RNDFUNC,SHAMTMASK) \
+ITERATOR_INSN_SHIFT_SLOT_NOV1(ITERSIZE,TAG, \
+"Vd32." #DSTTYPE "=vasr(Vu32." #SRCTYPE ",Vv32." #SRCTYPE ",Rt8)" #SYNOPTS, \
+"Vector shift right and shuffle", \
+    fHIDE(int )shamt = RtV & SHAMTMASK; \
+    DSTM(0,VdV.SRCTYPE[i],SATFUNC(RNDFUNC(VvV.SRCTYPE[i],shamt) >> shamt)); \
+    DSTM(1,VdV.SRCTYPE[i],SATFUNC(RNDFUNC(VuV.SRCTYPE[i],shamt) >> shamt)))
+
+#define MMVEC_AVGS_NOV1(TYPE,TYPE2,DESCR, WIDTH, DEST,SRC)\
+ITERATOR_INSN2_ANY_SLOT_NOV1(WIDTH,vavg##TYPE,                        "Vd32=vavg"TYPE2"(Vu32,Vv32)",          "Vd32."#DEST"=vavg(Vu32."#SRC",Vv32."#SRC")",          "Vector Average "DESCR,                                      VdV.DEST[i]  = fVAVGS(       WIDTH,  VuV.SRC[i], VvV.SRC[i])) \
+ITERATOR_INSN2_ANY_SLOT_NOV1(WIDTH,vavg##TYPE##rnd,                   "Vd32=vavg"TYPE2"(Vu32,Vv32):rnd",      "Vd32."#DEST"=vavg(Vu32."#SRC",Vv32."#SRC"):rnd",      "Vector Average % Round"DESCR,                               VdV.DEST[i]  = fVAVGSRND(    WIDTH,  VuV.SRC[i], VvV.SRC[i])) \
+ITERATOR_INSN2_ANY_SLOT_NOV1(WIDTH,vnavg##TYPE,                       "Vd32=vnavg"TYPE2"(Vu32,Vv32)",         "Vd32."#DEST"=vnavg(Vu32."#SRC",Vv32."#SRC")",         "Vector Negative Average "DESCR,                             VdV.DEST[i]  = fVNAVGS(      WIDTH,  VuV.SRC[i], VvV.SRC[i]))
+
+  #define MMVEC_AVGU_NOV1(TYPE,TYPE2,DESCR, WIDTH, DEST,SRC)\
+ITERATOR_INSN2_ANY_SLOT_NOV1(WIDTH,vavg##TYPE,                        "Vd32=vavg"TYPE2"(Vu32,Vv32)",         "Vd32."#DEST"=vavg(Vu32."#SRC",Vv32."#SRC")",        "Vector Average "DESCR,                                      VdV.DEST[i] = fVAVGU(   WIDTH,  VuV.SRC[i], VvV.SRC[i])) \
+ITERATOR_INSN2_ANY_SLOT_NOV1(WIDTH,vavg##TYPE##rnd,                   "Vd32=vavg"TYPE2"(Vu32,Vv32):rnd",     "Vd32."#DEST"=vavg(Vu32."#SRC",Vv32."#SRC"):rnd",    "Vector Average % Round"DESCR,                               VdV.DEST[i] = fVAVGURND(WIDTH,  VuV.SRC[i], VvV.SRC[i]))
+
+
+
+/******************************************************************************************
+*
+* MMVECTOR MEMORY OPERATIONS
+*
+*******************************************************************************************/
+
+#define MMVEC_EACH_EA(TAG,DESCR,ATTRIB,NT,SYNTAXA,SYNTAXB,BEH) \
+EXTINSN(V6_##TAG##_pi,      SYNTAXA "(Rx32++#s3)" NT SYNTAXB,ATTRIB,DESCR,{ fEA_REG(RxV); BEH; fPM_I(RxV,VEC_SCALE(siV)); }) \
+EXTINSN(V6_##TAG##_ai,      SYNTAXA "(Rt32+#s4)" NT SYNTAXB,ATTRIB,DESCR,{ fEA_RI(RtV,VEC_SCALE(siV)); BEH;}) \
+EXTINSN(V6_##TAG##_ppu,      SYNTAXA "(Rx32++Mu2)" NT SYNTAXB,ATTRIB,DESCR,{ fEA_REG(RxV); BEH; fPM_M(RxV,MuV); }) \
+
+
+#define MMVEC_COND_EACH_EA_TRUE(TAG,DESCR,ATTRIB,NT,SYNTAXA,SYNTAXB,SYNTAXP,BEH) \
+EXTINSN(V6_##TAG##_pred_pi,      "if (" #SYNTAXP "4) " SYNTAXA "(Rx32++#s3)" NT SYNTAXB, ATTRIB,DESCR, { if (fLSBOLD(SYNTAXP##V)) { fEA_REG(RxV); BEH; fPM_I(RxV,siV*fVECSIZE()); } else {CANCEL;}}) \
+EXTINSN(V6_##TAG##_pred_ai,      "if (" #SYNTAXP "4) " SYNTAXA "(Rt32+#s4)" NT SYNTAXB, ATTRIB,DESCR,  { if (fLSBOLD(SYNTAXP##V)) { fEA_RI(RtV,siV*fVECSIZE()); BEH;} else {CANCEL;}}) \
+EXTINSN(V6_##TAG##_pred_ppu,     "if (" #SYNTAXP "4) " SYNTAXA "(Rx32++Mu2)" NT SYNTAXB,ATTRIB,DESCR,  { if (fLSBOLD(SYNTAXP##V)) { fEA_REG(RxV); BEH; fPM_M(RxV,MuV); } else {CANCEL;}}) \
+
+#define MMVEC_COND_EACH_EA_FALSE(TAG,DESCR,ATTRIB,NT,SYNTAXA,SYNTAXB,SYNTAXP,BEH) \
+EXTINSN(V6_##TAG##_npred_pi,     "if (!" #SYNTAXP "4) " SYNTAXA "(Rx32++#s3)" NT SYNTAXB,ATTRIB,DESCR,{ if (fLSBOLDNOT(SYNTAXP##V)) { fEA_REG(RxV); BEH; fPM_I(RxV,siV*fVECSIZE()); } else {CANCEL;}}) \
+EXTINSN(V6_##TAG##_npred_ai,     "if (!" #SYNTAXP "4) " SYNTAXA "(Rt32+#s4)" NT SYNTAXB,ATTRIB,DESCR, { if (fLSBOLDNOT(SYNTAXP##V)) { fEA_RI(RtV,siV*fVECSIZE()); BEH;} else {CANCEL;}}) \
+EXTINSN(V6_##TAG##_npred_ppu,    "if (!" #SYNTAXP "4) " SYNTAXA "(Rx32++Mu2)" NT SYNTAXB,ATTRIB,DESCR,{ if (fLSBOLDNOT(SYNTAXP##V)) { fEA_REG(RxV); BEH; fPM_M(RxV,MuV); } else {CANCEL;}})
+
+#define MMVEC_COND_EACH_EA(TAG,DESCR,ATTRIB,NT,SYNTAXA,SYNTAXB,SYNTAXP,BEH) \
+MMVEC_COND_EACH_EA_TRUE(TAG,DESCR,ATTRIB,NT,SYNTAXA,SYNTAXB,SYNTAXP,BEH) \
+MMVEC_COND_EACH_EA_FALSE(TAG,DESCR,ATTRIB,NT,SYNTAXA,SYNTAXB,SYNTAXP,BEH)
+
+
+#define VEC_SCALE(X) X*fVECSIZE()
+
+
+#define MMVEC_LD(TAG,DESCR,ATTRIB,NT) MMVEC_EACH_EA(TAG,DESCR,ATTRIB,NT,"Vd32=vmem","",fLOADMMV(EA,VdV))
+#define MMVEC_LDC(TAG,DESCR,ATTRIB,NT) MMVEC_EACH_EA(TAG##_cur,DESCR,ATTRIB,NT,"Vd32.cur=vmem","",fLOADMMV(EA,VdV))
+#define MMVEC_LDT(TAG,DESCR,ATTRIB,NT) MMVEC_EACH_EA(TAG##_tmp,DESCR,ATTRIB,NT,"Vd32.tmp=vmem","",fLOADMMV(EA,VdV))
+#define MMVEC_LDU(TAG,DESCR,ATTRIB,NT) MMVEC_EACH_EA(TAG,DESCR,ATTRIB,NT,"Vd32=vmemu","",fLOADMMVU(EA,VdV))
+
+
+#define MMVEC_STQ(TAG,DESCR,ATTRIB,NT) \
+MMVEC_EACH_EA(TAG##_qpred,DESCR,ATTRIB,NT,"if (Qv4) vmem","=Vs32",fSTOREMMVQ(EA,VsV,QvV)) \
+MMVEC_EACH_EA(TAG##_nqpred,DESCR,ATTRIB,NT,"if (!Qv4) vmem","=Vs32",fSTOREMMVNQ(EA,VsV,QvV))
+
+/****************************************************************
+* MAPPING FOR VMEMs
+****************************************************************/
+
+#define ATTR_VMEM A_EXTENSION,A_CVI,A_CVI_VM
+#define ATTR_VMEMU A_EXTENSION,A_CVI,A_CVI_VM,A_CVI_VP
+
+
+MMVEC_LD(vL32b,  "Aligned Vector Load",        ATTRIBS(ATTR_VMEM,A_LOAD,A_CVI_VA),)
+MMVEC_LDC(vL32b,  "Aligned Vector Load Cur",	ATTRIBS(ATTR_VMEM,A_LOAD,A_CVI_NEW,A_CVI_VA),)
+MMVEC_LDT(vL32b,  "Aligned Vector Load Tmp",	ATTRIBS(ATTR_VMEM,A_LOAD,A_CVI_TMP),)
+
+MMVEC_COND_EACH_EA(vL32b,"Conditional Aligned Vector Load",ATTRIBS(ATTR_VMEM,A_LOAD,A_CVI_VA),,"Vd32=vmem",,Pv,fLOADMMV(EA,VdV);)
+MMVEC_COND_EACH_EA(vL32b_cur,"Conditional Aligned Vector Load Cur",ATTRIBS(ATTR_VMEM,A_LOAD,A_CVI_VA,A_CVI_NEW),,"Vd32.cur=vmem",,Pv,fLOADMMV(EA,VdV);)
+MMVEC_COND_EACH_EA(vL32b_tmp,"Conditional Aligned Vector Load Tmp",ATTRIBS(ATTR_VMEM,A_LOAD,A_CVI_TMP),,"Vd32.tmp=vmem",,Pv,fLOADMMV(EA,VdV);)
+
+MMVEC_EACH_EA(vS32b,"Aligned Vector Store",ATTRIBS(ATTR_VMEM,A_STORE,A_RESTRICT_SLOT0ONLY,A_CVI_VA),,"vmem","=Vs32",fSTOREMMV(EA,VsV))
+MMVEC_COND_EACH_EA(vS32b,"Aligned Vector Store",ATTRIBS(ATTR_VMEM,A_STORE,A_RESTRICT_SLOT0ONLY,A_CVI_VA),,"vmem","=Vs32",Pv,fSTOREMMV(EA,VsV))
+
+
+MMVEC_STQ(vS32b,  "Aligned Vector Store",      ATTRIBS(ATTR_VMEM,A_STORE,A_RESTRICT_SLOT0ONLY,A_CVI_VA),)
+
+MMVEC_LDU(vL32Ub, "Unaligned Vector Load",     ATTRIBS(ATTR_VMEMU,A_LOAD,A_RESTRICT_NOSLOT1),)
+
+MMVEC_EACH_EA(vS32Ub,"Unaligned Vector Store",ATTRIBS(ATTR_VMEMU,A_STORE,A_RESTRICT_NOSLOT1),,"vmemu","=Vs32",fSTOREMMVU(EA,VsV))
+
+MMVEC_COND_EACH_EA(vS32Ub,"Unaligned Vector Store",ATTRIBS(ATTR_VMEMU,A_STORE,A_RESTRICT_NOSLOT1),,"vmemu","=Vs32",Pv,fSTOREMMVU(EA,VsV))
+
+MMVEC_EACH_EA(vS32b_new,"Aligned Vector Store New",ATTRIBS(ATTR_VMEM,A_STORE,A_CVI_NEW,A_DOTNEWVALUE,A_RESTRICT_SLOT0ONLY),,"vmem","=Os8.new",fSTOREMMV(EA,fNEWVREG(OsN)))
+
+// V65 store relase, zero byte store
+MMVEC_EACH_EA(vS32b_srls,"Aligned Vector Scatter Release",ATTRIBS(ATTR_VMEM,A_STORE,A_CVI_SCATTER_RELEASE,A_CVI_NEW,A_RESTRICT_SLOT0ONLY),,"vmem",":scatter_release",fSTORERELEASE(EA,0))
+
+
+
+MMVEC_COND_EACH_EA(vS32b_new,"Aligned Vector Store New",ATTRIBS(ATTR_VMEM,A_STORE,A_CVI_NEW,A_DOTNEWVALUE,A_RESTRICT_SLOT0ONLY),,"vmem","=Os8.new",Pv,fSTOREMMV(EA,fNEWVREG(OsN)))
+
+
+/******************************************************************************************
+*
+* MMVECTOR MEMORY OPERATIONS - NON TEMPORAL
+*
+*******************************************************************************************/
+
+#define ATTR_VMEM_NT A_EXTENSION,A_CVI,A_CVI_VM
+
+MMVEC_EACH_EA(vS32b_nt,"Aligned Vector Store - Non temporal",ATTRIBS(ATTR_VMEM_NT,A_STORE,A_RESTRICT_SLOT0ONLY,A_CVI_VA),":nt","vmem","=Vs32",fSTOREMMV(EA,VsV))
+MMVEC_COND_EACH_EA(vS32b_nt,"Aligned Vector Store - Non temporal",ATTRIBS(ATTR_VMEM_NT,A_STORE,A_RESTRICT_SLOT0ONLY,A_CVI_VA),":nt","vmem","=Vs32",Pv,fSTOREMMV(EA,VsV))
+
+MMVEC_EACH_EA(vS32b_nt_new,"Aligned Vector Store New - Non temporal",ATTRIBS(ATTR_VMEM_NT,A_STORE,A_CVI_NEW,A_DOTNEWVALUE,A_RESTRICT_SLOT0ONLY),":nt","vmem","=Os8.new",fSTOREMMV(EA,fNEWVREG(OsN)))
+MMVEC_COND_EACH_EA(vS32b_nt_new,"Aligned Vector Store New - Non temporal",ATTRIBS(ATTR_VMEM_NT,A_STORE,A_CVI_NEW,A_DOTNEWVALUE,A_RESTRICT_SLOT0ONLY),":nt","vmem","=Os8.new",Pv,fSTOREMMV(EA,fNEWVREG(OsN)))
+
+
+MMVEC_STQ(vS32b_nt,  "Aligned Vector Store - Non temporal",      ATTRIBS(ATTR_VMEM_NT,A_STORE,A_RESTRICT_SLOT0ONLY,A_CVI_VA),":nt")
+
+MMVEC_LD(vL32b_nt,  "Aligned Vector Load - Non temporal",       ATTRIBS(ATTR_VMEM_NT,A_LOAD,A_CVI_VA),":nt")
+MMVEC_LDC(vL32b_nt,  "Aligned Vector Load Cur - Non temporal",	ATTRIBS(ATTR_VMEM_NT,A_LOAD,A_CVI_NEW,A_CVI_VA),":nt")
+MMVEC_LDT(vL32b_nt,  "Aligned Vector Load Tmp - Non temporal",	ATTRIBS(ATTR_VMEM_NT,A_LOAD,A_CVI_TMP),":nt")
+
+MMVEC_COND_EACH_EA(vL32b_nt,"Conditional Aligned Vector Load",ATTRIBS(ATTR_VMEM_NT,A_CVI_VA),,"Vd32=vmem",":nt",Pv,fLOADMMV(EA,VdV);)
+MMVEC_COND_EACH_EA(vL32b_nt_cur,"Conditional Aligned Vector Load Cur",ATTRIBS(ATTR_VMEM_NT,A_CVI_VA,A_CVI_NEW),,"Vd32.cur=vmem",":nt",Pv,fLOADMMV(EA,VdV);)
+MMVEC_COND_EACH_EA(vL32b_nt_tmp,"Conditional Aligned Vector Load Tmp",ATTRIBS(ATTR_VMEM_NT,A_CVI_TMP),,"Vd32.tmp=vmem",":nt",Pv,fLOADMMV(EA,VdV);)
+
+
+#undef VEC_SCALE
+
+
+/***************************************************
+ * Vector Alignment
+ ************************************************/
+
+#define VALIGNB(SHIFT)  \
+    fHIDE(int i;) \
+    for(i = 0; i < fVBYTES(); i++) {\
+        VdV.ub[i] = (i+SHIFT>=fVBYTES()) ? VuV.ub[i+SHIFT-fVBYTES()] : VvV.ub[i+SHIFT];\
+	}
+
+EXTINSN(V6_valignb,  "Vd32=valign(Vu32,Vv32,Rt8)",  ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),"Align Two vectors by Rt8 as control",
+{
+	unsigned shift = RtV & (fVBYTES()-1);
+	VALIGNB(shift)
+})
+EXTINSN(V6_vlalignb, "Vd32=vlalign(Vu32,Vv32,Rt8)", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),"Align Two vectors by Rt8 as control",
+{
+	unsigned shift = fVBYTES() - (RtV & (fVBYTES()-1));
+	VALIGNB(shift)
+})
+EXTINSN(V6_valignbi, "Vd32=valign(Vu32,Vv32,#u3)", 	ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),"Align Two vectors by #u3 as control",
+{
+	VALIGNB(uiV)
+})
+EXTINSN(V6_vlalignbi,"Vd32=vlalign(Vu32,Vv32,#u3)", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),"Align Two vectors by #u3 as control",
+{
+	unsigned shift = fVBYTES() - uiV;
+	VALIGNB(shift)
+})
+
+EXTINSN(V6_vror, "Vd32=vror(Vu32,Rt32)", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),
+"Align Two vectors by Rt32 as control",
+{
+	fHIDE(int k;)
+	for (k=0;k<fVBYTES();k++) {
+		VdV.ub[k] = VuV.ub[(k+RtV)&(fVBYTES()-1)];
+	}
+	})
+
+
+
+
+
+
+
+/**************************************************************
+* Unpack elements with zero/sign extend and cross lane permute
+***************************************************************/
+
+ITERATOR_INSN2_PERMUTE_SLOT_DOUBLE_VEC(8,vunpackub,  "Vdd32=vunpackub(Vu32)", "Vdd32.uh=vunpack(Vu32.ub)", "Unpack byte with zero-extend",     fVARRAY_ELEMENT_ACCESS(VddV, uh, i)  = fZE8_16( VuV.ub[i]))
+ITERATOR_INSN2_PERMUTE_SLOT_DOUBLE_VEC(8,vunpackb,   "Vdd32=vunpackb(Vu32)",  "Vdd32.h=vunpack(Vu32.b)",   "Unpack bytes with sign-extend",    fVARRAY_ELEMENT_ACCESS(VddV, h,  i)  = fSE8_16( VuV.b[i] ))
+ITERATOR_INSN2_PERMUTE_SLOT_DOUBLE_VEC(16,vunpackuh, "Vdd32=vunpackuh(Vu32)", "Vdd32.uw=vunpack(Vu32.uh)", "Unpack halves with zero-extend",   fVARRAY_ELEMENT_ACCESS(VddV, uw, i)  = fZE16_32(VuV.uh[i]))
+ITERATOR_INSN2_PERMUTE_SLOT_DOUBLE_VEC(16,vunpackh,  "Vdd32=vunpackh(Vu32)",  "Vdd32.w=vunpack(Vu32.h)",   "Unpack halves with sign-extend",   fVARRAY_ELEMENT_ACCESS(VddV, w,  i)  = fSE16_32(VuV.h[i] ))
+
+ITERATOR_INSN2_PERMUTE_SLOT_DOUBLE_VEC(8, vunpackob, "Vxx32|=vunpackob(Vu32)", "Vxx32.h|=vunpacko(Vu32.b)", "Unpack byte to odd bytes ",       fVARRAY_ELEMENT_ACCESS(VxxV, uh, i) |= fZE8_16( VuV.ub[i])<<8)
+ITERATOR_INSN2_PERMUTE_SLOT_DOUBLE_VEC(16,vunpackoh, "Vxx32|=vunpackoh(Vu32)", "Vxx32.w|=vunpacko(Vu32.h)", "Unpack halves to odd halves",     fVARRAY_ELEMENT_ACCESS(VxxV, uw, i) |= fZE16_32(VuV.uh[i])<<16)
+
+
+/**************************************************************
+* Pack elements and cross lane permute
+***************************************************************/
+
+ ITERATOR_INSN2_PERMUTE_SLOT(16, vpackeb,  "Vd32=vpackeb(Vu32,Vv32)", "Vd32.b=vpacke(Vu32.h,Vv32.h)",
+ "Pack  bytes",
+    VdV.ub[i]               = fGETUBYTE(0, VvV.uh[i]);
+    VdV.ub[i+fVELEM(16)]    = fGETUBYTE(0, VuV.uh[i]))
+
+ ITERATOR_INSN2_PERMUTE_SLOT(32, vpackeh,  "Vd32=vpackeh(Vu32,Vv32)", "Vd32.h=vpacke(Vu32.w,Vv32.w)",
+ "Pack  halfwords",
+    VdV.uh[i]               = fGETUHALF(0, VvV.uw[i]);
+    VdV.uh[i+fVELEM(32)]    = fGETUHALF(0, VuV.uw[i]))
+
+  ITERATOR_INSN2_PERMUTE_SLOT(16, vpackob,  "Vd32=vpackob(Vu32,Vv32)", "Vd32.b=vpacko(Vu32.h,Vv32.h)",
+ "Pack  bytes",
+    VdV.ub[i]               = fGETUBYTE(1, VvV.uh[i]);
+    VdV.ub[i+fVELEM(16)]    = fGETUBYTE(1, VuV.uh[i]))
+
+ ITERATOR_INSN2_PERMUTE_SLOT(32, vpackoh,  "Vd32=vpackoh(Vu32,Vv32)", "Vd32.h=vpacko(Vu32.w,Vv32.w)",
+ "Pack  halfwords",
+    VdV.uh[i]               = fGETUHALF(1, VvV.uw[i]);
+    VdV.uh[i+fVELEM(32)]    = fGETUHALF(1, VuV.uw[i]))
+
+
+
+ITERATOR_INSN2_PERMUTE_SLOT(16, vpackhub_sat,  "Vd32=vpackhub(Vu32,Vv32):sat", "Vd32.ub=vpack(Vu32.h,Vv32.h):sat",
+ "Pack ubytes with saturation",
+    VdV.ub[i]               = fVSATUB(VvV.h[i]);
+    VdV.ub[i+fVELEM(16)]    = fVSATUB(VuV.h[i]))
+
+
+ITERATOR_INSN2_PERMUTE_SLOT(16, vpackhb_sat,  "Vd32=vpackhb(Vu32,Vv32):sat", "Vd32.b=vpack(Vu32.h,Vv32.h):sat",
+ "Pack bytes with saturation",
+    VdV.b[i]               = fVSATB(VvV.h[i]);
+    VdV.b[i+fVELEM(16)]    = fVSATB(VuV.h[i]))
+
+
+ITERATOR_INSN2_PERMUTE_SLOT(32, vpackwuh_sat,  "Vd32=vpackwuh(Vu32,Vv32):sat", "Vd32.uh=vpack(Vu32.w,Vv32.w):sat",
+ "Pack ubytes with saturation",
+    VdV.uh[i]               = fVSATUH(VvV.w[i]);
+    VdV.uh[i+fVELEM(32)]    = fVSATUH(VuV.w[i]))
+
+ITERATOR_INSN2_PERMUTE_SLOT(32, vpackwh_sat,  "Vd32=vpackwh(Vu32,Vv32):sat", "Vd32.h=vpack(Vu32.w,Vv32.w):sat",
+ "Pack bytes with saturation",
+    VdV.h[i]               = fVSATH(VvV.w[i]);
+    VdV.h[i+fVELEM(32)]    = fVSATH(VuV.w[i]))
+
+
+
+
+
+/**************************************************************
+* Zero/Sign Extend with in-lane permute
+***************************************************************/
+
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(16,vzb,"Vdd32=vzxtb(Vu32)","Vdd32.uh=vzxt(Vu32.ub)",
+"Vector Zero Extend Bytes",
+    VddV.v[0].uh[i] = fZE8_16(fGETUBYTE(0, VuV.uh[i]));
+    VddV.v[1].uh[i] = fZE8_16(fGETUBYTE(1, VuV.uh[i])))
+
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(16,vsb,"Vdd32=vsxtb(Vu32)","Vdd32.h=vsxt(Vu32.b)",
+"Vector Sign Extend Bytes",
+    VddV.v[0].h[i] = fSE8_16(fGETBYTE(0, VuV.h[i]));
+    VddV.v[1].h[i] = fSE8_16(fGETBYTE(1, VuV.h[i])))
+
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(32,vzh,"Vdd32=vzxth(Vu32)","Vdd32.uw=vzxt(Vu32.uh)",
+"Vector Zero Extend halfwords",
+    VddV.v[0].uw[i] = fZE16_32(fGETUHALF(0, VuV.uw[i]));
+    VddV.v[1].uw[i] = fZE16_32(fGETUHALF(1, VuV.uw[i])))
+
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(32,vsh,"Vdd32=vsxth(Vu32)","Vdd32.w=vsxt(Vu32.h)",
+"Vector Sign Extend halfwords",
+    VddV.v[0].w[i] = fSE16_32(fGETHALF(0, VuV.w[i]));
+    VddV.v[1].w[i] = fSE16_32(fGETHALF(1, VuV.w[i])))
+
+
+/**********************************************************************
+*
+*
+*
+*               MMVECTOR REDUCTION
+*
+*
+*
+**********************************************************************/
+
+/********************************************
+*  2-WAY REDUCTION - UNSIGNED BYTE BY BYTE
+********************************************/
+
+
+ITERATOR_INSN2_MPY_SLOT(16,vdmpybus,"Vd32=vdmpybus(Vu32,Rt32)","Vd32.h=vdmpy(Vu32.ub,Rt32.b)",
+"Vector Dual Multiply-Accumulates unsigned bytes by bytes",
+    VdV.h[i]   = fMPY8US( fGETUBYTE(0, VuV.uh[i]), fGETBYTE((2*i) % 4, RtV));
+    VdV.h[i]  += fMPY8US( fGETUBYTE(1, VuV.uh[i]), fGETBYTE((2*i+1)%4, RtV)))
+
+ITERATOR_INSN2_MPY_SLOT(16,vdmpybus_acc,"Vx32+=vdmpybus(Vu32,Rt32)","Vx32.h+=vdmpy(Vu32.ub,Rt32.b)",
+"Vector Dual Multiply-Accumulates unsigned bytes by  bytes, and accumulate",
+    VxV.h[i] += fMPY8US( fGETUBYTE(0, VuV.uh[i]), fGETBYTE((2*i) % 4, RtV));
+    VxV.h[i] += fMPY8US( fGETUBYTE(1, VuV.uh[i]), fGETBYTE((2*i+1)%4, RtV)))
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vdmpybus_dv,"Vdd32=vdmpybus(Vuu32,Rt32)","Vdd32.h=vdmpy(Vuu32.ub,Rt32.b)",
+"Vector Dual Multiply-Accumulates unsigned bytes by  bytes, and accumulate Sliding Window Reduction",
+    VddV.v[0].h[i]  = fMPY8US(fGETUBYTE(0, VuuV.v[0].uh[i]),fGETBYTE((2*i) % 4, RtV));
+    VddV.v[0].h[i] += fMPY8US(fGETUBYTE(1, VuuV.v[0].uh[i]),fGETBYTE((2*i+1)%4, RtV));
+
+    VddV.v[1].h[i]  = fMPY8US(fGETUBYTE(1, VuuV.v[0].uh[i]),fGETBYTE((2*i) % 4, RtV));
+    VddV.v[1].h[i] += fMPY8US(fGETUBYTE(0, VuuV.v[1].uh[i]),fGETBYTE((2*i+1)%4, RtV)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vdmpybus_dv_acc,"Vxx32+=vdmpybus(Vuu32,Rt32)","Vxx32.h+=vdmpy(Vuu32.ub,Rt32.b)",
+"Vector Dual Multiply-Accumulates unsigned bytes by  bytes, and accumulate Sliding Window Reduction",
+    VxxV.v[0].h[i] += fMPY8US(fGETUBYTE(0, VuuV.v[0].uh[i]),fGETBYTE((2*i) % 4, RtV));
+    VxxV.v[0].h[i] += fMPY8US(fGETUBYTE(1, VuuV.v[0].uh[i]),fGETBYTE((2*i+1)%4, RtV));
+
+    VxxV.v[1].h[i] += fMPY8US(fGETUBYTE(1, VuuV.v[0].uh[i]),fGETBYTE((2*i) % 4, RtV));
+    VxxV.v[1].h[i] += fMPY8US(fGETUBYTE(0, VuuV.v[1].uh[i]),fGETBYTE((2*i+1)%4, RtV)))
+
+
+
+/********************************************
+*  2-WAY REDUCTION - HALF BY BYTE
+********************************************/
+ITERATOR_INSN2_MPY_SLOT(32,vdmpyhb,"Vd32=vdmpyhb(Vu32,Rt32)","Vd32.w=vdmpy(Vu32.h,Rt32.b)",
+"Dual-Vector 2-Element Half x Byte Reduction with Sliding Window Overlap",
+    VdV.w[i]  = fMPY16SS(fGETHALF(0, VuV.w[i]),fGETBYTE((2*i+0)%4, RtV));
+    VdV.w[i] += fMPY16SS(fGETHALF(1, VuV.w[i]),fGETBYTE((2*i+1)%4, RtV)))
+
+ITERATOR_INSN2_MPY_SLOT(32,vdmpyhb_acc,"Vx32+=vdmpyhb(Vu32,Rt32)","Vx32.w+=vdmpy(Vu32.h,Rt32.b)",
+"Dual-Vector 2-Element Half x Byte Reduction with Sliding Window Overlap",
+    VxV.w[i] += fMPY16SS(fGETHALF(0, VuV.w[i]),fGETBYTE((2*i+0)%4, RtV));
+    VxV.w[i] += fMPY16SS(fGETHALF(1, VuV.w[i]),fGETBYTE((2*i+1)%4, RtV)))
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhb_dv,"Vdd32=vdmpyhb(Vuu32,Rt32)","Vdd32.w=vdmpy(Vuu32.h,Rt32.b)",
+"Dual-Vector 2-Element Half x Byte Reduction with Sliding Window Overlap",
+    VddV.v[0].w[i]  = fMPY16SS(fGETHALF(0, VuuV.v[0].w[i]),fGETBYTE((2*i+0)%4, RtV));
+    VddV.v[0].w[i] += fMPY16SS(fGETHALF(1, VuuV.v[0].w[i]),fGETBYTE((2*i+1)%4, RtV));
+
+    VddV.v[1].w[i]  = fMPY16SS(fGETHALF(1, VuuV.v[0].w[i]),fGETBYTE((2*i+0)%4, RtV));
+    VddV.v[1].w[i] += fMPY16SS(fGETHALF(0, VuuV.v[1].w[i]),fGETBYTE((2*i+1)%4, RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhb_dv_acc,"Vxx32+=vdmpyhb(Vuu32,Rt32)","Vxx32.w+=vdmpy(Vuu32.h,Rt32.b)",
+"Dual-Vector 2-Element Half x Byte Reduction with Sliding Window Overlap",
+    VxxV.v[0].w[i] += fMPY16SS(fGETHALF(0, VuuV.v[0].w[i]),fGETBYTE((2*i+0)%4, RtV));
+    VxxV.v[0].w[i] += fMPY16SS(fGETHALF(1, VuuV.v[0].w[i]),fGETBYTE((2*i+1)%4, RtV));
+
+    VxxV.v[1].w[i] += fMPY16SS(fGETHALF(1, VuuV.v[0].w[i]),fGETBYTE((2*i+0)%4, RtV));
+    VxxV.v[1].w[i] += fMPY16SS(fGETHALF(0, VuuV.v[1].w[i]),fGETBYTE((2*i+1)%4, RtV)))
+
+
+
+
+
+/********************************************
+*  2-WAY REDUCTION - HALF BY HALF
+********************************************/
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhvsat,"Vd32=vdmpyh(Vu32,Vv32):sat","Vd32.w=vdmpy(Vu32.h,Vv32.h):sat",
+"Vector halfword multiply, accumulate pairs, sat to word",
+    fHIDE(size8s_t accum;)
+    accum    = fMPY16SS(fGETHALF(0,VuV.w[i]),fGETHALF(0, VvV.w[i]));
+    accum   += fMPY16SS(fGETHALF(1,VuV.w[i]),fGETHALF(1, VvV.w[i]));
+    VdV.w[i] = fVSATW(accum))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhvsat_acc,"Vx32+=vdmpyh(Vu32,Vv32):sat","Vx32.w+=vdmpy(Vu32.h,Vv32.h):sat",
+"Vector halfword multiply, accumulate pairs, sat to word",
+    fHIDE(size8s_t accum;)
+    accum    = fMPY16SS(fGETHALF(0,VuV.w[i]),fGETHALF(0, VvV.w[i]));
+    accum   += fMPY16SS(fGETHALF(1,VuV.w[i]),fGETHALF(1, VvV.w[i]));
+    VxV.w[i] = fVSATW(VxV.w[i]+accum))
+
+
+/* VDMPYH */
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhsat,"Vd32=vdmpyh(Vu32,Rt32):sat","Vd32.w=vdmpy(Vu32.h,Rt32.h):sat",
+"Vector halfword multiply, accumulate pairs, saturate to word",
+    fHIDE(size8s_t accum;)
+    accum    = fMPY16SS(fGETHALF(0, VuV.w[i]),fGETHALF(0, RtV));
+    accum   += fMPY16SS(fGETHALF(1, VuV.w[i]),fGETHALF(1, RtV));
+    VdV.w[i] = fVSATW(accum))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhsat_acc,"Vx32+=vdmpyh(Vu32,Rt32):sat","Vx32.w+=vdmpy(Vu32.h,Rt32.h):sat",
+"Vector halfword multiply, accumulate pairs, saturate to word",
+    fHIDE(size8s_t) accum = VxV.w[i];
+    accum   += fMPY16SS(fGETHALF(0, VuV.w[i]),fGETHALF(0, RtV));
+    accum   += fMPY16SS(fGETHALF(1, VuV.w[i]),fGETHALF(1, RtV));
+    VxV.w[i] = fVSATW(accum))
+
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhisat,"Vd32=vdmpyh(Vuu32,Rt32):sat","Vd32.w=vdmpy(Vuu32.h,Rt32.h):sat",
+"Dual Vector Signed Halfword by Signed Halfword 2-Way Reduction to Halfword with saturation",
+    fHIDE(size8s_t accum;)
+    accum    = fMPY16SS(fGETHALF(1,VuuV.v[0].w[i]),fGETHALF(0,RtV));
+    accum   += fMPY16SS(fGETHALF(0,VuuV.v[1].w[i]),fGETHALF(1,RtV));
+    VdV.w[i] = fVSATW(accum))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhisat_acc,"Vx32+=vdmpyh(Vuu32,Rt32):sat","Vx32.w+=vdmpy(Vuu32.h,Rt32.h):sat",
+"Dual Vector Signed Halfword by Signed Halfword 2-Way Reduction to Halfword with accumulation and saturation",
+    fHIDE(size8s_t) accum = VxV.w[i];
+    accum   += fMPY16SS(fGETHALF(1,VuuV.v[0].w[i]),fGETHALF(0,RtV));
+    accum   += fMPY16SS(fGETHALF(0,VuuV.v[1].w[i]),fGETHALF(1,RtV));
+    VxV.w[i] = fVSATW(accum))
+
+
+
+
+
+
+
+/* VDMPYHSU */
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhsusat,"Vd32=vdmpyhsu(Vu32,Rt32):sat","Vd32.w=vdmpy(Vu32.h,Rt32.uh):sat",
+"Vector halfword multiply, accumulate pairs, saturate to word",
+    fHIDE(size8s_t accum;)
+    accum    = fMPY16SU(fGETHALF(0, VuV.w[i]),fGETUHALF(0, RtV));
+    accum   += fMPY16SU(fGETHALF(1, VuV.w[i]),fGETUHALF(1, RtV));
+    VdV.w[i] = fVSATW(accum))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhsusat_acc,"Vx32+=vdmpyhsu(Vu32,Rt32):sat","Vx32.w+=vdmpy(Vu32.h,Rt32.uh):sat",
+"Vector halfword multiply, accumulate pairs, saturate to word",
+    fHIDE(size8s_t) accum=VxV.w[i];
+    accum   += fMPY16SU(fGETHALF(0, VuV.w[i]),fGETUHALF(0, RtV));
+    accum   += fMPY16SU(fGETHALF(1, VuV.w[i]),fGETUHALF(1, RtV));
+    VxV.w[i] = fVSATW(accum))
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhsuisat,"Vd32=vdmpyhsu(Vuu32,Rt32,#1):sat","Vd32.w=vdmpy(Vuu32.h,Rt32.uh,#1):sat",
+"Dual Vector Signed Halfword by Signed Halfword 2-Way Reduction to Halfword with saturation",
+    fHIDE(size8s_t accum;)
+    accum    = fMPY16SU(fGETHALF(1,VuuV.v[0].w[i]),fGETUHALF(0,RtV));
+    accum   += fMPY16SU(fGETHALF(0,VuuV.v[1].w[i]),fGETUHALF(1,RtV));
+    VdV.w[i] = fVSATW(accum))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdmpyhsuisat_acc,"Vx32+=vdmpyhsu(Vuu32,Rt32,#1):sat","Vx32.w+=vdmpy(Vuu32.h,Rt32.uh,#1):sat",
+"Dual Vector Signed Halfword by Signed Halfword 2-Way Reduction to Halfword with accumulation and saturation",
+    fHIDE(size8s_t) accum=VxV.w[i];
+    accum   += fMPY16SU(fGETHALF(1, VuuV.v[0].w[i]),fGETUHALF(0,RtV));
+    accum   += fMPY16SU(fGETHALF(0, VuuV.v[1].w[i]),fGETUHALF(1,RtV));
+    VxV.w[i] = fVSATW(accum))
+
+
+
+/********************************************
+*  3-WAY REDUCTION - UNSIGNED BYTE BY  BYTE
+********************************************/
+
+ ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vtmpyb, "Vdd32=vtmpyb(Vuu32,Rt32)", "Vdd32.h=vtmpy(Vuu32.b,Rt32.b)",
+"Dual Vector 3x1 Reduction",
+    VddV.v[0].h[i]  = fMPY8SS(fGETBYTE(0,VuuV.v[0].h[i]), fGETBYTE((2*i  )%4, RtV));
+    VddV.v[0].h[i] += fMPY8SS(fGETBYTE(1,VuuV.v[0].h[i]), fGETBYTE((2*i+1)%4, RtV));
+    VddV.v[0].h[i] += fGETBYTE(0,VuuV.v[1].h[i]);
+
+    VddV.v[1].h[i]  = fMPY8SS(fGETBYTE(1,VuuV.v[0].h[i]), fGETBYTE((2*i  )%4, RtV));
+    VddV.v[1].h[i] += fMPY8SS(fGETBYTE(0,VuuV.v[1].h[i]), fGETBYTE((2*i+1)%4, RtV));
+    VddV.v[1].h[i] += fGETBYTE(1,VuuV.v[1].h[i]))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vtmpyb_acc, "Vxx32+=vtmpyb(Vuu32,Rt32)", "Vxx32.h+=vtmpy(Vuu32.b,Rt32.b)",
+"Dual Vector 3x1 Reduction",
+    VxxV.v[0].h[i] += fMPY8SS(fGETBYTE(0,VuuV.v[0].h[i]), fGETBYTE((2*i  )%4, RtV));
+    VxxV.v[0].h[i] += fMPY8SS(fGETBYTE(1,VuuV.v[0].h[i]), fGETBYTE((2*i+1)%4, RtV));
+    VxxV.v[0].h[i] += fGETBYTE(0,VuuV.v[1].h[i]);
+
+    VxxV.v[1].h[i] += fMPY8SS(fGETBYTE(1,VuuV.v[0].h[i]), fGETBYTE((2*i  )%4, RtV));
+    VxxV.v[1].h[i] += fMPY8SS(fGETBYTE(0,VuuV.v[1].h[i]), fGETBYTE((2*i+1)%4, RtV));
+    VxxV.v[1].h[i] += fGETBYTE(1,VuuV.v[1].h[i]))
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vtmpybus, "Vdd32=vtmpybus(Vuu32,Rt32)", "Vdd32.h=vtmpy(Vuu32.ub,Rt32.b)",
+"Dual Vector 3x1 Reduction",
+    VddV.v[0].h[i]  = fMPY8US(fGETUBYTE(0,VuuV.v[0].uh[i]), fGETBYTE((2*i  )%4, RtV));
+    VddV.v[0].h[i] += fMPY8US(fGETUBYTE(1,VuuV.v[0].uh[i]), fGETBYTE((2*i+1)%4, RtV));
+    VddV.v[0].h[i] += fGETUBYTE(0,VuuV.v[1].uh[i]);
+
+    VddV.v[1].h[i]  = fMPY8US(fGETUBYTE(1,VuuV.v[0].uh[i]), fGETBYTE((2*i  )%4, RtV));
+    VddV.v[1].h[i] += fMPY8US(fGETUBYTE(0,VuuV.v[1].uh[i]), fGETBYTE((2*i+1)%4, RtV));
+    VddV.v[1].h[i] += fGETUBYTE(1,VuuV.v[1].uh[i]))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vtmpybus_acc, "Vxx32+=vtmpybus(Vuu32,Rt32)", "Vxx32.h+=vtmpy(Vuu32.ub,Rt32.b)",
+"Dual Vector 3x1 Reduction",
+    VxxV.v[0].h[i] += fMPY8US(fGETUBYTE(0,VuuV.v[0].uh[i]), fGETBYTE((2*i  )%4, RtV));
+    VxxV.v[0].h[i] += fMPY8US(fGETUBYTE(1,VuuV.v[0].uh[i]), fGETBYTE((2*i+1)%4, RtV));
+    VxxV.v[0].h[i] += fGETUBYTE(0,VuuV.v[1].uh[i]);
+
+    VxxV.v[1].h[i] += fMPY8US(fGETUBYTE(1,VuuV.v[0].uh[i]), fGETBYTE((2*i  )%4, RtV));
+    VxxV.v[1].h[i] += fMPY8US(fGETUBYTE(0,VuuV.v[1].uh[i]), fGETBYTE((2*i+1)%4, RtV));
+    VxxV.v[1].h[i] += fGETUBYTE(1,VuuV.v[1].uh[i]))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vtmpyhb, "Vdd32=vtmpyhb(Vuu32,Rt32)", "Vdd32.w=vtmpy(Vuu32.h,Rt32.b)",
+"Dual Vector 3x1 Reduction",
+    VddV.v[0].w[i] = fMPY16SS(fGETHALF(0,VuuV.v[0].w[i]), fSE8_16(fGETBYTE((2*i+0)%4, RtV)));
+    VddV.v[0].w[i]+= fMPY16SS(fGETHALF(1,VuuV.v[0].w[i]), fSE8_16(fGETBYTE((2*i+1)%4, RtV)));
+    VddV.v[0].w[i]+= fGETHALF(0,VuuV.v[1].w[i]);
+
+    VddV.v[1].w[i] = fMPY16SS(fGETHALF(1,VuuV.v[0].w[i]), fSE8_16(fGETBYTE((2*i+0)%4, RtV)));
+    VddV.v[1].w[i]+= fMPY16SS(fGETHALF(0,VuuV.v[1].w[i]), fSE8_16(fGETBYTE((2*i+1)%4, RtV)));
+    VddV.v[1].w[i]+= fGETHALF(1,VuuV.v[1].w[i]))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vtmpyhb_acc, "Vxx32+=vtmpyhb(Vuu32,Rt32)", "Vxx32.w+=vtmpy(Vuu32.h,Rt32.b)",
+"Dual Vector 3x1 Reduction",
+    VxxV.v[0].w[i]+= fMPY16SS(fGETHALF(0,VuuV.v[0].w[i]), fSE8_16(fGETBYTE((2*i+0)%4, RtV)));
+    VxxV.v[0].w[i]+= fMPY16SS(fGETHALF(1,VuuV.v[0].w[i]), fSE8_16(fGETBYTE((2*i+1)%4, RtV)));
+    VxxV.v[0].w[i]+= fGETHALF(0,VuuV.v[1].w[i]);
+
+    VxxV.v[1].w[i]+= fMPY16SS(fGETHALF(1,VuuV.v[0].w[i]), fSE8_16(fGETBYTE((2*i+0)%4, RtV)));
+    VxxV.v[1].w[i]+= fMPY16SS(fGETHALF(0,VuuV.v[1].w[i]), fSE8_16(fGETBYTE((2*i+1)%4, RtV)));
+    VxxV.v[1].w[i]+= fGETHALF(1,VuuV.v[1].w[i]))
+
+
+/********************************************
+*  4-WAY REDUCTION - UNSIGNED BYTE BY UNSIGNED BYTE
+********************************************/
+
+
+
+ITERATOR_INSN2_MPY_SLOT(32,vrmpyub,"Vd32=vrmpyub(Vu32,Rt32)","Vd32.uw=vrmpy(Vu32.ub,Rt32.ub)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients",
+    VdV.uw[i]  = fMPY8UU(fGETUBYTE(0,VuV.uw[i]), fGETUBYTE(0,RtV));
+    VdV.uw[i] += fMPY8UU(fGETUBYTE(1,VuV.uw[i]), fGETUBYTE(1,RtV));
+    VdV.uw[i] += fMPY8UU(fGETUBYTE(2,VuV.uw[i]), fGETUBYTE(2,RtV));
+    VdV.uw[i] += fMPY8UU(fGETUBYTE(3,VuV.uw[i]), fGETUBYTE(3,RtV)))
+
+ITERATOR_INSN2_MPY_SLOT(32,vrmpyub_acc,"Vx32+=vrmpyub(Vu32,Rt32)","Vx32.uw+=vrmpy(Vu32.ub,Rt32.ub)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients Accumulate",
+    VxV.uw[i] += fMPY8UU(fGETUBYTE(0,VuV.uw[i]), fGETUBYTE(0,RtV));
+    VxV.uw[i] += fMPY8UU(fGETUBYTE(1,VuV.uw[i]), fGETUBYTE(1,RtV));
+    VxV.uw[i] += fMPY8UU(fGETUBYTE(2,VuV.uw[i]), fGETUBYTE(2,RtV));
+    VxV.uw[i] += fMPY8UU(fGETUBYTE(3,VuV.uw[i]), fGETUBYTE(3,RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT(32,vrmpyubv,"Vd32=vrmpyub(Vu32,Vv32)","Vd32.uw=vrmpy(Vu32.ub,Vv32.ub)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients",
+    VdV.uw[i]  = fMPY8UU(fGETUBYTE(0,VuV.uw[i]), fGETUBYTE(0,VvV.uw[i]));
+    VdV.uw[i] += fMPY8UU(fGETUBYTE(1,VuV.uw[i]), fGETUBYTE(1,VvV.uw[i]));
+    VdV.uw[i] += fMPY8UU(fGETUBYTE(2,VuV.uw[i]), fGETUBYTE(2,VvV.uw[i]));
+    VdV.uw[i] += fMPY8UU(fGETUBYTE(3,VuV.uw[i]), fGETUBYTE(3,VvV.uw[i])))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vrmpyubv_acc,"Vx32+=vrmpyub(Vu32,Vv32)","Vx32.uw+=vrmpy(Vu32.ub,Vv32.ub)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients Accumulate",
+    VxV.uw[i] += fMPY8UU(fGETUBYTE(0,VuV.uw[i]), fGETUBYTE(0,VvV.uw[i]));
+    VxV.uw[i] += fMPY8UU(fGETUBYTE(1,VuV.uw[i]), fGETUBYTE(1,VvV.uw[i]));
+    VxV.uw[i] += fMPY8UU(fGETUBYTE(2,VuV.uw[i]), fGETUBYTE(2,VvV.uw[i]));
+    VxV.uw[i] += fMPY8UU(fGETUBYTE(3,VuV.uw[i]), fGETUBYTE(3,VvV.uw[i])))
+
+ITERATOR_INSN2_MPY_SLOT(32,vrmpybv,"Vd32=vrmpyb(Vu32,Vv32)","Vd32.w=vrmpy(Vu32.b,Vv32.b)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients",
+    VdV.w[i]  = fMPY8SS(fGETBYTE(0, VuV.w[i]), fGETBYTE(0, VvV.w[i]));
+    VdV.w[i] += fMPY8SS(fGETBYTE(1, VuV.w[i]), fGETBYTE(1, VvV.w[i]));
+    VdV.w[i] += fMPY8SS(fGETBYTE(2, VuV.w[i]), fGETBYTE(2, VvV.w[i]));
+    VdV.w[i] += fMPY8SS(fGETBYTE(3, VuV.w[i]), fGETBYTE(3, VvV.w[i])))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vrmpybv_acc,"Vx32+=vrmpyb(Vu32,Vv32)","Vx32.w+=vrmpy(Vu32.b,Vv32.b)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients",
+    VxV.w[i] += fMPY8SS(fGETBYTE(0, VuV.w[i]), fGETBYTE(0, VvV.w[i]));
+    VxV.w[i] += fMPY8SS(fGETBYTE(1, VuV.w[i]), fGETBYTE(1, VvV.w[i]));
+    VxV.w[i] += fMPY8SS(fGETBYTE(2, VuV.w[i]), fGETBYTE(2, VvV.w[i]));
+    VxV.w[i] += fMPY8SS(fGETBYTE(3, VuV.w[i]), fGETBYTE(3, VvV.w[i])))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vrmpyubi,"Vdd32=vrmpyub(Vuu32,Rt32,#u1)","Vdd32.uw=vrmpy(Vuu32.ub,Rt32.ub,#u1)",
+"Dual Vector Unsigned Byte By Signed Byte 4-way Reduction to Word",
+    VddV.v[0].uw[i]  = fMPY8UU(fGETUBYTE(0, VuuV.v[uiV ? 1:0].uw[i]),fGETUBYTE((0-uiV) & 0x3,RtV));
+    VddV.v[0].uw[i] += fMPY8UU(fGETUBYTE(1, VuuV.v[0        ].uw[i]),fGETUBYTE((1-uiV) & 0x3,RtV));
+    VddV.v[0].uw[i] += fMPY8UU(fGETUBYTE(2, VuuV.v[0        ].uw[i]),fGETUBYTE((2-uiV) & 0x3,RtV));
+    VddV.v[0].uw[i] += fMPY8UU(fGETUBYTE(3, VuuV.v[0        ].uw[i]),fGETUBYTE((3-uiV) & 0x3,RtV));
+
+    VddV.v[1].uw[i]  = fMPY8UU(fGETUBYTE(0, VuuV.v[1        ].uw[i]),fGETUBYTE((2-uiV) & 0x3,RtV));
+    VddV.v[1].uw[i] += fMPY8UU(fGETUBYTE(1, VuuV.v[1        ].uw[i]),fGETUBYTE((3-uiV) & 0x3,RtV));
+    VddV.v[1].uw[i] += fMPY8UU(fGETUBYTE(2, VuuV.v[uiV ? 1:0].uw[i]),fGETUBYTE((0-uiV) & 0x3,RtV));
+    VddV.v[1].uw[i] += fMPY8UU(fGETUBYTE(3, VuuV.v[0        ].uw[i]),fGETUBYTE((1-uiV) & 0x3,RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vrmpyubi_acc,"Vxx32+=vrmpyub(Vuu32,Rt32,#u1)","Vxx32.uw+=vrmpy(Vuu32.ub,Rt32.ub,#u1)",
+"Dual Vector Unsigned Byte By Signed Byte 4-way Reduction with accumulate and saturation to Word",
+    VxxV.v[0].uw[i] += fMPY8UU(fGETUBYTE(0, VuuV.v[uiV ? 1:0].uw[i]),fGETUBYTE((0-uiV) & 0x3,RtV));
+    VxxV.v[0].uw[i] += fMPY8UU(fGETUBYTE(1, VuuV.v[0        ].uw[i]),fGETUBYTE((1-uiV) & 0x3,RtV));
+    VxxV.v[0].uw[i] += fMPY8UU(fGETUBYTE(2, VuuV.v[0        ].uw[i]),fGETUBYTE((2-uiV) & 0x3,RtV));
+    VxxV.v[0].uw[i] += fMPY8UU(fGETUBYTE(3, VuuV.v[0        ].uw[i]),fGETUBYTE((3-uiV) & 0x3,RtV));
+
+    VxxV.v[1].uw[i] += fMPY8UU(fGETUBYTE(0, VuuV.v[1        ].uw[i]),fGETUBYTE((2-uiV) & 0x3,RtV));
+    VxxV.v[1].uw[i] += fMPY8UU(fGETUBYTE(1, VuuV.v[1        ].uw[i]),fGETUBYTE((3-uiV) & 0x3,RtV));
+    VxxV.v[1].uw[i] += fMPY8UU(fGETUBYTE(2, VuuV.v[uiV ? 1:0].uw[i]),fGETUBYTE((0-uiV) & 0x3,RtV));
+    VxxV.v[1].uw[i] += fMPY8UU(fGETUBYTE(3, VuuV.v[0        ].uw[i]),fGETUBYTE((1-uiV) & 0x3,RtV)))
+
+
+
+
+/********************************************
+*  4-WAY REDUCTION - UNSIGNED BYTE BY  BYTE
+********************************************/
+
+ITERATOR_INSN2_MPY_SLOT(32,vrmpybus,"Vd32=vrmpybus(Vu32,Rt32)","Vd32.w=vrmpy(Vu32.ub,Rt32.b)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients",
+    VdV.w[i]  = fMPY8US(fGETUBYTE(0,VuV.uw[i]), fGETBYTE(0,RtV));
+    VdV.w[i] += fMPY8US(fGETUBYTE(1,VuV.uw[i]), fGETBYTE(1,RtV));
+    VdV.w[i] += fMPY8US(fGETUBYTE(2,VuV.uw[i]), fGETBYTE(2,RtV));
+    VdV.w[i] += fMPY8US(fGETUBYTE(3,VuV.uw[i]), fGETBYTE(3,RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT(32,vrmpybus_acc,"Vx32+=vrmpybus(Vu32,Rt32)","Vx32.w+=vrmpy(Vu32.ub,Rt32.b)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients",
+    VxV.w[i] += fMPY8US(fGETUBYTE(0,VuV.uw[i]), fGETBYTE(0,RtV));
+    VxV.w[i] += fMPY8US(fGETUBYTE(1,VuV.uw[i]), fGETBYTE(1,RtV));
+    VxV.w[i] += fMPY8US(fGETUBYTE(2,VuV.uw[i]), fGETBYTE(2,RtV));
+    VxV.w[i] += fMPY8US(fGETUBYTE(3,VuV.uw[i]), fGETBYTE(3,RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vrmpybusi,"Vdd32=vrmpybus(Vuu32,Rt32,#u1)","Vdd32.w=vrmpy(Vuu32.ub,Rt32.b,#u1)",
+"Dual Vector Unsigned Byte By Signed Byte 4-way Reduction to Word",
+    VddV.v[0].w[i]  = fMPY8US(fGETUBYTE(0, VuuV.v[uiV ? 1:0].uw[i]),fGETBYTE((0-uiV) & 0x3,RtV));
+    VddV.v[0].w[i] += fMPY8US(fGETUBYTE(1, VuuV.v[0        ].uw[i]),fGETBYTE((1-uiV) & 0x3,RtV));
+    VddV.v[0].w[i] += fMPY8US(fGETUBYTE(2, VuuV.v[0        ].uw[i]),fGETBYTE((2-uiV) & 0x3,RtV));
+    VddV.v[0].w[i] += fMPY8US(fGETUBYTE(3, VuuV.v[0        ].uw[i]),fGETBYTE((3-uiV) & 0x3,RtV));
+
+    VddV.v[1].w[i]  = fMPY8US(fGETUBYTE(0, VuuV.v[1        ].uw[i]),fGETBYTE((2-uiV) & 0x3,RtV));
+    VddV.v[1].w[i] += fMPY8US(fGETUBYTE(1, VuuV.v[1        ].uw[i]),fGETBYTE((3-uiV) & 0x3,RtV));
+    VddV.v[1].w[i] += fMPY8US(fGETUBYTE(2, VuuV.v[uiV ? 1:0].uw[i]),fGETBYTE((0-uiV) & 0x3,RtV));
+    VddV.v[1].w[i] += fMPY8US(fGETUBYTE(3, VuuV.v[0        ].uw[i]),fGETBYTE((1-uiV) & 0x3,RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vrmpybusi_acc,"Vxx32+=vrmpybus(Vuu32,Rt32,#u1)","Vxx32.w+=vrmpy(Vuu32.ub,Rt32.b,#u1)",
+"Dual Vector Unsigned Byte By Signed Byte 4-way Reduction with accumulate and saturation to Word",
+    VxxV.v[0].w[i] += fMPY8US(fGETUBYTE(0, VuuV.v[uiV ? 1:0].uw[i]),fGETBYTE((0-uiV) & 0x3,RtV));
+    VxxV.v[0].w[i] += fMPY8US(fGETUBYTE(1, VuuV.v[0        ].uw[i]),fGETBYTE((1-uiV) & 0x3,RtV));
+    VxxV.v[0].w[i] += fMPY8US(fGETUBYTE(2, VuuV.v[0        ].uw[i]),fGETBYTE((2-uiV) & 0x3,RtV));
+    VxxV.v[0].w[i] += fMPY8US(fGETUBYTE(3, VuuV.v[0        ].uw[i]),fGETBYTE((3-uiV) & 0x3,RtV));
+
+    VxxV.v[1].w[i] += fMPY8US(fGETUBYTE(0, VuuV.v[1        ].uw[i]),fGETBYTE((2-uiV) & 0x3,RtV));
+    VxxV.v[1].w[i] += fMPY8US(fGETUBYTE(1, VuuV.v[1        ].uw[i]),fGETBYTE((3-uiV) & 0x3,RtV));
+    VxxV.v[1].w[i] += fMPY8US(fGETUBYTE(2, VuuV.v[uiV ? 1:0].uw[i]),fGETBYTE((0-uiV) & 0x3,RtV));
+    VxxV.v[1].w[i] += fMPY8US(fGETUBYTE(3, VuuV.v[0        ].uw[i]),fGETBYTE((1-uiV) & 0x3,RtV)))
+
+
+
+
+ITERATOR_INSN2_MPY_SLOT(32,vrmpybusv,"Vd32=vrmpybus(Vu32,Vv32)","Vd32.w=vrmpy(Vu32.ub,Vv32.b)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients",
+    VdV.w[i]  = fMPY8US(fGETUBYTE(0,VuV.uw[i]), fGETBYTE(0,VvV.w[i]));
+    VdV.w[i] += fMPY8US(fGETUBYTE(1,VuV.uw[i]), fGETBYTE(1,VvV.w[i]));
+    VdV.w[i] += fMPY8US(fGETUBYTE(2,VuV.uw[i]), fGETBYTE(2,VvV.w[i]));
+    VdV.w[i] += fMPY8US(fGETUBYTE(3,VuV.uw[i]), fGETBYTE(3,VvV.w[i])))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vrmpybusv_acc,"Vx32+=vrmpybus(Vu32,Vv32)","Vx32.w+=vrmpy(Vu32.ub,Vv32.b)",
+"Vector Multiply-Accumulate Reduce with 4 byte coefficients",
+    VxV.w[i] += fMPY8US(fGETUBYTE(0,VuV.uw[i]), fGETBYTE(0,VvV.w[i]));
+    VxV.w[i] += fMPY8US(fGETUBYTE(1,VuV.uw[i]), fGETBYTE(1,VvV.w[i]));
+    VxV.w[i] += fMPY8US(fGETUBYTE(2,VuV.uw[i]), fGETBYTE(2,VvV.w[i]));
+    VxV.w[i] += fMPY8US(fGETUBYTE(3,VuV.uw[i]), fGETBYTE(3,VvV.w[i])))
+
+
+
+
+
+
+
+
+
+
+
+/********************************************
+*  2-WAY REDUCTION - SAD
+********************************************/
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdsaduh,"Vdd32=vdsaduh(Vuu32,Rt32)","Vdd32.uw=vdsad(Vuu32.uh,Rt32.uh)",
+"Dual Vector Halfword by Byte 4-Way Reduction to Word",
+    VddV.v[0].uw[i]  = fABS(fGETUHALF(0, VuuV.v[0].uw[i]) - fGETUHALF(0,RtV));
+    VddV.v[0].uw[i] += fABS(fGETUHALF(1, VuuV.v[0].uw[i]) - fGETUHALF(1,RtV));
+    VddV.v[1].uw[i]  = fABS(fGETUHALF(1, VuuV.v[0].uw[i]) - fGETUHALF(0,RtV));
+    VddV.v[1].uw[i] += fABS(fGETUHALF(0, VuuV.v[1].uw[i]) - fGETUHALF(1,RtV)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vdsaduh_acc,"Vxx32+=vdsaduh(Vuu32,Rt32)","Vxx32.uw+=vdsad(Vuu32.uh,Rt32.uh)",
+"Dual Vector Halfword by Byte 4-Way Reduction to Word",
+    VxxV.v[0].uw[i] += fABS(fGETUHALF(0, VuuV.v[0].uw[i]) - fGETUHALF(0,RtV));
+    VxxV.v[0].uw[i] += fABS(fGETUHALF(1, VuuV.v[0].uw[i]) - fGETUHALF(1,RtV));
+    VxxV.v[1].uw[i] += fABS(fGETUHALF(1, VuuV.v[0].uw[i]) - fGETUHALF(0,RtV));
+    VxxV.v[1].uw[i] += fABS(fGETUHALF(0, VuuV.v[1].uw[i]) - fGETUHALF(1,RtV)))
+
+
+
+
+/********************************************
+*  4-WAY REDUCTION - SAD
+********************************************/
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vrsadubi,"Vdd32=vrsadub(Vuu32,Rt32,#u1)","Vdd32.uw=vrsad(Vuu32.ub,Rt32.ub,#u1)",
+"Dual Vector Halfword by Byte 4-Way Reduction to Word",
+    VddV.v[0].uw[i]  = fABS(fZE8_16(fGETUBYTE(0, VuuV.v[uiV?1:0].uw[i])) - fZE8_16(fGETUBYTE((0-uiV)&3,RtV)));
+    VddV.v[0].uw[i] += fABS(fZE8_16(fGETUBYTE(1, VuuV.v[0      ].uw[i])) - fZE8_16(fGETUBYTE((1-uiV)&3,RtV)));
+    VddV.v[0].uw[i] += fABS(fZE8_16(fGETUBYTE(2, VuuV.v[0      ].uw[i])) - fZE8_16(fGETUBYTE((2-uiV)&3,RtV)));
+    VddV.v[0].uw[i] += fABS(fZE8_16(fGETUBYTE(3, VuuV.v[0      ].uw[i])) - fZE8_16(fGETUBYTE((3-uiV)&3,RtV)));
+
+    VddV.v[1].uw[i]  = fABS(fZE8_16(fGETUBYTE(0, VuuV.v[1      ].uw[i])) - fZE8_16(fGETUBYTE((2-uiV)&3,RtV)));
+    VddV.v[1].uw[i] += fABS(fZE8_16(fGETUBYTE(1, VuuV.v[1      ].uw[i])) - fZE8_16(fGETUBYTE((3-uiV)&3,RtV)));
+    VddV.v[1].uw[i] += fABS(fZE8_16(fGETUBYTE(2, VuuV.v[uiV?1:0].uw[i])) - fZE8_16(fGETUBYTE((0-uiV)&3,RtV)));
+    VddV.v[1].uw[i] += fABS(fZE8_16(fGETUBYTE(3, VuuV.v[0      ].uw[i])) - fZE8_16(fGETUBYTE((1-uiV)&3,RtV))))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vrsadubi_acc,"Vxx32+=vrsadub(Vuu32,Rt32,#u1)","Vxx32.uw+=vrsad(Vuu32.ub,Rt32.ub,#u1)",
+"Dual Vector Halfword by Byte 4-Way Reduction to Word",
+    VxxV.v[0].uw[i] += fABS(fZE8_16(fGETUBYTE(0, VuuV.v[uiV?1:0].uw[i])) - fZE8_16(fGETUBYTE((0-uiV)&3,RtV)));
+    VxxV.v[0].uw[i] += fABS(fZE8_16(fGETUBYTE(1, VuuV.v[0      ].uw[i])) - fZE8_16(fGETUBYTE((1-uiV)&3,RtV)));
+    VxxV.v[0].uw[i] += fABS(fZE8_16(fGETUBYTE(2, VuuV.v[0      ].uw[i])) - fZE8_16(fGETUBYTE((2-uiV)&3,RtV)));
+    VxxV.v[0].uw[i] += fABS(fZE8_16(fGETUBYTE(3, VuuV.v[0      ].uw[i])) - fZE8_16(fGETUBYTE((3-uiV)&3,RtV)));
+
+    VxxV.v[1].uw[i] += fABS(fZE8_16(fGETUBYTE(0, VuuV.v[1      ].uw[i])) - fZE8_16(fGETUBYTE((2-uiV)&3,RtV)));
+    VxxV.v[1].uw[i] += fABS(fZE8_16(fGETUBYTE(1, VuuV.v[1      ].uw[i])) - fZE8_16(fGETUBYTE((3-uiV)&3,RtV)));
+    VxxV.v[1].uw[i] += fABS(fZE8_16(fGETUBYTE(2, VuuV.v[uiV?1:0].uw[i])) - fZE8_16(fGETUBYTE((0-uiV)&3,RtV)));
+    VxxV.v[1].uw[i] += fABS(fZE8_16(fGETUBYTE(3, VuuV.v[0      ].uw[i])) - fZE8_16(fGETUBYTE((1-uiV)&3,RtV))))
+
+
+
+
+
+
+
+
+
+
+/*********************************************************************
+ * MMVECTOR SHIFTING
+ * ******************************************************************/
+// Macro to shift arithmetically left/right and by either RT or Vv
+
+#define V_SHIFT(TYPE, DESC, SIZE, LOGSIZE, CASTTYPE)   \
+ITERATOR_INSN2_SHIFT_SLOT(SIZE,vasr##TYPE,   "Vd32=vasr" #TYPE "(Vu32,Rt32)","Vd32."#TYPE"=vasr(Vu32."#TYPE",Rt32)",         "Vector arithmetic shift right " DESC,    VdV.TYPE[i]     = (VuV.TYPE[i]    >> (RtV & (SIZE-1)))) \
+ITERATOR_INSN2_SHIFT_SLOT(SIZE,vasl##TYPE,   "Vd32=vasl" #TYPE "(Vu32,Rt32)","Vd32."#TYPE"=vasl(Vu32."#TYPE",Rt32)",         "Vector arithmetic shift left  " DESC,    VdV.TYPE[i]     = (VuV.TYPE[i]    << (RtV & (SIZE-1)))) \
+ITERATOR_INSN2_SHIFT_SLOT(SIZE,vlsr##TYPE,   "Vd32=vlsr" #TYPE "(Vu32,Rt32)","Vd32.u"#TYPE"=vlsr(Vu32.u"#TYPE",Rt32)",       "Vector logical shift right "    DESC,    VdV.u##TYPE[i]  = (VuV.u##TYPE[i] >> (RtV & (SIZE-1)))) \
+ITERATOR_INSN2_SHIFT_SLOT(SIZE,vasr##TYPE##v,"Vd32=vasr" #TYPE "(Vu32,Vv32)","Vd32."#TYPE"=vasr(Vu32."#TYPE",Vv32."#TYPE")", "Vector arithmetic shift right " DESC,    VdV.TYPE[i]     = fBIDIR_ASHIFTR(VuV.TYPE[i], fSXTN((LOGSIZE+1),SIZE,VvV.TYPE[i]),CASTTYPE)) \
+ITERATOR_INSN2_SHIFT_SLOT(SIZE,vasl##TYPE##v,"Vd32=vasl" #TYPE "(Vu32,Vv32)","Vd32."#TYPE"=vasl(Vu32."#TYPE",Vv32."#TYPE")", "Vector arithmetic shift left  " DESC,    VdV.TYPE[i]     = fBIDIR_ASHIFTL(VuV.TYPE[i],  fSXTN((LOGSIZE+1),SIZE,VvV.TYPE[i]),CASTTYPE)) \
+ITERATOR_INSN2_SHIFT_SLOT(SIZE,vlsr##TYPE##v,"Vd32=vlsr" #TYPE "(Vu32,Vv32)","Vd32."#TYPE"=vlsr(Vu32."#TYPE",Vv32."#TYPE")", "Vector logical shift right "    DESC,    VdV.u##TYPE[i]  = fBIDIR_LSHIFTR(VuV.u##TYPE[i], fSXTN((LOGSIZE+1),SIZE,VvV.TYPE[i]),CASTTYPE)) \
+
+V_SHIFT(w, "word",   32,5,4_4)
+V_SHIFT(h, "halfword", 16,4,2_2)
+
+ITERATOR_INSN_SHIFT_SLOT(8,vlsrb,"Vd32.ub=vlsr(Vu32.ub,Rt32)","vec log shift right bytes", VdV.b[i] = VuV.ub[i] >> (RtV & 0x7))
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vrotr,"Vd32=vrotr(Vu32,Vv32)","Vd32.uw=vrotr(Vu32.uw,Vv32.uw)","Vector word rotate right", VdV.uw[i] = ((VuV.uw[i] >> (VvV.uw[i] & 0x1f)) | (VuV.uw[i] << (32 - (VvV.uw[i] & 0x1f)))))
+
+/*********************************************************************
+ * MMVECTOR SHIFT AND PERMUTE
+ * ******************************************************************/
+
+ITERATOR_INSN2_PERMUTE_SLOT_DOUBLE_VEC(32,vasr_into,"Vxx32=vasrinto(Vu32,Vv32)","Vxx32.w=vasrinto(Vu32.w,Vv32.w)","ASR vector 1 elements and overlay dropping bits to MSB of vector 2 elements",
+    fHIDE(int64_t ) shift = (fSE32_64(VuV.w[i]) << 32);
+    fHIDE(int64_t ) mask  = (((fSE32_64(VxxV.v[0].w[i])) << 32) | fZE32_64(VxxV.v[0].w[i]));
+    fHIDE(int64_t) lomask = (((fSE32_64(1)) << 32) - 1);
+    fHIDE(int ) count = -(0x40 & VvV.w[i]) + (VvV.w[i] & 0x3f);
+    fHIDE(int64_t ) result = (count == -0x40) ? 0 : (((count < 0) ? ((shift << -(count)) | (mask & (lomask << -(count)))) : ((shift >> count) | (mask & (lomask >> count)))));
+    VxxV.v[1].w[i] = ((result >> 32) & 0xffffffff);
+    VxxV.v[0].w[i] = (result & 0xffffffff))
+
+#define NEW_NARROWING_SHIFT 1
+
+#if NEW_NARROWING_SHIFT
+#define NARROWING_SHIFT(ITERSIZE,TAG,DSTM,DSTTYPE,SRCTYPE,SYNOPTS,SATFUNC,RNDFUNC,SHAMTMASK) \
+ITERATOR_INSN_SHIFT_SLOT(ITERSIZE,TAG, \
+"Vd32." #DSTTYPE "=vasr(Vu32." #SRCTYPE ",Vv32." #SRCTYPE ",Rt8)" #SYNOPTS, \
+"Vector shift right and shuffle", \
+    fHIDE(int )shamt = RtV & SHAMTMASK; \
+    DSTM(0,VdV.SRCTYPE[i],SATFUNC(RNDFUNC(VvV.SRCTYPE[i],shamt) >> shamt)); \
+    DSTM(1,VdV.SRCTYPE[i],SATFUNC(RNDFUNC(VuV.SRCTYPE[i],shamt) >> shamt)))
+
+
+
+
+
+/* WORD TO HALF*/
+
+NARROWING_SHIFT(32,vasrwh,fSETHALF,h,w,,fECHO,fVNOROUND,0xF)
+NARROWING_SHIFT(32,vasrwhsat,fSETHALF,h,w,:sat,fVSATH,fVNOROUND,0xF)
+NARROWING_SHIFT(32,vasrwhrndsat,fSETHALF,h,w,:rnd:sat,fVSATH,fVROUND,0xF)
+NARROWING_SHIFT(32,vasrwuhrndsat,fSETHALF,uh,w,:rnd:sat,fVSATUH,fVROUND,0xF)
+NARROWING_SHIFT(32,vasrwuhsat,fSETHALF,uh,w,:sat,fVSATUH,fVNOROUND,0xF)
+NARROWING_SHIFT(32,vasruwuhrndsat,fSETHALF,uh,uw,:rnd:sat,fVSATUH,fVROUND,0xF)
+
+NARROWING_SHIFT_NOV1(32,vasruwuhsat,fSETHALF,uh,uw,:sat,fVSATUH,fVNOROUND,0xF)
+NARROWING_SHIFT(16,vasrhubsat,fSETBYTE,ub,h,:sat,fVSATUB,fVNOROUND,0x7)
+NARROWING_SHIFT(16,vasrhubrndsat,fSETBYTE,ub,h,:rnd:sat,fVSATUB,fVROUND,0x7)
+NARROWING_SHIFT(16,vasrhbsat,fSETBYTE,b,h,:sat,fVSATB,fVNOROUND,0x7)
+NARROWING_SHIFT(16,vasrhbrndsat,fSETBYTE,b,h,:rnd:sat,fVSATB,fVROUND,0x7)
+
+NARROWING_SHIFT_NOV1(16,vasruhubsat,fSETBYTE,ub,uh,:sat,fVSATUB,fVNOROUND,0x7)
+NARROWING_SHIFT_NOV1(16,vasruhubrndsat,fSETBYTE,ub,uh,:rnd:sat,fVSATUB,fVROUND,0x7)
+
+#else
+ITERATOR_INSN2_SHIFT_SLOT(32,vasrwh,"Vd32=vasrwh(Vu32,Vv32,Rt8)","Vd32.h=vasr(Vu32.w,Vv32.w,Rt8)",
+"Vector arithmetic shift right words, shuffle even halfwords",
+    fSETHALF(0,VdV.w[i], (VvV.w[i] >> (RtV & 0xF)));
+    fSETHALF(1,VdV.w[i], (VuV.w[i] >> (RtV & 0xF))))
+
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vasrwhsat,"Vd32=vasrwh(Vu32,Vv32,Rt8):sat","Vd32.h=vasr(Vu32.w,Vv32.w,Rt8):sat",
+"Vector arithmetic shift right words, shuffle even halfwords",
+    fSETHALF(0,VdV.w[i], fVSATH(VvV.w[i] >> (RtV & 0xF)));
+    fSETHALF(1,VdV.w[i], fVSATH(VuV.w[i] >> (RtV & 0xF))))
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vasrwhrndsat,"Vd32=vasrwh(Vu32,Vv32,Rt8):rnd:sat","Vd32.h=vasr(Vu32.w,Vv32.w,Rt8):rnd:sat",
+"Vector arithmetic shift right words, shuffle even halfwords",
+    fHIDE(int ) shamt = RtV & 0xF;
+    fSETHALF(0,VdV.w[i], fVSATH(  (VvV.w[i] + fBIDIR_ASHIFTL(1,(shamt-1),4_8) ) >> shamt));
+    fSETHALF(1,VdV.w[i], fVSATH(  (VuV.w[i] + fBIDIR_ASHIFTL(1,(shamt-1),4_8) ) >> shamt)))
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vasrwuhrndsat,"Vd32=vasrwuh(Vu32,Vv32,Rt8):rnd:sat","Vd32.uh=vasr(Vu32.w,Vv32.w,Rt8):rnd:sat",
+"Vector arithmetic shift right words, shuffle even halfwords",
+    fHIDE(int ) shamt = RtV & 0xF;
+    fSETHALF(0,VdV.w[i], fVSATUH(  (VvV.w[i] + fBIDIR_ASHIFTL(1,(shamt-1),4_8) ) >> shamt));
+    fSETHALF(1,VdV.w[i], fVSATUH(  (VuV.w[i] + fBIDIR_ASHIFTL(1,(shamt-1),4_8) ) >> shamt)))
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vasrwuhsat,"Vd32=vasrwuh(Vu32,Vv32,Rt8):sat","Vd32.uh=vasr(Vu32.w,Vv32.w,Rt8):sat",
+"Vector arithmetic shift right words, shuffle even halfwords",
+    fSETHALF(0, VdV.uw[i], fVSATUH(VvV.w[i] >> (RtV & 0xF)));
+    fSETHALF(1, VdV.uw[i], fVSATUH(VuV.w[i] >> (RtV & 0xF))))
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vasruwuhrndsat,"Vd32=vasruwuh(Vu32,Vv32,Rt8):rnd:sat","Vd32.uh=vasr(Vu32.uw,Vv32.uw,Rt8):rnd:sat",
+"Vector arithmetic shift right words, shuffle even halfwords",
+    fHIDE(int ) shamt = RtV & 0xF;
+    fSETHALF(0,VdV.w[i], fVSATUH(  (VvV.uw[i] + fBIDIR_ASHIFTL(1,(shamt-1),4_8) ) >> shamt));
+    fSETHALF(1,VdV.w[i], fVSATUH(  (VuV.uw[i] + fBIDIR_ASHIFTL(1,(shamt-1),4_8) ) >> shamt)))
+#endif
+
+
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vroundwh,"Vd32=vroundwh(Vu32,Vv32):sat","Vd32.h=vround(Vu32.w,Vv32.w):sat",
+"Vector round words to halves, shuffle resultant halfwords",
+    fSETHALF(0, VdV.uw[i], fVSATH((VvV.w[i] + fCONSTLL(0x8000)) >> 16));
+    fSETHALF(1, VdV.uw[i], fVSATH((VuV.w[i] + fCONSTLL(0x8000)) >> 16)))
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vroundwuh,"Vd32=vroundwuh(Vu32,Vv32):sat","Vd32.uh=vround(Vu32.w,Vv32.w):sat",
+"Vector round words to halves, shuffle resultant halfwords",
+    fSETHALF(0, VdV.uw[i], fVSATUH((VvV.w[i] + fCONSTLL(0x8000)) >> 16));
+    fSETHALF(1, VdV.uw[i], fVSATUH((VuV.w[i] + fCONSTLL(0x8000)) >> 16)))
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vrounduwuh,"Vd32=vrounduwuh(Vu32,Vv32):sat","Vd32.uh=vround(Vu32.uw,Vv32.uw):sat",
+"Vector round words to halves, shuffle resultant halfwords",
+    fSETHALF(0, VdV.uw[i], fVSATUH((VvV.uw[i] + fCONSTLL(0x8000)) >> 16));
+    fSETHALF(1, VdV.uw[i], fVSATUH((VuV.uw[i] + fCONSTLL(0x8000)) >> 16)))
+
+
+
+
+
+/* HALF TO BYTE*/
+
+ITERATOR_INSN2_SHIFT_SLOT(16,vroundhb,"Vd32=vroundhb(Vu32,Vv32):sat","Vd32.b=vround(Vu32.h,Vv32.h):sat",
+"Vector round words to halves, shuffle resultant halfwords",
+    fSETBYTE(0, VdV.uh[i], fVSATB((VvV.h[i] + 0x80) >> 8));
+    fSETBYTE(1, VdV.uh[i], fVSATB((VuV.h[i] + 0x80) >> 8)))
+
+ITERATOR_INSN2_SHIFT_SLOT(16,vroundhub,"Vd32=vroundhub(Vu32,Vv32):sat","Vd32.ub=vround(Vu32.h,Vv32.h):sat",
+"Vector round words to halves, shuffle resultant halfwords",
+    fSETBYTE(0, VdV.uh[i], fVSATUB((VvV.h[i] + 0x80) >> 8));
+    fSETBYTE(1, VdV.uh[i], fVSATUB((VuV.h[i] + 0x80) >> 8)))
+
+ITERATOR_INSN2_SHIFT_SLOT(16,vrounduhub,"Vd32=vrounduhub(Vu32,Vv32):sat","Vd32.ub=vround(Vu32.uh,Vv32.uh):sat",
+"Vector round words to halves, shuffle resultant halfwords",
+    fSETBYTE(0, VdV.uh[i], fVSATUB((VvV.uh[i] + 0x80) >> 8));
+    fSETBYTE(1, VdV.uh[i], fVSATUB((VuV.uh[i] + 0x80) >> 8)))
+
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vaslw_acc,"Vx32+=vaslw(Vu32,Rt32)","Vx32.w+=vasl(Vu32.w,Rt32)",
+"Vector shift add word",
+    VxV.w[i]  +=  (VuV.w[i] << (RtV & (32-1))))
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vasrw_acc,"Vx32+=vasrw(Vu32,Rt32)","Vx32.w+=vasr(Vu32.w,Rt32)",
+"Vector shift add word",
+    VxV.w[i]  +=  (VuV.w[i] >> (RtV & (32-1))))
+
+ITERATOR_INSN2_SHIFT_SLOT_NOV1(16,vaslh_acc,"Vx32+=vaslh(Vu32,Rt32)","Vx32.h+=vasl(Vu32.h,Rt32)",
+"Vector shift add halfword",
+    VxV.h[i]  +=  (VuV.h[i] << (RtV & (16-1))))
+
+ITERATOR_INSN2_SHIFT_SLOT_NOV1(16,vasrh_acc,"Vx32+=vasrh(Vu32,Rt32)","Vx32.h+=vasr(Vu32.h,Rt32)",
+"Vector shift add halfword",
+    VxV.h[i]  +=  (VuV.h[i] >> (RtV & (16-1))))
+
+/**************************************************************************
+*
+* MMVECTOR ELEMENT-WISE ARITHMETIC
+*
+**************************************************************************/
+
+/**************************************************************************
+* MACROS GO IN MACROS.DEF NOT HERE!!!
+**************************************************************************/
+
+
+#define MMVEC_ABSDIFF(TYPE,TYPE2,DESCR, WIDTH, DEST,SRC)\
+ITERATOR_INSN2_MPY_SLOT(WIDTH, vabsdiff##TYPE,                   "Vd32=vabsdiff"TYPE2"(Vu32,Vv32)" ,"Vd32."#DEST"=vabsdiff(Vu32."#SRC",Vv32."#SRC")" ,     "Vector Absolute of Difference "DESCR,   VdV.DEST[i] = (VuV.SRC[i] > VvV.SRC[i]) ? (VuV.SRC[i] - VvV.SRC[i]) : (VvV.SRC[i] - VuV.SRC[i]))
+
+#define MMVEC_ADDU_SAT(TYPE,TYPE2,DESCR, WIDTH, DEST,SRC)\
+ITERATOR_INSN2_ANY_SLOT(WIDTH, vadd##TYPE##sat,                  "Vd32=vadd"TYPE2"(Vu32,Vv32):sat" ,    "Vd32."#DEST"=vadd(Vu32."#SRC",Vv32."#SRC"):sat",    "Vector Add & Saturate "DESCR,            VdV.DEST[i] = fVUADDSAT(WIDTH,  VuV.SRC[i], VvV.SRC[i]))\
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(WIDTH, vadd##TYPE##sat_dv,    "Vdd32=vadd"TYPE2"(Vuu32,Vvv32):sat",  "Vdd32."#DEST"=vadd(Vuu32."#SRC",Vvv32."#SRC"):sat", "Double Vector Add & Saturate "DESCR,    VddV.v[0].DEST[i] = fVUADDSAT(WIDTH, VuuV.v[0].SRC[i],VvvV.v[0].SRC[i]); VddV.v[1].DEST[i] = fVUADDSAT(WIDTH, VuuV.v[1].SRC[i],VvvV.v[1].SRC[i]))\
+ITERATOR_INSN2_ANY_SLOT(WIDTH, vsub##TYPE##sat,                  "Vd32=vsub"TYPE2"(Vu32,Vv32):sat",     "Vd32."#DEST"=vsub(Vu32."#SRC",Vv32."#SRC"):sat",    "Vector Add & Saturate "DESCR,            VdV.DEST[i] = fVUSUBSAT(WIDTH,  VuV.SRC[i], VvV.SRC[i]))\
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(WIDTH, vsub##TYPE##sat_dv,    "Vdd32=vsub"TYPE2"(Vuu32,Vvv32):sat",  "Vdd32."#DEST"=vsub(Vuu32."#SRC",Vvv32."#SRC"):sat", "Double Vector Add & Saturate "DESCR,    VddV.v[0].DEST[i] = fVUSUBSAT(WIDTH, VuuV.v[0].SRC[i],VvvV.v[0].SRC[i]); VddV.v[1].DEST[i] = fVUSUBSAT(WIDTH, VuuV.v[1].SRC[i],VvvV.v[1].SRC[i]))\
+
+#define MMVEC_ADDS_SAT(TYPE,TYPE2,DESCR, WIDTH,DEST,SRC)\
+ITERATOR_INSN2_ANY_SLOT(WIDTH, vadd##TYPE##sat,                  "Vd32=vadd"TYPE2"(Vu32,Vv32):sat" ,    "Vd32."#DEST"=vadd(Vu32."#SRC",Vv32."#SRC"):sat",    "Vector Add & Saturate "DESCR,            VdV.DEST[i] = fVSADDSAT(WIDTH,  VuV.SRC[i],  VvV.SRC[i]))\
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(WIDTH, vadd##TYPE##sat_dv,    "Vdd32=vadd"TYPE2"(Vuu32,Vvv32):sat",  "Vdd32."#DEST"=vadd(Vuu32."#SRC",Vvv32."#SRC"):sat", "Double Vector Add & Saturate "DESCR,    VddV.v[0].DEST[i] = fVSADDSAT(WIDTH, VuuV.v[0].SRC[i], VvvV.v[0].SRC[i]); VddV.v[1].DEST[i] = fVSADDSAT(WIDTH, VuuV.v[1].SRC[i], VvvV.v[1].SRC[i]))\
+ITERATOR_INSN2_ANY_SLOT(WIDTH, vsub##TYPE##sat,                  "Vd32=vsub"TYPE2"(Vu32,Vv32):sat",     "Vd32."#DEST"=vsub(Vu32."#SRC",Vv32."#SRC"):sat",    "Vector Add & Saturate "DESCR,            VdV.DEST[i] = fVSSUBSAT(WIDTH,  VuV.SRC[i],  VvV.SRC[i]))\
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(WIDTH, vsub##TYPE##sat_dv,    "Vdd32=vsub"TYPE2"(Vuu32,Vvv32):sat",  "Vdd32."#DEST"=vsub(Vuu32."#SRC",Vvv32."#SRC"):sat", "Double Vector Add & Saturate "DESCR,    VddV.v[0].DEST[i] = fVSSUBSAT(WIDTH, VuuV.v[0].SRC[i], VvvV.v[0].SRC[i]); VddV.v[1].DEST[i] = fVSSUBSAT(WIDTH, VuuV.v[1].SRC[i], VvvV.v[1].SRC[i]))\
+
+#define MMVEC_AVGU(TYPE,TYPE2,DESCR, WIDTH, DEST,SRC)\
+ITERATOR_INSN2_ANY_SLOT(WIDTH,vavg##TYPE,                        "Vd32=vavg"TYPE2"(Vu32,Vv32)",         "Vd32."#DEST"=vavg(Vu32."#SRC",Vv32."#SRC")",        "Vector Average "DESCR,                                      VdV.DEST[i] = fVAVGU(   WIDTH,  VuV.SRC[i], VvV.SRC[i])) \
+ITERATOR_INSN2_ANY_SLOT(WIDTH,vavg##TYPE##rnd,                   "Vd32=vavg"TYPE2"(Vu32,Vv32):rnd",     "Vd32."#DEST"=vavg(Vu32."#SRC",Vv32."#SRC"):rnd",    "Vector Average % Round"DESCR,                               VdV.DEST[i] = fVAVGURND(WIDTH,  VuV.SRC[i], VvV.SRC[i]))
+
+
+
+#define MMVEC_AVGS(TYPE,TYPE2,DESCR, WIDTH, DEST,SRC)\
+ITERATOR_INSN2_ANY_SLOT(WIDTH,vavg##TYPE,                        "Vd32=vavg"TYPE2"(Vu32,Vv32)",          "Vd32."#DEST"=vavg(Vu32."#SRC",Vv32."#SRC")",          "Vector Average "DESCR,                                      VdV.DEST[i]  = fVAVGS(       WIDTH,  VuV.SRC[i], VvV.SRC[i])) \
+ITERATOR_INSN2_ANY_SLOT(WIDTH,vavg##TYPE##rnd,                   "Vd32=vavg"TYPE2"(Vu32,Vv32):rnd",      "Vd32."#DEST"=vavg(Vu32."#SRC",Vv32."#SRC"):rnd",      "Vector Average % Round"DESCR,                               VdV.DEST[i]  = fVAVGSRND(    WIDTH,  VuV.SRC[i], VvV.SRC[i])) \
+ITERATOR_INSN2_ANY_SLOT(WIDTH,vnavg##TYPE,                       "Vd32=vnavg"TYPE2"(Vu32,Vv32)",         "Vd32."#DEST"=vnavg(Vu32."#SRC",Vv32."#SRC")",         "Vector Negative Average "DESCR,                             VdV.DEST[i]  = fVNAVGS(      WIDTH,  VuV.SRC[i], VvV.SRC[i]))
+
+
+
+
+
+
+
+#define MMVEC_ADDWRAP(TYPE,TYPE2, DESCR, WIDTH , DEST,SRC)\
+ITERATOR_INSN2_ANY_SLOT(WIDTH, vadd##TYPE,                  "Vd32=vadd"TYPE2"(Vu32,Vv32)" ,     "Vd32."#DEST"=vadd(Vu32."#SRC",Vv32."#SRC")",    "Vector Add "DESCR,          VdV.DEST[i] =  VuV.SRC[i] +  VvV.SRC[i])\
+ITERATOR_INSN2_ANY_SLOT(WIDTH, vsub##TYPE,                  "Vd32=vsub"TYPE2"(Vu32,Vv32)" ,     "Vd32."#DEST"=vsub(Vu32."#SRC",Vv32."#SRC")",    "Vector Sub "DESCR,          VdV.DEST[i] =  VuV.SRC[i] -  VvV.SRC[i])\
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(WIDTH, vadd##TYPE##_dv,  "Vdd32=vadd"TYPE2"(Vuu32,Vvv32)" ,  "Vdd32."#DEST"=vadd(Vuu32."#SRC",Vvv32."#SRC")", "Double Vector Add "DESCR,   VddV.v[0].DEST[i] = VuuV.v[0].SRC[i] + VvvV.v[0].SRC[i]; VddV.v[1].DEST[i] = VuuV.v[1].SRC[i] + VvvV.v[1].SRC[i])\
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(WIDTH, vsub##TYPE##_dv,  "Vdd32=vsub"TYPE2"(Vuu32,Vvv32)" ,  "Vdd32."#DEST"=vsub(Vuu32."#SRC",Vvv32."#SRC")", "Double Vector Sub "DESCR,   VddV.v[0].DEST[i] = VuuV.v[0].SRC[i] - VvvV.v[0].SRC[i]; VddV.v[1].DEST[i] = VuuV.v[1].SRC[i] - VvvV.v[1].SRC[i]) \
+
+
+
+
+
+/* Wrapping Adds */
+MMVEC_ADDWRAP(b,    "b",    "Byte",         8,   b, b)
+MMVEC_ADDWRAP(h,    "h",    "Halfword",     16,  h, h)
+MMVEC_ADDWRAP(w,    "w",    "Word",         32,   w,    w)
+
+/* Saturating Adds */
+MMVEC_ADDU_SAT(ub, "ub",    "Unsigned Byte",        8,   ub,    ub)
+MMVEC_ADDU_SAT(uh, "uh",    "Unsigned Halfword",    16,  uh,    uh)
+MMVEC_ADDU_SAT(uw, "uw",    "Unsigned word",    32,  uw,    uw)
+MMVEC_ADDS_SAT(b,  "b",     "byte",             8,  b,     b)
+MMVEC_ADDS_SAT(h,  "h",     "Halfword",             16,  h,     h)
+MMVEC_ADDS_SAT(w,  "w",     "Word",                 32,  w,     w)
+
+
+/* Averaging Instructions */
+MMVEC_AVGU(ub,"ub",     "Unsigned Byte",     8,   ub,   ub)
+MMVEC_AVGU(uh,"uh",     "Unsigned Halfword", 16,  uh,   uh)
+MMVEC_AVGU_NOV1(uw,"uw",     "Unsigned Word",     32,  uw,   uw)
+MMVEC_AVGS_NOV1(b,   "b",    "Byte",               8,   b,   b)
+MMVEC_AVGS(h,   "h",    "Halfword",          16,   h,   h)
+MMVEC_AVGS(w,   "w",    "Word",              32,   w,   w)
+
+
+/* Absolute Difference */
+MMVEC_ABSDIFF(ub,"ub",  "Unsigned Byte",        8,   ub,    ub)
+MMVEC_ABSDIFF(uh,"uh",  "Unsigned Halfword",    16,  uh,    uh)
+MMVEC_ABSDIFF(h,"h",        "Halfword",             16,  uh,    h)
+MMVEC_ABSDIFF(w,"w",        "Word",                 32,  uw,    w)
+
+ITERATOR_INSN2_ANY_SLOT(8,vnavgub, "Vd32=vnavgub(Vu32,Vv32)", "Vd32.b=vnavg(Vu32.ub,Vv32.ub)",
+"Vector Negative Average Unsigned Byte", VdV.b[i]   = fVNAVGU(8, VuV.ub[i], VvV.ub[i]))
+
+ITERATOR_INSN_ANY_SLOT(32,vaddcarrysat,"Vd32.w=vadd(Vu32.w,Vv32.w,Qs4):carry:sat","add w/carry and saturate",
+VdV.w[i] = fVSATW(VuV.w[i]+VvV.w[i]+fGETQBIT(QsV,i*4)))
+
+ITERATOR_INSN_ANY_SLOT(32,vaddcarry,"Vd32.w=vadd(Vu32.w,Vv32.w,Qx4):carry","add w/carry",
+VdV.w[i] = VuV.w[i]+VvV.w[i]+fGETQBIT(QxV,i*4);
+fSETQBITS(QxV,4,0xF,4*i,-fCARRY_FROM_ADD32(VuV.w[i],VvV.w[i],fGETQBIT(QxV,i*4))))
+
+ITERATOR_INSN_ANY_SLOT(32,vsubcarry,"Vd32.w=vsub(Vu32.w,Vv32.w,Qx4):carry","add w/carry",
+VdV.w[i] = VuV.w[i]+~VvV.w[i]+fGETQBIT(QxV,i*4);
+fSETQBITS(QxV,4,0xF,4*i,-fCARRY_FROM_ADD32(VuV.w[i],~VvV.w[i],fGETQBIT(QxV,i*4))))
+
+ITERATOR_INSN_ANY_SLOT(32,vaddcarryo,"Vd32.w,Qe4=vadd(Vu32.w,Vv32.w):carry","add w/carry out-only",
+VdV.w[i] = VuV.w[i]+VvV.w[i];
+fSETQBITS(QeV,4,0xF,4*i,-fCARRY_FROM_ADD32(VuV.w[i],VvV.w[i],0)))
+
+ITERATOR_INSN_ANY_SLOT(32,vsubcarryo,"Vd32.w,Qe4=vsub(Vu32.w,Vv32.w):carry","subtract w/carry out-only",
+VdV.w[i] = VuV.w[i]+~VvV.w[i]+1;
+fSETQBITS(QeV,4,0xF,4*i,-fCARRY_FROM_ADD32(VuV.w[i],~VvV.w[i],1)))
+
+
+ITERATOR_INSN_ANY_SLOT(32,vsatdw,"Vd32.w=vsatdw(Vu32.w,Vv32.w)","Saturate from 64-bits (higher 32-bits come from first vector) to 32-bits",VdV.w[i] = fVSATDW(VuV.w[i],VvV.w[i]))
+
+
+#define MMVEC_ADDSAT_MIX(TAGEND,SATF,WIDTH,DEST,SRC1,SRC2)\
+ITERATOR_INSN_ANY_SLOT(WIDTH, vadd##TAGEND,"Vd32."#DEST"=vadd(Vu32."#SRC1",Vv32."#SRC2"):sat",    "Vector Add mixed", VdV.DEST[i] =  SATF(VuV.SRC1[i] +  VvV.SRC2[i]))\
+ITERATOR_INSN_ANY_SLOT(WIDTH, vsub##TAGEND,"Vd32."#DEST"=vsub(Vu32."#SRC1",Vv32."#SRC2"):sat",    "Vector Sub mixed", VdV.DEST[i] =  SATF(VuV.SRC1[i] -  VvV.SRC2[i]))\
+
+MMVEC_ADDSAT_MIX(ububb_sat,fVSATUB,8,ub,ub,b)
+
+/****************************
+*   WIDENING
+****************************/
+
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vaddubh,"Vdd32=vaddub(Vu32,Vv32)","Vdd32.h=vadd(Vu32.ub,Vv32.ub)",
+"Vector addition with widen into two vectors",
+    VddV.v[0].h[i] = fZE8_16(fGETUBYTE(0, VuV.uh[i])) + fZE8_16(fGETUBYTE(0, VvV.uh[i]));
+    VddV.v[1].h[i] = fZE8_16(fGETUBYTE(1, VuV.uh[i])) + fZE8_16(fGETUBYTE(1, VvV.uh[i])))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vsububh,"Vdd32=vsubub(Vu32,Vv32)","Vdd32.h=vsub(Vu32.ub,Vv32.ub)",
+"Vector subtraction with widen into two vectors",
+    VddV.v[0].h[i] = fZE8_16(fGETUBYTE(0, VuV.uh[i])) - fZE8_16(fGETUBYTE(0, VvV.uh[i]));
+    VddV.v[1].h[i] = fZE8_16(fGETUBYTE(1, VuV.uh[i])) - fZE8_16(fGETUBYTE(1, VvV.uh[i])))
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vaddhw,"Vdd32=vaddh(Vu32,Vv32)","Vdd32.w=vadd(Vu32.h,Vv32.h)",
+"Vector addition with widen into two vectors",
+    VddV.v[0].w[i] = fGETHALF(0, VuV.w[i]) + fGETHALF(0, VvV.w[i]);
+    VddV.v[1].w[i] = fGETHALF(1, VuV.w[i]) + fGETHALF(1, VvV.w[i]))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vsubhw,"Vdd32=vsubh(Vu32,Vv32)","Vdd32.w=vsub(Vu32.h,Vv32.h)",
+"Vector subtraction with widen into two vectors",
+    VddV.v[0].w[i] = fGETHALF(0, VuV.w[i]) - fGETHALF(0, VvV.w[i]);
+    VddV.v[1].w[i] = fGETHALF(1, VuV.w[i]) - fGETHALF(1, VvV.w[i]))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vadduhw,"Vdd32=vadduh(Vu32,Vv32)","Vdd32.w=vadd(Vu32.uh,Vv32.uh)",
+"Vector addition with widen into two vectors",
+    VddV.v[0].w[i] = fZE16_32(fGETUHALF(0, VuV.uw[i])) + fZE16_32(fGETUHALF(0, VvV.uw[i]));
+    VddV.v[1].w[i] = fZE16_32(fGETUHALF(1, VuV.uw[i])) + fZE16_32(fGETUHALF(1, VvV.uw[i])))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vsubuhw,"Vdd32=vsubuh(Vu32,Vv32)","Vdd32.w=vsub(Vu32.uh,Vv32.uh)",
+"Vector subtraction with widen into two vectors",
+    VddV.v[0].w[i] = fZE16_32(fGETUHALF(0, VuV.uw[i])) - fZE16_32(fGETUHALF(0, VvV.uw[i]));
+    VddV.v[1].w[i] = fZE16_32(fGETUHALF(1, VuV.uw[i])) - fZE16_32(fGETUHALF(1, VvV.uw[i])))
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vaddhw_acc,"Vxx32+=vaddh(Vu32,Vv32)","Vxx32.w+=vadd(Vu32.h,Vv32.h)",
+"Vector addition with widen into two vectors",
+    VxxV.v[0].w[i] += fGETHALF(0, VuV.w[i]) + fGETHALF(0, VvV.w[i]);
+    VxxV.v[1].w[i] += fGETHALF(1, VuV.w[i]) + fGETHALF(1, VvV.w[i]))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vadduhw_acc,"Vxx32+=vadduh(Vu32,Vv32)","Vxx32.w+=vadd(Vu32.uh,Vv32.uh)",
+"Vector addition with widen into two vectors",
+    VxxV.v[0].w[i] += fGETUHALF(0, VuV.w[i]) + fGETUHALF(0, VvV.w[i]);
+    VxxV.v[1].w[i] += fGETUHALF(1, VuV.w[i]) + fGETUHALF(1, VvV.w[i]))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vaddubh_acc,"Vxx32+=vaddub(Vu32,Vv32)","Vxx32.h+=vadd(Vu32.ub,Vv32.ub)",
+"Vector addition with widen into two vectors",
+    VxxV.v[0].h[i] += fGETUBYTE(0, VuV.h[i]) + fGETUBYTE(0, VvV.h[i]);
+    VxxV.v[1].h[i] += fGETUBYTE(1, VuV.h[i]) + fGETUBYTE(1, VvV.h[i]))
+
+
+/****************************
+*   Conditional
+****************************/
+
+#define CONDADDSUB(WIDTH,TAGEND,LHSYN,RHSYN,DESCR,LHBEH,RHBEH) \
+ITERATOR_INSN2_ANY_SLOT(WIDTH,vadd##TAGEND##q,"if (Qv4."#TAGEND") "LHSYN"+="RHSYN,"if (Qv4) "LHSYN"+="RHSYN,DESCR,LHBEH=fCONDMASK##WIDTH(QvV,i,LHBEH+RHBEH,LHBEH)) \
+ITERATOR_INSN2_ANY_SLOT(WIDTH,vsub##TAGEND##q,"if (Qv4."#TAGEND") "LHSYN"-="RHSYN,"if (Qv4) "LHSYN"-="RHSYN,DESCR,LHBEH=fCONDMASK##WIDTH(QvV,i,LHBEH-RHBEH,LHBEH)) \
+ITERATOR_INSN2_ANY_SLOT(WIDTH,vadd##TAGEND##nq,"if (!Qv4."#TAGEND") "LHSYN"+="RHSYN,"if (!Qv4) "LHSYN"+="RHSYN,DESCR,LHBEH=fCONDMASK##WIDTH(QvV,i,LHBEH,LHBEH+RHBEH)) \
+ITERATOR_INSN2_ANY_SLOT(WIDTH,vsub##TAGEND##nq,"if (!Qv4."#TAGEND") "LHSYN"-="RHSYN,"if (!Qv4) "LHSYN"-="RHSYN,DESCR,LHBEH=fCONDMASK##WIDTH(QvV,i,LHBEH,LHBEH-RHBEH)) \
+
+CONDADDSUB(8,b,"Vx32.b","Vu32.b","Conditional add/sub Byte",VxV.ub[i],VuV.ub[i])
+CONDADDSUB(16,h,"Vx32.h","Vu32.h","Conditional add/sub Half",VxV.h[i],VuV.h[i])
+CONDADDSUB(32,w,"Vx32.w","Vu32.w","Conditional add/sub Word",VxV.w[i],VuV.w[i])
+
+/*****************************************************
+ ABSOLUTE VALUES
+*****************************************************/
+// V65
+ITERATOR_INSN2_ANY_SLOT_NOV1(8,vabsb,        "Vd32=vabsb(Vu32)",     "Vd32.b=vabs(Vu32.b)",     "Vector absolute value of bytes",    VdV.b[i]  =  fABS(VuV.b[i]))
+ITERATOR_INSN2_ANY_SLOT_NOV1(8,vabsb_sat,    "Vd32=vabsb(Vu32):sat", "Vd32.b=vabs(Vu32.b):sat", "Vector absolute value of bytes",    VdV.b[i]  =  fVSATB(fABS(fSE8_16(VuV.b[i]))))
+
+
+ITERATOR_INSN2_ANY_SLOT(16,vabsh,        "Vd32=vabsh(Vu32)",     "Vd32.h=vabs(Vu32.h)",     "Vector absolute value of halfwords",    VdV.h[i]  =  fABS(VuV.h[i]))
+ITERATOR_INSN2_ANY_SLOT(16,vabsh_sat,    "Vd32=vabsh(Vu32):sat", "Vd32.h=vabs(Vu32.h):sat", "Vector absolute value of halfwords",    VdV.h[i]  =  fVSATH(fABS(fSE16_32(VuV.h[i]))))
+ITERATOR_INSN2_ANY_SLOT(32,vabsw,        "Vd32=vabsw(Vu32)",     "Vd32.w=vabs(Vu32.w)",     "Vector absolute value of words",        VdV.w[i]  =  fABS(VuV.w[i]))
+ITERATOR_INSN2_ANY_SLOT(32,vabsw_sat,    "Vd32=vabsw(Vu32):sat", "Vd32.w=vabs(Vu32.w):sat", "Vector absolute value of words",        VdV.w[i]  =  fVSATW(fABS(fSE32_64(VuV.w[i]))))
+
+
+/**************************************************************************
+ * MMVECTOR MULTIPLICATIONS
+ * ************************************************************************/
+
+
+/* Byte by Byte */
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpybv,"Vdd32=vmpyb(Vu32,Vv32)","Vdd32.h=vmpy(Vu32.b,Vv32.b)",
+"Vector absolute value of words",
+    VddV.v[0].h[i] =  fMPY8SS(fGETBYTE(0, VuV.h[i]), fGETBYTE(0, VvV.h[i]));
+    VddV.v[1].h[i] =  fMPY8SS(fGETBYTE(1, VuV.h[i]), fGETBYTE(1, VvV.h[i])))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpybv_acc,"Vxx32+=vmpyb(Vu32,Vv32)","Vxx32.h+=vmpy(Vu32.b,Vv32.b)",
+"Vector absolute value of words",
+    VxxV.v[0].h[i] +=  fMPY8SS(fGETBYTE(0, VuV.h[i]), fGETBYTE(0, VvV.h[i]));
+    VxxV.v[1].h[i] +=  fMPY8SS(fGETBYTE(1, VuV.h[i]), fGETBYTE(1, VvV.h[i])))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpyubv,"Vdd32=vmpyub(Vu32,Vv32)","Vdd32.uh=vmpy(Vu32.ub,Vv32.ub)",
+"Vector absolute value of words",
+    VddV.v[0].uh[i] =  fMPY8UU(fGETUBYTE(0, VuV.uh[i]), fGETUBYTE(0, VvV.uh[i]) );
+    VddV.v[1].uh[i] =  fMPY8UU(fGETUBYTE(1, VuV.uh[i]), fGETUBYTE(1, VvV.uh[i]) ))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpyubv_acc,"Vxx32+=vmpyub(Vu32,Vv32)","Vxx32.uh+=vmpy(Vu32.ub,Vv32.ub)",
+"Vector absolute value of words",
+    VxxV.v[0].uh[i] +=  fMPY8UU(fGETUBYTE(0, VuV.uh[i]), fGETUBYTE(0, VvV.uh[i]) );
+    VxxV.v[1].uh[i] +=  fMPY8UU(fGETUBYTE(1, VuV.uh[i]), fGETUBYTE(1, VvV.uh[i]) ))
+
+
+
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpybusv,"Vdd32=vmpybus(Vu32,Vv32)","Vdd32.h=vmpy(Vu32.ub,Vv32.b)",
+"Vector absolute value of words",
+    VddV.v[0].h[i]  = fMPY8US(fGETUBYTE(0, VuV.uh[i]), fGETBYTE(0, VvV.h[i]));
+    VddV.v[1].h[i]  = fMPY8US(fGETUBYTE(1, VuV.uh[i]), fGETBYTE(1, VvV.h[i])))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpybusv_acc,"Vxx32+=vmpybus(Vu32,Vv32)","Vxx32.h+=vmpy(Vu32.ub,Vv32.b)",
+"Vector absolute value of words",
+    VxxV.v[0].h[i]  += fMPY8US(fGETUBYTE(0, VuV.uh[i]), fGETBYTE(0, VvV.h[i]));
+    VxxV.v[1].h[i]  += fMPY8US(fGETUBYTE(1, VuV.uh[i]), fGETBYTE(1, VvV.h[i])))
+
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpabusv,"Vdd32=vmpabus(Vuu32,Vvv32)","Vdd32.h=vmpa(Vuu32.ub,Vvv32.b)",
+"Vertical Byte Multiply",
+    VddV.v[0].h[i] = fMPY8US(fGETUBYTE(0, VuuV.v[0].uh[i]), fGETBYTE(0, VvvV.v[0].uh[i])) + fMPY8US(fGETUBYTE(0, VuuV.v[1].uh[i]), fGETBYTE(0, VvvV.v[1].uh[i]));
+    VddV.v[1].h[i] = fMPY8US(fGETUBYTE(1, VuuV.v[0].uh[i]), fGETBYTE(1, VvvV.v[0].uh[i])) + fMPY8US(fGETUBYTE(1, VuuV.v[1].uh[i]), fGETBYTE(1, VvvV.v[1].uh[i])))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpabuuv,"Vdd32=vmpabuu(Vuu32,Vvv32)","Vdd32.h=vmpa(Vuu32.ub,Vvv32.ub)",
+"Vertical Byte Multiply",
+    VddV.v[0].h[i] = fMPY8UU(fGETUBYTE(0, VuuV.v[0].uh[i]), fGETUBYTE(0, VvvV.v[0].uh[i])) + fMPY8UU(fGETUBYTE(0, VuuV.v[1].uh[i]), fGETUBYTE(0, VvvV.v[1].uh[i]));
+    VddV.v[1].h[i] = fMPY8UU(fGETUBYTE(1, VuuV.v[0].uh[i]), fGETUBYTE(1, VvvV.v[0].uh[i])) + fMPY8UU(fGETUBYTE(1, VuuV.v[1].uh[i]), fGETUBYTE(1, VvvV.v[1].uh[i])))
+
+
+
+
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyhv,"Vdd32=vmpyh(Vu32,Vv32)","Vdd32.w=vmpy(Vu32.h,Vv32.h)",
+"Vector by Vector Halfword Multiply",
+    VddV.v[0].w[i] = fMPY16SS(fGETHALF(0, VuV.w[i]), fGETHALF(0, VvV.w[i]));
+    VddV.v[1].w[i] = fMPY16SS(fGETHALF(1, VuV.w[i]), fGETHALF(1, VvV.w[i])))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyhv_acc,"Vxx32+=vmpyh(Vu32,Vv32)","Vxx32.w+=vmpy(Vu32.h,Vv32.h)",
+"Vector by Vector Halfword Multiply",
+    VxxV.v[0].w[i] += fMPY16SS(fGETHALF(0, VuV.w[i]), fGETHALF(0, VvV.w[i]));
+    VxxV.v[1].w[i] += fMPY16SS(fGETHALF(1, VuV.w[i]), fGETHALF(1, VvV.w[i])))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyuhv,"Vdd32=vmpyuh(Vu32,Vv32)","Vdd32.uw=vmpy(Vu32.uh,Vv32.uh)",
+"Vector by Vector Unsigned Halfword Multiply",
+    VddV.v[0].uw[i] = fMPY16UU(fGETUHALF(0, VuV.uw[i]), fGETUHALF(0, VvV.uw[i]));
+    VddV.v[1].uw[i] = fMPY16UU(fGETUHALF(1, VuV.uw[i]), fGETUHALF(1, VvV.uw[i])))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyuhv_acc,"Vxx32+=vmpyuh(Vu32,Vv32)","Vxx32.uw+=vmpy(Vu32.uh,Vv32.uh)",
+"Vector by Vector Unsigned Halfword Multiply",
+    VxxV.v[0].uw[i] += fMPY16UU(fGETUHALF(0, VuV.uw[i]), fGETUHALF(0, VvV.uw[i]));
+    VxxV.v[1].uw[i] += fMPY16UU(fGETUHALF(1, VuV.uw[i]), fGETUHALF(1, VvV.uw[i])))
+
+
+
+/* Vector by Vector */
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpyhvsrs,"Vd32=vmpyh(Vu32,Vv32):<<1:rnd:sat","Vd32.h=vmpy(Vu32.h,Vv32.h):<<1:rnd:sat",
+"Vector halfword multiply with round, shift, and sat16",
+    VdV.h[i] = fVSATH(fGETHALF(1,fVSAT(fROUND((fMPY16SS(VuV.h[i],VvV.h[i]    )<<1))))))
+
+
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyhus, "Vdd32=vmpyhus(Vu32,Vv32)","Vdd32.w=vmpy(Vu32.h,Vv32.uh)",
+"Vector by Vector Halfword Multiply",
+    VddV.v[0].w[i] = fMPY16SU(fGETHALF(0, VuV.w[i]), fGETUHALF(0, VvV.uw[i]));
+    VddV.v[1].w[i] = fMPY16SU(fGETHALF(1, VuV.w[i]), fGETUHALF(1, VvV.uw[i])))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyhus_acc, "Vxx32+=vmpyhus(Vu32,Vv32)","Vxx32.w+=vmpy(Vu32.h,Vv32.uh)",
+"Vector by Vector Halfword Multiply",
+    VxxV.v[0].w[i] += fMPY16SU(fGETHALF(0, VuV.w[i]), fGETUHALF(0, VvV.uw[i]));
+    VxxV.v[1].w[i] += fMPY16SU(fGETHALF(1, VuV.w[i]), fGETUHALF(1, VvV.uw[i])))
+
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpyih,"Vd32=vmpyih(Vu32,Vv32)","Vd32.h=vmpyi(Vu32.h,Vv32.h)",
+"Vector by Vector Halfword Multiply",
+    VdV.h[i] = fMPY16SS(VuV.h[i], VvV.h[i]))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpyih_acc,"Vx32+=vmpyih(Vu32,Vv32)","Vx32.h+=vmpyi(Vu32.h,Vv32.h)",
+"Vector by Vector Halfword Multiply",
+    VxV.h[i] += fMPY16SS(VuV.h[i], VvV.h[i]))
+
+
+
+/* 32x32 high half / frac */
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyewuh,"Vd32=vmpyewuh(Vu32,Vv32)","Vd32.w=vmpye(Vu32.w,Vv32.uh)",
+"Vector by Vector Halfword Multiply",
+VdV.w[i] = fMPY3216SU(VuV.w[i], fGETUHALF(0, VvV.w[i])) >> 16)
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyowh,"Vd32=vmpyowh(Vu32,Vv32):<<1:sat","Vd32.w=vmpyo(Vu32.w,Vv32.h):<<1:sat",
+"Vector by Vector Halfword Multiply",
+VdV.w[i] = fVSATW((((fMPY3216SS(VuV.w[i], fGETHALF(1, VvV.w[i])) >> 14) + 0) >> 1)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyowh_rnd,"Vd32=vmpyowh(Vu32,Vv32):<<1:rnd:sat","Vd32.w=vmpyo(Vu32.w,Vv32.h):<<1:rnd:sat",
+"Vector by Vector Halfword Multiply",
+VdV.w[i] = fVSATW((((fMPY3216SS(VuV.w[i], fGETHALF(1, VvV.w[i])) >> 14) + 1) >> 1)))
+
+ITERATOR_INSN_MPY_SLOT_DOUBLE_VEC(32,vmpyewuh_64,"Vdd32=vmpye(Vu32.w,Vv32.uh)",
+"Word times Halfword Multiply, 64-bit result",
+	fHIDE(size8s_t prod;)
+	prod = fMPY32SU(VuV.w[i],fGETUHALF(0,VvV.w[i]));
+	VddV.v[1].w[i] = prod >> 16;
+	VddV.v[0].w[i] = prod << 16)
+
+ITERATOR_INSN_MPY_SLOT_DOUBLE_VEC(32,vmpyowh_64_acc,"Vxx32+=vmpyo(Vu32.w,Vv32.h)",
+"Word times Halfword Multiply, 64-bit result",
+	fHIDE(size8s_t prod;)
+	prod = fMPY32SS(VuV.w[i],fGETHALF(1,VvV.w[i]))  + fSE32_64(VxxV.v[1].w[i]);
+	VxxV.v[1].w[i] = prod >> 16;
+	fSETHALF(0, VxxV.v[0].w[i], VxxV.v[0].w[i] >> 16);
+	fSETHALF(1, VxxV.v[0].w[i], prod & 0x0000ffff))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyowh_sacc,"Vx32+=vmpyowh(Vu32,Vv32):<<1:sat:shift","Vx32.w+=vmpyo(Vu32.w,Vv32.h):<<1:sat:shift",
+"Vector by Vector Halfword Multiply",
+IV1DEAD() VxV.w[i] = fVSATW(((((VxV.w[i] + fMPY3216SS(VuV.w[i], fGETHALF(1, VvV.w[i]))) >> 14) + 0) >> 1)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyowh_rnd_sacc,"Vx32+=vmpyowh(Vu32,Vv32):<<1:rnd:sat:shift","Vx32.w+=vmpyo(Vu32.w,Vv32.h):<<1:rnd:sat:shift",
+"Vector by Vector Halfword Multiply",
+IV1DEAD() VxV.w[i] = fVSATW(((((VxV.w[i] + fMPY3216SS(VuV.w[i], fGETHALF(1, VvV.w[i]))) >> 14) + 1) >> 1)))
+
+/* For 32x32 integer / low half */
+
+ITERATOR_INSN_MPY_SLOT(32,vmpyieoh,"Vd32.w=vmpyieo(Vu32.h,Vv32.h)","Odd/Even multiply for 32x32 low half",
+	VdV.w[i] = (fGETHALF(0,VuV.w[i])*fGETHALF(1,VvV.w[i])) << 16)
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyiewuh,"Vd32=vmpyiewuh(Vu32,Vv32)","Vd32.w=vmpyie(Vu32.w,Vv32.uh)",
+"Vector by Vector Word by Halfword Multiply",
+IV1DEAD()    VdV.w[i] = fMPY3216SU(VuV.w[i], fGETUHALF(0, VvV.w[i])) )
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyiowh,"Vd32=vmpyiowh(Vu32,Vv32)","Vd32.w=vmpyio(Vu32.w,Vv32.h)",
+"Vector by Vector Word by Halfword Multiply",
+IV1DEAD()    VdV.w[i] = fMPY3216SS(VuV.w[i], fGETHALF(1, VvV.w[i])) )
+
+/* Add back these... */
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyiewh_acc,"Vx32+=vmpyiewh(Vu32,Vv32)","Vx32.w+=vmpyie(Vu32.w,Vv32.h)",
+"Vector by Vector Word by Halfword Multiply",
+VxV.w[i] = VxV.w[i] + fMPY3216SS(VuV.w[i], fGETHALF(0, VvV.w[i])) )
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyiewuh_acc,"Vx32+=vmpyiewuh(Vu32,Vv32)","Vx32.w+=vmpyie(Vu32.w,Vv32.uh)",
+"Vector by Vector Word by Halfword Multiply",
+VxV.w[i] = VxV.w[i] + fMPY3216SU(VuV.w[i], fGETUHALF(0, VvV.w[i])) )
+
+
+
+
+
+
+
+/* Vector by Scalar */
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpyub,"Vdd32=vmpyub(Vu32,Rt32)","Vdd32.uh=vmpy(Vu32.ub,Rt32.ub)",
+"Vector absolute value of words",
+    VddV.v[0].uh[i]  = fMPY8UU(fGETUBYTE(0, VuV.uh[i]), fGETUBYTE((2*i+0)%4, RtV));
+    VddV.v[1].uh[i]  = fMPY8UU(fGETUBYTE(1, VuV.uh[i]), fGETUBYTE((2*i+1)%4, RtV)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpyub_acc,"Vxx32+=vmpyub(Vu32,Rt32)","Vxx32.uh+=vmpy(Vu32.ub,Rt32.ub)",
+"Vector absolute value of words",
+    VxxV.v[0].uh[i] += fMPY8UU(fGETUBYTE(0, VuV.uh[i]), fGETUBYTE((2*i+0)%4, RtV));
+    VxxV.v[1].uh[i] += fMPY8UU(fGETUBYTE(1, VuV.uh[i]), fGETUBYTE((2*i+1)%4, RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpybus,"Vdd32=vmpybus(Vu32,Rt32)","Vdd32.h=vmpy(Vu32.ub,Rt32.b)",
+"Vector absolute value of words",
+    VddV.v[0].h[i]  = fMPY8US(fGETUBYTE(0, VuV.uh[i]), fGETBYTE((2*i+0)%4, RtV));
+    VddV.v[1].h[i]  = fMPY8US(fGETUBYTE(1, VuV.uh[i]), fGETBYTE((2*i+1)%4, RtV)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpybus_acc,"Vxx32+=vmpybus(Vu32,Rt32)","Vxx32.h+=vmpy(Vu32.ub,Rt32.b)",
+"Vector absolute value of words",
+    VxxV.v[0].h[i] += fMPY8US(fGETUBYTE(0, VuV.uh[i]), fGETBYTE((2*i+0)%4, RtV));
+    VxxV.v[1].h[i] += fMPY8US(fGETUBYTE(1, VuV.uh[i]), fGETBYTE((2*i+1)%4, RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpabus,"Vdd32=vmpabus(Vuu32,Rt32)","Vdd32.h=vmpa(Vuu32.ub,Rt32.b)",
+"Vertical Byte Multiply",
+    VddV.v[0].h[i] = fMPY8US(fGETUBYTE(0, VuuV.v[0].uh[i]), fGETBYTE(0, RtV)) + fMPY16SS(fGETUBYTE(0, VuuV.v[1].uh[i]), fGETBYTE(1, RtV));
+    VddV.v[1].h[i] = fMPY8US(fGETUBYTE(1, VuuV.v[0].uh[i]), fGETBYTE(2, RtV)) + fMPY16SS(fGETUBYTE(1, VuuV.v[1].uh[i]), fGETBYTE(3, RtV)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(16,vmpabus_acc,"Vxx32+=vmpabus(Vuu32,Rt32)","Vxx32.h+=vmpa(Vuu32.ub,Rt32.b)",
+"Vertical Byte Multiply",
+    VxxV.v[0].h[i] += fMPY8US(fGETUBYTE(0, VuuV.v[0].uh[i]), fGETBYTE(0, RtV)) + fMPY16SS(fGETUBYTE(0, VuuV.v[1].uh[i]), fGETBYTE(1, RtV));
+    VxxV.v[1].h[i] += fMPY8US(fGETUBYTE(1, VuuV.v[0].uh[i]), fGETBYTE(2, RtV)) + fMPY16SS(fGETUBYTE(1, VuuV.v[1].uh[i]), fGETBYTE(3, RtV)))
+
+// V65
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC_NOV1(16,vmpabuu,"Vdd32=vmpabuu(Vuu32,Rt32)","Vdd32.h=vmpa(Vuu32.ub,Rt32.ub)",
+"Vertical Byte Multiply",
+    VddV.v[0].uh[i] = fMPY8UU(fGETUBYTE(0, VuuV.v[0].uh[i]), fGETUBYTE(0, RtV)) + fMPY8UU(fGETUBYTE(0, VuuV.v[1].uh[i]), fGETUBYTE(1, RtV));
+    VddV.v[1].uh[i] = fMPY8UU(fGETUBYTE(1, VuuV.v[0].uh[i]), fGETUBYTE(2, RtV)) + fMPY8UU(fGETUBYTE(1, VuuV.v[1].uh[i]), fGETUBYTE(3, RtV)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC_NOV1(16,vmpabuu_acc,"Vxx32+=vmpabuu(Vuu32,Rt32)","Vxx32.h+=vmpa(Vuu32.ub,Rt32.ub)",
+"Vertical Byte Multiply",
+    VxxV.v[0].uh[i] += fMPY8UU(fGETUBYTE(0, VuuV.v[0].uh[i]), fGETUBYTE(0, RtV)) + fMPY8UU(fGETUBYTE(0, VuuV.v[1].uh[i]), fGETUBYTE(1, RtV));
+    VxxV.v[1].uh[i] += fMPY8UU(fGETUBYTE(1, VuuV.v[0].uh[i]), fGETUBYTE(2, RtV)) + fMPY8UU(fGETUBYTE(1, VuuV.v[1].uh[i]), fGETUBYTE(3, RtV)))
+
+
+
+
+/* Half by Byte */
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpahb,"Vdd32=vmpahb(Vuu32,Rt32)","Vdd32.w=vmpa(Vuu32.h,Rt32.b)",
+"Vertical Byte Multiply",
+    VddV.v[0].w[i] = fMPY16SS(fGETHALF(0, VuuV.v[0].w[i]), fSE8_16(fGETBYTE(0, RtV))) + fMPY16SS(fGETHALF(0, VuuV.v[1].w[i]), fSE8_16(fGETBYTE(1, RtV)));
+    VddV.v[1].w[i] = fMPY16SS(fGETHALF(1, VuuV.v[0].w[i]), fSE8_16(fGETBYTE(2, RtV))) + fMPY16SS(fGETHALF(1, VuuV.v[1].w[i]), fSE8_16(fGETBYTE(3, RtV))))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpahb_acc,"Vxx32+=vmpahb(Vuu32,Rt32)","Vxx32.w+=vmpa(Vuu32.h,Rt32.b)",
+"Vertical Byte Multiply",
+    VxxV.v[0].w[i] += fMPY16SS(fGETHALF(0, VuuV.v[0].w[i]), fSE8_16(fGETBYTE(0, RtV))) + fMPY16SS(fGETHALF(0, VuuV.v[1].w[i]), fSE8_16(fGETBYTE(1, RtV)));
+    VxxV.v[1].w[i] += fMPY16SS(fGETHALF(1, VuuV.v[0].w[i]), fSE8_16(fGETBYTE(2, RtV))) + fMPY16SS(fGETHALF(1, VuuV.v[1].w[i]), fSE8_16(fGETBYTE(3, RtV))))
+
+/* Half by Byte */
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpauhb,"Vdd32=vmpauhb(Vuu32,Rt32)","Vdd32.w=vmpa(Vuu32.uh,Rt32.b)",
+"Vertical Byte Multiply",
+    VddV.v[0].w[i] = fMPY16US(fGETUHALF(0, VuuV.v[0].w[i]), fSE8_16(fGETBYTE(0, RtV))) + fMPY16US(fGETUHALF(0, VuuV.v[1].w[i]), fSE8_16(fGETBYTE(1, RtV)));
+    VddV.v[1].w[i] = fMPY16US(fGETUHALF(1, VuuV.v[0].w[i]), fSE8_16(fGETBYTE(2, RtV))) + fMPY16US(fGETUHALF(1, VuuV.v[1].w[i]), fSE8_16(fGETBYTE(3, RtV))))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpauhb_acc,"Vxx32+=vmpauhb(Vuu32,Rt32)","Vxx32.w+=vmpa(Vuu32.uh,Rt32.b)",
+"Vertical Byte Multiply",
+    VxxV.v[0].w[i] += fMPY16US(fGETUHALF(0, VuuV.v[0].w[i]), fSE8_16(fGETBYTE(0, RtV))) + fMPY16US(fGETUHALF(0, VuuV.v[1].w[i]), fSE8_16(fGETBYTE(1, RtV)));
+    VxxV.v[1].w[i] += fMPY16US(fGETUHALF(1, VuuV.v[0].w[i]), fSE8_16(fGETBYTE(2, RtV))) + fMPY16US(fGETUHALF(1, VuuV.v[1].w[i]), fSE8_16(fGETBYTE(3, RtV))))
+
+
+
+
+
+
+
+/* Half by Half */
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyh,"Vdd32=vmpyh(Vu32,Rt32)","Vdd32.w=vmpy(Vu32.h,Rt32.h)",
+"Vector absolute value of words",
+    VddV.v[0].w[i] =  fMPY16SS(fGETHALF(0, VuV.w[i]), fGETHALF(0, RtV));
+    VddV.v[1].w[i] =  fMPY16SS(fGETHALF(1, VuV.w[i]), fGETHALF(1, RtV)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC_NOV1(32,vmpyh_acc,"Vxx32+=vmpyh(Vu32,Rt32)","Vxx32.w+=vmpy(Vu32.h,Rt32.h)",
+"Vector even halfwords with scalar lower halfword multiply with shift and sat32",
+    VxxV.v[0].w[i] =  fCAST8s(VxxV.v[0].w[i]) + fMPY16SS(fGETHALF(0, VuV.w[i]), fGETHALF(0, RtV));
+    VxxV.v[1].w[i] =  fCAST8s(VxxV.v[1].w[i]) + fMPY16SS(fGETHALF(1, VuV.w[i]), fGETHALF(1, RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyhsat_acc,"Vxx32+=vmpyh(Vu32,Rt32):sat","Vxx32.w+=vmpy(Vu32.h,Rt32.h):sat",
+"Vector even halfwords with scalar lower halfword multiply with shift and sat32",
+    VxxV.v[0].w[i] =  fVSATW(fCAST8s(VxxV.v[0].w[i]) + fMPY16SS(fGETHALF(0, VuV.w[i]), fGETHALF(0, RtV)));
+    VxxV.v[1].w[i] =  fVSATW(fCAST8s(VxxV.v[1].w[i]) + fMPY16SS(fGETHALF(1, VuV.w[i]), fGETHALF(1, RtV))))
+
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyhss,"Vd32=vmpyh(Vu32,Rt32):<<1:sat","Vd32.h=vmpy(Vu32.h,Rt32.h):<<1:sat",
+"Vector halfword by halfword multiply, shift by 1, and take upper 16 msb",
+          fSETHALF(0,VdV.w[i],fVSATH(fGETHALF(1,fVSAT((fMPY16SS(fGETHALF(0,VuV.w[i]),fGETHALF(0,RtV))<<1)))));
+          fSETHALF(1,VdV.w[i],fVSATH(fGETHALF(1,fVSAT((fMPY16SS(fGETHALF(1,VuV.w[i]),fGETHALF(1,RtV))<<1)))));
+)
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyhsrs,"Vd32=vmpyh(Vu32,Rt32):<<1:rnd:sat","Vd32.h=vmpy(Vu32.h,Rt32.h):<<1:rnd:sat",
+"Vector halfword with scalar halfword multiply with round, shift, and sat16",
+       fSETHALF(0,VdV.w[i],fVSATH(fGETHALF(1,fVSAT(fROUND((fMPY16SS(fGETHALF(0,VuV.w[i]),fGETHALF(0,RtV))<<1))))));
+       fSETHALF(1,VdV.w[i],fVSATH(fGETHALF(1,fVSAT(fROUND((fMPY16SS(fGETHALF(1,VuV.w[i]),fGETHALF(1,RtV))<<1))))));
+)
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyuh,"Vdd32=vmpyuh(Vu32,Rt32)","Vdd32.uw=vmpy(Vu32.uh,Rt32.uh)",
+"Vector even halfword unsigned multiply by scalar",
+    VddV.v[0].uw[i] = fMPY16UU(fGETUHALF(0, VuV.uw[i]),fGETUHALF(0,RtV));
+    VddV.v[1].uw[i] = fMPY16UU(fGETUHALF(1, VuV.uw[i]),fGETUHALF(1,RtV)))
+
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyuh_acc,"Vxx32+=vmpyuh(Vu32,Rt32)","Vxx32.uw+=vmpy(Vu32.uh,Rt32.uh)",
+"Vector even halfword unsigned multiply by scalar",
+    VxxV.v[0].uw[i] += fMPY16UU(fGETUHALF(0, VuV.uw[i]),fGETUHALF(0,RtV));
+    VxxV.v[1].uw[i] += fMPY16UU(fGETUHALF(1, VuV.uw[i]),fGETUHALF(1,RtV)))
+
+
+
+
+/********************************************
+*  HALF BY BYTE
+********************************************/
+ITERATOR_INSN2_MPY_SLOT(16,vmpyihb,"Vd32=vmpyihb(Vu32,Rt32)","Vd32.h=vmpyi(Vu32.h,Rt32.b)",
+"Vector word by byte multiply, keep lower result",
+VdV.h[i]  = fMPY16SS(VuV.h[i], fGETBYTE(i % 4, RtV) ))
+
+ITERATOR_INSN2_MPY_SLOT(16,vmpyihb_acc,"Vx32+=vmpyihb(Vu32,Rt32)","Vx32.h+=vmpyi(Vu32.h,Rt32.b)",
+"Vector word by byte multiply, keep lower result",
+VxV.h[i] += fMPY16SS(VuV.h[i], fGETBYTE(i % 4, RtV) ))
+
+
+/********************************************
+*  WORD BY BYTE
+********************************************/
+ITERATOR_INSN2_MPY_SLOT(32,vmpyiwb,"Vd32=vmpyiwb(Vu32,Rt32)","Vd32.w=vmpyi(Vu32.w,Rt32.b)",
+"Vector word by byte multiply, keep lower result",
+VdV.w[i]  = fMPY32SS(VuV.w[i], fGETBYTE(i % 4, RtV) ))
+
+ITERATOR_INSN2_MPY_SLOT(32,vmpyiwb_acc,"Vx32+=vmpyiwb(Vu32,Rt32)","Vx32.w+=vmpyi(Vu32.w,Rt32.b)",
+"Vector word by byte multiply, keep lower result",
+VxV.w[i] += fMPY32SS(VuV.w[i], fGETBYTE(i % 4, RtV) ))
+
+ITERATOR_INSN2_MPY_SLOT(32,vmpyiwub,"Vd32=vmpyiwub(Vu32,Rt32)","Vd32.w=vmpyi(Vu32.w,Rt32.ub)",
+"Vector word by byte multiply, keep lower result",
+VdV.w[i]  = fMPY32SS(VuV.w[i], fGETUBYTE(i % 4, RtV) ))
+
+ITERATOR_INSN2_MPY_SLOT(32,vmpyiwub_acc,"Vx32+=vmpyiwub(Vu32,Rt32)","Vx32.w+=vmpyi(Vu32.w,Rt32.ub)",
+"Vector word by byte multiply, keep lower result",
+VxV.w[i] += fMPY32SS(VuV.w[i], fGETUBYTE(i % 4, RtV) ))
+
+
+/********************************************
+*  WORD BY HALF
+********************************************/
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyiwh,"Vd32=vmpyiwh(Vu32,Rt32)","Vd32.w=vmpyi(Vu32.w,Rt32.h)",
+"Vector word by byte multiply, keep lower result",
+VdV.w[i]  = fMPY32SS(VuV.w[i], fGETHALF(i % 2, RtV)))
+
+ITERATOR_INSN2_MPY_SLOT_DOUBLE_VEC(32,vmpyiwh_acc,"Vx32+=vmpyiwh(Vu32,Rt32)","Vx32.w+=vmpyi(Vu32.w,Rt32.h)",
+"Vector word by byte multiply, keep lower result",
+VxV.w[i] += fMPY32SS(VuV.w[i], fGETHALF(i % 2, RtV)))
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+/**************************************************************************
+ * MMVECTOR LOGICAL OPERATIONS
+ * ************************************************************************/
+ITERATOR_INSN_ANY_SLOT(16,vand,"Vd32=vand(Vu32,Vv32)", "Vector Logical And", VdV.uh[i] = VuV.uh[i] & VvV.h[i])
+ITERATOR_INSN_ANY_SLOT(16,vor, "Vd32=vor(Vu32,Vv32)",  "Vector Logical Or", VdV.uh[i] = VuV.uh[i] | VvV.h[i])
+ITERATOR_INSN_ANY_SLOT(16,vxor,"Vd32=vxor(Vu32,Vv32)", "Vector Logical XOR",    VdV.uh[i] = VuV.uh[i] ^ VvV.h[i])
+ITERATOR_INSN_ANY_SLOT(16,vnot,"Vd32=vnot(Vu32)",     "Vector Logical NOT", VdV.uh[i] = ~VuV.uh[i])
+
+
+
+
+
+ITERATOR_INSN2_MPY_SLOT_LATE(8, vandqrt,
+"Vd32.ub=vand(Qu4.ub,Rt32.ub)", "Vd32=vand(Qu4,Rt32)", "Insert Predicate into Vector",
+    VdV.ub[i] = fGETQBIT(QuV,i) ? fGETUBYTE(i % 4, RtV) : 0)
+
+ITERATOR_INSN2_MPY_SLOT_LATE(8, vandqrt_acc,
+"Vx32.ub|=vand(Qu4.ub,Rt32.ub)", "Vx32|=vand(Qu4,Rt32)",  "Insert Predicate into Vector",
+    VxV.ub[i] |= (fGETQBIT(QuV,i)) ? fGETUBYTE(i % 4, RtV) : 0)
+
+ITERATOR_INSN2_MPY_SLOT_LATE(8, vandnqrt,
+"Vd32.ub=vand(!Qu4.ub,Rt32.ub)", "Vd32=vand(!Qu4,Rt32)", "Insert Predicate into Vector",
+    VdV.ub[i] = !fGETQBIT(QuV,i) ? fGETUBYTE(i % 4, RtV) : 0)
+
+ITERATOR_INSN2_MPY_SLOT_LATE(8, vandnqrt_acc,
+"Vx32.ub|=vand(!Qu4.ub,Rt32.ub)", "Vx32|=vand(!Qu4,Rt32)",  "Insert Predicate into Vector",
+    VxV.ub[i] |= !(fGETQBIT(QuV,i)) ? fGETUBYTE(i % 4, RtV) : 0)
+
+
+ITERATOR_INSN2_MPY_SLOT_LATE(8, vandvrt,
+"Qd4.ub=vand(Vu32.ub,Rt32.ub)", "Qd4=vand(Vu32,Rt32)", "Insert into Predicate",
+    fSETQBIT(QdV,i,((VuV.ub[i] & fGETUBYTE(i % 4, RtV)) != 0) ? 1 : 0))
+
+ITERATOR_INSN2_MPY_SLOT_LATE(8, vandvrt_acc,
+"Qx4.ub|=vand(Vu32.ub,Rt32.ub)", "Qx4|=vand(Vu32,Rt32)", "Insert into Predicate ",
+    fSETQBIT(QxV,i,fGETQBIT(QxV,i)|(((VuV.ub[i] & fGETUBYTE(i % 4, RtV)) != 0) ? 1 : 0)))
+
+ITERATOR_INSN_ANY_SLOT(8,vandvqv,"Vd32=vand(Qv4,Vu32)","Mask off bytes",
+VdV.b[i] = fGETQBIT(QvV,i) ? VuV.b[i] : 0)
+ITERATOR_INSN_ANY_SLOT(8,vandvnqv,"Vd32=vand(!Qv4,Vu32)","Mask off bytes",
+VdV.b[i] = !fGETQBIT(QvV,i) ? VuV.b[i] : 0)
+
+
+ /***************************************************
+ * Compare Vector with Vector
+ ***************************************************/
+#define VCMP(DEST, ASRC, ASRCOP, CMP, N, SRC, MASK, WIDTH)        \
+{ \
+       for(fHIDE(int) i = 0; i < fVBYTES(); i += WIDTH) { \
+		fSETQBITS(DEST,WIDTH,MASK,i,ASRC ASRCOP ((VuV.SRC[i/WIDTH] CMP VvV.SRC[i/WIDTH]) ? MASK : 0)); \
+    } \
+       }
+
+
+#define MMVEC_CMPGT(TYPE,TYPE2,TYPE3,DESCR,N,MASK,WIDTH,SRC) \
+EXTINSN(V6_vgt##TYPE,       "Qd4=vcmp.gt(Vu32." TYPE2 ",Vv32." TYPE2 ")",  ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA), DESCR" greater than", \
+	VCMP(QdV, , , >, N, SRC, MASK, WIDTH)) \
+EXTINSN(V6_vgt##TYPE##_and, "Qx4&=vcmp.gt(Vu32." TYPE2 ",Vv32." TYPE2 ")", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA), DESCR" greater than with predicate-and", \
+	VCMP(QxV, fGETQBITS(QxV,WIDTH,MASK,i), &, >, N, SRC, MASK, WIDTH)) \
+EXTINSN(V6_vgt##TYPE##_or,  "Qx4|=vcmp.gt(Vu32." TYPE2 ",Vv32." TYPE2 ")", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA), DESCR" greater than with predicate-or", \
+	VCMP(QxV, fGETQBITS(QxV,WIDTH,MASK,i), |, >, N, SRC, MASK, WIDTH)) \
+EXTINSN(V6_vgt##TYPE##_xor, "Qx4^=vcmp.gt(Vu32." TYPE2 ",Vv32." TYPE2 ")", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA), DESCR" greater than with predicate-xor", \
+	VCMP(QxV, fGETQBITS(QxV,WIDTH,MASK,i), ^, >, N, SRC, MASK, WIDTH))
+
+#define MMVEC_CMP(TYPE,TYPE2,TYPE3,DESCR,N,MASK, WIDTH, SRC)\
+MMVEC_CMPGT(TYPE,TYPE2,TYPE3,DESCR,N,MASK,WIDTH,SRC) \
+EXTINSN(V6_veq##TYPE,       "Qd4=vcmp.eq(Vu32." TYPE2 ",Vv32." TYPE2 ")",  ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA), DESCR" equal to", \
+	VCMP(QdV, , , ==, N, SRC, MASK, WIDTH)) \
+EXTINSN(V6_veq##TYPE##_and, "Qx4&=vcmp.eq(Vu32." TYPE2 ",Vv32." TYPE2 ")", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA), DESCR" equalto with predicate-and", \
+	VCMP(QxV, fGETQBITS(QxV,WIDTH,MASK,i), &, ==, N, SRC, MASK, WIDTH)) \
+EXTINSN(V6_veq##TYPE##_or,  "Qx4|=vcmp.eq(Vu32." TYPE2 ",Vv32." TYPE2 ")", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA), DESCR" equalto with predicate-or", \
+	VCMP(QxV, fGETQBITS(QxV,WIDTH,MASK,i), |, ==, N, SRC, MASK, WIDTH)) \
+EXTINSN(V6_veq##TYPE##_xor, "Qx4^=vcmp.eq(Vu32." TYPE2 ",Vv32." TYPE2 ")", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA), DESCR" equalto with predicate-xor", \
+	VCMP(QxV, fGETQBITS(QxV,WIDTH,MASK,i), ^, ==, N, SRC, MASK, WIDTH))
+
+
+MMVEC_CMP(w,"w","","Vector Word Compare ", fVELEM(32), 0xF, 4, w)
+MMVEC_CMP(h,"h","","Vector Half Compare ", fVELEM(16), 0x3, 2, h)
+MMVEC_CMP(b,"b","","Vector Half Compare ", fVELEM(8),  0x1, 1, b)
+MMVEC_CMPGT(uw,"uw","","Vector Unsigned Half Compare ", fVELEM(32), 0xF, 4,uw)
+MMVEC_CMPGT(uh,"uh","","Vector Unsigned Half Compare ", fVELEM(16), 0x3, 2,uh)
+MMVEC_CMPGT(ub,"ub","","Vector Unsigned Byte Compare ", fVELEM(8),  0x1, 1,ub)
+
+/***************************************************
+* Predicate Operations
+***************************************************/
+
+EXTINSN(V6_pred_scalar2, "Qd4=vsetq(Rt32)",         ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),   "Set Vector Predicate ",
+{
+    fHIDE(int i;)
+    for(i = 0; i < fVBYTES(); i++) fSETQBIT(QdV,i,(i < (RtV & (fVBYTES()-1))) ? 1 : 0);
+})
+
+EXTINSN(V6_pred_scalar2v2, "Qd4=vsetq2(Rt32)",         ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),   "Set Vector Predicate ",
+{
+    fHIDE(int i;)
+    for(i = 0; i < fVBYTES(); i++) fSETQBIT(QdV,i,(i <= ((RtV-1) & (fVBYTES()-1))) ? 1 : 0);
+})
+
+
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(8, shuffeqw, "Qd4.h=vshuffe(Qs4.w,Qt4.w)","Shrink Predicate", fSETQBIT(QdV,i, (i & 2) ? fGETQBIT(QsV,i-2) : fGETQBIT(QtV,i) ) )
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(8, shuffeqh, "Qd4.b=vshuffe(Qs4.h,Qt4.h)","Shrink Predicate", fSETQBIT(QdV,i, (i & 1) ? fGETQBIT(QsV,i-1) : fGETQBIT(QtV,i) ) )
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(8, pred_or, "Qd4=or(Qs4,Qt4)","Vector Predicate Or", fSETQBIT(QdV,i,fGETQBIT(QsV,i) || fGETQBIT(QtV,i) ) )
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(8, pred_and, "Qd4=and(Qs4,Qt4)","Vector Predicate And", fSETQBIT(QdV,i,fGETQBIT(QsV,i) && fGETQBIT(QtV,i) ) )
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(8, pred_xor, "Qd4=xor(Qs4,Qt4)","Vector Predicate Xor", fSETQBIT(QdV,i,fGETQBIT(QsV,i) ^ fGETQBIT(QtV,i) ) )
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(8, pred_or_n, "Qd4=or(Qs4,!Qt4)","Vector Predicate Or with not", fSETQBIT(QdV,i,fGETQBIT(QsV,i) || !fGETQBIT(QtV,i) ) )
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(8, pred_and_n, "Qd4=and(Qs4,!Qt4)","Vector Predicate And  with not", fSETQBIT(QdV,i,fGETQBIT(QsV,i) && !fGETQBIT(QtV,i) ) )
+ITERATOR_INSN_ANY_SLOT(8, pred_not, "Qd4=not(Qs4)","Vector Predicate Not", fSETQBIT(QdV,i,!fGETQBIT(QsV,i) ) )
+
+
+
+EXTINSN(V6_vcmov,  "if (Ps4) Vd32=Vu32",  ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA),   "Conditional Mov",
+{
+if (fLSBOLD(PsV))	{
+	fHIDE(int i;)
+	fVFOREACH(8, i) {
+		VdV.ub[i] = VuV.ub[i];
+	}
+	} else {CANCEL;}
+})
+
+EXTINSN(V6_vncmov,  "if (!Ps4) Vd32=Vu32",  ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA),   "Conditional Mov",
+{
+if (fLSBOLDNOT(PsV))	{
+	fHIDE(int i;)
+	fVFOREACH(8, i) {
+		VdV.ub[i] = VuV.ub[i];
+	}
+	} else {CANCEL;}
+})
+
+EXTINSN(V6_vccombine,  "if (Ps4) Vdd32=vcombine(Vu32,Vv32)",	ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA_DV),   "Conditional Combine",
+{
+if (fLSBOLD(PsV))	{
+	fHIDE(int i;)
+	fVFOREACH(8, i) {
+		VddV.v[0].ub[i] = VvV.ub[i];
+		VddV.v[1].ub[i] = VuV.ub[i];
+	}
+	} else {CANCEL;}
+})
+
+EXTINSN(V6_vnccombine,  "if (!Ps4) Vdd32=vcombine(Vu32,Vv32)",	ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA_DV),   "Conditional Combine",
+{
+if (fLSBOLDNOT(PsV))	{
+	fHIDE(int i;)
+	fVFOREACH(8, i) {
+		VddV.v[0].ub[i] = VvV.ub[i];
+		VddV.v[1].ub[i] = VuV.ub[i];
+	}
+	} else {CANCEL;}
+})
+
+
+
+ITERATOR_INSN_ANY_SLOT(8,vmux,"Vd32=vmux(Qt4,Vu32,Vv32)",
+"Vector Select Element 8-bit",
+    VdV.ub[i] = fGETQBIT(QtV,i) ? VuV.ub[i] : VvV.ub[i])
+
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(8,vswap,"Vdd32=vswap(Qt4,Vu32,Vv32)",
+"Vector Swap Element 8-bit",
+    VddV.v[0].ub[i] =  fGETQBIT(QtV,i) ? VuV.ub[i] : VvV.ub[i];
+	VddV.v[1].ub[i] = !fGETQBIT(QtV,i) ? VuV.ub[i] : VvV.ub[i])
+
+
+/***************************************************************************
+*
+*   MMVECTOR SORTING
+*
+****************************************************************************/
+
+#define MMVEC_SORT(TYPE,TYPE2,DESCR,ELEMENTSIZE,SRC)\
+ITERATOR_INSN2_ANY_SLOT(ELEMENTSIZE,vmax##TYPE, "Vd32=vmax" TYPE2 "(Vu32,Vv32)", "Vd32."#SRC"=vmax(Vu32."#SRC",Vv32."#SRC")", "Vector " DESCR " max", VdV.SRC[i] = (VuV.SRC[i] > VvV.SRC[i]) ? VuV.SRC[i] :  VvV.SRC[i])  \
+ITERATOR_INSN2_ANY_SLOT(ELEMENTSIZE,vmin##TYPE, "Vd32=vmin" TYPE2 "(Vu32,Vv32)", "Vd32."#SRC"=vmin(Vu32."#SRC",Vv32."#SRC")", "Vector " DESCR " min", VdV.SRC[i] = (VuV.SRC[i] < VvV.SRC[i]) ? VuV.SRC[i] :  VvV.SRC[i])
+
+MMVEC_SORT(b,"b", "signed byte",    8,  b)
+MMVEC_SORT(ub,"ub", "unsigned byte",    8,  ub)
+MMVEC_SORT(uh,"uh", "unsigned halfword",16, uh)
+MMVEC_SORT(h,   "h",    "halfword",         16, h)
+MMVEC_SORT(w,   "w",    "word",             32, w)
+
+
+
+
+
+
+
+
+
+/*************************************************************
+* SHUFFLES
+****************************************************************/
+
+ITERATOR_INSN2_ANY_SLOT(16,vsathub,"Vd32=vsathub(Vu32,Vv32)","Vd32.ub=vsat(Vu32.h,Vv32.h)",
+"Saturate and pack 32 halfwords to 32 unsigned bytes, and interleave them",
+    fSETBYTE(0, VdV.uh[i], fVSATUB(VvV.h[i]));
+    fSETBYTE(1, VdV.uh[i], fVSATUB(VuV.h[i])))
+
+ITERATOR_INSN2_ANY_SLOT(32,vsatwh,"Vd32=vsatwh(Vu32,Vv32)","Vd32.h=vsat(Vu32.w,Vv32.w)",
+"Saturate and pack 16 words to 16 halfwords, and interleave them",
+    fSETHALF(0, VdV.w[i], fVSATH(VvV.w[i]));
+    fSETHALF(1, VdV.w[i], fVSATH(VuV.w[i])))
+
+ITERATOR_INSN2_ANY_SLOT(32,vsatuwuh,"Vd32=vsatuwuh(Vu32,Vv32)","Vd32.uh=vsat(Vu32.uw,Vv32.uw)",
+"Saturate and pack 16 words to 16 halfwords, and interleave them",
+    fSETHALF(0, VdV.w[i], fVSATUH(VvV.uw[i]));
+    fSETHALF(1, VdV.w[i], fVSATUH(VuV.uw[i])))
+
+ITERATOR_INSN2_ANY_SLOT(16,vshuffeb,"Vd32=vshuffeb(Vu32,Vv32)","Vd32.b=vshuffe(Vu32.b,Vv32.b)",
+"Shuffle half words with in a lane",
+    fSETBYTE(0, VdV.uh[i], fGETUBYTE(0, VvV.uh[i]));
+    fSETBYTE(1, VdV.uh[i], fGETUBYTE(0, VuV.uh[i])))
+
+ITERATOR_INSN2_ANY_SLOT(16,vshuffob,"Vd32=vshuffob(Vu32,Vv32)","Vd32.b=vshuffo(Vu32.b,Vv32.b)",
+"Shuffle half words with in a lane",
+    fSETBYTE(0, VdV.uh[i], fGETUBYTE(1, VvV.uh[i]));
+    fSETBYTE(1, VdV.uh[i], fGETUBYTE(1, VuV.uh[i])))
+
+ITERATOR_INSN2_ANY_SLOT(32,vshufeh,"Vd32=vshuffeh(Vu32,Vv32)","Vd32.h=vshuffe(Vu32.h,Vv32.h)",
+"Shuffle half words with in a lane",
+    fSETHALF(0, VdV.uw[i], fGETUHALF(0, VvV.uw[i]));
+    fSETHALF(1, VdV.uw[i], fGETUHALF(0, VuV.uw[i])))
+
+ITERATOR_INSN2_ANY_SLOT(32,vshufoh,"Vd32=vshuffoh(Vu32,Vv32)","Vd32.h=vshuffo(Vu32.h,Vv32.h)",
+"Shuffle half words with in a lane",
+    fSETHALF(0, VdV.uw[i], fGETUHALF(1, VvV.uw[i]));
+    fSETHALF(1, VdV.uw[i], fGETUHALF(1, VuV.uw[i])))
+
+
+
+
+/**************************************************************************
+* Double Vector Shuffles
+**************************************************************************/
+
+EXTINSN(V6_vshuff, "vshuff(Vy32,Vx32,Rt32)",
+ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP_VS),
+"2x2->2x2 transpose, for multiple data sizes, inplace",
+{
+	fHIDE(int offset;)
+	for (offset=1; offset<fVBYTES(); offset<<=1) {
+		if ( RtV & offset) {
+			    fHIDE(int k;) \
+				fVFOREACH(8, k) {\
+				if (!( k & offset)) {
+					fSWAPB(VyV.ub[k], VxV.ub[k+offset]);
+				}
+			}
+		}
+	}
+	})
+
+EXTINSN(V6_vshuffvdd, "Vdd32=vshuff(Vu32,Vv32,Rt8)",
+ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP_VS),
+"2x2->2x2 transpose for multiple data sizes",
+{
+	fHIDE(int offset;)
+	VddV.v[0] = VvV;
+	VddV.v[1] = VuV;
+	for (offset=1; offset<fVBYTES(); offset<<=1) {
+		if ( RtV & offset) {
+			    fHIDE(int k;) \
+				fVFOREACH(8, k) {\
+				if (!( k & offset)) {
+					fSWAPB(VddV.v[1].ub[k], VddV.v[0].ub[k+offset]);
+				}
+			}
+		}
+	}
+	})
+
+EXTINSN(V6_vdeal, "vdeal(Vy32,Vx32,Rt32)",
+ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP_VS),
+" vector - vector deal - or deinterleave, for multiple data sizes, inplace",
+{
+	fHIDE(int offset;)
+	for (offset=fVBYTES()>>1; offset>0; offset>>=1) {
+		if ( RtV & offset) {
+			    fHIDE(int k;) \
+				fVFOREACH(8, k) {\
+				if (!( k & offset)) {
+					fSWAPB(VyV.ub[k], VxV.ub[k+offset]);
+				}
+			}
+		}
+	}
+	})
+
+EXTINSN(V6_vdealvdd, "Vdd32=vdeal(Vu32,Vv32,Rt8)",
+ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP_VS),
+" vector - vector deal - or deinterleave, for multiple data sizes",
+{
+	fHIDE(int offset;)
+	VddV.v[0] = VvV;
+	VddV.v[1] = VuV;
+	for (offset=fVBYTES()>>1; offset>0; offset>>=1) {
+		if ( RtV & offset) {
+			    fHIDE(int k;) \
+				fVFOREACH(8, k) {\
+				if (!( k & offset)) {
+					fSWAPB(VddV.v[1].ub[k], VddV.v[0].ub[k+offset]);
+				}
+			}
+		}
+	}
+	})
+
+/**************************************************************************/
+
+
+
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(32,vshufoeh,"Vdd32=vshuffoeh(Vu32,Vv32)","Vdd32.h=vshuffoe(Vu32.h,Vv32.h)",
+"Vector Shuffle half words",
+    fSETHALF(0, VddV.v[0].uw[i], fGETUHALF(0, VvV.uw[i]));
+    fSETHALF(1, VddV.v[0].uw[i], fGETUHALF(0, VuV.uw[i]));
+    fSETHALF(0, VddV.v[1].uw[i], fGETUHALF(1, VvV.uw[i]));
+    fSETHALF(1, VddV.v[1].uw[i], fGETUHALF(1, VuV.uw[i])))
+
+ITERATOR_INSN2_ANY_SLOT_DOUBLE_VEC(16,vshufoeb,"Vdd32=vshuffoeb(Vu32,Vv32)","Vdd32.b=vshuffoe(Vu32.b,Vv32.b)",
+"Vector Shuffle bytes",
+    fSETBYTE(0, VddV.v[0].uh[i], fGETUBYTE(0, VvV.uh[i]));
+    fSETBYTE(1, VddV.v[0].uh[i], fGETUBYTE(0, VuV.uh[i]));
+    fSETBYTE(0, VddV.v[1].uh[i], fGETUBYTE(1, VvV.uh[i]));
+    fSETBYTE(1, VddV.v[1].uh[i], fGETUBYTE(1, VuV.uh[i])))
+
+
+/***************************************************************
+* Deal
+***************************************************************/
+
+ITERATOR_INSN2_PERMUTE_SLOT(32, vdealh, "Vd32=vdealh(Vu32)", "Vd32.h=vdeal(Vu32.h)",
+"Deal Halfwords",
+    VdV.uh[i  ] = fGETUHALF(0, VuV.uw[i]);
+    VdV.uh[i+fVELEM(32)] = fGETUHALF(1, VuV.uw[i]))
+
+ITERATOR_INSN2_PERMUTE_SLOT(16, vdealb, "Vd32=vdealb(Vu32)", "Vd32.b=vdeal(Vu32.b)",
+"Deal Halfwords",
+    VdV.ub[i   ] = fGETUBYTE(0, VuV.uh[i]);
+    VdV.ub[i+fVELEM(16)] = fGETUBYTE(1, VuV.uh[i]))
+
+ITERATOR_INSN2_PERMUTE_SLOT(32, vdealb4w,  "Vd32=vdealb4w(Vu32,Vv32)", "Vd32.b=vdeale(Vu32.b,Vv32.b)",
+"Deal Two Vectors Bytes",
+    VdV.ub[0+i ] = fGETUBYTE(0, VvV.uw[i]);
+    VdV.ub[fVELEM(32)+i ] = fGETUBYTE(2, VvV.uw[i]);
+    VdV.ub[2*fVELEM(32)+i] = fGETUBYTE(0, VuV.uw[i]);
+    VdV.ub[3*fVELEM(32)+i] = fGETUBYTE(2, VuV.uw[i]))
+
+/***************************************************************
+* shuffle
+***************************************************************/
+
+ITERATOR_INSN2_PERMUTE_SLOT(32, vshuffh, "Vd32=vshuffh(Vu32)", "Vd32.h=vshuff(Vu32.h)",
+"Deal Halfwords",
+    fSETHALF(0, VdV.uw[i], VuV.uh[i]);
+    fSETHALF(1, VdV.uw[i], VuV.uh[i+fVELEM(32)]))
+
+ITERATOR_INSN2_PERMUTE_SLOT(16, vshuffb, "Vd32=vshuffb(Vu32)", "Vd32.b=vshuff(Vu32.b)",
+"Deal Halfwords",
+    fSETBYTE(0, VdV.uh[i], VuV.ub[i]);
+    fSETBYTE(1, VdV.uh[i], VuV.ub[i+fVELEM(16)]))
+
+
+
+
+
+/***********************************************************
+* INSERT AND EXTRACT
+*********************************************************/
+EXTINSN(V6_extractw, "Rd32=vextract(Vu32,Rs32)",
+ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VA,A_MEMLIKE,A_RESTRICT_SLOT0ONLY),
+"Extract an element from a vector to scalar",
+fHIDE(warn("RdN=%d VuN=%d RsN=%d RsV=0x%08x widx=%d",RdN,VuN,RsN,RsV,((RsV & (fVBYTES()-1)) >> 2));)
+RdV = VuV.uw[ (RsV & (fVBYTES()-1)) >> 2];
+fHIDE(warn("RdV=0x%08x",RdV);))
+
+EXTINSN(V6_vinsertwr, "Vx32.w=vinsert(Rt32)",
+ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VX),
+"Insert Word Scalar into Vector",
+VxV.uw[0] = RtV;)
+
+
+
+
+ITERATOR_INSN_MPY_SLOT_LATE(32,lvsplatw, "Vd32=vsplat(Rt32)", "Replicates scalar accross words in vector", VdV.uw[i] = RtV)
+
+ITERATOR_INSN_MPY_SLOT_LATE(16,lvsplath, "Vd32.h=vsplat(Rt32)", "Replicates scalar accross halves in vector", VdV.uh[i] = RtV)
+
+ITERATOR_INSN_MPY_SLOT_LATE(8,lvsplatb, "Vd32.b=vsplat(Rt32)", "Replicates scalar accross bytes in vector", VdV.ub[i] = RtV)
+
+
+ITERATOR_INSN_ANY_SLOT(32,vassign,"Vd32=Vu32","Copy a vector",VdV.w[i]=VuV.w[i])
+
+
+ITERATOR_INSN_ANY_SLOT_DOUBLE_VEC(8,vcombine,"Vdd32=vcombine(Vu32,Vv32)",
+"Vector assign, Any two to Vector Pair",
+    VddV.v[0].ub[i] = VvV.ub[i];
+    VddV.v[1].ub[i] = VuV.ub[i])
+
+
+
+///////////////////////////////////////////////////////////////////////////
+
+
+/*********************************************************
+* GENERAL PERMUTE NETWORKS
+*********************************************************/
+
+
+EXTINSN(V6_vdelta, "Vd32=vdelta(Vu32,Vv32)",    ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),
+"Reverse Benes Butterfly network ",
+{
+    fHIDE(int offset;)
+    fHIDE(int k;)
+    fHIDE(mmvector_t tmp;)
+    tmp = VuV;
+    for (offset=fVBYTES(); (offset>>=1)>0; ) {
+        for (k = 0; k<fVBYTES(); k++) {
+            VdV.ub[k] = (VvV.ub[k]&offset) ? tmp.ub[k^offset] : tmp.ub[k];
+        }
+        for (k = 0; k<fVBYTES(); k++) {
+            tmp.ub[k] = VdV.ub[k];
+        }
+    }
+})
+
+
+EXTINSN(V6_vrdelta, "Vd32=vrdelta(Vu32,Vv32)",  ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VP),
+"Forward Benes Butterfly network ",
+{
+	fHIDE(int offset;)
+    fHIDE(int k;)
+    fHIDE(mmvector_t tmp;)
+    tmp = VuV;
+    for (offset=1; offset<fVBYTES(); offset<<=1){
+        for (k = 0; k<fVBYTES(); k++) {
+            VdV.ub[k] = (VvV.ub[k]&offset) ? tmp.ub[k^offset] : tmp.ub[k];
+        }
+        for (k = 0; k<fVBYTES(); k++) {
+            tmp.ub[k] = VdV.ub[k];
+        }
+    }
+})
+
+
+
+
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vcl0w,"Vd32=vcl0w(Vu32)","Vd32.uw=vcl0(Vu32.uw)",         "Count Leading Zeros in Word",     VdV.uw[i]=fCL1_4(~VuV.uw[i]))
+ITERATOR_INSN2_SHIFT_SLOT(16,vcl0h,"Vd32=vcl0h(Vu32)","Vd32.uh=vcl0(Vu32.uh)",         "Count Leading Zeros in Word",    VdV.uh[i]=fCL1_2(~VuV.uh[i]))
+
+ITERATOR_INSN2_SHIFT_SLOT(32,vnormamtw,"Vd32=vnormamtw(Vu32)","Vd32.w=vnormamt(Vu32.w)","Norm Amount Word",
+VdV.w[i]=fMAX(fCL1_4(~VuV.w[i]),fCL1_4(VuV.w[i]))-1; fHIDE(IV1DEAD();))
+ITERATOR_INSN2_SHIFT_SLOT(16,vnormamth,"Vd32=vnormamth(Vu32)","Vd32.h=vnormamt(Vu32.h)","Norm Amount Halfword",
+VdV.h[i]=fMAX(fCL1_2(~VuV.h[i]),fCL1_2(VuV.h[i]))-1; fHIDE(IV1DEAD();))
+
+ITERATOR_INSN_SHIFT_SLOT_VV_LATE(32,vaddclbw,"Vd32.w=vadd(vclb(Vu32.w),Vv32.w)",
+"Count leading bits and add",
+VdV.w[i] = fMAX(fCL1_4(~VuV.w[i]),fCL1_4(VuV.w[i])) + VvV.w[i])
+
+ITERATOR_INSN_SHIFT_SLOT_VV_LATE(16,vaddclbh,"Vd32.h=vadd(vclb(Vu32.h),Vv32.h)",
+"Count leading bits and add",
+VdV.h[i] = fMAX(fCL1_2(~VuV.h[i]),fCL1_2(VuV.h[i])) + VvV.h[i])
+
+
+ITERATOR_INSN2_SHIFT_SLOT(16,vpopcounth,"Vd32=vpopcounth(Vu32)","Vd32.h=vpopcount(Vu32.h)",   "Count Leading Zeros in Word",  VdV.uh[i]=fCOUNTONES_2(VuV.uh[i]))
+
+
+#define fHIST(INPUTVEC) \
+	fUARCH_NOTE_PUMP_4X(); \
+	fHIDE(int lane;) \
+	fHIDE(mmvector_t tmp;) \
+	fVFOREACH(128, lane) { \
+		for (fHIDE(int )i=0; i<128/8; ++i) { \
+			unsigned char value = INPUTVEC.ub[(128/8)*lane+i]; \
+			unsigned char regno = value>>3; \
+			unsigned char element = value & 7; \
+			READ_EXT_VREG(regno,tmp,0); \
+			tmp.uh[(128/16)*lane+(element)]++; \
+			WRITE_EXT_VREG(regno,tmp,EXT_NEW); \
+		} \
+	}
+
+#define fHISTQ(INPUTVEC,QVAL) \
+	fUARCH_NOTE_PUMP_4X(); \
+	fHIDE(int lane;) \
+	fHIDE(mmvector_t tmp;) \
+	fVFOREACH(128, lane) { \
+		for (fHIDE(int )i=0; i<128/8; ++i) { \
+			unsigned char value = INPUTVEC.ub[(128/8)*lane+i]; \
+			unsigned char regno = value>>3; \
+			unsigned char element = value & 7; \
+			READ_EXT_VREG(regno,tmp,0); \
+			if (fGETQBIT(QVAL,128/8*lane+i)) tmp.uh[(128/16)*lane+(element)]++; \
+			WRITE_EXT_VREG(regno,tmp,EXT_NEW); \
+		} \
+	}
+
+
+
+EXTINSN(V6_vhist, "vhist",ATTRIBS(A_EXTENSION,A_CVI,A_CVI_4SLOT), "vhist instruction",{ fHIDE(mmvector_t inputVec;) inputVec=fTMPVDATA(); fHIST(inputVec); })
+EXTINSN(V6_vhistq, "vhist(Qv4)",ATTRIBS(A_EXTENSION,A_CVI,A_CVI_4SLOT), "vhist instruction",{ fHIDE(mmvector_t inputVec;) inputVec=fTMPVDATA(); fHISTQ(inputVec,QvV); })
+
+#undef fHIST
+#undef fHISTQ
+
+
+/* **** WEIGHTED HISTOGRAM **** */
+
+
+#if 1
+#define WHIST(EL,MASK,BSHIFT,COND,SATF) \
+	fHIDE(unsigned int) bucket = fGETUBYTE(0,input.h[i]); \
+	fHIDE(unsigned int) weight = fGETUBYTE(1,input.h[i]); \
+	fHIDE(unsigned int) vindex = (bucket >> 3) & 0x1F; \
+	fHIDE(unsigned int) elindex = ((i>>BSHIFT) & (~MASK)) | ((bucket>>BSHIFT) & MASK); \
+	fHIDE(mmvector_t tmp;) \
+	READ_EXT_VREG(vindex,tmp,0); \
+	COND tmp.EL[elindex] = SATF(tmp.EL[elindex] + weight); \
+	WRITE_EXT_VREG(vindex,tmp,EXT_NEW); \
+	fUARCH_NOTE_PUMP_2X();
+
+ITERATOR_INSN_VHISTLIKE(16,vwhist256,"vwhist256","vector weighted histogram halfword counters", WHIST(uh,7,0,,))
+ITERATOR_INSN_VHISTLIKE(16,vwhist256q,"vwhist256(Qv4)","vector weighted histogram halfword counters", WHIST(uh,7,0,if (fGETQBIT(QvV,2*i)),))
+ITERATOR_INSN_VHISTLIKE(16,vwhist256_sat,"vwhist256:sat","vector weighted histogram halfword counters", WHIST(uh,7,0,,fVSATUH))
+ITERATOR_INSN_VHISTLIKE(16,vwhist256q_sat,"vwhist256(Qv4):sat","vector weighted histogram halfword counters", WHIST(uh,7,0,if (fGETQBIT(QvV,2*i)),fVSATUH))
+ITERATOR_INSN_VHISTLIKE(16,vwhist128,"vwhist128","vector weighted histogram word counters", WHIST(uw,3,1,,))
+ITERATOR_INSN_VHISTLIKE(16,vwhist128q,"vwhist128(Qv4)","vector weighted histogram word counters", WHIST(uw,3,1,if (fGETQBIT(QvV,2*i)),))
+ITERATOR_INSN_VHISTLIKE(16,vwhist128m,"vwhist128(#u1)","vector weighted histogram word counters", WHIST(uw,3,1,if ((bucket & 1) == uiV),))
+ITERATOR_INSN_VHISTLIKE(16,vwhist128qm,"vwhist128(Qv4,#u1)","vector weighted histogram word counters", WHIST(uw,3,1,if (((bucket & 1) == uiV) && fGETQBIT(QvV,2*i)),))
+
+
+#endif
+
+
+
+/* ******   lookup table instructions                          ***********  */
+
+/* Use low bits from idx to choose next-bigger elements from vector, then use LSB from idx to choose odd or even element */
+
+ITERATOR_INSN_PERMUTE_SLOT(8,vlutvvb,"Vd32.b=vlut32(Vu32.b,Vv32.b,Rt8)","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int matchval;) fHIDE(int oddhalf;)
+matchval = RtV & 0x7;
+oddhalf = (RtV >> (fVECLOGSIZE()-6)) & 0x1;
+idx = VuV.ub[i];
+VdV.b[i] = ((idx & 0xE0) == (matchval << 5)) ? fGETBYTE(oddhalf,VvV.h[idx % fVELEM(16)]) : 0)
+
+
+ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC(8,vlutvvb_oracc,"Vx32.b|=vlut32(Vu32.b,Vv32.b,Rt8)","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int matchval;) fHIDE(int oddhalf;)
+matchval = RtV & 0x7;
+oddhalf = (RtV >> (fVECLOGSIZE()-6)) & 0x1;
+idx = VuV.ub[i];
+VxV.b[i] |= ((idx & 0xE0) == (matchval << 5)) ? fGETBYTE(oddhalf,VvV.h[idx % fVELEM(16)]) : 0)
+
+ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC(16,vlutvwh,"Vdd32.h=vlut16(Vu32.b,Vv32.h,Rt8)","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int matchval;) fHIDE(int oddhalf;)
+matchval = RtV & 0xF;
+oddhalf = (RtV >> (fVECLOGSIZE()-6)) & 0x1;
+idx = fGETUBYTE(0,VuV.uh[i]);
+VddV.v[0].h[i] = ((idx & 0xF0) == (matchval << 4)) ? fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]) : 0;
+idx = fGETUBYTE(1,VuV.uh[i]);
+VddV.v[1].h[i] = ((idx & 0xF0) == (matchval << 4)) ? fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]) : 0)
+
+ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC(16,vlutvwh_oracc,"Vxx32.h|=vlut16(Vu32.b,Vv32.h,Rt8)","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int matchval;) fHIDE(int oddhalf;)
+matchval = fGETUBYTE(0,RtV) & 0xF;
+oddhalf = (RtV >> (fVECLOGSIZE()-6)) & 0x1;
+idx = fGETUBYTE(0,VuV.uh[i]);
+VxxV.v[0].h[i] |= ((idx & 0xF0) == (matchval << 4)) ? fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]) : 0;
+idx = fGETUBYTE(1,VuV.uh[i]);
+VxxV.v[1].h[i] |= ((idx & 0xF0) == (matchval << 4)) ? fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]) : 0)
+
+ITERATOR_INSN_PERMUTE_SLOT(8,vlutvvbi,"Vd32.b=vlut32(Vu32.b,Vv32.b,#u3)","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int matchval;) fHIDE(int oddhalf;)
+matchval = uiV & 0x7;
+oddhalf = (uiV >> (fVECLOGSIZE()-6)) & 0x1;
+idx = VuV.ub[i];
+VdV.b[i] = ((idx & 0xE0) == (matchval << 5)) ? fGETBYTE(oddhalf,VvV.h[idx % fVELEM(16)]) : 0)
+
+
+ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC(8,vlutvvb_oracci,"Vx32.b|=vlut32(Vu32.b,Vv32.b,#u3)","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int matchval;) fHIDE(int oddhalf;)
+matchval = uiV & 0x7;
+oddhalf = (uiV >> (fVECLOGSIZE()-6)) & 0x1;
+idx = VuV.ub[i];
+VxV.b[i] |= ((idx & 0xE0) == (matchval << 5)) ? fGETBYTE(oddhalf,VvV.h[idx % fVELEM(16)]) : 0)
+
+ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC(16,vlutvwhi,"Vdd32.h=vlut16(Vu32.b,Vv32.h,#u3)","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int matchval;) fHIDE(int oddhalf;)
+matchval = uiV & 0xF;
+oddhalf = (uiV >> (fVECLOGSIZE()-6)) & 0x1;
+idx = fGETUBYTE(0,VuV.uh[i]);
+VddV.v[0].h[i] = ((idx & 0xF0) == (matchval << 4)) ? fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]) : 0;
+idx = fGETUBYTE(1,VuV.uh[i]);
+VddV.v[1].h[i] = ((idx & 0xF0) == (matchval << 4)) ? fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]) : 0)
+
+ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC(16,vlutvwh_oracci,"Vxx32.h|=vlut16(Vu32.b,Vv32.h,#u3)","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int matchval;) fHIDE(int oddhalf;)
+matchval = uiV & 0xF;
+oddhalf = (uiV >> (fVECLOGSIZE()-6)) & 0x1;
+idx = fGETUBYTE(0,VuV.uh[i]);
+VxxV.v[0].h[i] |= ((idx & 0xF0) == (matchval << 4)) ? fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]) : 0;
+idx = fGETUBYTE(1,VuV.uh[i]);
+VxxV.v[1].h[i] |= ((idx & 0xF0) == (matchval << 4)) ? fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]) : 0)
+
+ITERATOR_INSN_PERMUTE_SLOT(8,vlutvvb_nm,"Vd32.b=vlut32(Vu32.b,Vv32.b,Rt8):nomatch","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int oddhalf;) fHIDE(int matchval;)
+    matchval = RtV & 0x7;
+    oddhalf = (RtV >> (fVECLOGSIZE()-6)) & 0x1;
+    idx = VuV.ub[i];
+    idx = (idx&0x1F) | (matchval<<5);
+    VdV.b[i] = fGETBYTE(oddhalf,VvV.h[idx % fVELEM(16)]))
+
+ITERATOR_INSN_PERMUTE_SLOT_DOUBLE_VEC(16,vlutvwh_nm,"Vdd32.h=vlut16(Vu32.b,Vv32.h,Rt8):nomatch","vector-vector table lookup",
+fHIDE(unsigned int idx;) fHIDE(int oddhalf;) fHIDE(int matchval;)
+    matchval = RtV & 0xF;
+    oddhalf = (RtV >> (fVECLOGSIZE()-6)) & 0x1;
+    idx = fGETUBYTE(0,VuV.uh[i]);
+    idx = (idx&0x0F) | (matchval<<4);
+    VddV.v[0].h[i] = fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]);
+    idx = fGETUBYTE(1,VuV.uh[i]);
+    idx = (idx&0x0F) | (matchval<<4);
+    VddV.v[1].h[i] = fGETHALF(oddhalf,VvV.w[idx % fVELEM(32)]))
+
+
+
+
+/******************************************************************************
+NON LINEAR - V65
+ ******************************************************************************/
+
+ITERATOR_INSN_SLOT2_DOUBLE_VEC(16,vmpahhsat,"Vx32.h=vmpa(Vx32.h,Vu32.h,Rtt32.h):sat","piecewise linear approximation",
+    VxV.h[i]= fVSATH( ( ( fMPY16SS(VxV.h[i],VuV.h[i])<<1) + (fGETHALF(( (VuV.h[i]>>14)&0x3), RttV )<<15))>>16))
+
+
+ITERATOR_INSN_SLOT2_DOUBLE_VEC(16,vmpauhuhsat,"Vx32.h=vmpa(Vx32.h,Vu32.uh,Rtt32.uh):sat","piecewise linear approximation",
+    VxV.h[i]= fVSATH( (  fMPY16SU(VxV.h[i],VuV.uh[i]) + (fGETUHALF(((VuV.uh[i]>>14)&0x3), RttV )<<15))>>16))
+
+ITERATOR_INSN_SLOT2_DOUBLE_VEC(16,vmpsuhuhsat,"Vx32.h=vmps(Vx32.h,Vu32.uh,Rtt32.uh):sat","piecewise linear approximation",
+    VxV.h[i]= fVSATH( (  fMPY16SU(VxV.h[i],VuV.uh[i]) - (fGETUHALF(((VuV.uh[i]>>14)&0x3), RttV )<<15))>>16))
+
+
+ITERATOR_INSN_SLOT2_DOUBLE_VEC(16,vlut4,"Vd32.h=vlut4(Vu32.uh,Rtt32.h)","4 entry lookup table",
+    VdV.h[i]= fGETHALF(  ((VuV.h[i]>>14)&0x3), RttV ))
+
+
+
+/******************************************************************************
+V65
+ ******************************************************************************/
+
+ITERATOR_INSN_MPY_SLOT_NOV1(32,vmpyuhe,"Vd32.uw=vmpye(Vu32.uh,Rt32.uh)",
+"Vector even halfword unsigned multiply by scalar",
+    VdV.uw[i] = fMPY16UU(fGETUHALF(0, VuV.uw[i]),fGETUHALF(0,RtV)))
+
+
+ITERATOR_INSN_MPY_SLOT_NOV1(32,vmpyuhe_acc,"Vx32.uw+=vmpye(Vu32.uh,Rt32.uh)",
+"Vector even halfword unsigned multiply by scalar",
+    VxV.uw[i] += fMPY16UU(fGETUHALF(0, VuV.uw[i]),fGETUHALF(0,RtV)))
+
+
+
+
+EXTINSN(V6_vgathermw,  "vtmp.w=vgather(Rt32,Mu2,Vv32.w).w", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_GATHER,A_CVI_VA,A_CVI_VM,A_CVI_TMP_DST,A_MEMLIKE), "Gather Words",
+{
+    fHIDE(int i;)
+	fHIDE(int element_size = 4;)
+    fHIDE(fGATHER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+        EA = RtV+VvV.uw[i];
+        fVLOG_VTCM_GATHER_WORD(EA, VvV.uw[i], i,MuV);
+    }
+    fGATHER_FINISH()
+})
+EXTINSN(V6_vgathermh,  "vtmp.h=vgather(Rt32,Mu2,Vv32.h).h", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_GATHER,A_CVI_VA,A_CVI_VM,A_CVI_TMP_DST,A_MEMLIKE), "Gather halfwords",
+{
+    fHIDE(int i;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fGATHER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(16, i) {
+        EA = RtV+VvV.uh[i];
+        fVLOG_VTCM_GATHER_HALFWORD(EA, VvV.uh[i], i,MuV);
+    }
+    fGATHER_FINISH()
+})
+
+
+
+EXTINSN(V6_vgathermhw,  "vtmp.h=vgather(Rt32,Mu2,Vvv32.w).h", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_GATHER,A_CVI_VA_DV,A_CVI_VM,A_CVI_TMP_DST,A_MEMLIKE), "Gather halfwords",
+{
+    fHIDE(int i;)
+    fHIDE(int j;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fGATHER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+       for(j = 0; j < 2; j++) {
+            EA = RtV+VvvV.v[j].uw[i];
+            fVLOG_VTCM_GATHER_HALFWORD_DV(EA, VvvV.v[j].uw[i], (2*i+j),i,j,MuV);
+        }
+    }
+     fGATHER_FINISH()
+})
+
+
+EXTINSN(V6_vgathermwq,  "if (Qs4) vtmp.w=vgather(Rt32,Mu2,Vv32.w).w", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_GATHER,A_CVI_VA,A_CVI_VM,A_CVI_TMP_DST,A_MEMLIKE), "Gather Words",
+{
+    fHIDE(int i;)
+	fHIDE(int element_size = 4;)
+    fHIDE(fGATHER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+        EA = RtV+VvV.uw[i];
+        fVLOG_VTCM_GATHER_WORDQ(EA, VvV.uw[i], i,QsV,MuV);
+    }
+    fGATHER_FINISH()
+})
+EXTINSN(V6_vgathermhq,  "if (Qs4) vtmp.h=vgather(Rt32,Mu2,Vv32.h).h", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_GATHER,A_CVI_VA,A_CVI_VM,A_CVI_TMP_DST,A_MEMLIKE), "Gather halfwords",
+{
+    fHIDE(int i;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fGATHER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(16, i) {
+        EA = RtV+VvV.uh[i];
+        fVLOG_VTCM_GATHER_HALFWORDQ(EA, VvV.uh[i], i,QsV,MuV);
+    }
+    fGATHER_FINISH()
+})
+
+
+
+EXTINSN(V6_vgathermhwq,  "if (Qs4) vtmp.h=vgather(Rt32,Mu2,Vvv32.w).h", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_GATHER,A_CVI_VA_DV,A_CVI_VM,A_CVI_TMP_DST,A_MEMLIKE), "Gather halfwords",
+{
+    fHIDE(int i;)
+    fHIDE(int j;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fGATHER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+       for(j = 0; j < 2; j++) {
+            EA = RtV+VvvV.v[j].uw[i];
+            fVLOG_VTCM_GATHER_HALFWORDQ_DV(EA, VvvV.v[j].uw[i], (2*i+j),i,j,QsV,MuV);
+       }
+    }
+    fGATHER_FINISH()
+})
+
+
+
+EXTINSN(V6_vscattermw , "vscatter(Rt32,Mu2,Vv32.w).w=Vw32", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_SCATTER,A_CVI_VA,A_CVI_VM,A_MEMLIKE), "Scatter Words",
+{
+    fHIDE(int i;)
+	fHIDE(int element_size = 4;)
+    fHIDE(fSCATTER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+        EA = RtV+VvV.uw[i];
+        fVLOG_VTCM_WORD(EA, VvV.uw[i], VwV,i,MuV);
+    }
+    fSCATTER_FINISH(0)
+})
+
+
+
+EXTINSN(V6_vscattermh , "vscatter(Rt32,Mu2,Vv32.h).h=Vw32", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_SCATTER,A_CVI_VA,A_CVI_VM,A_MEMLIKE), "Scatter halfWords",
+{
+    fHIDE(int i;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fSCATTER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(16, i) {
+        EA = RtV+VvV.uh[i];
+        fVLOG_VTCM_HALFWORD(EA,VvV.uh[i],VwV,i,MuV);
+    }
+    fSCATTER_FINISH(0)
+})
+
+
+EXTINSN(V6_vscattermw_add,  "vscatter(Rt32,Mu2,Vv32.w).w+=Vw32", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_SCATTER,A_CVI_VA,A_CVI_VM,A_MEMLIKE), "Scatter Words-Add",
+{
+    fHIDE(int i;)
+    fHIDE(int ALIGNMENT=4;)
+	fHIDE(int element_size = 4;)
+    fHIDE(fSCATTER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+        EA = (RtV+fVALIGN(VvV.uw[i],ALIGNMENT));
+        fVLOG_VTCM_WORD_INCREMENT(EA,VvV.uw[i],VwV,i,ALIGNMENT,MuV);
+    }
+    fHIDE(fLOG_SCATTER_OP(4);)
+    fSCATTER_FINISH(1)
+})
+
+EXTINSN(V6_vscattermh_add,  "vscatter(Rt32,Mu2,Vv32.h).h+=Vw32", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_SCATTER,A_CVI_VA,A_CVI_VM,A_MEMLIKE), "Scatter halfword-Add",
+{
+    fHIDE(int i;)
+    fHIDE(int ALIGNMENT=2;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fSCATTER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(16, i) {
+        EA = (RtV+fVALIGN(VvV.uh[i],ALIGNMENT));
+        fVLOG_VTCM_HALFWORD_INCREMENT(EA,VvV.uh[i],VwV,i,ALIGNMENT,MuV);
+    }
+    fHIDE(fLOG_SCATTER_OP(2);)
+    fSCATTER_FINISH(1)
+})
+
+
+EXTINSN(V6_vscattermwq,  "if (Qs4) vscatter(Rt32,Mu2,Vv32.w).w=Vw32", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_SCATTER,A_CVI_VA,A_CVI_VM,A_MEMLIKE), "Scatter Words conditional",
+{
+    fHIDE(int i;)
+	fHIDE(int element_size = 4;)
+    fHIDE(fSCATTER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+        EA = RtV+VvV.uw[i];
+        fVLOG_VTCM_WORDQ(EA,VvV.uw[i], VwV,i,QsV,MuV);
+    }
+    fSCATTER_FINISH(0)
+})
+
+EXTINSN(V6_vscattermhq,  "if (Qs4) vscatter(Rt32,Mu2,Vv32.h).h=Vw32", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_SCATTER,A_CVI_VA,A_CVI_VM,A_MEMLIKE), "Scatter HalfWords conditional",
+{
+    fHIDE(int i;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fSCATTER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(16, i) {
+        EA = RtV+VvV.uh[i];
+        fVLOG_VTCM_HALFWORDQ(EA,VvV.uh[i],VwV,i,QsV,MuV);
+    }
+    fSCATTER_FINISH(0)
+})
+
+
+
+
+EXTINSN(V6_vscattermhw , "vscatter(Rt32,Mu2,Vvv32.w).h=Vw32", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_SCATTER,A_CVI_VA_DV,A_CVI_VM,A_MEMLIKE), "Scatter Words",
+{
+    fHIDE(int i;)
+    fHIDE(int j;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fSCATTER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+        for(j = 0; j < 2; j++) {
+            EA = RtV+VvvV.v[j].uw[i];
+            fVLOG_VTCM_HALFWORD_DV(EA,VvvV.v[j].uw[i],VwV,(2*i+j),i,j,MuV);
+        }
+    }
+    fSCATTER_FINISH(0)
+})
+
+
+
+EXTINSN(V6_vscattermhwq,  "if (Qs4) vscatter(Rt32,Mu2,Vvv32.w).h=Vw32", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_SCATTER,A_CVI_VA_DV,A_CVI_VM,A_MEMLIKE), "Scatter halfwords conditional",
+{
+    fHIDE(int i;)
+    fHIDE(int j;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fSCATTER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+        for(j = 0; j < 2; j++) {
+            EA = RtV+VvvV.v[j].uw[i];
+            fVLOG_VTCM_HALFWORDQ_DV(EA,VvvV.v[j].uw[i],VwV,(2*i+j),QsV,i,j,MuV);
+        }
+    }
+    fSCATTER_FINISH(0)
+})
+
+EXTINSN(V6_vscattermhw_add,  "vscatter(Rt32,Mu2,Vvv32.w).h+=Vw32", ATTRIBS(A_EXTENSION,A_CVI,A_CVI_SCATTER,A_CVI_VA_DV,A_CVI_VM,A_MEMLIKE), "Scatter halfwords-add",
+{
+    fHIDE(int i;)
+    fHIDE(int j;)
+    fHIDE(int ALIGNMENT=2;)
+	fHIDE(int element_size = 2;)
+    fHIDE(fSCATTER_INIT( RtV, MuV, element_size);)
+    fVLASTBYTE(MuV, element_size);
+    fVALIGN(RtV, element_size);
+    fVFOREACH(32, i) {
+        for(j = 0; j < 2; j++) {
+             EA =  RtV + fVALIGN(VvvV.v[j].uw[i],ALIGNMENT);;
+             fVLOG_VTCM_HALFWORD_INCREMENT_DV(EA,VvvV.v[j].uw[i],VwV,(2*i+j),i,j,ALIGNMENT,MuV);
+        }
+    }
+    fHIDE(fLOG_SCATTER_OP(2);)
+    fSCATTER_FINISH(1)
+})
+
+EXTINSN(V6_vprefixqb,"Vd32.b=prefixsum(Qv4)",   ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VS),  "parallel prefix sum of Q into byte",
+{
+    fHIDE(int i;)
+    fHIDE(size1u_t acc = 0;)
+    fVFOREACH(8, i) {
+        acc += fGETQBIT(QvV,i);
+        VdV.ub[i] = acc;
+    }
+    } )
+EXTINSN(V6_vprefixqh,"Vd32.h=prefixsum(Qv4)",   ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VS),  "parallel prefix sum of Q into halfwords",
+{
+    fHIDE(int i;)
+    fHIDE(size2u_t acc = 0;)
+    fVFOREACH(16, i) {
+        acc += fGETQBIT(QvV,i*2+0);
+        acc += fGETQBIT(QvV,i*2+1);
+        VdV.uh[i] = acc;
+    }
+    } )
+EXTINSN(V6_vprefixqw,"Vd32.w=prefixsum(Qv4)",   ATTRIBS(A_EXTENSION,A_CVI,A_CVI_VS),  "parallel prefix sum of Q into words",
+{
+    fHIDE(int i;)
+    fHIDE(size4u_t acc = 0;)
+    fVFOREACH(32, i) {
+        acc += fGETQBIT(QvV,i*4+0);
+        acc += fGETQBIT(QvV,i*4+1);
+        acc += fGETQBIT(QvV,i*4+2);
+        acc += fGETQBIT(QvV,i*4+3);
+        VdV.uw[i] = acc;
+    }
+    } )
+
+
+
+
+
+/******************************************************************************
+ DEBUG Vector/Register Printing
+ ******************************************************************************/
+
+#define PRINT_VU(TYPE, TYPE2, COUNT)\
+    int i;  \
+    size4u_t vec_len = fVBYTES();\
+    fprintf(stdout,"V%2d: ",VuN);  \
+    for (i=0;i<vec_len>>COUNT;i++) {         \
+        fprintf(stdout,TYPE2 " ", VuV.TYPE[i]); \
+    };  \
+    fprintf(stdout,"\\n");  \
+	fflush(stdout);\
+
+#undef ATTR_VMEM
+#undef ATTR_VMEMU
+#undef ATTR_VMEM_NT
+
+#endif /* NO_MMVEC */
+
+#ifdef __SELF_DEF_EXTINSN
+#undef EXTINSN
+#undef __SELF_DEF_EXTINSN
+#endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 25/30] Hexagon HVX (target/hexagon) instruction decoding
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (23 preceding siblings ...)
  2021-10-12 10:11 ` [PATCH v4 24/30] Hexagon HVX (target/hexagon) import semantics Taylor Simpson
@ 2021-10-12 10:11 ` Taylor Simpson
  2021-10-12 10:11 ` [PATCH v4 26/30] Hexagon HVX (target/hexagon) import instruction encodings Taylor Simpson
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Add new file to target/hexagon/meson.build

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/mmvec/decode_ext_mmvec.h |  24 ++++
 target/hexagon/decode.c                 |  24 +++-
 target/hexagon/mmvec/decode_ext_mmvec.c | 236 ++++++++++++++++++++++++++++++++
 target/hexagon/meson.build              |   1 +
 4 files changed, 283 insertions(+), 2 deletions(-)
 create mode 100644 target/hexagon/mmvec/decode_ext_mmvec.h
 create mode 100644 target/hexagon/mmvec/decode_ext_mmvec.c

diff --git a/target/hexagon/mmvec/decode_ext_mmvec.h b/target/hexagon/mmvec/decode_ext_mmvec.h
new file mode 100644
index 0000000..3664b68
--- /dev/null
+++ b/target/hexagon/mmvec/decode_ext_mmvec.h
@@ -0,0 +1,24 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HEXAGON_DECODE_EXT_MMVEC_H
+#define HEXAGON_DECODE_EXT_MMVEC_H
+
+void mmvec_ext_decode_checks(Packet *pkt, bool disas_only);
+SlotMask mmvec_ext_decode_find_iclass_slots(int opcode);
+
+#endif
diff --git a/target/hexagon/decode.c b/target/hexagon/decode.c
index d424245..653bfd7 100644
--- a/target/hexagon/decode.c
+++ b/target/hexagon/decode.c
@@ -22,6 +22,7 @@
 #include "decode.h"
 #include "insn.h"
 #include "printinsn.h"
+#include "mmvec/decode_ext_mmvec.h"
 
 #define fZXTN(N, M, VAL) ((VAL) & ((1LL << (N)) - 1))
 
@@ -566,8 +567,12 @@ static void decode_remove_extenders(Packet *packet)
 
 static SlotMask get_valid_slots(const Packet *pkt, unsigned int slot)
 {
-    return find_iclass_slots(pkt->insn[slot].opcode,
-                             pkt->insn[slot].iclass);
+    if (GET_ATTRIB(pkt->insn[slot].opcode, A_EXTENSION)) {
+        return mmvec_ext_decode_find_iclass_slots(pkt->insn[slot].opcode);
+    } else {
+        return find_iclass_slots(pkt->insn[slot].opcode,
+                                 pkt->insn[slot].iclass);
+    }
 }
 
 #define DECODE_NEW_TABLE(TAG, SIZE, WHATNOT)     /* NOTHING */
@@ -728,6 +733,11 @@ decode_insns_tablewalk(Insn *insn, const DectreeTable *table,
         }
         decode_op(insn, opc, encoding);
         return 1;
+    } else if (table->table[i].type == DECTREE_EXTSPACE) {
+        /*
+         * For now, HVX will be the only coproc
+         */
+        return decode_insns_tablewalk(insn, ext_trees[EXT_IDX_mmvec], encoding);
     } else {
         return 0;
     }
@@ -874,6 +884,7 @@ int decode_packet(int max_words, const uint32_t *words, Packet *pkt,
     int words_read = 0;
     bool end_of_packet = false;
     int new_insns = 0;
+    int i;
     uint32_t encoding32;
 
     /* Initialize */
@@ -901,6 +912,11 @@ int decode_packet(int max_words, const uint32_t *words, Packet *pkt,
         return 0;
     }
     pkt->encod_pkt_size_in_bytes = words_read * 4;
+    pkt->pkt_has_hvx = false;
+    for (i = 0; i < num_insns; i++) {
+        pkt->pkt_has_hvx |=
+            GET_ATTRIB(pkt->insn[i].opcode, A_CVI);
+    }
 
     /*
      * Check for :endloop in the parse bits
@@ -931,6 +947,10 @@ int decode_packet(int max_words, const uint32_t *words, Packet *pkt,
     decode_set_slot_number(pkt);
     decode_fill_newvalue_regno(pkt);
 
+    if (pkt->pkt_has_hvx) {
+        mmvec_ext_decode_checks(pkt, disas_only);
+    }
+
     if (!disas_only) {
         decode_shuffle_for_execution(pkt);
         decode_split_cmpjump(pkt);
diff --git a/target/hexagon/mmvec/decode_ext_mmvec.c b/target/hexagon/mmvec/decode_ext_mmvec.c
new file mode 100644
index 0000000..061a65a
--- /dev/null
+++ b/target/hexagon/mmvec/decode_ext_mmvec.c
@@ -0,0 +1,236 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "decode.h"
+#include "opcodes.h"
+#include "insn.h"
+#include "iclass.h"
+#include "mmvec/mmvec.h"
+#include "mmvec/decode_ext_mmvec.h"
+
+static void
+check_new_value(Packet *pkt)
+{
+    /* .new value for a MMVector store */
+    int i, j;
+    const char *reginfo;
+    const char *destletters;
+    const char *dststr = NULL;
+    uint16_t def_opcode;
+    char letter;
+    int def_regnum;
+
+    for (i = 1; i < pkt->num_insns; i++) {
+        uint16_t use_opcode = pkt->insn[i].opcode;
+        if (GET_ATTRIB(use_opcode, A_DOTNEWVALUE) &&
+            GET_ATTRIB(use_opcode, A_CVI) &&
+            GET_ATTRIB(use_opcode, A_STORE)) {
+            int use_regidx = strchr(opcode_reginfo[use_opcode], 's') -
+                opcode_reginfo[use_opcode];
+            /*
+             * What's encoded at the N-field is the offset to who's producing
+             * the value.
+             * Shift off the LSB which indicates odd/even register.
+             */
+            int def_off = ((pkt->insn[i].regno[use_regidx]) >> 1);
+            int def_oreg = pkt->insn[i].regno[use_regidx] & 1;
+            int def_idx = -1;
+            for (j = i - 1; (j >= 0) && (def_off >= 0); j--) {
+                if (!GET_ATTRIB(pkt->insn[j].opcode, A_CVI)) {
+                    continue;
+                }
+                def_off--;
+                if (def_off == 0) {
+                    def_idx = j;
+                    break;
+                }
+            }
+            /*
+             * Check for a badly encoded N-field which points to an instruction
+             * out-of-range
+             */
+            g_assert(!((def_off != 0) || (def_idx < 0) ||
+                       (def_idx > (pkt->num_insns - 1))));
+
+            /* def_idx is the index of the producer */
+            def_opcode = pkt->insn[def_idx].opcode;
+            reginfo = opcode_reginfo[def_opcode];
+            destletters = "dexy";
+            for (j = 0; (letter = destletters[j]) != 0; j++) {
+                dststr = strchr(reginfo, letter);
+                if (dststr != NULL) {
+                    break;
+                }
+            }
+            if ((dststr == NULL)  && GET_ATTRIB(def_opcode, A_CVI_GATHER)) {
+                def_regnum = 0;
+                pkt->insn[i].regno[use_regidx] = def_oreg;
+                pkt->insn[i].new_value_producer_slot = pkt->insn[def_idx].slot;
+            } else {
+                if (dststr == NULL) {
+                    /* still not there, we have a bad packet */
+                    g_assert_not_reached();
+                }
+                def_regnum = pkt->insn[def_idx].regno[dststr - reginfo];
+                /* Now patch up the consumer with the register number */
+                pkt->insn[i].regno[use_regidx] = def_regnum ^ def_oreg;
+                /* special case for (Vx,Vy) */
+                dststr = strchr(reginfo, 'y');
+                if (def_oreg && strchr(reginfo, 'x') && dststr) {
+                    def_regnum = pkt->insn[def_idx].regno[dststr - reginfo];
+                    pkt->insn[i].regno[use_regidx] = def_regnum;
+                }
+                /*
+                 * We need to remember who produces this value to later
+                 * check if it was dynamically cancelled
+                 */
+                pkt->insn[i].new_value_producer_slot = pkt->insn[def_idx].slot;
+            }
+        }
+    }
+}
+
+/*
+ * We don't want to reorder slot1/slot0 with respect to each other.
+ * So in our shuffling, we don't want to move the .cur / .tmp vmem earlier
+ * Instead, we should move the producing instruction later
+ * But the producing instruction might feed a .new store!
+ * So we may need to move that even later.
+ */
+
+static void
+decode_mmvec_move_cvi_to_end(Packet *pkt, int max)
+{
+    int i;
+    for (i = 0; i < max; i++) {
+        if (GET_ATTRIB(pkt->insn[i].opcode, A_CVI)) {
+            int last_inst = pkt->num_insns - 1;
+            uint16_t last_opcode = pkt->insn[last_inst].opcode;
+
+            /*
+             * If the last instruction is an endloop, move to the one before it
+             * Keep endloop as the last thing always
+             */
+            if ((last_opcode == J2_endloop0) ||
+                (last_opcode == J2_endloop1) ||
+                (last_opcode == J2_endloop01)) {
+                last_inst--;
+            }
+
+            decode_send_insn_to(pkt, i, last_inst);
+            max--;
+            i--;    /* Retry this index now that packet has rotated */
+        }
+    }
+}
+
+static void
+decode_shuffle_for_execution_vops(Packet *pkt)
+{
+    /*
+     * Sort for .new
+     */
+    int i;
+    for (i = 0; i < pkt->num_insns; i++) {
+        uint16_t opcode = pkt->insn[i].opcode;
+        if (GET_ATTRIB(opcode, A_LOAD) &&
+            (GET_ATTRIB(opcode, A_CVI_NEW) ||
+             GET_ATTRIB(opcode, A_CVI_TMP))) {
+            /*
+             * Find prior consuming vector instructions
+             * Move to end of packet
+             */
+            decode_mmvec_move_cvi_to_end(pkt, i);
+            break;
+        }
+    }
+
+    /* Move HVX new value stores to the end of the packet */
+    for (i = 0; i < pkt->num_insns - 1; i++) {
+        uint16_t opcode = pkt->insn[i].opcode;
+        if (GET_ATTRIB(opcode, A_STORE) &&
+            GET_ATTRIB(opcode, A_CVI_NEW) &&
+            !GET_ATTRIB(opcode, A_CVI_SCATTER_RELEASE)) {
+            int last_inst = pkt->num_insns - 1;
+            uint16_t last_opcode = pkt->insn[last_inst].opcode;
+
+            /*
+             * If the last instruction is an endloop, move to the one before it
+             * Keep endloop as the last thing always
+             */
+            if ((last_opcode == J2_endloop0) ||
+                (last_opcode == J2_endloop1) ||
+                (last_opcode == J2_endloop01)) {
+                last_inst--;
+            }
+
+            decode_send_insn_to(pkt, i, last_inst);
+            break;
+        }
+    }
+}
+
+static void
+check_for_vhist(Packet *pkt)
+{
+    pkt->vhist_insn = NULL;
+    for (int i = 0; i < pkt->num_insns; i++) {
+        Insn *insn = &pkt->insn[i];
+        int opcode = insn->opcode;
+        if (GET_ATTRIB(opcode, A_CVI) && GET_ATTRIB(opcode, A_CVI_4SLOT)) {
+                pkt->vhist_insn = insn;
+                return;
+        }
+    }
+}
+
+/*
+ * Public Functions
+ */
+
+SlotMask mmvec_ext_decode_find_iclass_slots(int opcode)
+{
+    if (GET_ATTRIB(opcode, A_CVI_VM)) {
+        /* HVX memory instruction */
+        if (GET_ATTRIB(opcode, A_RESTRICT_SLOT0ONLY)) {
+            return SLOTS_0;
+        } else if (GET_ATTRIB(opcode, A_RESTRICT_SLOT1ONLY)) {
+            return SLOTS_1;
+        }
+        return SLOTS_01;
+    } else if (GET_ATTRIB(opcode, A_RESTRICT_SLOT2ONLY)) {
+        return SLOTS_2;
+    } else if (GET_ATTRIB(opcode, A_CVI_VX)) {
+        /* HVX multiply instruction */
+        return SLOTS_23;
+    } else if (GET_ATTRIB(opcode, A_CVI_VS_VX)) {
+        /* HVX permute/shift instruction */
+        return SLOTS_23;
+    } else {
+        return SLOTS_0123;
+    }
+}
+
+void mmvec_ext_decode_checks(Packet *pkt, bool disas_only)
+{
+    check_new_value(pkt);
+    if (!disas_only) {
+        decode_shuffle_for_execution_vops(pkt);
+    }
+    check_for_vhist(pkt);
+}
diff --git a/target/hexagon/meson.build b/target/hexagon/meson.build
index a35eb28..b612431 100644
--- a/target/hexagon/meson.build
+++ b/target/hexagon/meson.build
@@ -175,6 +175,7 @@ hexagon_ss.add(files(
     'printinsn.c',
     'arch.c',
     'fma_emu.c',
+    'mmvec/decode_ext_mmvec.c',
     'mmvec/system_ext_mmvec.c',
 ))
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 26/30] Hexagon HVX (target/hexagon) import instruction encodings
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (24 preceding siblings ...)
  2021-10-12 10:11 ` [PATCH v4 25/30] Hexagon HVX (target/hexagon) instruction decoding Taylor Simpson
@ 2021-10-12 10:11 ` Taylor Simpson
  2021-10-29 19:08   ` Richard Henderson
  2021-10-12 10:11 ` [PATCH v4 27/30] Hexagon HVX (tests/tcg/hexagon) vector_add_int test Taylor Simpson
                   ` (3 subsequent siblings)
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/decode.c                      |   4 +
 target/hexagon/imported/allextenc.def        |  20 +
 target/hexagon/imported/encode.def           |   1 +
 target/hexagon/imported/mmvec/encode_ext.def | 794 +++++++++++++++++++++++++++
 4 files changed, 819 insertions(+)
 create mode 100644 target/hexagon/imported/allextenc.def
 create mode 100644 target/hexagon/imported/mmvec/encode_ext.def

diff --git a/target/hexagon/decode.c b/target/hexagon/decode.c
index 653bfd7..6f0f27b 100644
--- a/target/hexagon/decode.c
+++ b/target/hexagon/decode.c
@@ -47,6 +47,7 @@ enum {
         /* Name   Num Table */
 DEF_REGMAP(R_16,  16, 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23)
 DEF_REGMAP(R__8,  8,  0, 2, 4, 6, 16, 18, 20, 22)
+DEF_REGMAP(R_8,   8,  0, 1, 2, 3, 4, 5, 6, 7)
 
 #define DECODE_MAPPED_REG(OPNUM, NAME) \
     insn->regno[OPNUM] = DECODE_REGISTER_##NAME[insn->regno[OPNUM]];
@@ -158,6 +159,9 @@ static void decode_ext_init(void)
     for (i = EXT_IDX_noext; i < EXT_IDX_noext_AFTER; i++) {
         ext_trees[i] = &dectree_table_DECODE_EXT_EXT_noext;
     }
+    for (i = EXT_IDX_mmvec; i < EXT_IDX_mmvec_AFTER; i++) {
+        ext_trees[i] = &dectree_table_DECODE_EXT_EXT_mmvec;
+    }
 }
 
 typedef struct {
diff --git a/target/hexagon/imported/allextenc.def b/target/hexagon/imported/allextenc.def
new file mode 100644
index 0000000..39a3e93
--- /dev/null
+++ b/target/hexagon/imported/allextenc.def
@@ -0,0 +1,20 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define EXTNAME mmvec
+#include "mmvec/encode_ext.def"
+#undef EXTNAME
diff --git a/target/hexagon/imported/encode.def b/target/hexagon/imported/encode.def
index b9368d1..e40e7fb 100644
--- a/target/hexagon/imported/encode.def
+++ b/target/hexagon/imported/encode.def
@@ -71,6 +71,7 @@
 
 #include "encode_pp.def"
 #include "encode_subinsn.def"
+#include "allextenc.def"
 
 #ifdef __SELF_DEF_FIELD32
 #undef __SELF_DEF_FIELD32
diff --git a/target/hexagon/imported/mmvec/encode_ext.def b/target/hexagon/imported/mmvec/encode_ext.def
new file mode 100644
index 0000000..6fbbe2c
--- /dev/null
+++ b/target/hexagon/imported/mmvec/encode_ext.def
@@ -0,0 +1,794 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define CONCAT(A,B) A##B
+#define EXTEXTNAME(X) CONCAT(EXT_,X)
+#define DEF_ENC(TAG,STR) DEF_EXT_ENC(TAG,EXTEXTNAME(EXTNAME),STR)
+
+
+#ifndef NO_MMVEC
+DEF_ENC(V6_extractw,  ICLASS_LD" 001 0 000sssss  PP0uuuuu  --1ddddd") /* coproc insn, returns Rd */
+#endif
+
+
+#ifndef NO_MMVEC
+
+
+
+DEF_CLASS32(ICLASS_NCJ" 1--- -------- PP------ --------",COPROC_VMEM)
+DEF_CLASS32(ICLASS_NCJ" 1000 0-0ttttt PPi--iii ---ddddd",BaseOffset_VMEM_Loads)
+DEF_CLASS32(ICLASS_NCJ" 1000 1-0ttttt PPivviii ---ddddd",BaseOffset_if_Pv_VMEM_Loads)
+DEF_CLASS32(ICLASS_NCJ" 1000 0-1ttttt PPi--iii --------",BaseOffset_VMEM_Stores1)
+DEF_CLASS32(ICLASS_NCJ" 1000 1-0ttttt PPi--iii 00------",BaseOffset_VMEM_Stores2)
+DEF_CLASS32(ICLASS_NCJ" 1000 1-1ttttt PPivviii --------",BaseOffset_if_Pv_VMEM_Stores)
+
+DEF_CLASS32(ICLASS_NCJ" 1001 0-0xxxxx PP---iii ---ddddd",PostImm_VMEM_Loads)
+DEF_CLASS32(ICLASS_NCJ" 1001 1-0xxxxx PP-vviii ---ddddd",PostImm_if_Pv_VMEM_Loads)
+DEF_CLASS32(ICLASS_NCJ" 1001 0-1xxxxx PP---iii --------",PostImm_VMEM_Stores1)
+DEF_CLASS32(ICLASS_NCJ" 1001 1-0xxxxx PP---iii 00------",PostImm_VMEM_Stores2)
+DEF_CLASS32(ICLASS_NCJ" 1001 1-1xxxxx PP-vviii --------",PostImm_if_Pv_VMEM_Stores)
+
+DEF_CLASS32(ICLASS_NCJ" 1011 0-0xxxxx PPu----- ---ddddd",PostM_VMEM_Loads)
+DEF_CLASS32(ICLASS_NCJ" 1011 1-0xxxxx PPuvv--- ---ddddd",PostM_if_Pv_VMEM_Loads)
+DEF_CLASS32(ICLASS_NCJ" 1011 0-1xxxxx PPu----- --------",PostM_VMEM_Stores1)
+DEF_CLASS32(ICLASS_NCJ" 1011 1-0xxxxx PPu----- 00------",PostM_VMEM_Stores2)
+DEF_CLASS32(ICLASS_NCJ" 1011 1-1xxxxx PPuvv--- --------",PostM_if_Pv_VMEM_Stores)
+
+DEF_CLASS32(ICLASS_NCJ" 110- 0------- PP------ --------",Z_Load)
+DEF_CLASS32(ICLASS_NCJ" 110- 1------- PP------ --------",Z_Load_if_Pv)
+
+DEF_CLASS32(ICLASS_NCJ" 1111 000ttttt PPu--0-- ---vvvvv",Gather)
+DEF_CLASS32(ICLASS_NCJ" 1111 000ttttt PPu--1-- -ssvvvvv",Gather_if_Qs)
+DEF_CLASS32(ICLASS_NCJ" 1111 001ttttt PPuvvvvv ---wwwww",Scatter)
+DEF_CLASS32(ICLASS_NCJ" 1111 001ttttt PPuvvvvv -----sss",Scatter_New)
+DEF_CLASS32(ICLASS_NCJ" 1111 1--ttttt PPuvvvvv -sswwwww",Scatter_if_Qs)
+
+
+DEF_FIELD32(ICLASS_NCJ" 1--- -!------ PP------ --------",NT,"NonTemporal")
+
+
+
+DEF_FIELDROW_DESC32(                ICLASS_NCJ" 1 000 --- ----- PP i --iii ----- ---","[#0] vmem(Rt+#s4)[:nt]")
+
+#define LDST_ENC(TAG,MAJ3,MID3,RREG,TINY6,MIN3,VREG) DEF_ENC(TAG, ICLASS_NCJ "1" #MAJ3 #MID3 #RREG "PP" #TINY6 #MIN3 #VREG)
+
+#define LDST_BO(TAGPRE,MID3,PRED,MIN3,VREG) LDST_ENC(TAGPRE##_ai, 000,MID3,ttttt,i PRED iii,MIN3,VREG)
+#define LDST_PI(TAGPRE,MID3,PRED,MIN3,VREG) LDST_ENC(TAGPRE##_pi, 001,MID3,xxxxx,- PRED iii,MIN3,VREG)
+#define LDST_PM(TAGPRE,MID3,PRED,MIN3,VREG) LDST_ENC(TAGPRE##_ppu,011,MID3,xxxxx,u PRED ---,MIN3,VREG)
+
+#define LDST_BASICLD(OP,TAGPRE) \
+    OP(TAGPRE,                000,00,000,ddddd) \
+    OP(TAGPRE##_nt,           010,00,000,ddddd) \
+    OP(TAGPRE##_cur,          000,00,001,ddddd) \
+    OP(TAGPRE##_nt_cur,       010,00,001,ddddd) \
+    OP(TAGPRE##_tmp,          000,00,010,ddddd) \
+    OP(TAGPRE##_nt_tmp,       010,00,010,ddddd)
+
+#define LDST_BASICST(OP,TAGPRE) \
+    OP(TAGPRE,           001,--,000,sssss) \
+    OP(TAGPRE##_nt,      011,--,000,sssss) \
+    OP(TAGPRE##_new,     001,--,001,-0sss) \
+    OP(TAGPRE##_srls,    001,--,001,-1---) \
+    OP(TAGPRE##_nt_new,  011,--,001,--sss) \
+
+
+#define LDST_QPREDST(OP,TAGPRE) \
+    OP(TAGPRE##_qpred,    100,vv,000,sssss) \
+    OP(TAGPRE##_nt_qpred, 110,vv,000,sssss) \
+    OP(TAGPRE##_nqpred,   100,vv,001,sssss) \
+    OP(TAGPRE##_nt_nqpred,110,vv,001,sssss) \
+
+#define LDST_CONDLD(OP,TAGPRE) \
+    OP(TAGPRE##_pred,         100,vv,010,ddddd) \
+    OP(TAGPRE##_nt_pred,      110,vv,010,ddddd) \
+    OP(TAGPRE##_npred,        100,vv,011,ddddd) \
+    OP(TAGPRE##_nt_npred,     110,vv,011,ddddd) \
+    OP(TAGPRE##_cur_pred,     100,vv,100,ddddd) \
+    OP(TAGPRE##_nt_cur_pred,  110,vv,100,ddddd) \
+    OP(TAGPRE##_cur_npred,    100,vv,101,ddddd) \
+    OP(TAGPRE##_nt_cur_npred, 110,vv,101,ddddd) \
+    OP(TAGPRE##_tmp_pred,     100,vv,110,ddddd) \
+    OP(TAGPRE##_nt_tmp_pred,  110,vv,110,ddddd) \
+    OP(TAGPRE##_tmp_npred,    100,vv,111,ddddd) \
+    OP(TAGPRE##_nt_tmp_npred, 110,vv,111,ddddd) \
+
+#define LDST_PREDST(OP,TAGPRE,NT,MIN2) \
+    OP(TAGPRE##_pred,      1 NT 1,vv,MIN2 0,sssss) \
+    OP(TAGPRE##_npred,     1 NT 1,vv,MIN2 1,sssss)
+
+#define LDST_PREDSTNEW(OP,TAGPRE,NT,MIN2) \
+    OP(TAGPRE##_pred,      1 NT 1,vv,MIN2 0,NT 0 sss) \
+    OP(TAGPRE##_npred,     1 NT 1,vv,MIN2 1,NT 1 sss)
+
+// 0.0,vv,0--,sssss: pred st
+#define LDST_BASICPREDST(OP,TAGPRE) \
+    LDST_PREDST(OP,TAGPRE,             0,00) \
+    LDST_PREDST(OP,TAGPRE##_nt,        1,00) \
+    LDST_PREDSTNEW(OP,TAGPRE##_new,    0,01) \
+    LDST_PREDSTNEW(OP,TAGPRE##_nt_new, 1,01)
+
+
+
+LDST_BASICLD(LDST_BO,V6_vL32b)
+LDST_CONDLD(LDST_BO,V6_vL32b)
+LDST_BASICLD(LDST_PI,V6_vL32b)
+LDST_CONDLD(LDST_PI,V6_vL32b)
+LDST_BASICLD(LDST_PM,V6_vL32b)
+LDST_CONDLD(LDST_PM,V6_vL32b)
+
+// Loads
+
+LDST_BO(V6_vL32Ub,000,00,111,ddddd)
+//Stores
+LDST_BASICST(LDST_BO,V6_vS32b)
+
+
+LDST_BO(V6_vS32Ub,001,--,111,sssss)
+
+
+
+
+// Byte Enabled Stores
+LDST_QPREDST(LDST_BO,V6_vS32b)
+
+// Scalar Predicated Stores
+LDST_BASICPREDST(LDST_BO,V6_vS32b)
+
+
+LDST_PREDST(LDST_BO,V6_vS32Ub,0,11)
+
+
+
+
+DEF_FIELDROW_DESC32(                ICLASS_NCJ" 1 001 --- ----- PP - ----- ddddd ---","[#1] vmem(Rx++#s3)[:nt]")
+
+// Loads
+LDST_PI(V6_vL32Ub,000,00,111,ddddd)
+
+//Stores
+LDST_BASICST(LDST_PI,V6_vS32b)
+
+
+
+LDST_PI(V6_vS32Ub,001,--,111,sssss)
+
+
+// Byte Enabled Stores
+LDST_QPREDST(LDST_PI,V6_vS32b)
+
+
+// Scalar Predicated Stores
+LDST_BASICPREDST(LDST_PI,V6_vS32b)
+
+
+LDST_PREDST(LDST_PI,V6_vS32Ub,0,11)
+
+
+
+DEF_FIELDROW_DESC32(            ICLASS_NCJ" 1 011 --- ----- PP - ----- ----- ---","[#3] vmem(Rx++#M)[:nt]")
+
+// Loads
+LDST_PM(V6_vL32Ub,000,00,111,ddddd)
+
+//Stores
+LDST_BASICST(LDST_PM,V6_vS32b)
+
+
+
+LDST_PM(V6_vS32Ub,001,--,111,sssss)
+
+// Byte Enabled Stores
+LDST_QPREDST(LDST_PM,V6_vS32b)
+
+// Scalar Predicated Stores
+LDST_BASICPREDST(LDST_PM,V6_vS32b)
+
+
+LDST_PREDST(LDST_PM,V6_vS32Ub,0,11)
+
+
+
+DEF_ENC(V6_vaddcarrysat,    ICLASS_CJ" 1 101 100 vvvvv PP 1 uuuuu 0ss ddddd") //
+DEF_ENC(V6_vaddcarryo,        ICLASS_CJ" 1 101 101 vvvvv PP 1 uuuuu 0ee ddddd") //
+DEF_ENC(V6_vsubcarryo,        ICLASS_CJ" 1 101 101 vvvvv PP 1 uuuuu 1ee ddddd") //
+DEF_ENC(V6_vsatdw,          ICLASS_CJ" 1 101 100 vvvvv PP 1 uuuuu 111 ddddd") //
+
+DEF_FIELDROW_DESC32(           ICLASS_NCJ" 1 111 --- ----- PP - ----- ----- ---","[#6] vgather,vscatter")
+DEF_ENC(V6_vgathermw,         ICLASS_NCJ" 1 111 000 ttttt PP u --000 --- vvvvv")    // vtmp.w=vmem(Rt32,Mu2,Vv32.w).w
+DEF_ENC(V6_vgathermh,         ICLASS_NCJ" 1 111 000 ttttt PP u --001 --- vvvvv")    // vtmp.h=vmem(Rt32,Mu2,Vv32.h).h
+DEF_ENC(V6_vgathermhw,         ICLASS_NCJ" 1 111 000 ttttt PP u --010 --- vvvvv")    // vtmp.h=vmem(Rt32,Mu2,Vvv32.w).h
+
+
+DEF_ENC(V6_vgathermwq,         ICLASS_NCJ" 1 111 000 ttttt PP u --100 -ss vvvvv")    // if (Qs4) vtmp.w=vmem(Rt32,Mu2,Vv32.w).w
+DEF_ENC(V6_vgathermhq,         ICLASS_NCJ" 1 111 000 ttttt PP u --101 -ss vvvvv")    // if (Qs4) vtmp.h=vmem(Rt32,Mu2,Vv32.h).h
+DEF_ENC(V6_vgathermhwq,     ICLASS_NCJ" 1 111 000 ttttt PP u --110 -ss vvvvv")    // if (Qs4) vtmp.h=vmem(Rt32,Mu2,Vvv32.w).h
+
+
+
+DEF_ENC(V6_vscattermw,         ICLASS_NCJ" 1 111 001 ttttt PP u vvvvv 000 wwwww")    // vmem(Rt32,Mu2,Vv32.w)=Vw32.w
+DEF_ENC(V6_vscattermh,         ICLASS_NCJ" 1 111 001 ttttt PP u vvvvv 001 wwwww")    // vmem(Rt32,Mu2,Vv32.h)=Vw32.h
+DEF_ENC(V6_vscattermhw,     ICLASS_NCJ" 1 111 001 ttttt PP u vvvvv 010 wwwww")    // vmem(Rt32,Mu2,Vv32.h)=Vw32.h
+
+DEF_ENC(V6_vscattermw_add,     ICLASS_NCJ" 1 111 001 ttttt PP u vvvvv 100 wwwww")    // vmem(Rt32,Mu2,Vv32.w) += Vw32.w
+DEF_ENC(V6_vscattermh_add,     ICLASS_NCJ" 1 111 001 ttttt PP u vvvvv 101 wwwww")    // vmem(Rt32,Mu2,Vv32.h) += Vw32.h
+DEF_ENC(V6_vscattermhw_add, ICLASS_NCJ" 1 111 001 ttttt PP u vvvvv 110 wwwww")    // vmem(Rt32,Mu2,Vv32.h) += Vw32.h
+
+
+DEF_ENC(V6_vscattermwq,     ICLASS_NCJ" 1 111 100 ttttt PP u vvvvv 0ss wwwww")    // if (Qs4) vmem(Rt32,Mu2,Vv32.w)=Vw32.w
+DEF_ENC(V6_vscattermhq,     ICLASS_NCJ" 1 111 100 ttttt PP u vvvvv 1ss wwwww")    // if (Qs4) vmem(Rt32,Mu2,Vv32.h)=Vw32.h
+DEF_ENC(V6_vscattermhwq,     ICLASS_NCJ" 1 111 101 ttttt PP u vvvvv 0ss wwwww")    // if (Qs4) vmem(Rt32,Mu2,Vv32.h)=Vw32.h
+
+
+
+
+
+DEF_CLASS32(ICLASS_CJ" 1--- -------- PP------ --------",COPROC_VX)
+
+
+
+/***************************************************************
+*
+*  Group #0, Uses Q6 Rt8: new in v61
+*
+****************************************************************/
+
+DEF_FIELDROW_DESC32(            ICLASS_CJ" 1 000 --- ----- PP - ----- ----- ---","[#1] Vd32=(Vu32, Vv32, Rt8)")
+DEF_ENC(V6_vasrhbsat,             ICLASS_CJ" 1 000 vvv vvttt PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vasruwuhrndsat,         ICLASS_CJ" 1 000 vvv vvttt PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vasrwuhrndsat,         ICLASS_CJ" 1 000 vvv vvttt PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vlutvvb_nm,             ICLASS_CJ" 1 000 vvv vvttt PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vlutvwh_nm,             ICLASS_CJ" 1 000 vvv vvttt PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vasruhubrndsat,         ICLASS_CJ" 1 000 vvv vvttt PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vasruwuhsat,         ICLASS_CJ" 1 000 vvv vvttt PP 1 uuuuu 100 ddddd") //
+DEF_ENC(V6_vasruhubsat,            ICLASS_CJ" 1 000 vvv vvttt PP 1 uuuuu 101 ddddd") //
+
+/***************************************************************
+*
+*  Group #1, Uses Q6 Rt32
+*
+****************************************************************/
+
+DEF_FIELDROW_DESC32(        ICLASS_CJ" 1 001 --- ----- PP - ----- ----- ---","[#1] Vd32=(Vu32, Rt32)")
+DEF_ENC(V6_vtmpyb,             ICLASS_CJ" 1 001 000 ttttt PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vtmpybus,         ICLASS_CJ" 1 001 000 ttttt PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vdmpyhb,         ICLASS_CJ" 1 001 000 ttttt PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vrmpyub,         ICLASS_CJ" 1 001 000 ttttt PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vrmpybus,         ICLASS_CJ" 1 001 000 ttttt PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vdsaduh,         ICLASS_CJ" 1 001 000 ttttt PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vdmpybus,         ICLASS_CJ" 1 001 000 ttttt PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vdmpybus_dv,     ICLASS_CJ" 1 001 000 ttttt PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vdmpyhsusat,     ICLASS_CJ" 1 001 001 ttttt PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vdmpyhsuisat,     ICLASS_CJ" 1 001 001 ttttt PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vdmpyhsat,         ICLASS_CJ" 1 001 001 ttttt PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vdmpyhisat,         ICLASS_CJ" 1 001 001 ttttt PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vdmpyhb_dv,         ICLASS_CJ" 1 001 001 ttttt PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vmpybus,         ICLASS_CJ" 1 001 001 ttttt PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vmpabus,         ICLASS_CJ" 1 001 001 ttttt PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vmpahb,             ICLASS_CJ" 1 001 001 ttttt PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vmpyh,             ICLASS_CJ" 1 001 010 ttttt PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vmpyhss,         ICLASS_CJ" 1 001 010 ttttt PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vmpyhsrs,         ICLASS_CJ" 1 001 010 ttttt PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vmpyuh,             ICLASS_CJ" 1 001 010 ttttt PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vrmpybusi,         ICLASS_CJ" 1 001 010 ttttt PP 0 uuuuu 10i ddddd") //
+DEF_ENC(V6_vrsadubi,         ICLASS_CJ" 1 001 010 ttttt PP 0 uuuuu 11i ddddd") //
+
+DEF_ENC(V6_vmpyihb,         ICLASS_CJ" 1 001 011 ttttt PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vror,             ICLASS_CJ" 1 001 011 ttttt PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vmpyuhe,         ICLASS_CJ" 1 001 011 ttttt PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vmpabuu,         ICLASS_CJ" 1 001 011 ttttt PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vlut4,            ICLASS_CJ" 1 001 011 ttttt PP 0 uuuuu 100 ddddd") //
+
+
+DEF_ENC(V6_vasrw,             ICLASS_CJ" 1 001 011 ttttt PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vasrh,             ICLASS_CJ" 1 001 011 ttttt PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vaslw,             ICLASS_CJ" 1 001 011 ttttt PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vaslh,             ICLASS_CJ" 1 001 100 ttttt PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vlsrw,             ICLASS_CJ" 1 001 100 ttttt PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vlsrh,             ICLASS_CJ" 1 001 100 ttttt PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vlsrb,            ICLASS_CJ" 1 001 100 ttttt PP 0 uuuuu 011 ddddd") //
+
+DEF_ENC(V6_vmpauhb,            ICLASS_CJ" 1 001 100 ttttt PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vmpyiwub,         ICLASS_CJ" 1 001 100 ttttt PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vmpyiwh,         ICLASS_CJ" 1 001 100 ttttt PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vmpyiwb,         ICLASS_CJ" 1 001 101 ttttt PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_lvsplatw,         ICLASS_CJ" 1 001 101 ttttt PP 0 ----0 001 ddddd") //
+
+
+
+DEF_ENC(V6_pred_scalar2,     ICLASS_CJ" 1 001 101 ttttt PP 0 ----- 010 -01dd") //
+DEF_ENC(V6_vandvrt,         ICLASS_CJ" 1 001 101 ttttt PP 0 uuuuu 010 -10dd") //
+DEF_ENC(V6_pred_scalar2v2,     ICLASS_CJ" 1 001 101 ttttt PP 0 ----- 010 -11dd") //
+
+DEF_ENC(V6_vtmpyhb,         ICLASS_CJ" 1 001 101 ttttt PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vandqrt,         ICLASS_CJ" 1 001 101 ttttt PP 0 --0uu 101 ddddd") //
+DEF_ENC(V6_vandnqrt,         ICLASS_CJ" 1 001 101 ttttt PP 0 --1uu 101 ddddd") //
+
+DEF_ENC(V6_vrmpyubi,         ICLASS_CJ" 1 001 101 ttttt PP 0 uuuuu 11i ddddd") //
+
+DEF_ENC(V6_vmpyub,             ICLASS_CJ" 1 001 110 ttttt PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_lvsplath,         ICLASS_CJ" 1 001 110 ttttt PP 0 ----- 001 ddddd") //
+DEF_ENC(V6_lvsplatb,         ICLASS_CJ" 1 001 110 ttttt PP 0 ----- 010 ddddd") //
+
+
+DEF_FIELDROW_DESC32(        ICLASS_CJ" 1 001 --- ----- PP - ----- ----- ---","[#1] Vx32=(Vu32, Rt32)")
+DEF_ENC(V6_vtmpyb_acc,         ICLASS_CJ" 1 001 000 ttttt PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vtmpybus_acc,     ICLASS_CJ" 1 001 000 ttttt PP 1 uuuuu 001 xxxxx") //
+DEF_ENC(V6_vtmpyhb_acc,     ICLASS_CJ" 1 001 000 ttttt PP 1 uuuuu 010 xxxxx") //
+DEF_ENC(V6_vdmpyhb_acc,     ICLASS_CJ" 1 001 000 ttttt PP 1 uuuuu 011 xxxxx") //
+DEF_ENC(V6_vrmpyub_acc,     ICLASS_CJ" 1 001 000 ttttt PP 1 uuuuu 100 xxxxx") //
+DEF_ENC(V6_vrmpybus_acc,     ICLASS_CJ" 1 001 000 ttttt PP 1 uuuuu 101 xxxxx") //
+DEF_ENC(V6_vdmpybus_acc,     ICLASS_CJ" 1 001 000 ttttt PP 1 uuuuu 110 xxxxx") //
+DEF_ENC(V6_vdmpybus_dv_acc, ICLASS_CJ" 1 001 000 ttttt PP 1 uuuuu 111 xxxxx") //
+
+DEF_ENC(V6_vdmpyhsusat_acc, ICLASS_CJ" 1 001 001 ttttt PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vdmpyhsuisat_acc,ICLASS_CJ" 1 001 001 ttttt PP 1 uuuuu 001 xxxxx") //
+DEF_ENC(V6_vdmpyhisat_acc,     ICLASS_CJ" 1 001 001 ttttt PP 1 uuuuu 010 xxxxx") //
+DEF_ENC(V6_vdmpyhsat_acc,     ICLASS_CJ" 1 001 001 ttttt PP 1 uuuuu 011 xxxxx") //
+DEF_ENC(V6_vdmpyhb_dv_acc,     ICLASS_CJ" 1 001 001 ttttt PP 1 uuuuu 100 xxxxx") //
+DEF_ENC(V6_vmpybus_acc,     ICLASS_CJ" 1 001 001 ttttt PP 1 uuuuu 101 xxxxx") //
+DEF_ENC(V6_vmpabus_acc,     ICLASS_CJ" 1 001 001 ttttt PP 1 uuuuu 110 xxxxx") //
+DEF_ENC(V6_vmpahb_acc,         ICLASS_CJ" 1 001 001 ttttt PP 1 uuuuu 111 xxxxx") //
+
+DEF_ENC(V6_vmpyhsat_acc,     ICLASS_CJ" 1 001 010 ttttt PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vmpyuh_acc,         ICLASS_CJ" 1 001 010 ttttt PP 1 uuuuu 001 xxxxx") //
+DEF_ENC(V6_vmpyiwb_acc,     ICLASS_CJ" 1 001 010 ttttt PP 1 uuuuu 010 xxxxx") //
+DEF_ENC(V6_vmpyiwh_acc,     ICLASS_CJ" 1 001 010 ttttt PP 1 uuuuu 011 xxxxx") //
+DEF_ENC(V6_vrmpybusi_acc,     ICLASS_CJ" 1 001 010 ttttt PP 1 uuuuu 10i xxxxx") //
+DEF_ENC(V6_vrsadubi_acc,     ICLASS_CJ" 1 001 010 ttttt PP 1 uuuuu 11i xxxxx") //
+
+DEF_ENC(V6_vdsaduh_acc,     ICLASS_CJ" 1 001 011 ttttt PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vmpyihb_acc,     ICLASS_CJ" 1 001 011 ttttt PP 1 uuuuu 001 xxxxx") //
+DEF_ENC(V6_vaslw_acc,         ICLASS_CJ" 1 001 011 ttttt PP 1 uuuuu 010 xxxxx") //
+DEF_ENC(V6_vandqrt_acc,     ICLASS_CJ" 1 001 011 ttttt PP 1 --0uu 011 xxxxx") //
+DEF_ENC(V6_vandnqrt_acc,     ICLASS_CJ" 1 001 011 ttttt PP 1 --1uu 011 xxxxx") //
+DEF_ENC(V6_vandvrt_acc,     ICLASS_CJ" 1 001 011 ttttt PP 1 uuuuu 100 ---xx") //
+DEF_ENC(V6_vasrw_acc,         ICLASS_CJ" 1 001 011 ttttt PP 1 uuuuu 101 xxxxx") //
+DEF_ENC(V6_vrmpyubi_acc,     ICLASS_CJ" 1 001 011 ttttt PP 1 uuuuu 11i xxxxx") //
+
+DEF_ENC(V6_vmpyub_acc,         ICLASS_CJ" 1 001 100 ttttt PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vmpyiwub_acc,    ICLASS_CJ" 1 001 100 ttttt PP 1 uuuuu 001 xxxxx") //
+DEF_ENC(V6_vmpauhb_acc,        ICLASS_CJ" 1 001 100 ttttt PP 1 uuuuu 010 xxxxx") //
+DEF_ENC(V6_vmpyuhe_acc,        ICLASS_CJ" 1 001 100 ttttt PP 1 uuuuu 011 xxxxx")
+DEF_ENC(V6_vmpahhsat,        ICLASS_CJ" 1 001 100 ttttt PP 1 uuuuu 100 xxxxx") //
+DEF_ENC(V6_vmpauhuhsat,        ICLASS_CJ" 1 001 100 ttttt PP 1 uuuuu 101 xxxxx") //
+DEF_ENC(V6_vmpsuhuhsat,        ICLASS_CJ" 1 001 100 ttttt PP 1 uuuuu 110 xxxxx") //
+DEF_ENC(V6_vasrh_acc,         ICLASS_CJ" 1 001 100 ttttt PP 1 uuuuu 111 xxxxx") //
+
+
+
+
+DEF_ENC(V6_vinsertwr,        ICLASS_CJ" 1 001 101 ttttt PP 1 ----- 001 xxxxx")
+
+DEF_ENC(V6_vmpabuu_acc,        ICLASS_CJ" 1 001 101 ttttt PP 1 uuuuu 100 xxxxx") //
+DEF_ENC(V6_vaslh_acc,        ICLASS_CJ" 1 001 101 ttttt PP 1 uuuuu 101 xxxxx") //
+DEF_ENC(V6_vmpyh_acc,        ICLASS_CJ" 1 001 101 ttttt PP 1 uuuuu 110 xxxxx") //
+
+
+
+DEF_FIELDROW_DESC32(        ICLASS_CJ" 1 001 --- ----- PP - ----- ----- ---","[#1] (Vx32, Vy32, Rt32)")
+DEF_ENC(V6_vshuff,             ICLASS_CJ" 1 001 111 ttttt PP 1 yyyyy 001 xxxxx") //
+DEF_ENC(V6_vdeal,             ICLASS_CJ" 1 001 111 ttttt PP 1 yyyyy 010 xxxxx") //
+
+DEF_FIELDROW_DESC32(    ICLASS_CJ" 1 010 --- ----- PP - ----- ----- ---","[#2] if (Ps) Vd=Vu")
+DEF_ENC(V6_vcmov,         ICLASS_CJ" 1 010 000 ----- PP - uuuuu -ss ddddd")
+DEF_ENC(V6_vncmov,         ICLASS_CJ" 1 010 001 ----- PP - uuuuu -ss ddddd")
+DEF_ENC(V6_vnccombine,     ICLASS_CJ" 1 010 010 vvvvv PP - uuuuu -ss ddddd")
+DEF_ENC(V6_vccombine,     ICLASS_CJ" 1 010 011 vvvvv PP - uuuuu -ss ddddd")
+
+DEF_ENC(V6_vrotr,       ICLASS_CJ" 1 010 100 vvvvv PP 1 uuuuu 111 ddddd")
+DEF_ENC(V6_vasr_into,   ICLASS_CJ" 1 010 101 vvvvv PP 1 uuuuu 111 xxxxx")
+
+/***************************************************************
+*
+*  Group #3, Uses Q6 Rt8
+*
+****************************************************************/
+
+DEF_FIELDROW_DESC32(        ICLASS_CJ" 1 011 --- ----- PP - ----- ----- ---","[#3] Vd32=(Vu32, Vv32, Rt8)")
+DEF_ENC(V6_valignb,         ICLASS_CJ" 1 011 vvv vvttt PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vlalignb,         ICLASS_CJ" 1 011 vvv vvttt PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vasrwh,         ICLASS_CJ" 1 011 vvv vvttt PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vasrwhsat,         ICLASS_CJ" 1 011 vvv vvttt PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vasrwhrndsat,     ICLASS_CJ" 1 011 vvv vvttt PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vasrwuhsat,         ICLASS_CJ" 1 011 vvv vvttt PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vasrhubsat,         ICLASS_CJ" 1 011 vvv vvttt PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vasrhubrndsat,     ICLASS_CJ" 1 011 vvv vvttt PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vasrhbrndsat,     ICLASS_CJ" 1 011 vvv vvttt PP 1 uuuuu 000 ddddd") //
+DEF_ENC(V6_vlutvvb,            ICLASS_CJ" 1 011 vvv vvttt PP 1 uuuuu 001 ddddd")
+DEF_ENC(V6_vshuffvdd,         ICLASS_CJ" 1 011 vvv vvttt PP 1 uuuuu 011 ddddd") //
+DEF_ENC(V6_vdealvdd,         ICLASS_CJ" 1 011 vvv vvttt PP 1 uuuuu 100 ddddd") //
+DEF_ENC(V6_vlutvvb_oracc,    ICLASS_CJ" 1 011 vvv vvttt PP 1 uuuuu 101 xxxxx")
+DEF_ENC(V6_vlutvwh,            ICLASS_CJ" 1 011 vvv vvttt PP 1 uuuuu 110 ddddd")
+DEF_ENC(V6_vlutvwh_oracc,    ICLASS_CJ" 1 011 vvv vvttt PP 1 uuuuu 111 xxxxx")
+
+
+
+/***************************************************************
+*
+*  Group #4, No Q6 regs
+*
+****************************************************************/
+
+DEF_FIELDROW_DESC32(    ICLASS_CJ" 1 100 --- ----- PP 0 ----- ----- ---","[#4] Vd32=(Vu32, Vv32)")
+DEF_ENC(V6_vrmpyubv,     ICLASS_CJ" 1 100 000 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vrmpybv,     ICLASS_CJ" 1 100 000 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vrmpybusv,     ICLASS_CJ" 1 100 000 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vdmpyhvsat,     ICLASS_CJ" 1 100 000 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vmpybv,         ICLASS_CJ" 1 100 000 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vmpyubv,     ICLASS_CJ" 1 100 000 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vmpybusv,     ICLASS_CJ" 1 100 000 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vmpyhv,         ICLASS_CJ" 1 100 000 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vmpyuhv,     ICLASS_CJ" 1 100 001 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vmpyhvsrs,     ICLASS_CJ" 1 100 001 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vmpyhus,     ICLASS_CJ" 1 100 001 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vmpabusv,     ICLASS_CJ" 1 100 001 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vmpyih,         ICLASS_CJ" 1 100 001 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vand,         ICLASS_CJ" 1 100 001 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vor,         ICLASS_CJ" 1 100 001 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vxor,         ICLASS_CJ" 1 100 001 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vaddw,         ICLASS_CJ" 1 100 010 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vaddubsat,     ICLASS_CJ" 1 100 010 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vadduhsat,     ICLASS_CJ" 1 100 010 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vaddhsat,     ICLASS_CJ" 1 100 010 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vaddwsat,     ICLASS_CJ" 1 100 010 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vsubb,         ICLASS_CJ" 1 100 010 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vsubh,         ICLASS_CJ" 1 100 010 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vsubw,         ICLASS_CJ" 1 100 010 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vsububsat,     ICLASS_CJ" 1 100 011 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vsubuhsat,     ICLASS_CJ" 1 100 011 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vsubhsat,     ICLASS_CJ" 1 100 011 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vsubwsat,     ICLASS_CJ" 1 100 011 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vaddb_dv,     ICLASS_CJ" 1 100 011 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vaddh_dv,     ICLASS_CJ" 1 100 011 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vaddw_dv,     ICLASS_CJ" 1 100 011 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vaddubsat_dv,ICLASS_CJ" 1 100 011 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vadduhsat_dv,ICLASS_CJ" 1 100 100 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vaddhsat_dv, ICLASS_CJ" 1 100 100 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vaddwsat_dv, ICLASS_CJ" 1 100 100 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vsubb_dv,     ICLASS_CJ" 1 100 100 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vsubh_dv,     ICLASS_CJ" 1 100 100 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vsubw_dv,     ICLASS_CJ" 1 100 100 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vsububsat_dv,ICLASS_CJ" 1 100 100 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vsubuhsat_dv,ICLASS_CJ" 1 100 100 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vsubhsat_dv,    ICLASS_CJ" 1 100 101 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vsubwsat_dv, ICLASS_CJ" 1 100 101 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vaddubh,     ICLASS_CJ" 1 100 101 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vadduhw,     ICLASS_CJ" 1 100 101 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vaddhw,         ICLASS_CJ" 1 100 101 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vsububh,     ICLASS_CJ" 1 100 101 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vsubuhw,        ICLASS_CJ" 1 100 101 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vsubhw,        ICLASS_CJ" 1 100 101 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vabsdiffub,    ICLASS_CJ" 1 100 110 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vabsdiffh,     ICLASS_CJ" 1 100 110 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vabsdiffuh,     ICLASS_CJ" 1 100 110 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vabsdiffw,     ICLASS_CJ" 1 100 110 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vavgub,         ICLASS_CJ" 1 100 110 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vavguh,         ICLASS_CJ" 1 100 110 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vavgh,        ICLASS_CJ" 1 100 110 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vavgw,        ICLASS_CJ" 1 100 110 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vnavgub,        ICLASS_CJ" 1 100 111 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vnavgh,         ICLASS_CJ" 1 100 111 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vnavgw,         ICLASS_CJ" 1 100 111 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vavgubrnd,     ICLASS_CJ" 1 100 111 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vavguhrnd,     ICLASS_CJ" 1 100 111 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vavghrnd,     ICLASS_CJ" 1 100 111 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vavgwrnd,    ICLASS_CJ" 1 100 111 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vmpabuuv,    ICLASS_CJ" 1 100 111 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_FIELDROW_DESC32(        ICLASS_CJ" 1 100 --- ----- PP 1 ----- ----- ---","[#4] Vx32=(Vu32, Vv32)")
+DEF_ENC(V6_vrmpyubv_acc,      ICLASS_CJ" 1 100 000 vvvvv PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vrmpybv_acc,       ICLASS_CJ" 1 100 000 vvvvv PP 1 uuuuu 001 xxxxx") //
+DEF_ENC(V6_vrmpybusv_acc,    ICLASS_CJ" 1 100 000 vvvvv PP 1 uuuuu 010 xxxxx") //
+DEF_ENC(V6_vdmpyhvsat_acc,    ICLASS_CJ" 1 100 000 vvvvv PP 1 uuuuu 011 xxxxx") //
+DEF_ENC(V6_vmpybv_acc,         ICLASS_CJ" 1 100 000 vvvvv PP 1 uuuuu 100 xxxxx") //
+DEF_ENC(V6_vmpyubv_acc,     ICLASS_CJ" 1 100 000 vvvvv PP 1 uuuuu 101 xxxxx") //
+DEF_ENC(V6_vmpybusv_acc,    ICLASS_CJ" 1 100 000 vvvvv PP 1 uuuuu 110 xxxxx") //
+DEF_ENC(V6_vmpyhv_acc,        ICLASS_CJ" 1 100 000 vvvvv PP 1 uuuuu 111 xxxxx") //
+
+DEF_ENC(V6_vmpyuhv_acc,        ICLASS_CJ" 1 100 001 vvvvv PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vmpyhus_acc,     ICLASS_CJ" 1 100 001 vvvvv PP 1 uuuuu 001 xxxxx") //
+DEF_ENC(V6_vaddhw_acc,        ICLASS_CJ" 1 100 001 vvvvv PP 1 uuuuu 010 xxxxx") //
+DEF_ENC(V6_vmpyowh_64_acc,    ICLASS_CJ" 1 100 001 vvvvv PP 1 uuuuu 011 xxxxx")
+DEF_ENC(V6_vmpyih_acc,         ICLASS_CJ" 1 100 001 vvvvv PP 1 uuuuu 100 xxxxx") //
+DEF_ENC(V6_vmpyiewuh_acc,    ICLASS_CJ" 1 100 001 vvvvv PP 1 uuuuu 101 xxxxx") //
+DEF_ENC(V6_vmpyowh_sacc,    ICLASS_CJ" 1 100 001 vvvvv PP 1 uuuuu 110 xxxxx") //
+DEF_ENC(V6_vmpyowh_rnd_sacc,ICLASS_CJ" 1 100 001 vvvvv PP 1 uuuuu 111 xxxxx") //
+
+DEF_ENC(V6_vmpyiewh_acc,      ICLASS_CJ" 1 100 010 vvvvv PP 1 uuuuu 000 xxxxx") //
+
+DEF_ENC(V6_vadduhw_acc,          ICLASS_CJ" 1 100 010 vvvvv PP 1 uuuuu 100 xxxxx") //
+DEF_ENC(V6_vaddubh_acc,          ICLASS_CJ" 1 100 010 vvvvv PP 1 uuuuu 101 xxxxx") //
+
+DEF_FIELDROW_DESC32(    ICLASS_CJ" 1 100 100 ----- PP 1 ----- ----- ---","[#4] Qx4=(Vu32, Vv32)")
+// Grouped by element size (lsbs), operation (next-lsbs) and operation (next-lsbs)
+DEF_ENC(V6_veqb_and,     ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 000 000xx") //
+DEF_ENC(V6_veqh_and,     ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 000 001xx") //
+DEF_ENC(V6_veqw_and,     ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 000 010xx") //
+
+DEF_ENC(V6_vgtb_and,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 000 100xx") //
+DEF_ENC(V6_vgth_and,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 000 101xx") //
+DEF_ENC(V6_vgtw_and,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 000 110xx") //
+
+DEF_ENC(V6_vgtub_and,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 001 000xx") //
+DEF_ENC(V6_vgtuh_and,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 001 001xx") //
+DEF_ENC(V6_vgtuw_and,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 001 010xx") //
+
+DEF_ENC(V6_veqb_or,     ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 010 000xx") //
+DEF_ENC(V6_veqh_or,     ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 010 001xx") //
+DEF_ENC(V6_veqw_or,     ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 010 010xx") //
+
+DEF_ENC(V6_vgtb_or,        ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 010 100xx") //
+DEF_ENC(V6_vgth_or,        ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 010 101xx") //
+DEF_ENC(V6_vgtw_or,        ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 010 110xx") //
+
+DEF_ENC(V6_vgtub_or,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 011 000xx") //
+DEF_ENC(V6_vgtuh_or,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 011 001xx") //
+DEF_ENC(V6_vgtuw_or,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 011 010xx") //
+
+DEF_ENC(V6_veqb_xor,     ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 100 000xx") //
+DEF_ENC(V6_veqh_xor,     ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 100 001xx") //
+DEF_ENC(V6_veqw_xor,     ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 100 010xx") //
+
+DEF_ENC(V6_vgtb_xor,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 100 100xx") //
+DEF_ENC(V6_vgth_xor,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 100 101xx") //
+DEF_ENC(V6_vgtw_xor,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 100 110xx") //
+
+DEF_ENC(V6_vgtub_xor,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 101 000xx") //
+DEF_ENC(V6_vgtuh_xor,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 101 001xx") //
+DEF_ENC(V6_vgtuw_xor,    ICLASS_CJ" 1 100 100 vvvvv PP 1 uuuuu 101 010xx") //
+
+DEF_FIELDROW_DESC32(    ICLASS_CJ" 1 100 101 ----- PP 1 ----- ----- ---","[#4] Qx4,Vd32=(Vu32, Vv32)")
+DEF_ENC(V6_vaddcarry,    ICLASS_CJ" 1 100 101 vvvvv PP 1 uuuuu 0xx ddddd") //
+DEF_ENC(V6_vsubcarry,    ICLASS_CJ" 1 100 101 vvvvv PP 1 uuuuu 1xx ddddd") //
+
+DEF_FIELDROW_DESC32(        ICLASS_CJ" 1 100 11- ----- PP 1 ----- ----- ---","[#4] Vx32|=(Vu32, Vv32,#)")
+DEF_ENC(V6_vlutvvb_oracci,    ICLASS_CJ" 1 100 110 vvvvv PP 1 uuuuu iii xxxxx") //
+DEF_ENC(V6_vlutvwh_oracci,    ICLASS_CJ" 1 100 111 vvvvv PP 1 uuuuu iii xxxxx") //
+
+
+
+/***************************************************************
+*
+*  Group #5, Reserved/Deprecated. Uses Q6 Rx. Stupid FFT.
+*
+****************************************************************/
+
+
+
+
+/***************************************************************
+*
+*  Group #6, No Q6 regs
+*
+****************************************************************/
+
+DEF_FIELDROW_DESC32(    ICLASS_CJ" 1 110 --0 ----- PP 0 ----- ----- ---","[#6] Vd32=Vu32")
+DEF_ENC(V6_vabsh,         ICLASS_CJ" 1 110 --0 ---00 PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vabsh_sat,     ICLASS_CJ" 1 110 --0 ---00 PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vabsw,         ICLASS_CJ" 1 110 --0 ---00 PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vabsw_sat,     ICLASS_CJ" 1 110 --0 ---00 PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vnot,         ICLASS_CJ" 1 110 --0 ---00 PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vdealh,         ICLASS_CJ" 1 110 --0 ---00 PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vdealb,         ICLASS_CJ" 1 110 --0 ---00 PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vunpackub,     ICLASS_CJ" 1 110 --0 ---01 PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vunpackuh,     ICLASS_CJ" 1 110 --0 ---01 PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vunpackb,     ICLASS_CJ" 1 110 --0 ---01 PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vunpackh,     ICLASS_CJ" 1 110 --0 ---01 PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vabsb,         ICLASS_CJ" 1 110 --0 ---01 PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vabsb_sat,     ICLASS_CJ" 1 110 --0 ---01 PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vshuffh,     ICLASS_CJ" 1 110 --0 ---01 PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vshuffb,     ICLASS_CJ" 1 110 --0 ---10 PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vzb,         ICLASS_CJ" 1 110 --0 ---10 PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vzh,         ICLASS_CJ" 1 110 --0 ---10 PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vsb,         ICLASS_CJ" 1 110 --0 ---10 PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vsh,         ICLASS_CJ" 1 110 --0 ---10 PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vcl0w,         ICLASS_CJ" 1 110 --0 ---10 PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vpopcounth,     ICLASS_CJ" 1 110 --0 ---10 PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vcl0h,         ICLASS_CJ" 1 110 --0 ---10 PP 0 uuuuu 111 ddddd") //
+
+
+DEF_FIELDROW_DESC32(    ICLASS_CJ" 1 110 --0 ---11 PP 0 ----- ----- ---","[#6] Qd4=Qt4, Qs4")
+DEF_ENC(V6_pred_and,     ICLASS_CJ" 1 110 tt0 ---11 PP 0 ---ss 000 000dd") //
+DEF_ENC(V6_pred_or,     ICLASS_CJ" 1 110 tt0 ---11 PP 0 ---ss 000 001dd") //
+DEF_ENC(V6_pred_not,     ICLASS_CJ" 1 110 --0 ---11 PP 0 ---ss 000 010dd") //
+DEF_ENC(V6_pred_xor,     ICLASS_CJ" 1 110 tt0 ---11 PP 0 ---ss 000 011dd") //
+DEF_ENC(V6_pred_or_n,     ICLASS_CJ" 1 110 tt0 ---11 PP 0 ---ss 000 100dd") //
+DEF_ENC(V6_pred_and_n,     ICLASS_CJ" 1 110 tt0 ---11 PP 0 ---ss 000 101dd") //
+DEF_ENC(V6_shuffeqh,     ICLASS_CJ" 1 110 tt0 ---11 PP 0 ---ss 000 110dd") //
+DEF_ENC(V6_shuffeqw,     ICLASS_CJ" 1 110 tt0 ---11 PP 0 ---ss 000 111dd") //
+
+DEF_ENC(V6_vnormamtw,        ICLASS_CJ" 1 110 --0 ---11 PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vnormamth,        ICLASS_CJ" 1 110 --0 ---11 PP 0 uuuuu 101 ddddd") //
+
+DEF_FIELDROW_DESC32(        ICLASS_CJ" 1 110 --1 ----- PP 0 ----- ----- ---","[#6] Vd32=Vu32,Vv32")
+DEF_ENC(V6_vlutvvbi,        ICLASS_CJ" 1 110 001 vvvvv PP 0 uuuuu iii ddddd")
+DEF_ENC(V6_vlutvwhi,        ICLASS_CJ" 1 110 011 vvvvv PP 0 uuuuu iii ddddd")
+
+DEF_ENC(V6_vaddbsat_dv,        ICLASS_CJ" 1 110 101 vvvvv PP 0 uuuuu 000 ddddd")
+DEF_ENC(V6_vsubbsat_dv,        ICLASS_CJ" 1 110 101 vvvvv PP 0 uuuuu 001 ddddd")
+DEF_ENC(V6_vadduwsat_dv,    ICLASS_CJ" 1 110 101 vvvvv PP 0 uuuuu 010 ddddd")
+DEF_ENC(V6_vsubuwsat_dv,    ICLASS_CJ" 1 110 101 vvvvv PP 0 uuuuu 011 ddddd")
+DEF_ENC(V6_vaddububb_sat,    ICLASS_CJ" 1 110 101 vvvvv PP 0 uuuuu 100 ddddd")
+DEF_ENC(V6_vsubububb_sat,    ICLASS_CJ" 1 110 101 vvvvv PP 0 uuuuu 101 ddddd")
+DEF_ENC(V6_vmpyewuh_64,        ICLASS_CJ" 1 110 101 vvvvv PP 0 uuuuu 110 ddddd")
+
+DEF_FIELDROW_DESC32(        ICLASS_CJ" 1 110 --0 ----- PP 1 ----- ----- ---","Vx32=Vu32")
+DEF_ENC(V6_vunpackob,         ICLASS_CJ" 1 110 --0 ---00 PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vunpackoh,         ICLASS_CJ" 1 110 --0 ---00 PP 1 uuuuu 001 xxxxx") //
+//DEF_ENC(V6_vunpackow,     ICLASS_CJ" 1 110 --0 ---00 PP 1 uuuuu 010 xxxxx") //
+
+DEF_ENC(V6_vhist,            ICLASS_CJ" 1 110 --0 ---00 PP 1 -000- 100 -----")
+DEF_ENC(V6_vwhist256,        ICLASS_CJ" 1 110 --0 ---00 PP 1 -0010 100 -----")
+DEF_ENC(V6_vwhist256_sat,    ICLASS_CJ" 1 110 --0 ---00 PP 1 -0011 100 -----")
+DEF_ENC(V6_vwhist128,        ICLASS_CJ" 1 110 --0 ---00 PP 1 -010- 100 -----")
+DEF_ENC(V6_vwhist128m,        ICLASS_CJ" 1 110 --0 ---00 PP 1 -011i 100 -----")
+
+DEF_FIELDROW_DESC32(        ICLASS_CJ" 1 110 --0 ----- PP 1 ----- ----- ---","if (Qv4) Vx32=Vu32")
+DEF_ENC(V6_vaddbq,             ICLASS_CJ" 1 110 vv0 ---01 PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vaddhq,             ICLASS_CJ" 1 110 vv0 ---01 PP 1 uuuuu 001 xxxxx") //
+DEF_ENC(V6_vaddwq,             ICLASS_CJ" 1 110 vv0 ---01 PP 1 uuuuu 010 xxxxx") //
+DEF_ENC(V6_vaddbnq,         ICLASS_CJ" 1 110 vv0 ---01 PP 1 uuuuu 011 xxxxx") //
+DEF_ENC(V6_vaddhnq,         ICLASS_CJ" 1 110 vv0 ---01 PP 1 uuuuu 100 xxxxx") //
+DEF_ENC(V6_vaddwnq,         ICLASS_CJ" 1 110 vv0 ---01 PP 1 uuuuu 101 xxxxx") //
+DEF_ENC(V6_vsubbq,             ICLASS_CJ" 1 110 vv0 ---01 PP 1 uuuuu 110 xxxxx") //
+DEF_ENC(V6_vsubhq,             ICLASS_CJ" 1 110 vv0 ---01 PP 1 uuuuu 111 xxxxx") //
+
+DEF_ENC(V6_vsubwq,             ICLASS_CJ" 1 110 vv0 ---10 PP 1 uuuuu 000 xxxxx") //
+DEF_ENC(V6_vsubbnq,         ICLASS_CJ" 1 110 vv0 ---10 PP 1 uuuuu 001 xxxxx") //
+DEF_ENC(V6_vsubhnq,         ICLASS_CJ" 1 110 vv0 ---10 PP 1 uuuuu 010 xxxxx") //
+DEF_ENC(V6_vsubwnq,         ICLASS_CJ" 1 110 vv0 ---10 PP 1 uuuuu 011 xxxxx") //
+
+DEF_ENC(V6_vhistq,            ICLASS_CJ" 1 110 vv0 ---10 PP 1 --00- 100 -----")
+DEF_ENC(V6_vwhist256q,        ICLASS_CJ" 1 110 vv0 ---10 PP 1 --010 100 -----")
+DEF_ENC(V6_vwhist256q_sat,    ICLASS_CJ" 1 110 vv0 ---10 PP 1 --011 100 -----")
+DEF_ENC(V6_vwhist128q,        ICLASS_CJ" 1 110 vv0 ---10 PP 1 --10- 100 -----")
+DEF_ENC(V6_vwhist128qm,        ICLASS_CJ" 1 110 vv0 ---10 PP 1 --11i 100 -----")
+
+
+DEF_ENC(V6_vandvqv,            ICLASS_CJ" 1 110 vv0 ---11 PP 1 uuuuu 000 ddddd")
+DEF_ENC(V6_vandvnqv,        ICLASS_CJ" 1 110 vv0 ---11 PP 1 uuuuu 001 ddddd")
+
+
+DEF_ENC(V6_vprefixqb,       ICLASS_CJ" 1 110 vv0 ---11 PP 1 --000 010 ddddd") //
+DEF_ENC(V6_vprefixqh,       ICLASS_CJ" 1 110 vv0 ---11 PP 1 --001 010 ddddd") //
+DEF_ENC(V6_vprefixqw,       ICLASS_CJ" 1 110 vv0 ---11 PP 1 --010 010 ddddd") //
+
+
+
+
+DEF_ENC(V6_vassign,            ICLASS_CJ" 1 110 --0 ---11 PP 1 uuuuu 111 ddddd")
+
+DEF_ENC(V6_valignbi,         ICLASS_CJ" 1 110 001 vvvvv PP 1 uuuuu iii ddddd")
+DEF_ENC(V6_vlalignbi,         ICLASS_CJ" 1 110 011 vvvvv PP 1 uuuuu iii ddddd")
+DEF_ENC(V6_vswap,             ICLASS_CJ" 1 110 101 vvvvv PP 1 uuuuu -tt ddddd") //
+DEF_ENC(V6_vmux,             ICLASS_CJ" 1 110 111 vvvvv PP 1 uuuuu -tt ddddd") //
+
+
+
+/***************************************************************
+*
+*  Group #7, No Q6 regs
+*
+****************************************************************/
+
+DEF_FIELDROW_DESC32(    ICLASS_CJ" 1 111 --- ----- PP 0 ----- ----- ---","[#7] Vd32=(Vu32, Vv32)")
+DEF_ENC(V6_vaddbsat,    ICLASS_CJ" 1 111 000 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vminub,         ICLASS_CJ" 1 111 000 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vminuh,         ICLASS_CJ" 1 111 000 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vminh,         ICLASS_CJ" 1 111 000 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vminw,         ICLASS_CJ" 1 111 000 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vmaxub,         ICLASS_CJ" 1 111 000 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vmaxuh,         ICLASS_CJ" 1 111 000 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vmaxh,         ICLASS_CJ" 1 111 000 vvvvv PP 0 uuuuu 111 ddddd") //
+
+
+DEF_ENC(V6_vaddclbh,    ICLASS_CJ" 1 111 000 vvvvv PP 1 uuuuu 000 ddddd") //
+DEF_ENC(V6_vaddclbw,    ICLASS_CJ" 1 111 000 vvvvv PP 1 uuuuu 001 ddddd") //
+
+DEF_ENC(V6_vavguw,        ICLASS_CJ" 1 111 000 vvvvv PP 1 uuuuu 010 ddddd") //
+DEF_ENC(V6_vavguwrnd,    ICLASS_CJ" 1 111 000 vvvvv PP 1 uuuuu 011 ddddd") //
+DEF_ENC(V6_vavgb,        ICLASS_CJ" 1 111 000 vvvvv PP 1 uuuuu 100 ddddd") //
+DEF_ENC(V6_vavgbrnd,    ICLASS_CJ" 1 111 000 vvvvv PP 1 uuuuu 101 ddddd") //
+DEF_ENC(V6_vnavgb,        ICLASS_CJ" 1 111 000 vvvvv PP 1 uuuuu 110 ddddd") //
+
+
+DEF_ENC(V6_vmaxw,         ICLASS_CJ" 1 111 001 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vdelta,         ICLASS_CJ" 1 111 001 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vsubbsat,    ICLASS_CJ" 1 111 001 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vrdelta,     ICLASS_CJ" 1 111 001 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vminb,         ICLASS_CJ" 1 111 001 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vmaxb,         ICLASS_CJ" 1 111 001 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vsatuwuh,    ICLASS_CJ" 1 111 001 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vdealb4w,     ICLASS_CJ" 1 111 001 vvvvv PP 0 uuuuu 111 ddddd") //
+
+
+DEF_ENC(V6_vmpyowh_rnd,     ICLASS_CJ" 1 111 010 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vshuffeb,      ICLASS_CJ" 1 111 010 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vshuffob,      ICLASS_CJ" 1 111 010 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vshufeh,      ICLASS_CJ" 1 111 010 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vshufoh,      ICLASS_CJ" 1 111 010 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vshufoeh,      ICLASS_CJ" 1 111 010 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vshufoeb,      ICLASS_CJ" 1 111 010 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vcombine,     ICLASS_CJ" 1 111 010 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vmpyieoh,     ICLASS_CJ" 1 111 011 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vadduwsat,     ICLASS_CJ" 1 111 011 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vsathub,     ICLASS_CJ" 1 111 011 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vsatwh,         ICLASS_CJ" 1 111 011 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vroundwh,    ICLASS_CJ" 1 111 011 vvvvv PP 0 uuuuu 100 ddddd")
+DEF_ENC(V6_vroundwuh,    ICLASS_CJ" 1 111 011 vvvvv PP 0 uuuuu 101 ddddd")
+DEF_ENC(V6_vroundhb,    ICLASS_CJ" 1 111 011 vvvvv PP 0 uuuuu 110 ddddd")
+DEF_ENC(V6_vroundhub,    ICLASS_CJ" 1 111 011 vvvvv PP 0 uuuuu 111 ddddd")
+
+DEF_FIELDROW_DESC32(    ICLASS_CJ" 1 111 100 ----- PP - ----- ----- ---","[#7] Qd4=(Vu32, Vv32)")
+DEF_ENC(V6_veqb,         ICLASS_CJ" 1 111 100 vvvvv PP 0 uuuuu 000 000dd") //
+DEF_ENC(V6_veqh,         ICLASS_CJ" 1 111 100 vvvvv PP 0 uuuuu 000 001dd") //
+DEF_ENC(V6_veqw,         ICLASS_CJ" 1 111 100 vvvvv PP 0 uuuuu 000 010dd") //
+
+DEF_ENC(V6_vgtb,         ICLASS_CJ" 1 111 100 vvvvv PP 0 uuuuu 000 100dd") //
+DEF_ENC(V6_vgth,         ICLASS_CJ" 1 111 100 vvvvv PP 0 uuuuu 000 101dd") //
+DEF_ENC(V6_vgtw,         ICLASS_CJ" 1 111 100 vvvvv PP 0 uuuuu 000 110dd") //
+
+DEF_ENC(V6_vgtub,         ICLASS_CJ" 1 111 100 vvvvv PP 0 uuuuu 001 000dd") //
+DEF_ENC(V6_vgtuh,         ICLASS_CJ" 1 111 100 vvvvv PP 0 uuuuu 001 001dd") //
+DEF_ENC(V6_vgtuw,         ICLASS_CJ" 1 111 100 vvvvv PP 0 uuuuu 001 010dd") //
+
+
+DEF_ENC(V6_vasrwv,         ICLASS_CJ" 1 111 101 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vlsrwv,         ICLASS_CJ" 1 111 101 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vlsrhv,         ICLASS_CJ" 1 111 101 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vasrhv,         ICLASS_CJ" 1 111 101 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vaslwv,         ICLASS_CJ" 1 111 101 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vaslhv,         ICLASS_CJ" 1 111 101 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vaddb,         ICLASS_CJ" 1 111 101 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vaddh,         ICLASS_CJ" 1 111 101 vvvvv PP 0 uuuuu 111 ddddd") //
+
+
+DEF_ENC(V6_vmpyiewuh,     ICLASS_CJ" 1 111 110 vvvvv PP 0 uuuuu 000 ddddd")
+DEF_ENC(V6_vmpyiowh,    ICLASS_CJ" 1 111 110 vvvvv PP 0 uuuuu 001 ddddd")
+DEF_ENC(V6_vpackeb,     ICLASS_CJ" 1 111 110 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vpackeh,     ICLASS_CJ" 1 111 110 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vsubuwsat,     ICLASS_CJ" 1 111 110 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vpackhub_sat,ICLASS_CJ" 1 111 110 vvvvv PP 0 uuuuu 101 ddddd") //
+DEF_ENC(V6_vpackhb_sat, ICLASS_CJ" 1 111 110 vvvvv PP 0 uuuuu 110 ddddd") //
+DEF_ENC(V6_vpackwuh_sat,ICLASS_CJ" 1 111 110 vvvvv PP 0 uuuuu 111 ddddd") //
+
+DEF_ENC(V6_vpackwh_sat, ICLASS_CJ" 1 111 111 vvvvv PP 0 uuuuu 000 ddddd") //
+DEF_ENC(V6_vpackob,     ICLASS_CJ" 1 111 111 vvvvv PP 0 uuuuu 001 ddddd") //
+DEF_ENC(V6_vpackoh,     ICLASS_CJ" 1 111 111 vvvvv PP 0 uuuuu 010 ddddd") //
+DEF_ENC(V6_vrounduhub,     ICLASS_CJ" 1 111 111 vvvvv PP 0 uuuuu 011 ddddd") //
+DEF_ENC(V6_vrounduwuh,     ICLASS_CJ" 1 111 111 vvvvv PP 0 uuuuu 100 ddddd") //
+DEF_ENC(V6_vmpyewuh,    ICLASS_CJ" 1 111 111 vvvvv PP 0 uuuuu 101 ddddd")
+DEF_ENC(V6_vmpyowh,        ICLASS_CJ" 1 111 111 vvvvv PP 0 uuuuu 111 ddddd")
+
+
+#endif /* NO MMVEC */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 27/30] Hexagon HVX (tests/tcg/hexagon) vector_add_int test
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (25 preceding siblings ...)
  2021-10-12 10:11 ` [PATCH v4 26/30] Hexagon HVX (target/hexagon) import instruction encodings Taylor Simpson
@ 2021-10-12 10:11 ` Taylor Simpson
  2021-10-29 19:10   ` Richard Henderson
  2021-10-12 10:11 ` [PATCH v4 28/30] Hexagon HVX (tests/tcg/hexagon) hvx_misc test Taylor Simpson
                   ` (2 subsequent siblings)
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Signe-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 tests/tcg/hexagon/vector_add_int.c | 61 ++++++++++++++++++++++++++++++++++++++
 tests/tcg/hexagon/Makefile.target  |  3 ++
 2 files changed, 64 insertions(+)
 create mode 100644 tests/tcg/hexagon/vector_add_int.c

diff --git a/tests/tcg/hexagon/vector_add_int.c b/tests/tcg/hexagon/vector_add_int.c
new file mode 100644
index 0000000..d6010ea
--- /dev/null
+++ b/tests/tcg/hexagon/vector_add_int.c
@@ -0,0 +1,61 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <stdio.h>
+
+int gA[401];
+int gB[401];
+int gC[401];
+
+void vector_add_int()
+{
+  int i;
+  for (i = 0; i < 400; i++) {
+    gA[i] = gB[i] + gC[i];
+  }
+}
+
+int main()
+{
+  int error = 0;
+  int i;
+  for (i = 0; i < 400; i++) {
+    gB[i] = i * 2;
+    gC[i] = i * 3;
+  }
+  gA[400] = 17;
+  vector_add_int();
+  for (i = 0; i < 400; i++) {
+    if (gA[i] != i * 5) {
+        error++;
+        printf("ERROR: gB[%d] = %d\t", i, gB[i]);
+        printf("gC[%d] = %d\t", i, gC[i]);
+        printf("gA[%d] = %d\n", i, gA[i]);
+    }
+  }
+  if (gA[400] != 17) {
+    error++;
+    printf("ERROR: Overran the buffer\n");
+  }
+  if (!error) {
+    printf("PASS\n");
+    return 0;
+  } else {
+    printf("FAIL\n");
+    return 1;
+  }
+}
diff --git a/tests/tcg/hexagon/Makefile.target b/tests/tcg/hexagon/Makefile.target
index c1e1650..b010edc 100644
--- a/tests/tcg/hexagon/Makefile.target
+++ b/tests/tcg/hexagon/Makefile.target
@@ -38,7 +38,10 @@ HEX_TESTS += circ
 HEX_TESTS += brev
 HEX_TESTS += load_unpack
 HEX_TESTS += load_align
+HEX_TESTS += vector_add_int
 HEX_TESTS += atomics
 HEX_TESTS += fpstuff
 
 TESTS += $(HEX_TESTS)
+
+vector_add_int: CFLAGS += -mhvx -fvectorize
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 28/30] Hexagon HVX (tests/tcg/hexagon) hvx_misc test
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (26 preceding siblings ...)
  2021-10-12 10:11 ` [PATCH v4 27/30] Hexagon HVX (tests/tcg/hexagon) vector_add_int test Taylor Simpson
@ 2021-10-12 10:11 ` Taylor Simpson
  2021-10-29 19:11   ` Richard Henderson
  2021-10-12 10:11 ` [PATCH v4 29/30] Hexagon HVX (tests/tcg/hexagon) scatter_gather test Taylor Simpson
  2021-10-12 10:11 ` [PATCH v4 30/30] Hexagon HVX (tests/tcg/hexagon) histogram test Taylor Simpson
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Tests for
    packet semantics
    vector loads (aligned and unaligned)
    vector stores (aligned and unaligned)
    vector masked stores
    vector new value store
    maximum HVX temps in a packet
    vector operations

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 tests/tcg/hexagon/hvx_misc.c      | 469 ++++++++++++++++++++++++++++++++++++++
 tests/tcg/hexagon/Makefile.target |   2 +
 2 files changed, 471 insertions(+)
 create mode 100644 tests/tcg/hexagon/hvx_misc.c

diff --git a/tests/tcg/hexagon/hvx_misc.c b/tests/tcg/hexagon/hvx_misc.c
new file mode 100644
index 0000000..312bb98
--- /dev/null
+++ b/tests/tcg/hexagon/hvx_misc.c
@@ -0,0 +1,469 @@
+/*
+ *  Copyright(c) 2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <string.h>
+
+int err;
+
+static void __check(int line, int i, int j, uint64_t result, uint64_t expect)
+{
+    if (result != expect) {
+        printf("ERROR at line %d: [%d][%d] 0x%016llx != 0x%016llx\n",
+               line, i, j, result, expect);
+        err++;
+    }
+}
+
+#define check(RES, EXP) __check(__LINE__, RES, EXP)
+
+#define MAX_VEC_SIZE_BYTES         128
+
+typedef union {
+    uint64_t ud[MAX_VEC_SIZE_BYTES / 8];
+    int64_t   d[MAX_VEC_SIZE_BYTES / 8];
+    uint32_t uw[MAX_VEC_SIZE_BYTES / 4];
+    int32_t   w[MAX_VEC_SIZE_BYTES / 4];
+    uint16_t uh[MAX_VEC_SIZE_BYTES / 2];
+    int16_t   h[MAX_VEC_SIZE_BYTES / 2];
+    uint8_t  ub[MAX_VEC_SIZE_BYTES / 1];
+    int8_t    b[MAX_VEC_SIZE_BYTES / 1];
+} MMVector;
+
+#define BUFSIZE      16
+#define OUTSIZE      16
+#define MASKMOD      3
+
+MMVector buffer0[BUFSIZE] __attribute__((aligned(MAX_VEC_SIZE_BYTES)));
+MMVector buffer1[BUFSIZE] __attribute__((aligned(MAX_VEC_SIZE_BYTES)));
+MMVector mask[BUFSIZE] __attribute__((aligned(MAX_VEC_SIZE_BYTES)));
+MMVector output[OUTSIZE] __attribute__((aligned(MAX_VEC_SIZE_BYTES)));
+MMVector expect[OUTSIZE] __attribute__((aligned(MAX_VEC_SIZE_BYTES)));
+
+#define CHECK_OUTPUT_FUNC(FIELD, FIELDSZ) \
+static void check_output_##FIELD(int line, size_t num_vectors) \
+{ \
+    for (int i = 0; i < num_vectors; i++) { \
+        for (int j = 0; j < MAX_VEC_SIZE_BYTES / FIELDSZ; j++) { \
+            __check(line, i, j, output[i].FIELD[j], expect[i].FIELD[j]); \
+        } \
+    } \
+}
+
+CHECK_OUTPUT_FUNC(d,  8)
+CHECK_OUTPUT_FUNC(w,  4)
+CHECK_OUTPUT_FUNC(h,  2)
+CHECK_OUTPUT_FUNC(b,  1)
+
+static void init_buffers(void)
+{
+    int counter0 = 0;
+    int counter1 = 17;
+    for (int i = 0; i < BUFSIZE; i++) {
+        for (int j = 0; j < MAX_VEC_SIZE_BYTES; j++) {
+            buffer0[i].b[j] = counter0++;
+            buffer1[i].b[j] = counter1++;
+        }
+        for (int j = 0; j < MAX_VEC_SIZE_BYTES / 4; j++) {
+            mask[i].w[j] = (i + j % MASKMOD == 0) ? 0 : 1;
+        }
+    }
+}
+
+static void test_load_tmp(void)
+{
+    void *p0 = buffer0;
+    void *p1 = buffer1;
+    void *pout = output;
+
+    for (int i = 0; i < BUFSIZE; i++) {
+        /*
+         * Load into v12 as .tmp, then use it in the next packet
+         * Should get the new value within the same packet and
+         * the old value in the next packet
+         */
+        asm("v3 = vmem(%0 + #0)\n\t"
+            "r1 = #1\n\t"
+            "v12 = vsplat(r1)\n\t"
+            "{\n\t"
+            "    v12.tmp = vmem(%1 + #0)\n\t"
+            "    v4.w = vadd(v12.w, v3.w)\n\t"
+            "}\n\t"
+            "v4.w = vadd(v4.w, v12.w)\n\t"
+            "vmem(%2 + #0) = v4\n\t"
+            : : "r"(p0), "r"(p1), "r"(pout)
+            : "r1", "v12", "v3", "v4", "v6", "memory");
+        p0 += sizeof(MMVector);
+        p1 += sizeof(MMVector);
+        pout += sizeof(MMVector);
+
+        for (int j = 0; j < MAX_VEC_SIZE_BYTES / 4; j++) {
+            expect[i].w[j] = buffer0[i].w[j] + buffer1[i].w[j] + 1;
+        }
+    }
+
+    check_output_w(__LINE__, BUFSIZE);
+}
+
+static void test_load_cur(void)
+{
+    void *p0 = buffer0;
+    void *pout = output;
+
+    for (int i = 0; i < BUFSIZE; i++) {
+        asm("{\n\t"
+            "    v2.cur = vmem(%0 + #0)\n\t"
+            "    vmem(%1 + #0) = v2\n\t"
+            "}\n\t"
+            : : "r"(p0), "r"(pout) : "v2", "memory");
+        p0 += sizeof(MMVector);
+        pout += sizeof(MMVector);
+
+        for (int j = 0; j < MAX_VEC_SIZE_BYTES / 4; j++) {
+            expect[i].uw[j] = buffer0[i].uw[j];
+        }
+    }
+
+    check_output_w(__LINE__, BUFSIZE);
+}
+
+static void test_load_aligned(void)
+{
+    /* Aligned loads ignore the low bits of the address */
+    void *p0 = buffer0;
+    void *pout = output;
+    const size_t offset = 13;
+
+    p0 += offset;    /* Create an unaligned address */
+    asm("v2 = vmem(%0 + #0)\n\t"
+        "vmem(%1 + #0) = v2\n\t"
+        : : "r"(p0), "r"(pout) : "v2", "memory");
+
+    expect[0] = buffer0[0];
+
+    check_output_w(__LINE__, 1);
+}
+
+static void test_load_unaligned(void)
+{
+    void *p0 = buffer0;
+    void *pout = output;
+    const size_t offset = 12;
+
+    p0 += offset;    /* Create an unaligned address */
+    asm("v2 = vmemu(%0 + #0)\n\t"
+        "vmem(%1 + #0) = v2\n\t"
+        : : "r"(p0), "r"(pout) : "v2", "memory");
+
+    memcpy(expect, &buffer0[0].ub[offset], sizeof(MMVector));
+
+    check_output_w(__LINE__, 1);
+}
+
+static void test_store_aligned(void)
+{
+    /* Aligned stores ignore the low bits of the address */
+    void *p0 = buffer0;
+    void *pout = output;
+    const size_t offset = 13;
+
+    pout += offset;    /* Create an unaligned address */
+    asm("v2 = vmem(%0 + #0)\n\t"
+        "vmem(%1 + #0) = v2\n\t"
+        : : "r"(p0), "r"(pout) : "v2", "memory");
+
+    expect[0] = buffer0[0];
+
+    check_output_w(__LINE__, 1);
+}
+
+static void test_store_unaligned(void)
+{
+    void *p0 = buffer0;
+    void *pout = output;
+    const size_t offset = 12;
+
+    pout += offset;    /* Create an unaligned address */
+    asm("v2 = vmem(%0 + #0)\n\t"
+        "vmemu(%1 + #0) = v2\n\t"
+        : : "r"(p0), "r"(pout) : "v2", "memory");
+
+    memcpy(expect, buffer0, 2 * sizeof(MMVector));
+    memcpy(&expect[0].ub[offset], buffer0, sizeof(MMVector));
+
+    check_output_w(__LINE__, 2);
+}
+
+static void test_masked_store(bool invert)
+{
+    void *p0 = buffer0;
+    void *pmask = mask;
+    void *pout = output;
+
+    memset(expect, 0xff, sizeof(expect));
+    memset(output, 0xff, sizeof(expect));
+
+    for (int i = 0; i < BUFSIZE; i++) {
+        if (invert) {
+            asm("r4 = #0\n\t"
+                "v4 = vsplat(r4)\n\t"
+                "v5 = vmem(%0 + #0)\n\t"
+                "q0 = vcmp.eq(v4.w, v5.w)\n\t"
+                "v5 = vmem(%1)\n\t"
+                "if (!q0) vmem(%2) = v5\n\t"             /* Inverted test */
+                : : "r"(pmask), "r"(p0), "r"(pout)
+                : "r4", "v4", "v5", "q0", "memory");
+        } else {
+            asm("r4 = #0\n\t"
+                "v4 = vsplat(r4)\n\t"
+                "v5 = vmem(%0 + #0)\n\t"
+                "q0 = vcmp.eq(v4.w, v5.w)\n\t"
+                "v5 = vmem(%1)\n\t"
+                "if (q0) vmem(%2) = v5\n\t"             /* Non-inverted test */
+                : : "r"(pmask), "r"(p0), "r"(pout)
+                : "r4", "v4", "v5", "q0", "memory");
+        }
+        p0 += sizeof(MMVector);
+        pmask += sizeof(MMVector);
+        pout += sizeof(MMVector);
+
+        for (int j = 0; j < MAX_VEC_SIZE_BYTES / 4; j++) {
+            if (invert) {
+                if (i + j % MASKMOD != 0) {
+                    expect[i].w[j] = buffer0[i].w[j];
+                }
+            } else {
+                if (i + j % MASKMOD == 0) {
+                    expect[i].w[j] = buffer0[i].w[j];
+                }
+            }
+        }
+    }
+
+    check_output_w(__LINE__, BUFSIZE);
+}
+
+static void test_new_value_store(void)
+{
+    void *p0 = buffer0;
+    void *pout = output;
+
+    asm("{\n\t"
+        "    v2 = vmem(%0 + #0)\n\t"
+        "    vmem(%1 + #0) = v2.new\n\t"
+        "}\n\t"
+        : : "r"(p0), "r"(pout) : "v2", "memory");
+
+    expect[0] = buffer0[0];
+
+    check_output_w(__LINE__, 1);
+}
+
+static void test_max_temps()
+{
+    void *p0 = buffer0;
+    void *pout = output;
+
+    asm("v0 = vmem(%0 + #0)\n\t"
+        "v1 = vmem(%0 + #1)\n\t"
+        "v2 = vmem(%0 + #2)\n\t"
+        "v3 = vmem(%0 + #3)\n\t"
+        "v4 = vmem(%0 + #4)\n\t"
+        "{\n\t"
+        "    v1:0.w = vadd(v3:2.w, v1:0.w)\n\t"
+        "    v2.b = vshuffe(v3.b, v2.b)\n\t"
+        "    v3.w = vadd(v1.w, v4.w)\n\t"
+        "    v4.tmp = vmem(%0 + #5)\n\t"
+        "}\n\t"
+        "vmem(%1 + #0) = v0\n\t"
+        "vmem(%1 + #1) = v1\n\t"
+        "vmem(%1 + #2) = v2\n\t"
+        "vmem(%1 + #3) = v3\n\t"
+        "vmem(%1 + #4) = v4\n\t"
+        : : "r"(p0), "r"(pout) : "memory");
+
+        /* The first two vectors come from the vadd-pair instruction */
+        for (int i = 0; i < MAX_VEC_SIZE_BYTES / 4; i++) {
+            expect[0].w[i] = buffer0[0].w[i] + buffer0[2].w[i];
+            expect[1].w[i] = buffer0[1].w[i] + buffer0[3].w[i];
+        }
+        /* The third vector comes from the vshuffe instruction */
+        for (int i = 0; i < MAX_VEC_SIZE_BYTES / 2; i++) {
+            expect[2].uh[i] = (buffer0[2].uh[i] & 0xff) |
+                              (buffer0[3].uh[i] & 0xff) << 8;
+        }
+        /* The fourth vector comes from the vadd-single instruction */
+        for (int i = 0; i < MAX_VEC_SIZE_BYTES / 4; i++) {
+            expect[3].w[i] = buffer0[1].w[i] + buffer0[5].w[i];
+        }
+        /*
+         * The fifth vector comes from the load to v4
+         * make sure the .tmp is dropped
+         */
+        expect[4] = buffer0[4];
+
+        check_output_b(__LINE__, 5);
+}
+
+#define VEC_OP1(ASM, EL, IN, OUT) \
+    asm("v2 = vmem(%0 + #0)\n\t" \
+        "v2" #EL " = " #ASM "(v2" #EL ")\n\t" \
+        "vmem(%1 + #0) = v2\n\t" \
+        : : "r"(IN), "r"(OUT) : "v2", "memory")
+
+#define VEC_OP2(ASM, EL, IN0, IN1, OUT) \
+    asm("v2 = vmem(%0 + #0)\n\t" \
+        "v3 = vmem(%1 + #0)\n\t" \
+        "v2" #EL " = " #ASM "(v2" #EL ", v3" #EL ")\n\t" \
+        "vmem(%2 + #0) = v2\n\t" \
+        : : "r"(IN0), "r"(IN1), "r"(OUT) : "v2", "v3", "memory")
+
+#define TEST_VEC_OP1(NAME, ASM, EL, FIELD, FIELDSZ, OP) \
+static void test_##NAME(void) \
+{ \
+    void *pin = buffer0; \
+    void *pout = output; \
+    for (int i = 0; i < BUFSIZE; i++) { \
+        VEC_OP1(ASM, EL, pin, pout); \
+        pin += sizeof(MMVector); \
+        pout += sizeof(MMVector); \
+    } \
+    for (int i = 0; i < BUFSIZE; i++) { \
+        for (int j = 0; j < MAX_VEC_SIZE_BYTES / FIELDSZ; j++) { \
+            expect[i].FIELD[j] = OP buffer0[i].FIELD[j]; \
+        } \
+    } \
+    check_output_##FIELD(__LINE__, BUFSIZE); \
+}
+
+#define TEST_VEC_OP2(NAME, ASM, EL, FIELD, FIELDSZ, OP) \
+static void test_##NAME(void) \
+{ \
+    void *p0 = buffer0; \
+    void *p1 = buffer1; \
+    void *pout = output; \
+    for (int i = 0; i < BUFSIZE; i++) { \
+        VEC_OP2(ASM, EL, p0, p1, pout); \
+        p0 += sizeof(MMVector); \
+        p1 += sizeof(MMVector); \
+        pout += sizeof(MMVector); \
+    } \
+    for (int i = 0; i < BUFSIZE; i++) { \
+        for (int j = 0; j < MAX_VEC_SIZE_BYTES / FIELDSZ; j++) { \
+            expect[i].FIELD[j] = buffer0[i].FIELD[j] OP buffer1[i].FIELD[j]; \
+        } \
+    } \
+    check_output_##FIELD(__LINE__, BUFSIZE); \
+}
+
+#define THRESHOLD        31
+
+#define PRED_OP2(ASM, IN0, IN1, OUT, INV) \
+    asm("r4 = #%3\n\t" \
+        "v1.b = vsplat(r4)\n\t" \
+        "v2 = vmem(%0 + #0)\n\t" \
+        "q0 = vcmp.gt(v2.b, v1.b)\n\t" \
+        "v3 = vmem(%1 + #0)\n\t" \
+        "q1 = vcmp.gt(v3.b, v1.b)\n\t" \
+        "q2 = " #ASM "(q0, " INV "q1)\n\t" \
+        "r4 = #0xff\n\t" \
+        "v1.b = vsplat(r4)\n\t" \
+        "if (q2) vmem(%2 + #0) = v1\n\t" \
+        : : "r"(IN0), "r"(IN1), "r"(OUT), "i"(THRESHOLD) \
+        : "r4", "v1", "v2", "v3", "q0", "q1", "q2", "memory")
+
+#define TEST_PRED_OP2(NAME, ASM, OP, INV) \
+static void test_##NAME(bool invert) \
+{ \
+    void *p0 = buffer0; \
+    void *p1 = buffer1; \
+    void *pout = output; \
+    memset(output, 0, sizeof(expect)); \
+    for (int i = 0; i < BUFSIZE; i++) { \
+        PRED_OP2(ASM, p0, p1, pout, INV); \
+        p0 += sizeof(MMVector); \
+        p1 += sizeof(MMVector); \
+        pout += sizeof(MMVector); \
+    } \
+    for (int i = 0; i < BUFSIZE; i++) { \
+        for (int j = 0; j < MAX_VEC_SIZE_BYTES; j++) { \
+            bool p0 = (buffer0[i].b[j] > THRESHOLD); \
+            bool p1 = (buffer1[i].b[j] > THRESHOLD); \
+            if (invert) { \
+                expect[i].b[j] = (p0 OP !p1) ? 0xff : 0x00; \
+            } else { \
+                expect[i].b[j] = (p0 OP p1) ? 0xff : 0x00; \
+            } \
+        } \
+    } \
+    check_output_b(__LINE__, BUFSIZE); \
+}
+
+TEST_VEC_OP2(vadd_w, vadd, .w, w, 4, +)
+TEST_VEC_OP2(vadd_h, vadd, .h, h, 2, +)
+TEST_VEC_OP2(vadd_b, vadd, .b, b, 1, +)
+TEST_VEC_OP2(vsub_w, vsub, .w, w, 4, -)
+TEST_VEC_OP2(vsub_h, vsub, .h, h, 2, -)
+TEST_VEC_OP2(vsub_b, vsub, .b, b, 1, -)
+TEST_VEC_OP2(vxor, vxor, , d, 8, ^)
+TEST_VEC_OP2(vand, vand, , d, 8, &)
+TEST_VEC_OP2(vor, vor, , d, 8, |)
+TEST_VEC_OP1(vnot, vnot, , d, 8, ~)
+
+TEST_PRED_OP2(pred_or, or, |, "")
+TEST_PRED_OP2(pred_or_n, or, |, "!")
+TEST_PRED_OP2(pred_and, and, &, "")
+TEST_PRED_OP2(pred_and_n, and, &, "!")
+TEST_PRED_OP2(pred_xor, xor, ^, "")
+
+int main()
+{
+    init_buffers();
+
+    test_load_tmp();
+    test_load_cur();
+    test_load_aligned();
+    test_load_unaligned();
+    test_store_aligned();
+    test_store_unaligned();
+    test_masked_store(false);
+    test_masked_store(true);
+    test_new_value_store();
+    test_max_temps();
+
+    test_vadd_w();
+    test_vadd_h();
+    test_vadd_b();
+    test_vsub_w();
+    test_vsub_h();
+    test_vsub_b();
+    test_vxor();
+    test_vand();
+    test_vor();
+    test_vnot();
+
+    test_pred_or(false);
+    test_pred_or_n(true);
+    test_pred_and(false);
+    test_pred_and_n(true);
+    test_pred_xor(false);
+
+    puts(err ? "FAIL" : "PASS");
+    return err ? 1 : 0;
+}
diff --git a/tests/tcg/hexagon/Makefile.target b/tests/tcg/hexagon/Makefile.target
index b010edc..62916a5 100644
--- a/tests/tcg/hexagon/Makefile.target
+++ b/tests/tcg/hexagon/Makefile.target
@@ -41,7 +41,9 @@ HEX_TESTS += load_align
 HEX_TESTS += vector_add_int
 HEX_TESTS += atomics
 HEX_TESTS += fpstuff
+HEX_TESTS += hvx_misc
 
 TESTS += $(HEX_TESTS)
 
 vector_add_int: CFLAGS += -mhvx -fvectorize
+hvx_misc: CFLAGS += -mhvx
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 29/30] Hexagon HVX (tests/tcg/hexagon) scatter_gather test
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (27 preceding siblings ...)
  2021-10-12 10:11 ` [PATCH v4 28/30] Hexagon HVX (tests/tcg/hexagon) hvx_misc test Taylor Simpson
@ 2021-10-12 10:11 ` Taylor Simpson
  2021-10-29 19:13   ` Richard Henderson
  2021-10-12 10:11 ` [PATCH v4 30/30] Hexagon HVX (tests/tcg/hexagon) histogram test Taylor Simpson
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 tests/tcg/hexagon/scatter_gather.c | 1011 ++++++++++++++++++++++++++++++++++++
 tests/tcg/hexagon/Makefile.target  |    2 +
 2 files changed, 1013 insertions(+)
 create mode 100644 tests/tcg/hexagon/scatter_gather.c

diff --git a/tests/tcg/hexagon/scatter_gather.c b/tests/tcg/hexagon/scatter_gather.c
new file mode 100644
index 0000000..b93eb18
--- /dev/null
+++ b/tests/tcg/hexagon/scatter_gather.c
@@ -0,0 +1,1011 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * This example tests the HVX scatter/gather instructions
+ *
+ * See section 5.13 of the V68 HVX Programmer's Reference
+ *
+ * There are 3 main classes operations
+ *     _16                 16-bit elements and 16-bit offsets
+ *     _32                 32-bit elements and 32-bit offsets
+ *     _16_32              16-bit elements and 32-bit offsets
+ *
+ * There are also masked and accumulate versions
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+typedef long HVX_Vector       __attribute__((__vector_size__(128)))
+                              __attribute__((aligned(128)));
+typedef long HVX_VectorPair   __attribute__((__vector_size__(256)))
+                              __attribute__((aligned(128)));
+typedef long HVX_VectorPred   __attribute__((__vector_size__(128)))
+                              __attribute__((aligned(128)));
+
+#define VSCATTER_16(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermh_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_MASKED(MASK, BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermhq_128B(MASK, (int)BASE, RGN, OFF, VALS)
+#define VSCATTER_32(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermw_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_32_MASKED(MASK, BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermwq_128B(MASK, (int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_32(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermhw_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_32_MASKED(MASK, BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermhwq_128B(MASK, (int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_ACC(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermh_add_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_32_ACC(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermw_add_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_32_ACC(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermhw_add_128B((int)BASE, RGN, OFF, VALS)
+
+#define VGATHER_16(DSTADDR, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermh_128B(DSTADDR, (int)BASE, RGN, OFF)
+#define VGATHER_16_MASKED(DSTADDR, MASK, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermhq_128B(DSTADDR, MASK, (int)BASE, RGN, OFF)
+#define VGATHER_32(DSTADDR, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermw_128B(DSTADDR, (int)BASE, RGN, OFF)
+#define VGATHER_32_MASKED(DSTADDR, MASK, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermwq_128B(DSTADDR, MASK, (int)BASE, RGN, OFF)
+#define VGATHER_16_32(DSTADDR, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermhw_128B(DSTADDR, (int)BASE, RGN, OFF)
+#define VGATHER_16_32_MASKED(DSTADDR, MASK, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermhwq_128B(DSTADDR, MASK, (int)BASE, RGN, OFF)
+
+#define VSHUFF_H(V) \
+    __builtin_HEXAGON_V6_vshuffh_128B(V)
+#define VSPLAT_H(X) \
+    __builtin_HEXAGON_V6_lvsplath_128B(X)
+#define VAND_VAL(PRED, VAL) \
+    __builtin_HEXAGON_V6_vandvrt_128B(PRED, VAL)
+#define VDEAL_H(V) \
+    __builtin_HEXAGON_V6_vdealh_128B(V)
+
+int err;
+
+/* define the number of rows/cols in a square matrix */
+#define MATRIX_SIZE 64
+
+/* define the size of the scatter buffer */
+#define SCATTER_BUFFER_SIZE (MATRIX_SIZE * MATRIX_SIZE)
+
+/* fake vtcm - put buffers together and force alignment */
+static struct {
+    unsigned short vscatter16[SCATTER_BUFFER_SIZE];
+    unsigned short vgather16[MATRIX_SIZE];
+    unsigned int   vscatter32[SCATTER_BUFFER_SIZE];
+    unsigned int   vgather32[MATRIX_SIZE];
+    unsigned short vscatter16_32[SCATTER_BUFFER_SIZE];
+    unsigned short vgather16_32[MATRIX_SIZE];
+} vtcm __attribute__((aligned(0x10000)));
+
+/* declare the arrays of reference values */
+unsigned short vscatter16_ref[SCATTER_BUFFER_SIZE];
+unsigned short vgather16_ref[MATRIX_SIZE];
+unsigned int   vscatter32_ref[SCATTER_BUFFER_SIZE];
+unsigned int   vgather32_ref[MATRIX_SIZE];
+unsigned short vscatter16_32_ref[SCATTER_BUFFER_SIZE];
+unsigned short vgather16_32_ref[MATRIX_SIZE];
+
+/* declare the arrays of offsets */
+unsigned short half_offsets[MATRIX_SIZE];
+unsigned int   word_offsets[MATRIX_SIZE];
+
+/* declare the arrays of values */
+unsigned short half_values[MATRIX_SIZE];
+unsigned short half_values_acc[MATRIX_SIZE];
+unsigned short half_values_masked[MATRIX_SIZE];
+unsigned int   word_values[MATRIX_SIZE];
+unsigned int   word_values_acc[MATRIX_SIZE];
+unsigned int   word_values_masked[MATRIX_SIZE];
+
+/* declare the arrays of predicates */
+unsigned short half_predicates[MATRIX_SIZE];
+unsigned int   word_predicates[MATRIX_SIZE];
+
+/* make this big enough for all the intrinsics */
+const size_t region_len = sizeof(vtcm);
+
+/* optionally add sync instructions */
+#define SYNC_VECTOR 1
+
+static void sync_scatter(void *addr)
+{
+#if SYNC_VECTOR
+    /*
+     * Do the scatter release followed by a dummy load to complete the
+     * synchronization.  Normally the dummy load would be deferred as
+     * long as possible to minimize stalls.
+     */
+    asm volatile("vmem(%0 + #0):scatter_release\n" : : "r"(addr));
+    /* use volatile to force the load */
+    volatile HVX_Vector vDummy = *(HVX_Vector *)addr; vDummy = vDummy;
+#endif
+}
+
+static void sync_gather(void *addr)
+{
+#if SYNC_VECTOR
+    /* use volatile to force the load */
+    volatile HVX_Vector vDummy = *(HVX_Vector *)addr; vDummy = vDummy;
+#endif
+}
+
+/* optionally print the results */
+#define PRINT_DATA 0
+
+#define FILL_CHAR       '.'
+
+/* fill vtcm scratch with ee */
+void prefill_vtcm_scratch(void)
+{
+    memset(&vtcm, FILL_CHAR, sizeof(vtcm));
+}
+
+/* create byte offsets to be a diagonal of the matrix with 16 bit elements */
+void create_offsets_values_preds_16(void)
+{
+    unsigned short half_element = 0;
+    unsigned short half_element_masked = 0;
+    char letter = 'A';
+    char letter_masked = '@';
+
+    for (int i = 0; i < MATRIX_SIZE; i++) {
+        half_offsets[i] = i * (2 * MATRIX_SIZE + 2);
+
+        half_element = 0;
+        half_element_masked = 0;
+        for (int j = 0; j < 2; j++) {
+            half_element |= letter << j * 8;
+            half_element_masked |= letter_masked << j * 8;
+        }
+
+        half_values[i] = half_element;
+        half_values_acc[i] = ((i % 10) << 8) + (i % 10);
+        half_values_masked[i] = half_element_masked;
+
+        letter++;
+        /* reset to 'A' */
+        if (letter == 'M') {
+            letter = 'A';
+        }
+
+        half_predicates[i] = (i % 3 == 0 || i % 5 == 0) ? ~0 : 0;
+    }
+}
+
+/* create byte offsets to be a diagonal of the matrix with 32 bit elements */
+void create_offsets_values_preds_32(void)
+{
+    unsigned int word_element = 0;
+    unsigned int word_element_masked = 0;
+    char letter = 'A';
+    char letter_masked = '&';
+
+    for (int i = 0; i < MATRIX_SIZE; i++) {
+        word_offsets[i] = i * (4 * MATRIX_SIZE + 4);
+
+        word_element = 0;
+        word_element_masked = 0;
+        for (int j = 0; j < 4; j++) {
+            word_element |= letter << j * 8;
+            word_element_masked |= letter_masked << j * 8;
+        }
+
+        word_values[i] = word_element;
+        word_values_acc[i] = ((i % 10) << 8) + (i % 10);
+        word_values_masked[i] = word_element_masked;
+
+        letter++;
+        /* reset to 'A' */
+        if (letter == 'M') {
+            letter = 'A';
+        }
+
+        word_predicates[i] = (i % 4 == 0 || i % 7 == 0) ? ~0 : 0;
+    }
+}
+
+/*
+ * create byte offsets to be a diagonal of the matrix with 16 bit elements
+ * and 32 bit offsets
+ */
+void create_offsets_values_preds_16_32(void)
+{
+    unsigned short half_element = 0;
+    unsigned short half_element_masked = 0;
+    char letter = 'D';
+    char letter_masked = '$';
+
+    for (int i = 0; i < MATRIX_SIZE; i++) {
+        word_offsets[i] = i * (2 * MATRIX_SIZE + 2);
+
+        half_element = 0;
+        half_element_masked = 0;
+        for (int j = 0; j < 2; j++) {
+            half_element |= letter << j * 8;
+            half_element_masked |= letter_masked << j * 8;
+        }
+
+        half_values[i] = half_element;
+        half_values_acc[i] = ((i % 10) << 8) + (i % 10);
+        half_values_masked[i] = half_element_masked;
+
+        letter++;
+        /* reset to 'A' */
+        if (letter == 'P') {
+            letter = 'D';
+        }
+
+        half_predicates[i] = (i % 2 == 0 || i % 13 == 0) ? ~0 : 0;
+    }
+}
+
+/* scatter the 16 bit elements using intrinsics */
+void vector_scatter_16(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsets = *(HVX_Vector *)half_offsets;
+    HVX_Vector values = *(HVX_Vector *)half_values;
+
+    VSCATTER_16(&vtcm.vscatter16, region_len, offsets, values);
+
+    sync_scatter(vtcm.vscatter16);
+}
+
+/* scatter-accumulate the 16 bit elements using intrinsics */
+void vector_scatter_16_acc(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsets = *(HVX_Vector *)half_offsets;
+    HVX_Vector values = *(HVX_Vector *)half_values_acc;
+
+    VSCATTER_16_ACC(&vtcm.vscatter16, region_len, offsets, values);
+
+    sync_scatter(vtcm.vscatter16);
+}
+
+/* scatter the 16 bit elements using intrinsics */
+void vector_scatter_16_masked(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsets = *(HVX_Vector *)half_offsets;
+    HVX_Vector values = *(HVX_Vector *)half_values_masked;
+    HVX_Vector pred_reg = *(HVX_Vector *)half_predicates;
+    HVX_VectorPred preds = VAND_VAL(pred_reg, ~0);
+
+    VSCATTER_16_MASKED(preds, &vtcm.vscatter16, region_len, offsets, values);
+
+    sync_scatter(vtcm.vscatter16);
+}
+
+/* scatter the 32 bit elements using intrinsics */
+void vector_scatter_32(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsetslo = *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi = *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+    HVX_Vector valueslo = *(HVX_Vector *)word_values;
+    HVX_Vector valueshi = *(HVX_Vector *)&word_values[MATRIX_SIZE / 2];
+
+    VSCATTER_32(&vtcm.vscatter32, region_len, offsetslo, valueslo);
+    VSCATTER_32(&vtcm.vscatter32, region_len, offsetshi, valueshi);
+
+    sync_scatter(vtcm.vscatter32);
+}
+
+/* scatter-acc the 32 bit elements using intrinsics */
+void vector_scatter_32_acc(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsetslo = *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi = *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+    HVX_Vector valueslo = *(HVX_Vector *)word_values_acc;
+    HVX_Vector valueshi = *(HVX_Vector *)&word_values_acc[MATRIX_SIZE / 2];
+
+    VSCATTER_32_ACC(&vtcm.vscatter32, region_len, offsetslo, valueslo);
+    VSCATTER_32_ACC(&vtcm.vscatter32, region_len, offsetshi, valueshi);
+
+    sync_scatter(vtcm.vscatter32);
+}
+
+/* scatter the 32 bit elements using intrinsics */
+void vector_scatter_32_masked(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsetslo = *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi = *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+    HVX_Vector valueslo = *(HVX_Vector *)word_values_masked;
+    HVX_Vector valueshi = *(HVX_Vector *)&word_values_masked[MATRIX_SIZE / 2];
+    HVX_Vector pred_reglo = *(HVX_Vector *)word_predicates;
+    HVX_Vector pred_reghi = *(HVX_Vector *)&word_predicates[MATRIX_SIZE / 2];
+    HVX_VectorPred predslo = VAND_VAL(pred_reglo, ~0);
+    HVX_VectorPred predshi = VAND_VAL(pred_reghi, ~0);
+
+    VSCATTER_32_MASKED(predslo, &vtcm.vscatter32, region_len, offsetslo,
+                       valueslo);
+    VSCATTER_32_MASKED(predshi, &vtcm.vscatter32, region_len, offsetshi,
+                       valueshi);
+
+    sync_scatter(vtcm.vscatter16);
+}
+
+/* scatter the 16 bit elements with 32 bit offsets using intrinsics */
+void vector_scatter_16_32(void)
+{
+    HVX_VectorPair offsets;
+    HVX_Vector values;
+
+    /* get the word offsets in a vector pair */
+    offsets = *(HVX_VectorPair *)word_offsets;
+
+    /* these values need to be shuffled for the scatter */
+    values = *(HVX_Vector *)half_values;
+    values = VSHUFF_H(values);
+
+    VSCATTER_16_32(&vtcm.vscatter16_32, region_len, offsets, values);
+
+    sync_scatter(vtcm.vscatter16_32);
+}
+
+/* scatter-acc the 16 bit elements with 32 bit offsets using intrinsics */
+void vector_scatter_16_32_acc(void)
+{
+    HVX_VectorPair offsets;
+    HVX_Vector values;
+
+    /* get the word offsets in a vector pair */
+    offsets = *(HVX_VectorPair *)word_offsets;
+
+    /* these values need to be shuffled for the scatter */
+    values = *(HVX_Vector *)half_values_acc;
+    values = VSHUFF_H(values);
+
+    VSCATTER_16_32_ACC(&vtcm.vscatter16_32, region_len, offsets, values);
+
+    sync_scatter(vtcm.vscatter16_32);
+}
+
+/* masked scatter the 16 bit elements with 32 bit offsets using intrinsics */
+void vector_scatter_16_32_masked(void)
+{
+    HVX_VectorPair offsets;
+    HVX_Vector values;
+    HVX_Vector pred_reg;
+
+    /* get the word offsets in a vector pair */
+    offsets = *(HVX_VectorPair *)word_offsets;
+
+    /* these values need to be shuffled for the scatter */
+    values = *(HVX_Vector *)half_values_masked;
+    values = VSHUFF_H(values);
+
+    pred_reg = *(HVX_Vector *)half_predicates;
+    pred_reg = VSHUFF_H(pred_reg);
+    HVX_VectorPred preds = VAND_VAL(pred_reg, ~0);
+
+    VSCATTER_16_32_MASKED(preds, &vtcm.vscatter16_32, region_len, offsets,
+                          values);
+
+    sync_scatter(vtcm.vscatter16_32);
+}
+
+/* gather the elements from the scatter16 buffer */
+void vector_gather_16(void)
+{
+    HVX_Vector *vgather = (HVX_Vector *)&vtcm.vgather16;
+    HVX_Vector offsets = *(HVX_Vector *)half_offsets;
+
+    VGATHER_16(vgather, &vtcm.vscatter16, region_len, offsets);
+
+    sync_gather(vgather);
+}
+
+static unsigned short gather_16_masked_init(void)
+{
+    char letter = '?';
+    return letter | (letter << 8);
+}
+
+void vector_gather_16_masked(void)
+{
+    HVX_Vector *vgather = (HVX_Vector *)&vtcm.vgather16;
+    HVX_Vector offsets = *(HVX_Vector *)half_offsets;
+    HVX_Vector pred_reg = *(HVX_Vector *)half_predicates;
+    HVX_VectorPred preds = VAND_VAL(pred_reg, ~0);
+
+    *vgather = VSPLAT_H(gather_16_masked_init());
+    VGATHER_16_MASKED(vgather, preds, &vtcm.vscatter16, region_len, offsets);
+
+    sync_gather(vgather);
+}
+
+/* gather the elements from the scatter32 buffer */
+void vector_gather_32(void)
+{
+    HVX_Vector *vgatherlo = (HVX_Vector *)&vtcm.vgather32;
+    HVX_Vector *vgatherhi =
+        (HVX_Vector *)((int)&vtcm.vgather32 + (MATRIX_SIZE * 2));
+    HVX_Vector offsetslo = *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi = *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+
+    VGATHER_32(vgatherlo, &vtcm.vscatter32, region_len, offsetslo);
+    VGATHER_32(vgatherhi, &vtcm.vscatter32, region_len, offsetshi);
+
+    sync_gather(vgatherhi);
+}
+
+static unsigned int gather_32_masked_init(void)
+{
+    char letter = '?';
+    return letter | (letter << 8) | (letter << 16) | (letter << 24);
+}
+
+void vector_gather_32_masked(void)
+{
+    HVX_Vector *vgatherlo = (HVX_Vector *)&vtcm.vgather32;
+    HVX_Vector *vgatherhi =
+        (HVX_Vector *)((int)&vtcm.vgather32 + (MATRIX_SIZE * 2));
+    HVX_Vector offsetslo = *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi = *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+    HVX_Vector pred_reglo = *(HVX_Vector *)word_predicates;
+    HVX_VectorPred predslo = VAND_VAL(pred_reglo, ~0);
+    HVX_Vector pred_reghi = *(HVX_Vector *)&word_predicates[MATRIX_SIZE / 2];
+    HVX_VectorPred predshi = VAND_VAL(pred_reghi, ~0);
+
+    *vgatherlo = VSPLAT_H(gather_32_masked_init());
+    *vgatherhi = VSPLAT_H(gather_32_masked_init());
+    VGATHER_32_MASKED(vgatherlo, predslo, &vtcm.vscatter32, region_len,
+                      offsetslo);
+    VGATHER_32_MASKED(vgatherhi, predshi, &vtcm.vscatter32, region_len,
+                      offsetshi);
+
+    sync_gather(vgatherlo);
+    sync_gather(vgatherhi);
+}
+
+/* gather the elements from the scatter16_32 buffer */
+void vector_gather_16_32(void)
+{
+    HVX_Vector *vgather;
+    HVX_VectorPair offsets;
+    HVX_Vector values;
+
+    /* get the vtcm address to gather from */
+    vgather = (HVX_Vector *)&vtcm.vgather16_32;
+
+    /* get the word offsets in a vector pair */
+    offsets = *(HVX_VectorPair *)word_offsets;
+
+    VGATHER_16_32(vgather, &vtcm.vscatter16_32, region_len, offsets);
+
+    /* deal the elements to get the order back */
+    values = *(HVX_Vector *)vgather;
+    values = VDEAL_H(values);
+
+    /* write it back to vtcm address */
+    *(HVX_Vector *)vgather = values;
+}
+
+void vector_gather_16_32_masked(void)
+{
+    HVX_Vector *vgather;
+    HVX_VectorPair offsets;
+    HVX_Vector pred_reg;
+    HVX_VectorPred preds;
+    HVX_Vector values;
+
+    /* get the vtcm address to gather from */
+    vgather = (HVX_Vector *)&vtcm.vgather16_32;
+
+    /* get the word offsets in a vector pair */
+    offsets = *(HVX_VectorPair *)word_offsets;
+    pred_reg = *(HVX_Vector *)half_predicates;
+    pred_reg = VSHUFF_H(pred_reg);
+    preds = VAND_VAL(pred_reg, ~0);
+
+   *vgather = VSPLAT_H(gather_16_masked_init());
+   VGATHER_16_32_MASKED(vgather, preds, &vtcm.vscatter16_32, region_len,
+                        offsets);
+
+    /* deal the elements to get the order back */
+    values = *(HVX_Vector *)vgather;
+    values = VDEAL_H(values);
+
+    /* write it back to vtcm address */
+    *(HVX_Vector *)vgather = values;
+}
+
+static void check_buffer(const char *name, void *c, void *r, size_t size)
+{
+    char *check = (char *)c;
+    char *ref = (char *)r;
+    for (int i = 0; i < size; i++) {
+        if (check[i] != ref[i]) {
+            printf("ERROR %s [%d]: 0x%x (%c) != 0x%x (%c)\n", name, i,
+                   check[i], check[i], ref[i], ref[i]);
+            err++;
+        }
+    }
+}
+
+/*
+ * These scalar functions are the C equivalents of the vector functions that
+ * use HVX
+ */
+
+/* scatter the 16 bit elements using C */
+void scalar_scatter_16(unsigned short *vscatter16)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        vscatter16[half_offsets[i] / 2] = half_values[i];
+    }
+}
+
+void check_scatter_16()
+{
+    memset(vscatter16_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16(vscatter16_ref);
+    check_buffer(__func__, vtcm.vscatter16, vscatter16_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* scatter the 16 bit elements using C */
+void scalar_scatter_16_acc(unsigned short *vscatter16)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        vscatter16[half_offsets[i] / 2] += half_values_acc[i];
+    }
+}
+
+void check_scatter_16_acc()
+{
+    memset(vscatter16_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16(vscatter16_ref);
+    scalar_scatter_16_acc(vscatter16_ref);
+    check_buffer(__func__, vtcm.vscatter16, vscatter16_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* scatter the 16 bit elements using C */
+void scalar_scatter_16_masked(unsigned short *vscatter16)
+{
+    for (int i = 0; i < MATRIX_SIZE; i++) {
+        if (half_predicates[i]) {
+            vscatter16[half_offsets[i] / 2] = half_values_masked[i];
+        }
+    }
+
+}
+
+void check_scatter_16_masked()
+{
+    memset(vscatter16_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16(vscatter16_ref);
+    scalar_scatter_16_acc(vscatter16_ref);
+    scalar_scatter_16_masked(vscatter16_ref);
+    check_buffer(__func__, vtcm.vscatter16, vscatter16_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_32(unsigned int *vscatter32)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        vscatter32[word_offsets[i] / 4] = word_values[i];
+    }
+}
+
+void check_scatter_32()
+{
+    memset(vscatter32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+    scalar_scatter_32(vscatter32_ref);
+    check_buffer(__func__, vtcm.vscatter32, vscatter32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_32_acc(unsigned int *vscatter32)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        vscatter32[word_offsets[i] / 4] += word_values_acc[i];
+    }
+}
+
+void check_scatter_32_acc()
+{
+    memset(vscatter32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+    scalar_scatter_32(vscatter32_ref);
+    scalar_scatter_32_acc(vscatter32_ref);
+    check_buffer(__func__, vtcm.vscatter32, vscatter32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_32_masked(unsigned int *vscatter32)
+{
+    for (int i = 0; i < MATRIX_SIZE; i++) {
+        if (word_predicates[i]) {
+            vscatter32[word_offsets[i] / 4] = word_values_masked[i];
+        }
+    }
+}
+
+void check_scatter_32_masked()
+{
+    memset(vscatter32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+    scalar_scatter_32(vscatter32_ref);
+    scalar_scatter_32_acc(vscatter32_ref);
+    scalar_scatter_32_masked(vscatter32_ref);
+    check_buffer(__func__, vtcm.vscatter32, vscatter32_ref,
+                  SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_16_32(unsigned short *vscatter16_32)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        vscatter16_32[word_offsets[i] / 2] = half_values[i];
+    }
+}
+
+void check_scatter_16_32()
+{
+    memset(vscatter16_32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16_32(vscatter16_32_ref);
+    check_buffer(__func__, vtcm.vscatter16_32, vscatter16_32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_16_32_acc(unsigned short *vscatter16_32)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        vscatter16_32[word_offsets[i] / 2] += half_values_acc[i];
+    }
+}
+
+void check_scatter_16_32_acc()
+{
+    memset(vscatter16_32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16_32(vscatter16_32_ref);
+    scalar_scatter_16_32_acc(vscatter16_32_ref);
+    check_buffer(__func__, vtcm.vscatter16_32, vscatter16_32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+void scalar_scatter_16_32_masked(unsigned short *vscatter16_32)
+{
+    for (int i = 0; i < MATRIX_SIZE; i++) {
+        if (half_predicates[i]) {
+            vscatter16_32[word_offsets[i] / 2] = half_values_masked[i];
+        }
+    }
+}
+
+void check_scatter_16_32_masked()
+{
+    memset(vscatter16_32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16_32(vscatter16_32_ref);
+    scalar_scatter_16_32_acc(vscatter16_32_ref);
+    scalar_scatter_16_32_masked(vscatter16_32_ref);
+    check_buffer(__func__, vtcm.vscatter16_32, vscatter16_32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* gather the elements from the scatter buffer using C */
+void scalar_gather_16(unsigned short *vgather16)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        vgather16[i] = vtcm.vscatter16[half_offsets[i] / 2];
+    }
+}
+
+void check_gather_16()
+{
+      memset(vgather16_ref, 0, MATRIX_SIZE * sizeof(unsigned short));
+      scalar_gather_16(vgather16_ref);
+      check_buffer(__func__, vtcm.vgather16, vgather16_ref,
+                   MATRIX_SIZE * sizeof(unsigned short));
+}
+
+void scalar_gather_16_masked(unsigned short *vgather16)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        if (half_predicates[i]) {
+            vgather16[i] = vtcm.vscatter16[half_offsets[i] / 2];
+        }
+    }
+}
+
+void check_gather_16_masked()
+{
+    memset(vgather16_ref, gather_16_masked_init(),
+           MATRIX_SIZE * sizeof(unsigned short));
+    scalar_gather_16_masked(vgather16_ref);
+    check_buffer(__func__, vtcm.vgather16, vgather16_ref,
+                 MATRIX_SIZE * sizeof(unsigned short));
+}
+
+/* gather the elements from the scatter buffer using C */
+void scalar_gather_32(unsigned int *vgather32)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        vgather32[i] = vtcm.vscatter32[word_offsets[i] / 4];
+    }
+}
+
+void check_gather_32(void)
+{
+    memset(vgather32_ref, 0, MATRIX_SIZE * sizeof(unsigned int));
+    scalar_gather_32(vgather32_ref);
+    check_buffer(__func__, vtcm.vgather32, vgather32_ref,
+                 MATRIX_SIZE * sizeof(unsigned int));
+}
+
+void scalar_gather_32_masked(unsigned int *vgather32)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        if (word_predicates[i]) {
+            vgather32[i] = vtcm.vscatter32[word_offsets[i] / 4];
+        }
+    }
+}
+
+
+void check_gather_32_masked(void)
+{
+    memset(vgather32_ref, gather_32_masked_init(),
+           MATRIX_SIZE * sizeof(unsigned int));
+    scalar_gather_32_masked(vgather32_ref);
+    check_buffer(__func__, vtcm.vgather32,
+                 vgather32_ref, MATRIX_SIZE * sizeof(unsigned int));
+}
+
+/* gather the elements from the scatter buffer using C */
+void scalar_gather_16_32(unsigned short *vgather16_32)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        vgather16_32[i] = vtcm.vscatter16_32[word_offsets[i] / 2];
+    }
+}
+
+void check_gather_16_32(void)
+{
+    memset(vgather16_32_ref, 0, MATRIX_SIZE * sizeof(unsigned short));
+    scalar_gather_16_32(vgather16_32_ref);
+    check_buffer(__func__, vtcm.vgather16_32, vgather16_32_ref,
+                 MATRIX_SIZE * sizeof(unsigned short));
+}
+
+void scalar_gather_16_32_masked(unsigned short *vgather16_32)
+{
+    for (int i = 0; i < MATRIX_SIZE; ++i) {
+        if (half_predicates[i]) {
+            vgather16_32[i] = vtcm.vscatter16_32[word_offsets[i] / 2];
+        }
+    }
+
+}
+
+void check_gather_16_32_masked(void)
+{
+    memset(vgather16_32_ref, gather_16_masked_init(),
+           MATRIX_SIZE * sizeof(unsigned short));
+    scalar_gather_16_32_masked(vgather16_32_ref);
+    check_buffer(__func__, vtcm.vgather16_32, vgather16_32_ref,
+                 MATRIX_SIZE * sizeof(unsigned short));
+}
+
+/* print scatter16 buffer */
+void print_scatter16_buffer(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 16 bit scatter buffer");
+
+        for (int i = 0; i < SCATTER_BUFFER_SIZE; i++) {
+            if ((i % MATRIX_SIZE) == 0) {
+                printf("\n");
+            }
+            for (int j = 0; j < 2; j++) {
+                printf("%c", (char)((vtcm.vscatter16[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the gather 16 buffer */
+void print_gather_result_16(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 16 bit gather result\n");
+
+        for (int i = 0; i < MATRIX_SIZE; i++) {
+            for (int j = 0; j < 2; j++) {
+                printf("%c", (char)((vtcm.vgather16[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the scatter32 buffer */
+void print_scatter32_buffer(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 32 bit scatter buffer");
+
+        for (int i = 0; i < SCATTER_BUFFER_SIZE; i++) {
+            if ((i % MATRIX_SIZE) == 0) {
+                printf("\n");
+            }
+            for (int j = 0; j < 4; j++) {
+                printf("%c", (char)((vtcm.vscatter32[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the gather 32 buffer */
+void print_gather_result_32(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 32 bit gather result\n");
+
+        for (int i = 0; i < MATRIX_SIZE; i++) {
+            for (int j = 0; j < 4; j++) {
+                printf("%c", (char)((vtcm.vgather32[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the scatter16_32 buffer */
+void print_scatter16_32_buffer(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 16_32 bit scatter buffer");
+
+        for (int i = 0; i < SCATTER_BUFFER_SIZE; i++) {
+            if ((i % MATRIX_SIZE) == 0) {
+                printf("\n");
+            }
+            for (int j = 0; j < 2; j++) {
+                printf("%c",
+                      (unsigned char)((vtcm.vscatter16_32[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the gather 16_32 buffer */
+void print_gather_result_16_32(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 16_32 bit gather result\n");
+
+        for (int i = 0; i < MATRIX_SIZE; i++) {
+            for (int j = 0; j < 2; j++) {
+                printf("%c",
+                       (unsigned char)((vtcm.vgather16_32[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+int main()
+{
+    prefill_vtcm_scratch();
+
+    /* 16 bit elements with 16 bit offsets */
+    create_offsets_values_preds_16();
+
+    vector_scatter_16();
+    print_scatter16_buffer();
+    check_scatter_16();
+
+    vector_gather_16();
+    print_gather_result_16();
+    check_gather_16();
+
+    vector_gather_16_masked();
+    print_gather_result_16();
+    check_gather_16_masked();
+
+    vector_scatter_16_acc();
+    print_scatter16_buffer();
+    check_scatter_16_acc();
+
+    vector_scatter_16_masked();
+    print_scatter16_buffer();
+    check_scatter_16_masked();
+
+    /* 32 bit elements with 32 bit offsets */
+    create_offsets_values_preds_32();
+
+    vector_scatter_32();
+    print_scatter32_buffer();
+    check_scatter_32();
+
+    vector_gather_32();
+    print_gather_result_32();
+    check_gather_32();
+
+    vector_gather_32_masked();
+    print_gather_result_32();
+    check_gather_32_masked();
+
+    vector_scatter_32_acc();
+    print_scatter32_buffer();
+    check_scatter_32_acc();
+
+    vector_scatter_32_masked();
+    print_scatter32_buffer();
+    check_scatter_32_masked();
+
+    /* 16 bit elements with 32 bit offsets */
+    create_offsets_values_preds_16_32();
+
+    vector_scatter_16_32();
+    print_scatter16_32_buffer();
+    check_scatter_16_32();
+
+    vector_gather_16_32();
+    print_gather_result_16_32();
+    check_gather_16_32();
+
+    vector_gather_16_32_masked();
+    print_gather_result_16_32();
+    check_gather_16_32_masked();
+
+    vector_scatter_16_32_acc();
+    print_scatter16_32_buffer();
+    check_scatter_16_32_acc();
+
+    vector_scatter_16_32_masked();
+    print_scatter16_32_buffer();
+    check_scatter_16_32_masked();
+
+    puts(err ? "FAIL" : "PASS");
+    return err;
+}
diff --git a/tests/tcg/hexagon/Makefile.target b/tests/tcg/hexagon/Makefile.target
index 62916a5..c4ccc99 100644
--- a/tests/tcg/hexagon/Makefile.target
+++ b/tests/tcg/hexagon/Makefile.target
@@ -39,11 +39,13 @@ HEX_TESTS += brev
 HEX_TESTS += load_unpack
 HEX_TESTS += load_align
 HEX_TESTS += vector_add_int
+HEX_TESTS += scatter_gather
 HEX_TESTS += atomics
 HEX_TESTS += fpstuff
 HEX_TESTS += hvx_misc
 
 TESTS += $(HEX_TESTS)
 
+scatter_gather: CFLAGS += -mhvx
 vector_add_int: CFLAGS += -mhvx -fvectorize
 hvx_misc: CFLAGS += -mhvx
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 30/30] Hexagon HVX (tests/tcg/hexagon) histogram test
  2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
                   ` (28 preceding siblings ...)
  2021-10-12 10:11 ` [PATCH v4 29/30] Hexagon HVX (tests/tcg/hexagon) scatter_gather test Taylor Simpson
@ 2021-10-12 10:11 ` Taylor Simpson
  2021-10-29 19:15   ` Richard Henderson
  29 siblings, 1 reply; 45+ messages in thread
From: Taylor Simpson @ 2021-10-12 10:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: ale, bcain, tsimpson, richard.henderson, f4bug

Signe-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 tests/tcg/hexagon/hvx_histogram_input.h | 717 ++++++++++++++++++++++++++++++++
 tests/tcg/hexagon/hvx_histogram_row.h   |  24 ++
 tests/tcg/hexagon/hvx_histogram.c       |  88 ++++
 tests/tcg/hexagon/Makefile.target       |   5 +
 tests/tcg/hexagon/hvx_histogram_row.S   | 294 +++++++++++++
 5 files changed, 1128 insertions(+)
 create mode 100644 tests/tcg/hexagon/hvx_histogram_input.h
 create mode 100644 tests/tcg/hexagon/hvx_histogram_row.h
 create mode 100644 tests/tcg/hexagon/hvx_histogram.c
 create mode 100644 tests/tcg/hexagon/hvx_histogram_row.S

diff --git a/tests/tcg/hexagon/hvx_histogram_input.h b/tests/tcg/hexagon/hvx_histogram_input.h
new file mode 100644
index 0000000..2f91092
--- /dev/null
+++ b/tests/tcg/hexagon/hvx_histogram_input.h
@@ -0,0 +1,717 @@
+/*
+ *  Copyright(c) 2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+    { 0x26, 0x32, 0x2e, 0x2e, 0x2d, 0x2c, 0x2d, 0x2d,
+      0x2c, 0x2e, 0x31, 0x33, 0x36, 0x39, 0x3b, 0x3f,
+      0x42, 0x46, 0x4a, 0x4c, 0x51, 0x53, 0x53, 0x54,
+      0x56, 0x57, 0x58, 0x57, 0x56, 0x52, 0x51, 0x4f,
+      0x4c, 0x49, 0x47, 0x42, 0x3e, 0x3b, 0x38, 0x35,
+      0x33, 0x30, 0x2e, 0x2c, 0x2b, 0x2a, 0x2a, 0x28,
+      0x28, 0x27, 0x27, 0x28, 0x29, 0x2a, 0x2c, 0x2e,
+      0x2f, 0x33, 0x36, 0x38, 0x3c, 0x3d, 0x40, 0x42,
+      0x43, 0x42, 0x43, 0x44, 0x43, 0x41, 0x40, 0x3b,
+      0x3b, 0x3a, 0x38, 0x35, 0x32, 0x2f, 0x2c, 0x29,
+      0x27, 0x26, 0x23, 0x21, 0x1e, 0x1c, 0x1a, 0x19,
+      0x17, 0x15, 0x15, 0x14, 0x13, 0x12, 0x11, 0x10,
+      0x0f, 0x0e, 0x0f, 0x0f, 0x0e, 0x0d, 0x0d, 0x0d,
+      0x0c, 0x0d, 0x0e, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c,
+      0x0c, 0x0c, 0x0d, 0x0c, 0x0f, 0x0e, 0x0f, 0x0f,
+      0x0f, 0x10, 0x11, 0x12, 0x14, 0x16, 0x17, 0x19,
+      0x1c, 0x1d, 0x21, 0x25, 0x27, 0x29, 0x2b, 0x2f,
+      0x31, 0x33, 0x36, 0x38, 0x39, 0x3a, 0x3b, 0x3c,
+      0x3c, 0x3d, 0x3e, 0x3e, 0x3c, 0x3b, 0x3a, 0x39,
+      0x39, 0x3a, 0x3a, 0x3a, 0x3a, 0x3c, 0x3e, 0x43,
+      0x47, 0x4a, 0x4d, 0x51, 0x51, 0x54, 0x56, 0x56,
+      0x57, 0x56, 0x53, 0x4f, 0x4b, 0x47, 0x43, 0x41,
+      0x3e, 0x3c, 0x3a, 0x37, 0x36, 0x33, 0x32, 0x34,
+      0x34, 0x34, 0x34, 0x35, 0x36, 0x39, 0x3d, 0x3d,
+      0x3f, 0x40, 0x40, 0x40, 0x40, 0x3e, 0x40, 0x40,
+      0x42, 0x44, 0x47, 0x48, 0x4b, 0x4e, 0x56, 0x5c,
+      0x62, 0x68, 0x6f, 0x73, 0x76, 0x79, 0x7a, 0x7c,
+      0x7e, 0x7c, 0x78, 0x72, 0x6e, 0x69, 0x65, 0x60,
+      0x5b, 0x56, 0x52, 0x4d, 0x4a, 0x48, 0x47, 0x46,
+      0x44, 0x43, 0x42, 0x41, 0x41, 0x41, 0x40, 0x40,
+      0x3f, 0x3e, 0x3d, 0x3c, 0x3b, 0x3b, 0x38, 0x37,
+      0x36, 0x35, 0x36, 0x35, 0x36, 0x37, 0x38, 0x3c,
+      0x3d, 0x3f, 0x42, 0x44, 0x46, 0x48, 0x4b, 0x4c,
+      0x4e, 0x4e, 0x4d, 0x4c, 0x4a, 0x48, 0x49, 0x49,
+      0x4b, 0x4d, 0x4e, },
+    { 0x23, 0x2d, 0x29, 0x29, 0x28, 0x28, 0x29, 0x29,
+      0x28, 0x2b, 0x2d, 0x2f, 0x32, 0x34, 0x36, 0x3a,
+      0x3d, 0x41, 0x44, 0x47, 0x4a, 0x4c, 0x4e, 0x4e,
+      0x50, 0x51, 0x51, 0x51, 0x4f, 0x4c, 0x4b, 0x48,
+      0x46, 0x44, 0x40, 0x3d, 0x39, 0x36, 0x34, 0x30,
+      0x2f, 0x2d, 0x2a, 0x29, 0x28, 0x27, 0x26, 0x25,
+      0x25, 0x24, 0x24, 0x24, 0x26, 0x28, 0x28, 0x2a,
+      0x2b, 0x2e, 0x32, 0x34, 0x37, 0x39, 0x3b, 0x3c,
+      0x3d, 0x3d, 0x3e, 0x3e, 0x3e, 0x3c, 0x3b, 0x38,
+      0x37, 0x35, 0x33, 0x30, 0x2e, 0x2b, 0x27, 0x25,
+      0x24, 0x21, 0x20, 0x1d, 0x1b, 0x1a, 0x18, 0x16,
+      0x15, 0x14, 0x13, 0x12, 0x10, 0x11, 0x10, 0x0e,
+      0x0e, 0x0d, 0x0d, 0x0d, 0x0d, 0x0c, 0x0c, 0x0b,
+      0x0b, 0x0b, 0x0c, 0x0b, 0x0b, 0x09, 0x0a, 0x0b,
+      0x0b, 0x0a, 0x0a, 0x0c, 0x0c, 0x0c, 0x0d, 0x0e,
+      0x0e, 0x0f, 0x0f, 0x11, 0x12, 0x15, 0x15, 0x17,
+      0x1a, 0x1c, 0x1f, 0x22, 0x25, 0x26, 0x29, 0x2a,
+      0x2d, 0x30, 0x33, 0x34, 0x35, 0x35, 0x37, 0x37,
+      0x39, 0x3a, 0x39, 0x38, 0x37, 0x36, 0x36, 0x37,
+      0x35, 0x36, 0x35, 0x35, 0x36, 0x37, 0x3a, 0x3e,
+      0x40, 0x43, 0x48, 0x49, 0x4b, 0x4c, 0x4d, 0x4e,
+      0x4f, 0x4f, 0x4c, 0x48, 0x45, 0x41, 0x3e, 0x3b,
+      0x3a, 0x37, 0x36, 0x33, 0x32, 0x31, 0x30, 0x31,
+      0x32, 0x31, 0x31, 0x31, 0x31, 0x34, 0x37, 0x38,
+      0x3a, 0x3b, 0x3b, 0x3b, 0x3c, 0x3b, 0x3d, 0x3e,
+      0x3f, 0x40, 0x43, 0x44, 0x47, 0x4b, 0x4f, 0x56,
+      0x5a, 0x60, 0x66, 0x69, 0x6a, 0x6e, 0x71, 0x72,
+      0x73, 0x72, 0x6d, 0x69, 0x66, 0x60, 0x5c, 0x59,
+      0x54, 0x50, 0x4d, 0x48, 0x46, 0x44, 0x44, 0x43,
+      0x42, 0x41, 0x41, 0x40, 0x3f, 0x3f, 0x3e, 0x3d,
+      0x3d, 0x3d, 0x3c, 0x3a, 0x39, 0x38, 0x35, 0x35,
+      0x34, 0x34, 0x35, 0x34, 0x35, 0x36, 0x39, 0x3c,
+      0x3d, 0x3e, 0x41, 0x43, 0x44, 0x46, 0x48, 0x49,
+      0x4a, 0x49, 0x48, 0x47, 0x45, 0x43, 0x43, 0x44,
+      0x45, 0x47, 0x48, },
+    { 0x23, 0x2d, 0x2a, 0x2a, 0x29, 0x29, 0x2a, 0x2a,
+      0x29, 0x2c, 0x2d, 0x2f, 0x32, 0x34, 0x36, 0x3a,
+      0x3d, 0x40, 0x44, 0x48, 0x4a, 0x4c, 0x4e, 0x4e,
+      0x50, 0x51, 0x51, 0x51, 0x4f, 0x4c, 0x4b, 0x48,
+      0x46, 0x44, 0x40, 0x3d, 0x39, 0x36, 0x34, 0x30,
+      0x2f, 0x2d, 0x2a, 0x29, 0x28, 0x27, 0x26, 0x25,
+      0x25, 0x24, 0x24, 0x25, 0x26, 0x28, 0x29, 0x2a,
+      0x2b, 0x2e, 0x31, 0x34, 0x37, 0x39, 0x3b, 0x3c,
+      0x3d, 0x3e, 0x3e, 0x3d, 0x3e, 0x3c, 0x3c, 0x3a,
+      0x37, 0x35, 0x33, 0x30, 0x2f, 0x2b, 0x28, 0x26,
+      0x24, 0x21, 0x20, 0x1e, 0x1c, 0x1b, 0x18, 0x17,
+      0x16, 0x14, 0x13, 0x12, 0x10, 0x10, 0x0f, 0x0e,
+      0x0f, 0x0e, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0c,
+      0x0b, 0x0b, 0x0c, 0x0c, 0x0c, 0x0b, 0x0b, 0x0c,
+      0x0c, 0x0b, 0x0b, 0x0c, 0x0d, 0x0c, 0x0e, 0x0e,
+      0x0e, 0x0f, 0x11, 0x11, 0x13, 0x14, 0x16, 0x18,
+      0x1a, 0x1d, 0x1f, 0x22, 0x25, 0x26, 0x29, 0x2b,
+      0x2d, 0x31, 0x33, 0x34, 0x36, 0x37, 0x38, 0x38,
+      0x39, 0x3a, 0x39, 0x38, 0x37, 0x36, 0x37, 0x37,
+      0x35, 0x36, 0x35, 0x36, 0x35, 0x38, 0x3a, 0x3e,
+      0x40, 0x41, 0x45, 0x47, 0x49, 0x4a, 0x4c, 0x4d,
+      0x4e, 0x4d, 0x4a, 0x47, 0x44, 0x40, 0x3d, 0x3b,
+      0x39, 0x37, 0x34, 0x34, 0x32, 0x31, 0x31, 0x33,
+      0x32, 0x31, 0x32, 0x33, 0x32, 0x36, 0x38, 0x39,
+      0x3b, 0x3c, 0x3c, 0x3c, 0x3d, 0x3d, 0x3e, 0x3e,
+      0x41, 0x42, 0x43, 0x45, 0x48, 0x4c, 0x50, 0x56,
+      0x5b, 0x5f, 0x62, 0x67, 0x69, 0x6c, 0x6e, 0x6e,
+      0x70, 0x6f, 0x6b, 0x67, 0x63, 0x5e, 0x5b, 0x58,
+      0x54, 0x51, 0x4e, 0x4a, 0x48, 0x46, 0x46, 0x46,
+      0x45, 0x46, 0x44, 0x43, 0x44, 0x43, 0x42, 0x42,
+      0x41, 0x40, 0x3f, 0x3e, 0x3c, 0x3b, 0x3a, 0x39,
+      0x39, 0x39, 0x38, 0x37, 0x37, 0x3a, 0x3e, 0x40,
+      0x42, 0x43, 0x47, 0x47, 0x48, 0x4a, 0x4b, 0x4c,
+      0x4c, 0x4b, 0x4a, 0x48, 0x46, 0x44, 0x43, 0x45,
+      0x45, 0x46, 0x47, },
+    { 0x21, 0x2b, 0x28, 0x28, 0x28, 0x28, 0x29, 0x29,
+      0x28, 0x2a, 0x2d, 0x30, 0x32, 0x34, 0x37, 0x3a,
+      0x3c, 0x40, 0x44, 0x48, 0x4a, 0x4c, 0x4e, 0x4e,
+      0x50, 0x51, 0x52, 0x51, 0x4f, 0x4b, 0x4b, 0x48,
+      0x45, 0x43, 0x3f, 0x3c, 0x39, 0x36, 0x33, 0x30,
+      0x2f, 0x2d, 0x2b, 0x2a, 0x28, 0x27, 0x26, 0x25,
+      0x24, 0x24, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2c, 0x2d, 0x31, 0x34, 0x37, 0x39, 0x3b, 0x3c,
+      0x3d, 0x3e, 0x3e, 0x3e, 0x3e, 0x3d, 0x3c, 0x3a,
+      0x37, 0x35, 0x33, 0x30, 0x2f, 0x2b, 0x28, 0x26,
+      0x25, 0x21, 0x20, 0x1e, 0x1c, 0x19, 0x19, 0x18,
+      0x17, 0x15, 0x15, 0x12, 0x11, 0x11, 0x11, 0x0f,
+      0x0e, 0x0e, 0x0e, 0x0e, 0x0d, 0x0d, 0x0d, 0x0c,
+      0x0c, 0x0c, 0x0b, 0x0b, 0x0b, 0x0b, 0x0b, 0x0b,
+      0x0c, 0x0c, 0x0c, 0x0c, 0x0e, 0x0e, 0x0f, 0x0f,
+      0x0f, 0x10, 0x11, 0x13, 0x13, 0x15, 0x16, 0x18,
+      0x1a, 0x1c, 0x1f, 0x22, 0x25, 0x28, 0x29, 0x2d,
+      0x2f, 0x32, 0x34, 0x35, 0x36, 0x37, 0x38, 0x38,
+      0x39, 0x3a, 0x39, 0x39, 0x37, 0x36, 0x37, 0x36,
+      0x35, 0x35, 0x37, 0x35, 0x36, 0x37, 0x3a, 0x3d,
+      0x3e, 0x41, 0x43, 0x46, 0x46, 0x47, 0x48, 0x49,
+      0x4a, 0x49, 0x47, 0x45, 0x42, 0x3f, 0x3d, 0x3b,
+      0x3a, 0x38, 0x36, 0x34, 0x32, 0x32, 0x32, 0x32,
+      0x32, 0x31, 0x33, 0x32, 0x34, 0x37, 0x38, 0x38,
+      0x3a, 0x3b, 0x3d, 0x3d, 0x3d, 0x3e, 0x3f, 0x41,
+      0x42, 0x44, 0x44, 0x46, 0x49, 0x4d, 0x50, 0x54,
+      0x58, 0x5c, 0x61, 0x63, 0x65, 0x69, 0x6a, 0x6c,
+      0x6d, 0x6c, 0x68, 0x64, 0x61, 0x5c, 0x59, 0x57,
+      0x53, 0x51, 0x4f, 0x4c, 0x4a, 0x48, 0x48, 0x49,
+      0x49, 0x48, 0x48, 0x48, 0x47, 0x47, 0x46, 0x46,
+      0x45, 0x44, 0x42, 0x41, 0x3f, 0x3e, 0x3c, 0x3c,
+      0x3c, 0x3d, 0x3c, 0x3c, 0x3c, 0x3e, 0x41, 0x43,
+      0x46, 0x48, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4e,
+      0x4e, 0x4d, 0x4b, 0x49, 0x47, 0x44, 0x44, 0x45,
+      0x45, 0x45, 0x46, },
+    { 0x22, 0x2b, 0x27, 0x27, 0x27, 0x27, 0x28, 0x28,
+      0x28, 0x2a, 0x2c, 0x2f, 0x30, 0x34, 0x37, 0x3b,
+      0x3d, 0x41, 0x45, 0x48, 0x4a, 0x4c, 0x4e, 0x4e,
+      0x50, 0x51, 0x52, 0x51, 0x4f, 0x4b, 0x4b, 0x47,
+      0x45, 0x43, 0x3f, 0x3c, 0x39, 0x36, 0x33, 0x30,
+      0x2f, 0x2d, 0x2b, 0x2a, 0x27, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2c, 0x2e, 0x31, 0x34, 0x37, 0x39, 0x3a, 0x3b,
+      0x3d, 0x3e, 0x3e, 0x3f, 0x3f, 0x3d, 0x3c, 0x3a,
+      0x38, 0x36, 0x34, 0x31, 0x2e, 0x2c, 0x29, 0x26,
+      0x25, 0x22, 0x20, 0x1e, 0x1c, 0x1a, 0x19, 0x18,
+      0x16, 0x15, 0x14, 0x12, 0x10, 0x11, 0x11, 0x0f,
+      0x0e, 0x0e, 0x0e, 0x0e, 0x0d, 0x0c, 0x0d, 0x0c,
+      0x0c, 0x0c, 0x0b, 0x0b, 0x0b, 0x0b, 0x0b, 0x0b,
+      0x0c, 0x0c, 0x0c, 0x0d, 0x0d, 0x0e, 0x0f, 0x0f,
+      0x0f, 0x10, 0x11, 0x13, 0x13, 0x15, 0x15, 0x18,
+      0x19, 0x1d, 0x1f, 0x21, 0x24, 0x27, 0x2a, 0x2c,
+      0x30, 0x33, 0x35, 0x36, 0x37, 0x38, 0x39, 0x39,
+      0x3a, 0x3a, 0x39, 0x39, 0x37, 0x36, 0x37, 0x36,
+      0x36, 0x36, 0x36, 0x36, 0x36, 0x37, 0x39, 0x3a,
+      0x3d, 0x3e, 0x41, 0x43, 0x43, 0x45, 0x46, 0x46,
+      0x47, 0x46, 0x44, 0x42, 0x40, 0x3d, 0x3a, 0x39,
+      0x37, 0x36, 0x35, 0x34, 0x33, 0x32, 0x32, 0x32,
+      0x32, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38,
+      0x39, 0x3c, 0x3c, 0x3e, 0x3e, 0x3e, 0x41, 0x43,
+      0x44, 0x45, 0x46, 0x48, 0x49, 0x4c, 0x51, 0x54,
+      0x56, 0x5a, 0x5f, 0x61, 0x63, 0x65, 0x67, 0x69,
+      0x6a, 0x69, 0x67, 0x61, 0x5f, 0x5b, 0x58, 0x56,
+      0x54, 0x51, 0x50, 0x4e, 0x4c, 0x4a, 0x4b, 0x4c,
+      0x4c, 0x4b, 0x4b, 0x4b, 0x4b, 0x49, 0x4a, 0x49,
+      0x49, 0x48, 0x46, 0x44, 0x42, 0x41, 0x40, 0x3f,
+      0x3f, 0x40, 0x40, 0x40, 0x40, 0x42, 0x46, 0x49,
+      0x4b, 0x4c, 0x4f, 0x4f, 0x50, 0x52, 0x51, 0x51,
+      0x50, 0x4f, 0x4c, 0x4a, 0x48, 0x46, 0x45, 0x44,
+      0x44, 0x45, 0x46, },
+    { 0x21, 0x2a, 0x27, 0x27, 0x27, 0x27, 0x27, 0x27,
+      0x27, 0x29, 0x2d, 0x2f, 0x31, 0x34, 0x37, 0x3b,
+      0x3e, 0x41, 0x45, 0x48, 0x4a, 0x4c, 0x4e, 0x4e,
+      0x50, 0x51, 0x52, 0x51, 0x4f, 0x4b, 0x4b, 0x48,
+      0x45, 0x43, 0x3f, 0x3c, 0x39, 0x36, 0x33, 0x2f,
+      0x2f, 0x2d, 0x2a, 0x2a, 0x27, 0x26, 0x25, 0x24,
+      0x22, 0x24, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2c, 0x2f, 0x31, 0x34, 0x37, 0x39, 0x3a, 0x3c,
+      0x3d, 0x3e, 0x3f, 0x40, 0x3f, 0x3d, 0x3d, 0x3a,
+      0x38, 0x36, 0x34, 0x31, 0x2e, 0x2c, 0x29, 0x26,
+      0x25, 0x22, 0x21, 0x1f, 0x1d, 0x1b, 0x19, 0x18,
+      0x16, 0x14, 0x14, 0x13, 0x11, 0x11, 0x11, 0x0f,
+      0x0f, 0x0f, 0x0e, 0x0e, 0x0d, 0x0d, 0x0d, 0x0d,
+      0x0d, 0x0d, 0x0c, 0x0b, 0x0b, 0x0b, 0x0b, 0x0c,
+      0x0c, 0x0d, 0x0d, 0x0d, 0x0e, 0x0e, 0x0f, 0x0f,
+      0x0f, 0x10, 0x13, 0x13, 0x14, 0x15, 0x17, 0x19,
+      0x1a, 0x1d, 0x1f, 0x22, 0x25, 0x27, 0x2a, 0x2e,
+      0x31, 0x33, 0x35, 0x38, 0x39, 0x3a, 0x3b, 0x3b,
+      0x3c, 0x3c, 0x3b, 0x3a, 0x39, 0x38, 0x38, 0x37,
+      0x36, 0x36, 0x37, 0x36, 0x37, 0x38, 0x38, 0x3a,
+      0x3b, 0x3e, 0x40, 0x40, 0x41, 0x42, 0x43, 0x42,
+      0x43, 0x42, 0x40, 0x40, 0x3f, 0x3c, 0x3b, 0x39,
+      0x38, 0x37, 0x36, 0x35, 0x34, 0x33, 0x32, 0x33,
+      0x32, 0x32, 0x34, 0x35, 0x35, 0x36, 0x39, 0x39,
+      0x3a, 0x3c, 0x3c, 0x3f, 0x40, 0x41, 0x43, 0x45,
+      0x45, 0x47, 0x48, 0x4a, 0x4b, 0x4d, 0x50, 0x53,
+      0x56, 0x59, 0x5c, 0x5f, 0x60, 0x65, 0x64, 0x66,
+      0x68, 0x66, 0x64, 0x61, 0x5e, 0x5a, 0x59, 0x56,
+      0x54, 0x52, 0x51, 0x50, 0x4e, 0x4c, 0x4d, 0x4f,
+      0x4f, 0x4f, 0x50, 0x50, 0x4f, 0x4f, 0x4e, 0x4d,
+      0x4c, 0x4b, 0x49, 0x47, 0x45, 0x44, 0x43, 0x43,
+      0x42, 0x43, 0x44, 0x44, 0x46, 0x47, 0x49, 0x4d,
+      0x4f, 0x51, 0x53, 0x54, 0x53, 0x54, 0x54, 0x53,
+      0x53, 0x51, 0x4e, 0x4b, 0x4a, 0x47, 0x45, 0x44,
+      0x44, 0x45, 0x46, },
+    { 0x20, 0x28, 0x26, 0x26, 0x25, 0x24, 0x27, 0x27,
+      0x27, 0x29, 0x2c, 0x2e, 0x31, 0x34, 0x37, 0x3b,
+      0x3e, 0x41, 0x45, 0x48, 0x4a, 0x4c, 0x4e, 0x4e,
+      0x50, 0x51, 0x52, 0x51, 0x4f, 0x4b, 0x4a, 0x49,
+      0x45, 0x43, 0x3f, 0x3c, 0x3a, 0x36, 0x33, 0x30,
+      0x2f, 0x2d, 0x2a, 0x28, 0x27, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2c, 0x2e, 0x31, 0x34, 0x37, 0x39, 0x3b, 0x3c,
+      0x3d, 0x3e, 0x3f, 0x40, 0x3e, 0x3d, 0x3d, 0x3a,
+      0x38, 0x36, 0x34, 0x31, 0x2f, 0x2c, 0x29, 0x27,
+      0x25, 0x21, 0x21, 0x1f, 0x1c, 0x1d, 0x19, 0x18,
+      0x16, 0x15, 0x15, 0x13, 0x12, 0x11, 0x11, 0x0f,
+      0x0f, 0x0e, 0x0f, 0x0f, 0x0e, 0x0d, 0x0d, 0x0d,
+      0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c,
+      0x0d, 0x0d, 0x0d, 0x0e, 0x0e, 0x0e, 0x0f, 0x10,
+      0x10, 0x10, 0x12, 0x13, 0x15, 0x16, 0x18, 0x1a,
+      0x1c, 0x1d, 0x20, 0x22, 0x25, 0x27, 0x2a, 0x2e,
+      0x30, 0x34, 0x38, 0x39, 0x3a, 0x3b, 0x3b, 0x3b,
+      0x3c, 0x3d, 0x3c, 0x3b, 0x3a, 0x39, 0x38, 0x37,
+      0x36, 0x36, 0x38, 0x37, 0x37, 0x37, 0x38, 0x3a,
+      0x3b, 0x3c, 0x3d, 0x3e, 0x3f, 0x40, 0x40, 0x40,
+      0x42, 0x40, 0x3f, 0x3e, 0x3d, 0x3b, 0x3a, 0x39,
+      0x37, 0x36, 0x36, 0x35, 0x34, 0x34, 0x33, 0x33,
+      0x33, 0x34, 0x35, 0x35, 0x35, 0x36, 0x38, 0x39,
+      0x3a, 0x3b, 0x3d, 0x3f, 0x42, 0x43, 0x45, 0x45,
+      0x46, 0x48, 0x49, 0x4b, 0x4b, 0x4d, 0x50, 0x53,
+      0x56, 0x57, 0x5a, 0x5c, 0x5e, 0x61, 0x63, 0x65,
+      0x66, 0x64, 0x62, 0x5f, 0x5c, 0x59, 0x58, 0x56,
+      0x55, 0x54, 0x52, 0x51, 0x50, 0x51, 0x51, 0x52,
+      0x52, 0x52, 0x52, 0x52, 0x51, 0x51, 0x51, 0x50,
+      0x4f, 0x4e, 0x4c, 0x4a, 0x47, 0x46, 0x45, 0x45,
+      0x45, 0x46, 0x46, 0x46, 0x4a, 0x4c, 0x4d, 0x52,
+      0x54, 0x56, 0x58, 0x58, 0x56, 0x57, 0x57, 0x56,
+      0x55, 0x53, 0x50, 0x4d, 0x49, 0x45, 0x44, 0x44,
+      0x43, 0x44, 0x45, },
+    { 0x1f, 0x27, 0x24, 0x23, 0x25, 0x24, 0x25, 0x26,
+      0x26, 0x28, 0x2b, 0x2e, 0x31, 0x34, 0x37, 0x3a,
+      0x3d, 0x41, 0x45, 0x48, 0x4b, 0x4d, 0x4f, 0x4e,
+      0x50, 0x51, 0x52, 0x50, 0x4f, 0x4b, 0x4a, 0x49,
+      0x45, 0x43, 0x3f, 0x3c, 0x3a, 0x36, 0x33, 0x30,
+      0x2f, 0x2d, 0x29, 0x28, 0x27, 0x26, 0x25, 0x24,
+      0x23, 0x25, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2c, 0x2f, 0x32, 0x34, 0x37, 0x39, 0x3b, 0x3c,
+      0x3e, 0x3f, 0x3f, 0x40, 0x3e, 0x3d, 0x3c, 0x3a,
+      0x38, 0x36, 0x34, 0x31, 0x30, 0x2c, 0x29, 0x28,
+      0x25, 0x23, 0x22, 0x1f, 0x1c, 0x1c, 0x18, 0x18,
+      0x16, 0x14, 0x14, 0x13, 0x11, 0x11, 0x11, 0x0f,
+      0x0f, 0x0e, 0x0f, 0x0f, 0x0e, 0x0d, 0x0d, 0x0d,
+      0x0c, 0x0c, 0x0b, 0x0c, 0x0c, 0x0c, 0x0c, 0x0c,
+      0x0d, 0x0e, 0x0e, 0x0f, 0x0d, 0x0f, 0x10, 0x10,
+      0x10, 0x11, 0x13, 0x14, 0x15, 0x16, 0x19, 0x1a,
+      0x1c, 0x1f, 0x20, 0x23, 0x26, 0x28, 0x2a, 0x2e,
+      0x31, 0x35, 0x38, 0x39, 0x3a, 0x3c, 0x3d, 0x3d,
+      0x3e, 0x3e, 0x3d, 0x3c, 0x3a, 0x3a, 0x39, 0x39,
+      0x38, 0x37, 0x38, 0x38, 0x37, 0x38, 0x39, 0x3a,
+      0x3c, 0x3c, 0x3d, 0x3e, 0x3f, 0x3f, 0x40, 0x3f,
+      0x41, 0x40, 0x3e, 0x3e, 0x3d, 0x3b, 0x3b, 0x39,
+      0x37, 0x37, 0x35, 0x36, 0x34, 0x34, 0x34, 0x35,
+      0x35, 0x34, 0x34, 0x35, 0x35, 0x37, 0x38, 0x39,
+      0x3a, 0x3c, 0x3f, 0x3f, 0x43, 0x43, 0x45, 0x47,
+      0x48, 0x48, 0x4a, 0x4b, 0x4e, 0x4d, 0x51, 0x53,
+      0x56, 0x58, 0x59, 0x5b, 0x5d, 0x60, 0x62, 0x63,
+      0x64, 0x63, 0x61, 0x5e, 0x5c, 0x5a, 0x57, 0x56,
+      0x55, 0x54, 0x53, 0x52, 0x51, 0x51, 0x52, 0x52,
+      0x54, 0x54, 0x55, 0x55, 0x55, 0x54, 0x54, 0x53,
+      0x52, 0x50, 0x4e, 0x4d, 0x4b, 0x4a, 0x48, 0x48,
+      0x48, 0x48, 0x4a, 0x4b, 0x4d, 0x4f, 0x52, 0x55,
+      0x58, 0x5a, 0x5b, 0x5b, 0x5b, 0x5b, 0x5a, 0x59,
+      0x58, 0x55, 0x51, 0x4e, 0x4a, 0x46, 0x45, 0x44,
+      0x44, 0x44, 0x44, },
+    { 0x1e, 0x26, 0x23, 0x23, 0x25, 0x24, 0x25, 0x26,
+      0x26, 0x28, 0x2b, 0x2e, 0x31, 0x34, 0x37, 0x3a,
+      0x3e, 0x42, 0x45, 0x48, 0x4b, 0x4d, 0x4f, 0x4f,
+      0x50, 0x51, 0x52, 0x50, 0x4f, 0x4b, 0x4a, 0x48,
+      0x46, 0x44, 0x3f, 0x3b, 0x39, 0x36, 0x33, 0x30,
+      0x2f, 0x2d, 0x2a, 0x28, 0x27, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2c, 0x2f, 0x32, 0x34, 0x37, 0x39, 0x3b, 0x3d,
+      0x3e, 0x3f, 0x41, 0x41, 0x40, 0x3e, 0x3d, 0x3b,
+      0x38, 0x37, 0x34, 0x32, 0x30, 0x2c, 0x2a, 0x27,
+      0x26, 0x23, 0x22, 0x20, 0x1d, 0x1b, 0x1a, 0x19,
+      0x17, 0x15, 0x15, 0x13, 0x12, 0x12, 0x11, 0x0f,
+      0x11, 0x0f, 0x0e, 0x0e, 0x0d, 0x0d, 0x0d, 0x0c,
+      0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d,
+      0x0e, 0x0e, 0x0e, 0x0f, 0x10, 0x10, 0x11, 0x11,
+      0x11, 0x13, 0x16, 0x15, 0x15, 0x18, 0x1a, 0x1b,
+      0x1d, 0x20, 0x22, 0x24, 0x27, 0x29, 0x2c, 0x30,
+      0x33, 0x37, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3e,
+      0x40, 0x40, 0x40, 0x3f, 0x3e, 0x3d, 0x3c, 0x3a,
+      0x3a, 0x3a, 0x3a, 0x3a, 0x3a, 0x3a, 0x3b, 0x3d,
+      0x3d, 0x3f, 0x40, 0x40, 0x3f, 0x41, 0x41, 0x41,
+      0x41, 0x41, 0x40, 0x40, 0x3f, 0x3e, 0x3c, 0x3b,
+      0x3a, 0x39, 0x37, 0x36, 0x36, 0x35, 0x35, 0x36,
+      0x36, 0x35, 0x35, 0x36, 0x36, 0x38, 0x39, 0x39,
+      0x3b, 0x3c, 0x3e, 0x40, 0x41, 0x43, 0x45, 0x47,
+      0x48, 0x48, 0x4b, 0x4c, 0x4d, 0x4f, 0x51, 0x53,
+      0x56, 0x56, 0x59, 0x5b, 0x5d, 0x5f, 0x61, 0x62,
+      0x63, 0x63, 0x61, 0x5e, 0x5c, 0x5a, 0x59, 0x57,
+      0x56, 0x54, 0x54, 0x53, 0x52, 0x53, 0x53, 0x55,
+      0x56, 0x56, 0x57, 0x57, 0x57, 0x57, 0x56, 0x56,
+      0x55, 0x53, 0x51, 0x4f, 0x4d, 0x4b, 0x49, 0x4b,
+      0x4b, 0x4c, 0x4d, 0x4e, 0x51, 0x53, 0x55, 0x58,
+      0x5b, 0x5c, 0x60, 0x60, 0x5f, 0x5e, 0x5d, 0x5c,
+      0x5a, 0x57, 0x53, 0x4f, 0x4b, 0x46, 0x45, 0x44,
+      0x44, 0x44, 0x44, },
+    { 0x1d, 0x25, 0x22, 0x22, 0x23, 0x23, 0x24, 0x25,
+      0x25, 0x28, 0x2b, 0x2e, 0x31, 0x34, 0x37, 0x3a,
+      0x3e, 0x42, 0x45, 0x48, 0x4b, 0x4d, 0x4f, 0x4f,
+      0x50, 0x51, 0x52, 0x50, 0x4f, 0x4b, 0x4a, 0x47,
+      0x45, 0x43, 0x3f, 0x3c, 0x38, 0x35, 0x33, 0x30,
+      0x2f, 0x2d, 0x2a, 0x28, 0x27, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2b, 0x2f, 0x32, 0x34, 0x37, 0x39, 0x3c, 0x3d,
+      0x3e, 0x3f, 0x40, 0x41, 0x40, 0x3e, 0x3d, 0x3b,
+      0x39, 0x36, 0x34, 0x32, 0x30, 0x2d, 0x2a, 0x26,
+      0x26, 0x24, 0x22, 0x1f, 0x1d, 0x1c, 0x1a, 0x19,
+      0x18, 0x16, 0x15, 0x14, 0x12, 0x12, 0x12, 0x10,
+      0x10, 0x0f, 0x0e, 0x10, 0x0e, 0x0e, 0x0d, 0x0c,
+      0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0e, 0x0d, 0x0e,
+      0x0f, 0x0f, 0x0f, 0x10, 0x11, 0x11, 0x11, 0x12,
+      0x13, 0x14, 0x16, 0x16, 0x18, 0x1a, 0x1b, 0x1c,
+      0x1e, 0x21, 0x23, 0x25, 0x28, 0x2a, 0x2e, 0x32,
+      0x34, 0x38, 0x3a, 0x3c, 0x3d, 0x3f, 0x40, 0x42,
+      0x43, 0x43, 0x43, 0x42, 0x40, 0x3e, 0x3e, 0x3c,
+      0x3b, 0x3b, 0x3c, 0x3a, 0x3b, 0x3b, 0x3e, 0x3e,
+      0x40, 0x3f, 0x41, 0x41, 0x41, 0x42, 0x42, 0x43,
+      0x42, 0x41, 0x41, 0x41, 0x40, 0x3e, 0x3d, 0x3c,
+      0x3b, 0x3a, 0x39, 0x37, 0x36, 0x35, 0x36, 0x37,
+      0x35, 0x36, 0x36, 0x37, 0x38, 0x39, 0x3a, 0x3b,
+      0x3b, 0x3d, 0x3e, 0x40, 0x41, 0x41, 0x44, 0x46,
+      0x48, 0x48, 0x4a, 0x4c, 0x4d, 0x4f, 0x51, 0x53,
+      0x55, 0x57, 0x59, 0x5a, 0x5b, 0x5e, 0x5f, 0x61,
+      0x62, 0x61, 0x60, 0x5e, 0x5c, 0x5a, 0x59, 0x58,
+      0x56, 0x55, 0x54, 0x53, 0x53, 0x54, 0x54, 0x55,
+      0x57, 0x57, 0x58, 0x59, 0x5a, 0x58, 0x59, 0x58,
+      0x57, 0x55, 0x53, 0x52, 0x4f, 0x4e, 0x4d, 0x4d,
+      0x4d, 0x4f, 0x51, 0x50, 0x54, 0x56, 0x59, 0x5c,
+      0x5f, 0x61, 0x64, 0x64, 0x63, 0x61, 0x5e, 0x5e,
+      0x5c, 0x59, 0x54, 0x50, 0x4c, 0x46, 0x45, 0x44,
+      0x44, 0x44, 0x44, },
+    { 0x1c, 0x24, 0x21, 0x21, 0x21, 0x22, 0x23, 0x23,
+      0x25, 0x27, 0x2a, 0x2e, 0x31, 0x33, 0x37, 0x3b,
+      0x3e, 0x42, 0x45, 0x48, 0x4b, 0x4c, 0x50, 0x4f,
+      0x50, 0x51, 0x52, 0x50, 0x4e, 0x4b, 0x4a, 0x49,
+      0x45, 0x42, 0x3f, 0x3c, 0x38, 0x35, 0x33, 0x30,
+      0x2f, 0x2d, 0x2a, 0x28, 0x27, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2b, 0x2f, 0x32, 0x34, 0x38, 0x39, 0x3c, 0x3d,
+      0x3e, 0x3e, 0x40, 0x41, 0x40, 0x3e, 0x3c, 0x3a,
+      0x39, 0x37, 0x35, 0x33, 0x30, 0x2d, 0x2b, 0x28,
+      0x26, 0x23, 0x23, 0x20, 0x1e, 0x1b, 0x19, 0x19,
+      0x17, 0x16, 0x15, 0x14, 0x12, 0x12, 0x11, 0x10,
+      0x0f, 0x0e, 0x0e, 0x10, 0x0e, 0x0d, 0x0c, 0x0c,
+      0x0c, 0x0d, 0x0d, 0x0d, 0x0d, 0x0e, 0x0d, 0x0e,
+      0x0f, 0x0f, 0x0f, 0x10, 0x11, 0x11, 0x12, 0x14,
+      0x14, 0x14, 0x16, 0x18, 0x19, 0x1b, 0x1c, 0x1e,
+      0x20, 0x23, 0x26, 0x27, 0x29, 0x2c, 0x2f, 0x33,
+      0x36, 0x38, 0x3b, 0x3e, 0x3e, 0x42, 0x43, 0x46,
+      0x46, 0x46, 0x46, 0x44, 0x42, 0x41, 0x3f, 0x3e,
+      0x3d, 0x3d, 0x3e, 0x3d, 0x3d, 0x3e, 0x3e, 0x40,
+      0x40, 0x40, 0x43, 0x43, 0x42, 0x43, 0x45, 0x43,
+      0x43, 0x43, 0x42, 0x42, 0x41, 0x40, 0x40, 0x3e,
+      0x3c, 0x3a, 0x3a, 0x38, 0x36, 0x36, 0x36, 0x36,
+      0x37, 0x37, 0x36, 0x38, 0x38, 0x39, 0x3b, 0x3b,
+      0x3e, 0x3e, 0x3e, 0x40, 0x41, 0x43, 0x45, 0x46,
+      0x46, 0x49, 0x4c, 0x4c, 0x4d, 0x4f, 0x51, 0x54,
+      0x56, 0x57, 0x58, 0x5a, 0x5c, 0x5e, 0x60, 0x60,
+      0x61, 0x61, 0x60, 0x5f, 0x5c, 0x5a, 0x59, 0x58,
+      0x57, 0x57, 0x55, 0x54, 0x53, 0x55, 0x55, 0x58,
+      0x58, 0x59, 0x5a, 0x5a, 0x5a, 0x5b, 0x5b, 0x5b,
+      0x5a, 0x59, 0x56, 0x54, 0x53, 0x4e, 0x4e, 0x50,
+      0x50, 0x51, 0x52, 0x52, 0x57, 0x59, 0x5d, 0x60,
+      0x63, 0x63, 0x66, 0x66, 0x66, 0x64, 0x63, 0x61,
+      0x60, 0x5b, 0x55, 0x51, 0x4d, 0x48, 0x45, 0x44,
+      0x43, 0x43, 0x43, },
+    { 0x1b, 0x23, 0x20, 0x21, 0x22, 0x22, 0x23, 0x24,
+      0x26, 0x27, 0x2a, 0x2e, 0x31, 0x33, 0x37, 0x3b,
+      0x3d, 0x42, 0x46, 0x49, 0x4a, 0x4c, 0x4f, 0x4f,
+      0x50, 0x50, 0x52, 0x50, 0x4e, 0x4b, 0x4b, 0x49,
+      0x45, 0x42, 0x3e, 0x3c, 0x38, 0x35, 0x33, 0x30,
+      0x2f, 0x2d, 0x2a, 0x28, 0x27, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2c, 0x2f, 0x32, 0x35, 0x38, 0x3a, 0x3c, 0x3d,
+      0x3e, 0x3e, 0x40, 0x41, 0x40, 0x3f, 0x3d, 0x3b,
+      0x3a, 0x38, 0x36, 0x33, 0x30, 0x2d, 0x2b, 0x29,
+      0x27, 0x24, 0x24, 0x21, 0x1e, 0x1c, 0x1b, 0x1a,
+      0x18, 0x17, 0x16, 0x15, 0x13, 0x12, 0x10, 0x0f,
+      0x10, 0x0f, 0x0e, 0x0f, 0x0e, 0x0d, 0x0d, 0x0d,
+      0x0d, 0x0d, 0x0e, 0x0e, 0x0e, 0x0f, 0x0e, 0x0f,
+      0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15,
+      0x15, 0x16, 0x17, 0x1a, 0x1b, 0x1d, 0x1e, 0x20,
+      0x21, 0x25, 0x27, 0x29, 0x2b, 0x2d, 0x31, 0x35,
+      0x37, 0x39, 0x3c, 0x3f, 0x40, 0x43, 0x46, 0x47,
+      0x4a, 0x49, 0x48, 0x46, 0x45, 0x43, 0x42, 0x41,
+      0x3f, 0x40, 0x3f, 0x3f, 0x40, 0x3f, 0x41, 0x43,
+      0x43, 0x43, 0x44, 0x45, 0x45, 0x45, 0x45, 0x45,
+      0x45, 0x45, 0x44, 0x43, 0x43, 0x42, 0x42, 0x40,
+      0x3e, 0x3d, 0x3c, 0x39, 0x38, 0x38, 0x38, 0x38,
+      0x38, 0x36, 0x38, 0x39, 0x39, 0x3a, 0x3c, 0x3d,
+      0x3e, 0x3e, 0x3f, 0x41, 0x42, 0x42, 0x43, 0x45,
+      0x46, 0x49, 0x4b, 0x4d, 0x4f, 0x50, 0x53, 0x54,
+      0x57, 0x58, 0x5a, 0x5c, 0x5b, 0x5e, 0x60, 0x61,
+      0x60, 0x60, 0x5f, 0x5f, 0x5d, 0x5b, 0x5b, 0x59,
+      0x58, 0x57, 0x56, 0x55, 0x55, 0x55, 0x57, 0x59,
+      0x5b, 0x5b, 0x5d, 0x5c, 0x5c, 0x5e, 0x5e, 0x5e,
+      0x5d, 0x5b, 0x59, 0x56, 0x54, 0x51, 0x51, 0x51,
+      0x52, 0x55, 0x56, 0x56, 0x5a, 0x5d, 0x5f, 0x63,
+      0x66, 0x68, 0x6b, 0x6b, 0x68, 0x67, 0x66, 0x64,
+      0x61, 0x5d, 0x57, 0x52, 0x4f, 0x49, 0x46, 0x45,
+      0x43, 0x43, 0x43, },
+    { 0x1a, 0x22, 0x1f, 0x20, 0x21, 0x22, 0x23, 0x24,
+      0x26, 0x27, 0x2a, 0x2d, 0x31, 0x33, 0x37, 0x3b,
+      0x3d, 0x41, 0x46, 0x49, 0x4a, 0x4d, 0x4f, 0x4f,
+      0x50, 0x51, 0x52, 0x50, 0x4e, 0x4b, 0x4b, 0x48,
+      0x44, 0x42, 0x3e, 0x3c, 0x39, 0x35, 0x33, 0x30,
+      0x2f, 0x2d, 0x2a, 0x28, 0x27, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x27, 0x27, 0x29, 0x2a,
+      0x2d, 0x2f, 0x32, 0x35, 0x39, 0x3a, 0x3c, 0x3d,
+      0x3e, 0x3f, 0x40, 0x41, 0x40, 0x3f, 0x3e, 0x3c,
+      0x3a, 0x38, 0x36, 0x33, 0x31, 0x2d, 0x2c, 0x29,
+      0x27, 0x26, 0x24, 0x21, 0x1f, 0x1d, 0x1c, 0x1a,
+      0x19, 0x18, 0x16, 0x15, 0x14, 0x13, 0x12, 0x10,
+      0x11, 0x10, 0x0f, 0x0f, 0x0f, 0x0e, 0x0e, 0x0e,
+      0x0f, 0x0f, 0x0e, 0x0e, 0x0e, 0x0f, 0x0f, 0x10,
+      0x11, 0x12, 0x12, 0x13, 0x15, 0x15, 0x16, 0x16,
+      0x17, 0x18, 0x1a, 0x1b, 0x1c, 0x1e, 0x1f, 0x21,
+      0x22, 0x25, 0x27, 0x2a, 0x2c, 0x2e, 0x33, 0x36,
+      0x39, 0x3a, 0x3d, 0x40, 0x41, 0x45, 0x47, 0x4a,
+      0x4c, 0x4d, 0x4c, 0x4a, 0x48, 0x45, 0x44, 0x41,
+      0x42, 0x42, 0x42, 0x42, 0x42, 0x43, 0x43, 0x44,
+      0x45, 0x47, 0x47, 0x48, 0x47, 0x48, 0x47, 0x47,
+      0x48, 0x48, 0x46, 0x46, 0x46, 0x43, 0x43, 0x41,
+      0x3f, 0x3e, 0x3b, 0x39, 0x38, 0x37, 0x37, 0x37,
+      0x38, 0x38, 0x37, 0x39, 0x39, 0x3a, 0x3c, 0x3e,
+      0x3e, 0x3f, 0x3f, 0x3f, 0x42, 0x43, 0x43, 0x45,
+      0x47, 0x48, 0x4b, 0x4c, 0x4e, 0x50, 0x51, 0x54,
+      0x56, 0x58, 0x5a, 0x5c, 0x5c, 0x5f, 0x5f, 0x5f,
+      0x61, 0x60, 0x5f, 0x5f, 0x5e, 0x5b, 0x5c, 0x5b,
+      0x59, 0x59, 0x57, 0x56, 0x55, 0x56, 0x57, 0x59,
+      0x5a, 0x5b, 0x5c, 0x5c, 0x5d, 0x5e, 0x5e, 0x5d,
+      0x5e, 0x5c, 0x5a, 0x57, 0x55, 0x52, 0x51, 0x52,
+      0x53, 0x55, 0x57, 0x58, 0x5c, 0x5e, 0x61, 0x65,
+      0x69, 0x6b, 0x6c, 0x6b, 0x6a, 0x69, 0x67, 0x64,
+      0x61, 0x5d, 0x59, 0x53, 0x4d, 0x48, 0x46, 0x45,
+      0x44, 0x44, 0x43, },
+    { 0x1a, 0x21, 0x1e, 0x1f, 0x20, 0x21, 0x23, 0x24,
+      0x25, 0x28, 0x2a, 0x2e, 0x31, 0x33, 0x37, 0x3b,
+      0x3e, 0x41, 0x46, 0x49, 0x4b, 0x4d, 0x4f, 0x4e,
+      0x50, 0x51, 0x51, 0x50, 0x4e, 0x4b, 0x4a, 0x48,
+      0x44, 0x42, 0x3e, 0x3c, 0x39, 0x35, 0x32, 0x30,
+      0x2f, 0x2d, 0x29, 0x27, 0x27, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x26, 0x27, 0x29, 0x2a,
+      0x2c, 0x2f, 0x32, 0x35, 0x38, 0x3b, 0x3c, 0x3e,
+      0x3f, 0x3f, 0x40, 0x41, 0x40, 0x3f, 0x3e, 0x3c,
+      0x3a, 0x39, 0x36, 0x34, 0x31, 0x2d, 0x2c, 0x29,
+      0x27, 0x26, 0x24, 0x21, 0x1f, 0x1d, 0x1c, 0x1a,
+      0x19, 0x17, 0x16, 0x15, 0x14, 0x13, 0x12, 0x10,
+      0x11, 0x10, 0x0f, 0x0f, 0x0f, 0x0e, 0x0e, 0x0e,
+      0x0e, 0x0e, 0x0e, 0x0e, 0x0e, 0x0f, 0x0f, 0x10,
+      0x11, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x19,
+      0x19, 0x1a, 0x1c, 0x1d, 0x1e, 0x20, 0x22, 0x24,
+      0x25, 0x27, 0x29, 0x2c, 0x2e, 0x31, 0x35, 0x38,
+      0x3a, 0x3d, 0x41, 0x42, 0x45, 0x48, 0x4c, 0x4e,
+      0x4f, 0x4f, 0x4f, 0x4d, 0x4b, 0x49, 0x47, 0x47,
+      0x46, 0x45, 0x45, 0x45, 0x44, 0x44, 0x46, 0x47,
+      0x48, 0x49, 0x4b, 0x4b, 0x4a, 0x4b, 0x4b, 0x4a,
+      0x4b, 0x4a, 0x49, 0x49, 0x48, 0x46, 0x46, 0x44,
+      0x42, 0x41, 0x3d, 0x3b, 0x3a, 0x38, 0x38, 0x38,
+      0x37, 0x37, 0x39, 0x38, 0x3a, 0x3a, 0x3c, 0x3c,
+      0x3e, 0x40, 0x40, 0x41, 0x43, 0x43, 0x45, 0x46,
+      0x48, 0x49, 0x4b, 0x4e, 0x4f, 0x50, 0x53, 0x55,
+      0x57, 0x59, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, 0x60,
+      0x60, 0x60, 0x5f, 0x5f, 0x5e, 0x5c, 0x5b, 0x5a,
+      0x59, 0x58, 0x57, 0x57, 0x56, 0x56, 0x57, 0x58,
+      0x59, 0x5a, 0x5b, 0x5c, 0x5c, 0x5d, 0x5e, 0x5d,
+      0x5c, 0x5b, 0x58, 0x57, 0x54, 0x52, 0x52, 0x53,
+      0x54, 0x57, 0x58, 0x58, 0x5b, 0x5e, 0x62, 0x65,
+      0x69, 0x6b, 0x6d, 0x6c, 0x6a, 0x69, 0x67, 0x64,
+      0x62, 0x5e, 0x59, 0x54, 0x4d, 0x48, 0x47, 0x46,
+      0x45, 0x45, 0x44, },
+    { 0x1a, 0x21, 0x1e, 0x1f, 0x20, 0x21, 0x23, 0x24,
+      0x25, 0x28, 0x2a, 0x2e, 0x31, 0x34, 0x37, 0x3b,
+      0x3e, 0x42, 0x47, 0x49, 0x4b, 0x4d, 0x4f, 0x4f,
+      0x50, 0x51, 0x51, 0x50, 0x50, 0x4c, 0x4a, 0x47,
+      0x44, 0x42, 0x3e, 0x3c, 0x39, 0x35, 0x32, 0x31,
+      0x2f, 0x2d, 0x29, 0x27, 0x26, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x25, 0x25, 0x26, 0x27, 0x29, 0x2b,
+      0x2c, 0x2f, 0x33, 0x35, 0x38, 0x3a, 0x3c, 0x3e,
+      0x40, 0x40, 0x41, 0x42, 0x41, 0x3f, 0x3f, 0x3d,
+      0x3b, 0x39, 0x36, 0x33, 0x32, 0x2e, 0x2d, 0x2a,
+      0x27, 0x26, 0x25, 0x22, 0x1f, 0x1d, 0x1c, 0x1b,
+      0x19, 0x17, 0x17, 0x16, 0x15, 0x14, 0x12, 0x11,
+      0x11, 0x11, 0x10, 0x10, 0x0f, 0x0f, 0x0f, 0x0f,
+      0x0f, 0x0f, 0x10, 0x11, 0x10, 0x11, 0x11, 0x12,
+      0x11, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1b,
+      0x1c, 0x1c, 0x1e, 0x20, 0x21, 0x22, 0x23, 0x25,
+      0x27, 0x2a, 0x2c, 0x2f, 0x31, 0x35, 0x38, 0x3b,
+      0x3d, 0x40, 0x44, 0x47, 0x49, 0x4c, 0x4f, 0x51,
+      0x53, 0x53, 0x53, 0x51, 0x50, 0x4e, 0x4c, 0x4b,
+      0x4a, 0x49, 0x49, 0x49, 0x49, 0x4a, 0x4a, 0x4d,
+      0x4e, 0x4e, 0x4f, 0x50, 0x4f, 0x50, 0x51, 0x50,
+      0x50, 0x4e, 0x4d, 0x4c, 0x4b, 0x48, 0x48, 0x47,
+      0x44, 0x42, 0x3f, 0x3d, 0x3b, 0x3a, 0x39, 0x39,
+      0x39, 0x38, 0x39, 0x3b, 0x3a, 0x3c, 0x3e, 0x3d,
+      0x40, 0x40, 0x40, 0x42, 0x42, 0x42, 0x45, 0x46,
+      0x47, 0x49, 0x4c, 0x4e, 0x50, 0x50, 0x53, 0x56,
+      0x58, 0x59, 0x5d, 0x5d, 0x5e, 0x60, 0x61, 0x61,
+      0x62, 0x61, 0x60, 0x60, 0x5e, 0x5d, 0x5d, 0x5b,
+      0x57, 0x58, 0x56, 0x55, 0x55, 0x56, 0x56, 0x59,
+      0x59, 0x58, 0x5a, 0x5a, 0x5a, 0x5c, 0x5c, 0x5c,
+      0x5b, 0x5b, 0x58, 0x57, 0x54, 0x53, 0x52, 0x53,
+      0x54, 0x57, 0x58, 0x59, 0x5c, 0x5f, 0x63, 0x67,
+      0x6b, 0x6d, 0x6e, 0x6e, 0x6b, 0x6a, 0x68, 0x64,
+      0x62, 0x5e, 0x58, 0x53, 0x4f, 0x49, 0x47, 0x46,
+      0x45, 0x45, 0x44, },
+    { 0x19, 0x20, 0x1e, 0x1e, 0x1f, 0x20, 0x22, 0x23,
+      0x25, 0x27, 0x2a, 0x2e, 0x31, 0x34, 0x37, 0x3a,
+      0x3e, 0x41, 0x46, 0x49, 0x4a, 0x4d, 0x4f, 0x4e,
+      0x50, 0x51, 0x51, 0x4f, 0x4f, 0x4d, 0x49, 0x47,
+      0x44, 0x42, 0x3e, 0x3c, 0x39, 0x36, 0x32, 0x31,
+      0x2f, 0x2d, 0x29, 0x27, 0x26, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x25, 0x25, 0x26, 0x28, 0x29, 0x2b,
+      0x2c, 0x2f, 0x33, 0x35, 0x38, 0x3a, 0x3c, 0x3e,
+      0x3f, 0x3f, 0x41, 0x42, 0x41, 0x3f, 0x3f, 0x3d,
+      0x3c, 0x39, 0x36, 0x33, 0x32, 0x2e, 0x2d, 0x2a,
+      0x27, 0x26, 0x25, 0x22, 0x1f, 0x1e, 0x1d, 0x1b,
+      0x1a, 0x17, 0x17, 0x17, 0x14, 0x14, 0x12, 0x11,
+      0x11, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10,
+      0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x13, 0x14,
+      0x14, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1c, 0x1e,
+      0x1e, 0x1f, 0x22, 0x23, 0x23, 0x24, 0x25, 0x27,
+      0x2a, 0x2d, 0x2f, 0x31, 0x35, 0x38, 0x3a, 0x3e,
+      0x41, 0x44, 0x48, 0x4b, 0x4d, 0x51, 0x53, 0x55,
+      0x57, 0x57, 0x56, 0x55, 0x54, 0x52, 0x52, 0x50,
+      0x4e, 0x50, 0x4e, 0x4d, 0x4d, 0x4d, 0x4f, 0x51,
+      0x51, 0x52, 0x54, 0x55, 0x55, 0x55, 0x57, 0x55,
+      0x54, 0x53, 0x52, 0x4e, 0x4d, 0x4b, 0x4a, 0x49,
+      0x46, 0x44, 0x41, 0x3f, 0x3d, 0x3b, 0x3a, 0x3a,
+      0x39, 0x39, 0x39, 0x39, 0x3a, 0x3b, 0x3d, 0x3e,
+      0x3f, 0x40, 0x41, 0x42, 0x44, 0x44, 0x45, 0x47,
+      0x49, 0x49, 0x4a, 0x4d, 0x50, 0x51, 0x53, 0x57,
+      0x5a, 0x5b, 0x5e, 0x5f, 0x60, 0x61, 0x62, 0x62,
+      0x63, 0x62, 0x60, 0x60, 0x5e, 0x5c, 0x5c, 0x59,
+      0x58, 0x56, 0x55, 0x55, 0x55, 0x55, 0x55, 0x54,
+      0x56, 0x56, 0x57, 0x58, 0x58, 0x59, 0x5a, 0x59,
+      0x58, 0x57, 0x56, 0x55, 0x54, 0x52, 0x53, 0x53,
+      0x53, 0x56, 0x57, 0x59, 0x5b, 0x5e, 0x62, 0x66,
+      0x6a, 0x6c, 0x6d, 0x6e, 0x6b, 0x69, 0x67, 0x64,
+      0x61, 0x5d, 0x58, 0x54, 0x50, 0x4a, 0x47, 0x46,
+      0x45, 0x45, 0x44, },
+    { 0x1a, 0x21, 0x1e, 0x1f, 0x1f, 0x20, 0x22, 0x23,
+      0x25, 0x27, 0x2b, 0x2e, 0x31, 0x34, 0x37, 0x3b,
+      0x3d, 0x42, 0x45, 0x49, 0x4a, 0x4d, 0x4e, 0x4e,
+      0x51, 0x52, 0x50, 0x4f, 0x4f, 0x4c, 0x49, 0x48,
+      0x45, 0x42, 0x3e, 0x3b, 0x39, 0x36, 0x32, 0x32,
+      0x2f, 0x2c, 0x2a, 0x28, 0x26, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x25, 0x28, 0x29, 0x2b,
+      0x2d, 0x2f, 0x33, 0x35, 0x38, 0x3a, 0x3c, 0x3e,
+      0x3f, 0x3f, 0x41, 0x42, 0x41, 0x3f, 0x3e, 0x3c,
+      0x3c, 0x3a, 0x37, 0x33, 0x32, 0x2f, 0x2d, 0x2b,
+      0x28, 0x26, 0x25, 0x22, 0x20, 0x1e, 0x1d, 0x1b,
+      0x1a, 0x17, 0x17, 0x16, 0x14, 0x14, 0x12, 0x11,
+      0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10,
+      0x10, 0x11, 0x12, 0x12, 0x12, 0x13, 0x14, 0x14,
+      0x16, 0x18, 0x19, 0x1a, 0x1b, 0x1d, 0x1e, 0x1f,
+      0x21, 0x22, 0x23, 0x25, 0x26, 0x26, 0x28, 0x2a,
+      0x2c, 0x2e, 0x32, 0x34, 0x39, 0x39, 0x3d, 0x41,
+      0x45, 0x47, 0x4c, 0x4e, 0x51, 0x54, 0x56, 0x58,
+      0x5b, 0x5c, 0x5a, 0x59, 0x58, 0x56, 0x55, 0x53,
+      0x53, 0x52, 0x52, 0x51, 0x52, 0x52, 0x53, 0x55,
+      0x57, 0x58, 0x5a, 0x5a, 0x59, 0x5b, 0x59, 0x59,
+      0x58, 0x57, 0x55, 0x53, 0x51, 0x4e, 0x4c, 0x4a,
+      0x48, 0x46, 0x43, 0x40, 0x3e, 0x3c, 0x3b, 0x3b,
+      0x38, 0x39, 0x38, 0x39, 0x3a, 0x3d, 0x3d, 0x3e,
+      0x3f, 0x40, 0x41, 0x43, 0x44, 0x45, 0x46, 0x48,
+      0x4a, 0x4b, 0x4d, 0x4e, 0x50, 0x52, 0x54, 0x56,
+      0x59, 0x5c, 0x5e, 0x5f, 0x60, 0x62, 0x62, 0x63,
+      0x63, 0x63, 0x61, 0x5f, 0x5e, 0x5d, 0x5c, 0x5b,
+      0x59, 0x56, 0x56, 0x55, 0x54, 0x53, 0x53, 0x54,
+      0x55, 0x54, 0x55, 0x55, 0x55, 0x57, 0x58, 0x57,
+      0x57, 0x56, 0x55, 0x54, 0x54, 0x52, 0x52, 0x53,
+      0x54, 0x55, 0x57, 0x58, 0x5b, 0x5e, 0x62, 0x65,
+      0x69, 0x6b, 0x6d, 0x6e, 0x6a, 0x69, 0x67, 0x63,
+      0x61, 0x5d, 0x58, 0x54, 0x4f, 0x4b, 0x48, 0x47,
+      0x46, 0x45, 0x45, },
+    { 0x1a, 0x21, 0x1e, 0x1f, 0x1f, 0x20, 0x22, 0x23,
+      0x25, 0x27, 0x2b, 0x2d, 0x31, 0x34, 0x37, 0x3b,
+      0x3d, 0x42, 0x45, 0x48, 0x4c, 0x4e, 0x4e, 0x4f,
+      0x51, 0x52, 0x50, 0x50, 0x4f, 0x4c, 0x4a, 0x48,
+      0x45, 0x42, 0x3f, 0x3b, 0x39, 0x36, 0x32, 0x31,
+      0x2f, 0x2c, 0x2a, 0x28, 0x26, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x27, 0x28, 0x29, 0x2b,
+      0x2d, 0x30, 0x33, 0x36, 0x39, 0x3b, 0x3d, 0x3f,
+      0x3f, 0x40, 0x42, 0x43, 0x42, 0x40, 0x3e, 0x3c,
+      0x3c, 0x3a, 0x37, 0x34, 0x32, 0x2f, 0x2d, 0x2c,
+      0x2a, 0x27, 0x26, 0x23, 0x20, 0x1e, 0x1d, 0x1c,
+      0x1a, 0x18, 0x18, 0x17, 0x15, 0x16, 0x14, 0x12,
+      0x12, 0x12, 0x12, 0x12, 0x12, 0x11, 0x11, 0x12,
+      0x12, 0x12, 0x13, 0x14, 0x14, 0x14, 0x15, 0x16,
+      0x17, 0x19, 0x1b, 0x1c, 0x1e, 0x20, 0x20, 0x22,
+      0x24, 0x25, 0x26, 0x27, 0x28, 0x2a, 0x2c, 0x2c,
+      0x2f, 0x32, 0x35, 0x37, 0x3b, 0x3c, 0x41, 0x45,
+      0x48, 0x4c, 0x50, 0x52, 0x54, 0x57, 0x5a, 0x5c,
+      0x5f, 0x5f, 0x5f, 0x5d, 0x5c, 0x5b, 0x5a, 0x58,
+      0x57, 0x57, 0x57, 0x56, 0x56, 0x57, 0x57, 0x5a,
+      0x5c, 0x5e, 0x5f, 0x61, 0x5f, 0x5f, 0x5f, 0x5e,
+      0x5d, 0x5c, 0x5a, 0x57, 0x55, 0x52, 0x4f, 0x4e,
+      0x4a, 0x47, 0x46, 0x42, 0x41, 0x3e, 0x3d, 0x3c,
+      0x3b, 0x3a, 0x39, 0x39, 0x3b, 0x3c, 0x3d, 0x3f,
+      0x40, 0x42, 0x42, 0x44, 0x45, 0x46, 0x49, 0x49,
+      0x4b, 0x4c, 0x4e, 0x4f, 0x51, 0x54, 0x57, 0x58,
+      0x5b, 0x5d, 0x61, 0x61, 0x61, 0x63, 0x65, 0x65,
+      0x64, 0x64, 0x62, 0x61, 0x60, 0x5e, 0x5d, 0x5c,
+      0x59, 0x58, 0x56, 0x54, 0x53, 0x53, 0x53, 0x54,
+      0x54, 0x53, 0x53, 0x54, 0x54, 0x54, 0x55, 0x55,
+      0x56, 0x55, 0x54, 0x53, 0x53, 0x52, 0x52, 0x53,
+      0x55, 0x56, 0x57, 0x58, 0x5b, 0x5e, 0x62, 0x66,
+      0x69, 0x6b, 0x6d, 0x6d, 0x6b, 0x69, 0x67, 0x64,
+      0x61, 0x5d, 0x58, 0x55, 0x50, 0x4b, 0x48, 0x47,
+      0x46, 0x46, 0x46, },
+    { 0x1a, 0x20, 0x1e, 0x1f, 0x1f, 0x21, 0x22, 0x23,
+      0x25, 0x27, 0x2b, 0x2d, 0x31, 0x34, 0x37, 0x3b,
+      0x3d, 0x42, 0x45, 0x48, 0x4c, 0x4e, 0x4f, 0x4f,
+      0x51, 0x52, 0x51, 0x50, 0x4e, 0x4b, 0x4a, 0x48,
+      0x45, 0x42, 0x3f, 0x3b, 0x38, 0x36, 0x32, 0x31,
+      0x2f, 0x2c, 0x2a, 0x28, 0x26, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x27, 0x28, 0x29, 0x2b,
+      0x2e, 0x30, 0x33, 0x36, 0x39, 0x3b, 0x3d, 0x3f,
+      0x3f, 0x40, 0x41, 0x42, 0x41, 0x40, 0x3e, 0x3c,
+      0x3c, 0x3a, 0x37, 0x34, 0x33, 0x30, 0x2e, 0x2b,
+      0x29, 0x26, 0x24, 0x24, 0x20, 0x1f, 0x1d, 0x1d,
+      0x1a, 0x19, 0x17, 0x16, 0x16, 0x16, 0x16, 0x14,
+      0x13, 0x12, 0x13, 0x13, 0x13, 0x12, 0x12, 0x13,
+      0x13, 0x14, 0x15, 0x15, 0x14, 0x15, 0x16, 0x18,
+      0x19, 0x1b, 0x1c, 0x1e, 0x20, 0x21, 0x22, 0x24,
+      0x27, 0x28, 0x29, 0x2a, 0x2c, 0x2c, 0x2d, 0x2f,
+      0x32, 0x35, 0x37, 0x3a, 0x3c, 0x3e, 0x44, 0x48,
+      0x4c, 0x50, 0x54, 0x56, 0x58, 0x5b, 0x5e, 0x60,
+      0x61, 0x63, 0x62, 0x61, 0x60, 0x5f, 0x5e, 0x5e,
+      0x5c, 0x5c, 0x5b, 0x5a, 0x5a, 0x5b, 0x5c, 0x5e,
+      0x60, 0x63, 0x64, 0x65, 0x63, 0x62, 0x63, 0x63,
+      0x61, 0x60, 0x5e, 0x5b, 0x58, 0x55, 0x51, 0x4f,
+      0x4c, 0x4a, 0x47, 0x44, 0x42, 0x41, 0x3e, 0x3c,
+      0x3b, 0x3a, 0x3a, 0x3b, 0x3b, 0x3c, 0x3e, 0x3f,
+      0x40, 0x42, 0x43, 0x45, 0x46, 0x47, 0x49, 0x4a,
+      0x4c, 0x4c, 0x4f, 0x51, 0x52, 0x55, 0x58, 0x5b,
+      0x5c, 0x5f, 0x61, 0x62, 0x63, 0x64, 0x64, 0x65,
+      0x66, 0x65, 0x63, 0x62, 0x5f, 0x5e, 0x5e, 0x5c,
+      0x5b, 0x58, 0x56, 0x55, 0x54, 0x53, 0x52, 0x53,
+      0x52, 0x52, 0x52, 0x52, 0x52, 0x53, 0x55, 0x55,
+      0x55, 0x53, 0x53, 0x53, 0x52, 0x51, 0x52, 0x52,
+      0x55, 0x55, 0x58, 0x58, 0x5b, 0x5d, 0x61, 0x65,
+      0x68, 0x6a, 0x6c, 0x6b, 0x69, 0x68, 0x67, 0x64,
+      0x61, 0x5e, 0x58, 0x54, 0x4f, 0x4b, 0x49, 0x48,
+      0x47, 0x46, 0x45, },
+    { 0x19, 0x20, 0x1d, 0x1f, 0x1f, 0x20, 0x23, 0x23,
+      0x25, 0x27, 0x2b, 0x2d, 0x31, 0x34, 0x37, 0x3b,
+      0x3d, 0x42, 0x45, 0x48, 0x4c, 0x4e, 0x4f, 0x4f,
+      0x51, 0x52, 0x51, 0x50, 0x4e, 0x4b, 0x4a, 0x48,
+      0x44, 0x42, 0x3f, 0x3a, 0x38, 0x36, 0x32, 0x30,
+      0x2f, 0x2c, 0x2a, 0x28, 0x26, 0x26, 0x25, 0x24,
+      0x23, 0x24, 0x24, 0x25, 0x26, 0x28, 0x29, 0x2b,
+      0x2e, 0x30, 0x34, 0x36, 0x39, 0x3b, 0x3d, 0x3f,
+      0x3f, 0x40, 0x41, 0x42, 0x41, 0x40, 0x3e, 0x3c,
+      0x3c, 0x3a, 0x37, 0x34, 0x33, 0x30, 0x2e, 0x2b,
+      0x29, 0x27, 0x25, 0x24, 0x21, 0x1f, 0x1e, 0x1c,
+      0x1b, 0x19, 0x17, 0x16, 0x16, 0x16, 0x16, 0x14,
+      0x13, 0x12, 0x13, 0x13, 0x13, 0x13, 0x13, 0x13,
+      0x13, 0x14, 0x15, 0x14, 0x14, 0x14, 0x17, 0x19,
+      0x1a, 0x1c, 0x1e, 0x20, 0x21, 0x23, 0x24, 0x26,
+      0x29, 0x29, 0x2b, 0x2c, 0x2d, 0x2e, 0x30, 0x31,
+      0x34, 0x38, 0x3b, 0x3c, 0x3f, 0x42, 0x47, 0x4c,
+      0x50, 0x54, 0x57, 0x5b, 0x5c, 0x5e, 0x62, 0x63,
+      0x66, 0x66, 0x66, 0x65, 0x64, 0x63, 0x61, 0x62,
+      0x60, 0x60, 0x5f, 0x5e, 0x5e, 0x5f, 0x60, 0x62,
+      0x65, 0x67, 0x69, 0x6a, 0x69, 0x68, 0x69, 0x67,
+      0x66, 0x64, 0x62, 0x5f, 0x5c, 0x58, 0x54, 0x51,
+      0x4e, 0x4b, 0x49, 0x45, 0x43, 0x41, 0x40, 0x3e,
+      0x3c, 0x3a, 0x3b, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f,
+      0x41, 0x42, 0x44, 0x46, 0x46, 0x48, 0x49, 0x4b,
+      0x4d, 0x50, 0x51, 0x53, 0x55, 0x57, 0x58, 0x5c,
+      0x5f, 0x60, 0x63, 0x64, 0x64, 0x65, 0x66, 0x66,
+      0x66, 0x65, 0x65, 0x63, 0x61, 0x5f, 0x5e, 0x5c,
+      0x5a, 0x58, 0x56, 0x55, 0x54, 0x53, 0x52, 0x52,
+      0x53, 0x52, 0x52, 0x52, 0x52, 0x53, 0x53, 0x53,
+      0x54, 0x53, 0x53, 0x52, 0x53, 0x51, 0x53, 0x53,
+      0x55, 0x57, 0x58, 0x59, 0x5b, 0x5d, 0x62, 0x64,
+      0x68, 0x6a, 0x6c, 0x6b, 0x69, 0x68, 0x67, 0x64,
+      0x61, 0x5d, 0x57, 0x54, 0x50, 0x4a, 0x48, 0x47,
+      0x46, 0x45, 0x45, },
diff --git a/tests/tcg/hexagon/hvx_histogram_row.h b/tests/tcg/hexagon/hvx_histogram_row.h
new file mode 100644
index 0000000..6a4531a
--- /dev/null
+++ b/tests/tcg/hexagon/hvx_histogram_row.h
@@ -0,0 +1,24 @@
+/*
+ *  Copyright(c) 2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HVX_HISTOGRAM_ROW_H
+#define HVX_HISTOGRAM_ROW_H
+
+void hvx_histogram_row(uint8_t *src, int stride, int width, int height,
+                       int *hist);
+
+#endif
diff --git a/tests/tcg/hexagon/hvx_histogram.c b/tests/tcg/hexagon/hvx_histogram.c
new file mode 100644
index 0000000..43377a9
--- /dev/null
+++ b/tests/tcg/hexagon/hvx_histogram.c
@@ -0,0 +1,88 @@
+/*
+ *  Copyright(c) 2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include "hvx_histogram_row.h"
+
+const int vector_len = 128;
+const int width = 275;
+const int height = 20;
+const int stride = (width + vector_len - 1) & -vector_len;
+
+int err;
+
+static uint8_t input[height][stride] __attribute__((aligned(128))) = {
+#include "hvx_histogram_input.h"
+};
+
+static int result[256] __attribute__((aligned(128)));
+static int expect[256] __attribute__((aligned(128)));
+
+static void check(void)
+{
+    for (int i = 0; i < 256; i++) {
+        int res = result[i];
+        int exp = expect[i];
+        if (res != exp) {
+            printf("ERROR at %3d: 0x%04x != 0x%04x\n",
+                   i, res, exp);
+            err++;
+        }
+    }
+}
+
+static void ref_histogram(uint8_t *src, int stride, int width, int height,
+                          int *hist)
+{
+    for (int i = 0; i < 256; i++) {
+        hist[i] = 0;
+    }
+
+    for (int i = 0; i < height; i++) {
+        for (int j = 0; j < width; j++) {
+            hist[src[i * stride + j]]++;
+        }
+    }
+}
+
+static void hvx_histogram(uint8_t *src, int stride, int width, int height,
+                          int *hist)
+{
+    int n = 8192 / width;
+
+    for (int i = 0; i < 256; i++) {
+        hist[i] = 0;
+    }
+
+    for (int i = 0; i < height; i += n) {
+        int k = height - i > n ? n : height - i;
+        hvx_histogram_row(src, stride, width, k, hist);
+        src += n * stride;
+    }
+}
+
+int main()
+{
+    ref_histogram(&input[0][0], stride, width, height, expect);
+    hvx_histogram(&input[0][0], stride, width, height, result);
+    check();
+
+    puts(err ? "FAIL" : "PASS");
+    return err ? 1 : 0;
+}
diff --git a/tests/tcg/hexagon/Makefile.target b/tests/tcg/hexagon/Makefile.target
index c4ccc99..00c9a78 100644
--- a/tests/tcg/hexagon/Makefile.target
+++ b/tests/tcg/hexagon/Makefile.target
@@ -43,9 +43,14 @@ HEX_TESTS += scatter_gather
 HEX_TESTS += atomics
 HEX_TESTS += fpstuff
 HEX_TESTS += hvx_misc
+HEX_TESTS += hvx_histogram
 
 TESTS += $(HEX_TESTS)
 
 scatter_gather: CFLAGS += -mhvx
 vector_add_int: CFLAGS += -mhvx -fvectorize
 hvx_misc: CFLAGS += -mhvx
+hvx_histogram: CFLAGS += -mhvx -Wno-gnu-folding-constant
+
+hvx_histogram: hvx_histogram.c hvx_histogram_row.S
+	$(CC) $(CFLAGS) $(CROSS_CC_GUEST_CFLAGS) $^ -o $@
diff --git a/tests/tcg/hexagon/hvx_histogram_row.S b/tests/tcg/hexagon/hvx_histogram_row.S
new file mode 100644
index 0000000..5e42c33
--- /dev/null
+++ b/tests/tcg/hexagon/hvx_histogram_row.S
@@ -0,0 +1,294 @@
+/*
+ *  Copyright(c) 2021 Qualcomm Innovation Center, Inc. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+
+/*
+ * void hvx_histogram_row(uint8_t *src,     => r0
+ *                        int stride,       => r1
+ *                        int width,        => r2
+ *                        int height,       => r3
+ *                        int *hist         => r4)
+ */
+    .text
+    .p2align 2
+    .global hvx_histogram_row
+    .type hvx_histogram_row, @function
+hvx_histogram_row:
+    { r2 = lsr(r2, #7)          /* size / VLEN */
+      r5 = and(r2, #127)        /* size % VLEN */
+      v1 = #0
+      v0 = #0
+    }
+    /*
+     * Step 1: Clean the whole vector register file
+     */
+    { v3:2 = v1:0
+      v5:4 = v1:0
+      p0 = cmp.gt(r2, #0)       /* P0 = (width / VLEN > 0) */
+      p1 = cmp.eq(r5, #0)       /* P1 = (width % VLEN == 0) */
+    }
+    { q0 = vsetq(r5)
+      v7:6 = v1:0
+    }
+    { v9:8   = v1:0
+      v11:10 = v1:0
+    }
+    { v13:12 = v1:0
+      v15:14 = v1:0
+    }
+    { v17:16 = v1:0
+      v19:18 = v1:0
+    }
+    { v21:20 = v1:0
+      v23:22 = v1:0
+    }
+    { v25:24 = v1:0
+      v27:26 = v1:0
+    }
+    { v29:28 = v1:0
+      v31:30 = v1:0
+      r10 = add(r0, r1)           /* R10 = &src[2 * stride] */
+      loop1(.outerloop, r3)
+    }
+
+    /*
+     * Step 2: vhist
+     */
+    .falign
+.outerloop:
+    { if (!p0) jump .loopend
+      loop0(.innerloop, r2)
+    }
+
+    .falign
+.innerloop:
+    { v12.tmp = vmem(R0++#1)
+      vhist
+    }:endloop0
+
+    .falign
+.loopend:
+    if (p1) jump .skip       /* if (width % VLEN == 0) done with current row */
+    { v13.tmp = vmem(r0 + #0)
+      vhist(q0)
+    }
+
+    .falign
+.skip:
+    { r0 = r10                    /* R0  = &src[(i + 1) * stride] */
+      r10 = add(r10, r1)          /* R10 = &src[(i + 2) * stride] */
+    }:endloop1
+
+
+    /*
+     * Step 3: Sum up the data
+     */
+    { v0.h = vshuff(v0.h)
+      r10 = ##0x00010001
+    }
+    v1.h = vshuff(v1.h)
+    { V2.h = vshuff(v2.h)
+      v0.w = vdmpy(v0.h, r10.h):sat
+    }
+    { v3.h = vshuff(v3.h)
+      v1.w = vdmpy(v1.h, r10.h):sat
+    }
+    { v4.h = vshuff(V4.h)
+      v2.w = vdmpy(v2.h, r10.h):sat
+    }
+    { v5.h = vshuff(v5.h)
+      v3.w = vdmpy(v3.h, r10.h):sat
+    }
+    { v6.h = vshuff(v6.h)
+      v4.w = vdmpy(v4.h, r10.h):sat
+    }
+    { v7.h = vshuff(v7.h)
+      v5.w = vdmpy(v5.h, r10.h):sat
+    }
+    { v8.h = vshuff(V8.h)
+      v6.w = vdmpy(v6.h, r10.h):sat
+    }
+    { v9.h = vshuff(V9.h)
+      v7.w = vdmpy(v7.h, r10.h):sat
+    }
+    { v10.h = vshuff(v10.h)
+      v8.w = vdmpy(v8.h, r10.h):sat
+    }
+    { v11.h = vshuff(v11.h)
+      v9.w = vdmpy(v9.h, r10.h):sat
+    }
+    { v12.h = vshuff(v12.h)
+      v10.w = vdmpy(v10.h, r10.h):sat
+    }
+    { v13.h = vshuff(V13.h)
+      v11.w = vdmpy(v11.h, r10.h):sat
+    }
+    { v14.h = vshuff(v14.h)
+      v12.w = vdmpy(v12.h, r10.h):sat
+    }
+    { v15.h = vshuff(v15.h)
+      v13.w = vdmpy(v13.h, r10.h):sat
+    }
+    { v16.h = vshuff(v16.h)
+      v14.w = vdmpy(v14.h, r10.h):sat
+    }
+    { v17.h = vshuff(v17.h)
+      v15.w = vdmpy(v15.h, r10.h):sat
+    }
+    { v18.h = vshuff(v18.h)
+      v16.w = vdmpy(v16.h, r10.h):sat
+    }
+    { v19.h = vshuff(v19.h)
+      v17.w = vdmpy(v17.h, r10.h):sat
+    }
+    { v20.h = vshuff(v20.h)
+      v18.W = vdmpy(v18.h, r10.h):sat
+    }
+    { v21.h = vshuff(v21.h)
+      v19.w = vdmpy(v19.h, r10.h):sat
+    }
+    { v22.h = vshuff(v22.h)
+      v20.w = vdmpy(v20.h, r10.h):sat
+    }
+    { v23.h = vshuff(v23.h)
+      v21.w = vdmpy(v21.h, r10.h):sat
+    }
+    { v24.h = vshuff(v24.h)
+      v22.w = vdmpy(v22.h, r10.h):sat
+    }
+    { v25.h = vshuff(v25.h)
+      v23.w = vdmpy(v23.h, r10.h):sat
+    }
+    { v26.h = vshuff(v26.h)
+      v24.w = vdmpy(v24.h, r10.h):sat
+    }
+    { v27.h = vshuff(V27.h)
+      v25.w = vdmpy(v25.h, r10.h):sat
+    }
+    { v28.h = vshuff(v28.h)
+      v26.w = vdmpy(v26.h, r10.h):sat
+    }
+    { v29.h = vshuff(v29.h)
+      v27.w = vdmpy(v27.h, r10.h):sat
+    }
+    { v30.h = vshuff(v30.h)
+      v28.w = vdmpy(v28.h, r10.h):sat
+    }
+    { v31.h = vshuff(v31.h)
+      v29.w = vdmpy(v29.h, r10.h):sat
+      r28 = #32
+    }
+    { vshuff(v1, v0, r28)
+      v30.w = vdmpy(v30.h, r10.h):sat
+    }
+    { vshuff(v3, v2, r28)
+      v31.w = vdmpy(v31.h, r10.h):sat
+    }
+    { vshuff(v5, v4, r28)
+      v0.w = vadd(v1.w, v0.w)
+      v2.w = vadd(v3.w, v2.w)
+    }
+    { vshuff(v7, v6, r28)
+      r7 = #64
+    }
+    { vshuff(v9, v8, r28)
+      v4.w = vadd(v5.w, v4.w)
+      v6.w = vadd(v7.w, v6.w)
+    }
+    vshuff(v11, v10, r28)
+    { vshuff(v13, v12, r28)
+      v8.w = vadd(v9.w, v8.w)
+      v10.w = vadd(v11.w, v10.w)
+    }
+    vshuff(v15, v14, r28)
+    { vshuff(v17, v16, r28)
+      v12.w = vadd(v13.w, v12.w)
+      v14.w = vadd(v15.w, v14.w)
+    }
+    vshuff(v19, v18, r28)
+    { vshuff(v21, v20, r28)
+      v16.w = vadd(v17.w, v16.w)
+      v18.w = vadd(v19.w, v18.w)
+    }
+    vshuff(v23, v22, r28)
+    { vshuff(v25, v24, r28)
+      v20.w = vadd(v21.w, v20.w)
+      v22.w = vadd(v23.w, v22.w)
+    }
+    vshuff(v27, v26, r28)
+    { vshuff(v29, v28, r28)
+      v24.w = vadd(v25.w, v24.w)
+      v26.w = vadd(v27.w, v26.w)
+    }
+    vshuff(v31, v30, r28)
+    { v28.w = vadd(v29.w, v28.w)
+      vshuff(v2, v0, r7)
+    }
+    { v30.w = vadd(v31.w, v30.w)
+      vshuff(v6, v4, r7)
+      v0.w  = vadd(v0.w, v2.w)
+    }
+    { vshuff(v10, v8, r7)
+      v1.tmp = vmem(r4 + #0)      /* update hist[0-31] */
+      v0.w  = vadd(v0.w, v1.w)
+      vmem(r4++#1) = v0.new
+    }
+    { vshuff(v14, v12, r7)
+      v4.w  = vadd(v4.w, v6.w)
+      v8.w  = vadd(v8.w, v10.w)
+    }
+    { vshuff(v18, v16, r7)
+      v1.tmp = vmem(r4 + #0)      /* update hist[32-63] */
+      v4.w  = vadd(v4.w, v1.w)
+      vmem(r4++#1) = v4.new
+    }
+    { vshuff(v22, v20, r7)
+      v12.w = vadd(v12.w, v14.w)
+      V16.w = vadd(v16.w, v18.w)
+    }
+    { vshuff(v26, v24, r7)
+      v1.tmp = vmem(r4 + #0)      /* update hist[64-95] */
+      v8.w  = vadd(v8.w, v1.w)
+      vmem(r4++#1) = v8.new
+    }
+    { vshuff(v30, v28, r7)
+      v1.tmp = vmem(r4 + #0)      /* update hist[96-127] */
+      v12.w  = vadd(v12.w, v1.w)
+      vmem(r4++#1) = v12.new
+    }
+
+    { v20.w = vadd(v20.w, v22.w)
+      v1.tmp = vmem(r4 + #0)      /* update hist[128-159] */
+      v16.w  = vadd(v16.w, v1.w)
+      vmem(r4++#1) = v16.new
+    }
+    { v24.w = vadd(v24.w, v26.w)
+      v1.tmp = vmem(r4 + #0)      /* update hist[160-191] */
+      v20.w  = vadd(v20.w, v1.w)
+      vmem(r4++#1) = v20.new
+    }
+    { v28.w = vadd(v28.w, v30.w)
+      v1.tmp = vmem(r4 + #0)      /* update hist[192-223] */
+      v24.w  = vadd(v24.w, v1.w)
+      vmem(r4++#1) = v24.new
+    }
+    { v1.tmp = vmem(r4 + #0)      /* update hist[224-255] */
+      v28.w  = vadd(v28.w, v1.w)
+      vmem(r4++#1) = v28.new
+    }
+    jumpr r31
+    .size hvx_histogram_row, .-hvx_histogram_row
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 13/30] Hexagon HVX (target/hexagon) helper overrides infrastructure
  2021-10-12 10:10 ` [PATCH v4 13/30] Hexagon HVX (target/hexagon) helper overrides infrastructure Taylor Simpson
@ 2021-10-29 16:48   ` Philippe Mathieu-Daudé
  2021-10-29 19:00   ` Richard Henderson
  1 sibling, 0 replies; 45+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-29 16:48 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, richard.henderson

On 10/12/21 12:10, Taylor Simpson wrote:
> Build the infrastructure to create overrides for HVX instructions.
> We create a new empty file (gen_tcg_hvx.h) that will be populated
> in subsequent patches.
> 
> Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
> ---
>  target/hexagon/gen_tcg_hvx.h        | 21 +++++++++++++++++++++
>  target/hexagon/genptr.c             |  1 +
>  target/hexagon/gen_helper_funcs.py  |  3 ++-
>  target/hexagon/gen_helper_protos.py |  3 ++-
>  target/hexagon/gen_tcg_funcs.py     |  3 ++-
>  target/hexagon/meson.build          | 13 +++++++------
>  6 files changed, 35 insertions(+), 9 deletions(-)
>  create mode 100644 target/hexagon/gen_tcg_hvx.h

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 10/30] Hexagon HVX (target/hexagon) instruction utility functions
  2021-10-12 10:10 ` [PATCH v4 10/30] Hexagon HVX (target/hexagon) instruction utility functions Taylor Simpson
@ 2021-10-29 18:53   ` Richard Henderson
  2021-10-29 23:37     ` Taylor Simpson
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 18:53 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:10 AM, Taylor Simpson wrote:
> +void mem_vector_scatter_init(CPUHexagonState *env, int slot,
> +                             target_ulong base_vaddr,
> +                             int length, int element_size)
> +{
> +    int i;
> +
> +    for (i = 0; i < sizeof(MMVector); i++) {
> +        env->vtcm_log.data.ub[i] = 0;
> +    }
> +    bitmap_zero(env->vtcm_log.mask, MAX_VEC_SIZE_BYTES);
> +
> +    env->vtcm_pending = true;
> +    env->vtcm_log.op = false;
> +    env->vtcm_log.op_size = 0;
> +    env->vtcm_log.size = sizeof(MMVector);

Init really wants size != 0 here?  Because it's not that way for gather...

Otherwise it looks like you want

     memset(&env->vtcm_log, 0, sizeof(env->vtcm_log));


> +void mem_vector_gather_init(CPUHexagonState *env,
> +                            target_ulong base_vaddr,
> +                            int length, int element_size)
> +{
> +    int i;
> +
> +    for (i = 0; i < sizeof(MMVector); i++) {
> +        env->vtcm_log.data.ub[i] = 0;
> +        env->vtcm_log.va[i] = 0;
> +        env->tmp_VRegs[0].ub[i] = 0;
> +    }
> +    bitmap_zero(env->vtcm_log.mask, MAX_VEC_SIZE_BYTES / 8);
> +    env->vtcm_log.op = false;
> +    env->vtcm_log.op_size = 0;
> +}

Likewise memset of vtcm_log, with a second memset for tmp_Vregs[0].


r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 11/30] Hexagon HVX (target/hexagon) helper functions
  2021-10-12 10:10 ` [PATCH v4 11/30] Hexagon HVX (target/hexagon) helper functions Taylor Simpson
@ 2021-10-29 18:58   ` Richard Henderson
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 18:58 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:10 AM, Taylor Simpson wrote:
> Probe and commit vector stores (masked and scatter/gather)
> Log vector register writes
> Add the execution counters to the debug log
> Histogram instructions
> 
> Signed-off-by: Taylor Simpson<tsimpson@quicinc.com>
> ---
>   target/hexagon/helper.h    |  16 +++
>   target/hexagon/op_helper.c | 282 ++++++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 296 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 12/30] Hexagon HVX (target/hexagon) TCG generation
  2021-10-12 10:10 ` [PATCH v4 12/30] Hexagon HVX (target/hexagon) TCG generation Taylor Simpson
@ 2021-10-29 18:59   ` Richard Henderson
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 18:59 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:10 AM, Taylor Simpson wrote:
> Signed-off-by: Taylor Simpson<tsimpson@quicinc.com>
> ---
>   target/hexagon/translate.h |  61 ++++++++++++
>   target/hexagon/genptr.c    |  15 +++
>   target/hexagon/translate.c | 243 ++++++++++++++++++++++++++++++++++++++++++++-
>   3 files changed, 315 insertions(+), 4 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 13/30] Hexagon HVX (target/hexagon) helper overrides infrastructure
  2021-10-12 10:10 ` [PATCH v4 13/30] Hexagon HVX (target/hexagon) helper overrides infrastructure Taylor Simpson
  2021-10-29 16:48   ` Philippe Mathieu-Daudé
@ 2021-10-29 19:00   ` Richard Henderson
  1 sibling, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 19:00 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:10 AM, Taylor Simpson wrote:
> Build the infrastructure to create overrides for HVX instructions.
> We create a new empty file (gen_tcg_hvx.h) that will be populated
> in subsequent patches.
> 
> Signed-off-by: Taylor Simpson<tsimpson@quicinc.com>
> ---
>   target/hexagon/gen_tcg_hvx.h        | 21 +++++++++++++++++++++
>   target/hexagon/genptr.c             |  1 +
>   target/hexagon/gen_helper_funcs.py  |  3 ++-
>   target/hexagon/gen_helper_protos.py |  3 ++-
>   target/hexagon/gen_tcg_funcs.py     |  3 ++-
>   target/hexagon/meson.build          | 13 +++++++------
>   6 files changed, 35 insertions(+), 9 deletions(-)
>   create mode 100644 target/hexagon/gen_tcg_hvx.h

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 14/30] Hexagon HVX (target/hexagon) helper overrides for histogram instructions
  2021-10-12 10:10 ` [PATCH v4 14/30] Hexagon HVX (target/hexagon) helper overrides for histogram instructions Taylor Simpson
@ 2021-10-29 19:04   ` Richard Henderson
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 19:04 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:10 AM, Taylor Simpson wrote:
> Signed-off-by: Taylor Simpson<tsimpson@quicinc.com>
> ---
>   target/hexagon/gen_tcg_hvx.h | 106 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 106 insertions(+)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 19/30] Hexagon HVX (target/hexagon) helper overrides - vector logical ops
  2021-10-12 10:10 ` [PATCH v4 19/30] Hexagon HVX (target/hexagon) helper overrides - vector logical ops Taylor Simpson
@ 2021-10-29 19:06   ` Richard Henderson
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 19:06 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:10 AM, Taylor Simpson wrote:
> Signed-off-by: Taylor Simpson<tsimpson@quicinc.com>
> ---
>   target/hexagon/gen_tcg_hvx.h | 42 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 42 insertions(+)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 26/30] Hexagon HVX (target/hexagon) import instruction encodings
  2021-10-12 10:11 ` [PATCH v4 26/30] Hexagon HVX (target/hexagon) import instruction encodings Taylor Simpson
@ 2021-10-29 19:08   ` Richard Henderson
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 19:08 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:11 AM, Taylor Simpson wrote:
> Signed-off-by: Taylor Simpson<tsimpson@quicinc.com>
> ---
>   target/hexagon/decode.c                      |   4 +
>   target/hexagon/imported/allextenc.def        |  20 +
>   target/hexagon/imported/encode.def           |   1 +
>   target/hexagon/imported/mmvec/encode_ext.def | 794 +++++++++++++++++++++++++++
>   4 files changed, 819 insertions(+)
>   create mode 100644 target/hexagon/imported/allextenc.def
>   create mode 100644 target/hexagon/imported/mmvec/encode_ext.def

Acked-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 27/30] Hexagon HVX (tests/tcg/hexagon) vector_add_int test
  2021-10-12 10:11 ` [PATCH v4 27/30] Hexagon HVX (tests/tcg/hexagon) vector_add_int test Taylor Simpson
@ 2021-10-29 19:10   ` Richard Henderson
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 19:10 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:11 AM, Taylor Simpson wrote:
> Signe-off-by: Taylor Simpson<tsimpson@quicinc.com>

"Signed"

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 28/30] Hexagon HVX (tests/tcg/hexagon) hvx_misc test
  2021-10-12 10:11 ` [PATCH v4 28/30] Hexagon HVX (tests/tcg/hexagon) hvx_misc test Taylor Simpson
@ 2021-10-29 19:11   ` Richard Henderson
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 19:11 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:11 AM, Taylor Simpson wrote:
> Tests for
>      packet semantics
>      vector loads (aligned and unaligned)
>      vector stores (aligned and unaligned)
>      vector masked stores
>      vector new value store
>      maximum HVX temps in a packet
>      vector operations
> 
> Signed-off-by: Taylor Simpson<tsimpson@quicinc.com>
> ---
>   tests/tcg/hexagon/hvx_misc.c      | 469 ++++++++++++++++++++++++++++++++++++++
>   tests/tcg/hexagon/Makefile.target |   2 +
>   2 files changed, 471 insertions(+)
>   create mode 100644 tests/tcg/hexagon/hvx_misc.c

Acked-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 29/30] Hexagon HVX (tests/tcg/hexagon) scatter_gather test
  2021-10-12 10:11 ` [PATCH v4 29/30] Hexagon HVX (tests/tcg/hexagon) scatter_gather test Taylor Simpson
@ 2021-10-29 19:13   ` Richard Henderson
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 19:13 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:11 AM, Taylor Simpson wrote:
> Signed-off-by: Taylor Simpson<tsimpson@quicinc.com>
> ---
>   tests/tcg/hexagon/scatter_gather.c | 1011 ++++++++++++++++++++++++++++++++++++
>   tests/tcg/hexagon/Makefile.target  |    2 +
>   2 files changed, 1013 insertions(+)
>   create mode 100644 tests/tcg/hexagon/scatter_gather.c

Acked-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 30/30] Hexagon HVX (tests/tcg/hexagon) histogram test
  2021-10-12 10:11 ` [PATCH v4 30/30] Hexagon HVX (tests/tcg/hexagon) histogram test Taylor Simpson
@ 2021-10-29 19:15   ` Richard Henderson
  2021-10-29 19:18     ` Taylor Simpson
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2021-10-29 19:15 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: ale, bcain, f4bug

On 10/12/21 3:11 AM, Taylor Simpson wrote:
> Signe-off-by: Taylor Simpson<tsimpson@quicinc.com>

Signed.

Second instance that I've noticed; you might grep for it across your patches, just to be sure.

Acked-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v4 30/30] Hexagon HVX (tests/tcg/hexagon) histogram test
  2021-10-29 19:15   ` Richard Henderson
@ 2021-10-29 19:18     ` Taylor Simpson
  0 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-29 19:18 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: ale, Brian Cain, f4bug



> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Friday, October 29, 2021 2:15 PM
> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
> Cc: f4bug@amsat.org; ale@rev.ng; Brian Cain <bcain@quicinc.com>
> Subject: Re: [PATCH v4 30/30] Hexagon HVX (tests/tcg/hexagon) histogram
> test
> 
> On 10/12/21 3:11 AM, Taylor Simpson wrote:
> > Signe-off-by: Taylor Simpson<tsimpson@quicinc.com>
> 
> Signed.
> 
> Second instance that I've noticed; you might grep for it across your patches,
> just to be sure.

I will double-check.  Weird this would happen because I always use git commit -s.

Taylor


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v4 10/30] Hexagon HVX (target/hexagon) instruction utility functions
  2021-10-29 18:53   ` Richard Henderson
@ 2021-10-29 23:37     ` Taylor Simpson
  0 siblings, 0 replies; 45+ messages in thread
From: Taylor Simpson @ 2021-10-29 23:37 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: ale, Brian Cain, f4bug


> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Friday, October 29, 2021 1:53 PM
> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
> Cc: f4bug@amsat.org; ale@rev.ng; Brian Cain <bcain@quicinc.com>
> Subject: Re: [PATCH v4 10/30] Hexagon HVX (target/hexagon) instruction
> utility functions
> 
> On 10/12/21 3:10 AM, Taylor Simpson wrote:
> > +void mem_vector_scatter_init(CPUHexagonState *env, int slot,
> > +                             target_ulong base_vaddr,
> > +                             int length, int element_size) {
> > +    int i;
> > +
> > +    for (i = 0; i < sizeof(MMVector); i++) {
> > +        env->vtcm_log.data.ub[i] = 0;
> > +    }
> > +    bitmap_zero(env->vtcm_log.mask, MAX_VEC_SIZE_BYTES);
> > +
> > +    env->vtcm_pending = true;
> > +    env->vtcm_log.op = false;
> > +    env->vtcm_log.op_size = 0;
> > +    env->vtcm_log.size = sizeof(MMVector);
> 
> Init really wants size != 0 here?  Because it's not that way for gather...

The vtcm_log.size is only used during packet commit when there is a scatter.  It's not used for gather.

Since it's always sizeof(MMVector), I will remove it and replace all the uses with the constant value.

Also, op and op_size are only used for scatter, so I'll remove the initialization from mem_vector_gather_init.

> 
> Otherwise it looks like you want
> 
>      memset(&env->vtcm_log, 0, sizeof(env->vtcm_log));

Actually, this initialization is not needed because the values will be overwritten by the instruction.  So, I'll remove it.

> 
> Likewise memset of vtcm_log, with a second memset for tmp_Vregs[0].

Ditto.


Thanks a ton for the reviews!  This is the last feedback to be addressed on this series.  I'll send v5 out tonight for your review.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-10-29 23:44 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-12 10:10 [PATCH v4 00/30] Hexagon HVX (target/hexagon) patch series Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 01/30] Hexagon HVX (target/hexagon) README Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 02/30] Hexagon HVX (target/hexagon) add Hexagon Vector eXtensions (HVX) to core Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 03/30] Hexagon HVX (target/hexagon) register names Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 04/30] Hexagon HVX (target/hexagon) instruction attributes Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 05/30] Hexagon HVX (target/hexagon) macros Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 06/30] Hexagon HVX (target/hexagon) import macro definitions Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 07/30] Hexagon HVX (target/hexagon) semantics generator Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 08/30] Hexagon HVX (target/hexagon) semantics generator - part 2 Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 09/30] Hexagon HVX (target/hexagon) C preprocessor for decode tree Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 10/30] Hexagon HVX (target/hexagon) instruction utility functions Taylor Simpson
2021-10-29 18:53   ` Richard Henderson
2021-10-29 23:37     ` Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 11/30] Hexagon HVX (target/hexagon) helper functions Taylor Simpson
2021-10-29 18:58   ` Richard Henderson
2021-10-12 10:10 ` [PATCH v4 12/30] Hexagon HVX (target/hexagon) TCG generation Taylor Simpson
2021-10-29 18:59   ` Richard Henderson
2021-10-12 10:10 ` [PATCH v4 13/30] Hexagon HVX (target/hexagon) helper overrides infrastructure Taylor Simpson
2021-10-29 16:48   ` Philippe Mathieu-Daudé
2021-10-29 19:00   ` Richard Henderson
2021-10-12 10:10 ` [PATCH v4 14/30] Hexagon HVX (target/hexagon) helper overrides for histogram instructions Taylor Simpson
2021-10-29 19:04   ` Richard Henderson
2021-10-12 10:10 ` [PATCH v4 15/30] Hexagon HVX (target/hexagon) helper overrides - vector assign & cmov Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 16/30] Hexagon HVX (target/hexagon) helper overrides - vector add & sub Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 17/30] Hexagon HVX (target/hexagon) helper overrides - vector shifts Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 18/30] Hexagon HVX (target/hexagon) helper overrides - vector max/min Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 19/30] Hexagon HVX (target/hexagon) helper overrides - vector logical ops Taylor Simpson
2021-10-29 19:06   ` Richard Henderson
2021-10-12 10:10 ` [PATCH v4 20/30] Hexagon HVX (target/hexagon) helper overrides - vector compares Taylor Simpson
2021-10-12 10:10 ` [PATCH v4 21/30] Hexagon HVX (target/hexagon) helper overrides - vector splat and abs Taylor Simpson
2021-10-12 10:11 ` [PATCH v4 22/30] Hexagon HVX (target/hexagon) helper overrides - vector loads Taylor Simpson
2021-10-12 10:11 ` [PATCH v4 23/30] Hexagon HVX (target/hexagon) helper overrides - vector stores Taylor Simpson
2021-10-12 10:11 ` [PATCH v4 24/30] Hexagon HVX (target/hexagon) import semantics Taylor Simpson
2021-10-12 10:11 ` [PATCH v4 25/30] Hexagon HVX (target/hexagon) instruction decoding Taylor Simpson
2021-10-12 10:11 ` [PATCH v4 26/30] Hexagon HVX (target/hexagon) import instruction encodings Taylor Simpson
2021-10-29 19:08   ` Richard Henderson
2021-10-12 10:11 ` [PATCH v4 27/30] Hexagon HVX (tests/tcg/hexagon) vector_add_int test Taylor Simpson
2021-10-29 19:10   ` Richard Henderson
2021-10-12 10:11 ` [PATCH v4 28/30] Hexagon HVX (tests/tcg/hexagon) hvx_misc test Taylor Simpson
2021-10-29 19:11   ` Richard Henderson
2021-10-12 10:11 ` [PATCH v4 29/30] Hexagon HVX (tests/tcg/hexagon) scatter_gather test Taylor Simpson
2021-10-29 19:13   ` Richard Henderson
2021-10-12 10:11 ` [PATCH v4 30/30] Hexagon HVX (tests/tcg/hexagon) histogram test Taylor Simpson
2021-10-29 19:15   ` Richard Henderson
2021-10-29 19:18     ` Taylor Simpson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.