xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support
@ 2018-08-09  8:15 Jan Beulich
  2018-08-09  8:23 ` [PATCH 1/6] x86emul: fix FMA scalar operand sizes Jan Beulich
                   ` (12 more replies)
  0 siblings, 13 replies; 465+ messages in thread
From: Jan Beulich @ 2018-08-09  8:15 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

1: fix FMA scalar operand sizes
2: extend MASKMOV{Q,DQU} tests
3: support AVX512 opmask insns
4: clean up AVX2 insn use in test harness
5: correct EVEX decoding
6: generalize vector length handling for AVX512/EVEX

While I also have ready a patch emulating the basic AVX512 moves,
its prereq to widen respective fields in HVM code to allow for 64 byte
operand sizes has a dependency on the "x86/HVM: implement memory
read caching" series, which looks to be stalled, and I prefer to avoid
posting further patches with dependencies on stalled series.

Some support beyond moves is also mostly ready, but it's not quite
enough yet to enable testing that code in the harness, and I'd rather
not post code that I haven't pushed through the harness yet. The
partial testing I've been doing means, however, that in particular the
last two patches here have been tested already, even if no such
testing would be possible for others until further patches actually get
posted.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH 1/6] x86emul: fix FMA scalar operand sizes
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
@ 2018-08-09  8:23 ` Jan Beulich
  2018-08-09  8:24 ` [PATCH 2/6] x86emul: extend MASKMOV{Q,DQU} tests Jan Beulich
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-08-09  8:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

FMA insns, other than the earlier AVX additions, don't use the low
opcode bit to distinguish between single and double vector elements.
While the difference is benign for packed flavors, the scalar ones
need to use VEX.W here. Oddly enough the table entries didn't even use
simd_scalar_fp, but uniformly used simd_packed_fp (implying the
distinction was by [VEX-encoded] opcode prefix).

Split simd_scalar_fp into simd_scalar_opc and simd_scalar_vexw, and
correct 

Also correct the scalar insn comments (they only ever use XMM registers
as operands).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -224,7 +224,13 @@ enum simd_opsize {
      * - 32 bits with low opcode bit clear (scalar single)
      * - 64 bits with low opcode bit set (scalar double)
      */
-    simd_scalar_fp,
+    simd_scalar_opc,
+
+    /*
+     * Scalar floating point:
+     * - 32/64 bits depending on VEX.W
+     */
+    simd_scalar_vexw,
 
     /*
      * 128 bits of integer or floating point data, with no further
@@ -407,7 +413,7 @@ static const struct ext0f38_table {
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x18 ... 0x19] = { .simd_size = simd_scalar_fp, .two_op = 1 },
+    [0x18 ... 0x19] = { .simd_size = simd_scalar_opc, .two_op = 1 },
     [0x1a] = { .simd_size = simd_128, .two_op = 1 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
@@ -427,9 +433,30 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_other },
     [0x8e] = { .simd_size = simd_other, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
-    [0x96 ... 0x9f] = { .simd_size = simd_packed_fp },
-    [0xa6 ... 0xaf] = { .simd_size = simd_packed_fp },
-    [0xb6 ... 0xbf] = { .simd_size = simd_packed_fp },
+    [0x96 ... 0x98] = { .simd_size = simd_packed_fp },
+    [0x99] = { .simd_size = simd_scalar_vexw },
+    [0x9a] = { .simd_size = simd_packed_fp },
+    [0x9b] = { .simd_size = simd_scalar_vexw },
+    [0x9c] = { .simd_size = simd_packed_fp },
+    [0x9d] = { .simd_size = simd_scalar_vexw },
+    [0x9e] = { .simd_size = simd_packed_fp },
+    [0x9f] = { .simd_size = simd_scalar_vexw },
+    [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp },
+    [0xa9] = { .simd_size = simd_scalar_vexw },
+    [0xaa] = { .simd_size = simd_packed_fp },
+    [0xab] = { .simd_size = simd_scalar_vexw },
+    [0xac] = { .simd_size = simd_packed_fp },
+    [0xad] = { .simd_size = simd_scalar_vexw },
+    [0xae] = { .simd_size = simd_packed_fp },
+    [0xaf] = { .simd_size = simd_scalar_vexw },
+    [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp },
+    [0xb9] = { .simd_size = simd_scalar_vexw },
+    [0xba] = { .simd_size = simd_packed_fp },
+    [0xbb] = { .simd_size = simd_scalar_vexw },
+    [0xbc] = { .simd_size = simd_packed_fp },
+    [0xbd] = { .simd_size = simd_scalar_vexw },
+    [0xbe] = { .simd_size = simd_packed_fp },
+    [0xbf] = { .simd_size = simd_scalar_vexw },
     [0xc8 ... 0xcd] = { .simd_size = simd_other },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
@@ -454,7 +481,7 @@ static const struct ext0f3a_table {
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1 },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
-    [0x0a ... 0x0b] = { .simd_size = simd_scalar_fp },
+    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
     [0x14 ... 0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1 },
@@ -476,13 +503,13 @@ static const struct ext0f3a_table {
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
-    [0x6a ... 0x6b] = { .simd_size = simd_scalar_fp, .four_op = 1 },
+    [0x6a ... 0x6b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x6c ... 0x6d] = { .simd_size = simd_packed_fp, .four_op = 1 },
-    [0x6e ... 0x6f] = { .simd_size = simd_scalar_fp, .four_op = 1 },
+    [0x6e ... 0x6f] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_packed_fp, .four_op = 1 },
-    [0x7a ... 0x7b] = { .simd_size = simd_scalar_fp, .four_op = 1 },
+    [0x7a ... 0x7b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x7c ... 0x7d] = { .simd_size = simd_packed_fp, .four_op = 1 },
-    [0x7e ... 0x7f] = { .simd_size = simd_scalar_fp, .four_op = 1 },
+    [0x7e ... 0x7f] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0xcc] = { .simd_size = simd_other },
     [0xdf] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xf0] = {},
@@ -518,7 +545,7 @@ static const struct ext8f09_table {
 } ext8f09_table[256] = {
     [0x01 ... 0x02] = { .two_op = 1 },
     [0x80 ... 0x81] = { .simd_size = simd_packed_fp, .two_op = 1 },
-    [0x82 ... 0x83] = { .simd_size = simd_scalar_fp, .two_op = 1 },
+    [0x82 ... 0x83] = { .simd_size = simd_scalar_opc, .two_op = 1 },
     [0x90 ... 0x9b] = { .simd_size = simd_packed_int },
     [0xc1 ... 0xc3] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xc6 ... 0xc7] = { .simd_size = simd_packed_int, .two_op = 1 },
@@ -3132,10 +3159,14 @@ x86_decode(
         }
         break;
 
-    case simd_scalar_fp:
+    case simd_scalar_opc:
         op_bytes = 4 << (ctxt->opcode & 1);
         break;
 
+    case simd_scalar_vexw:
+        op_bytes = 4 << vex.w;
+        break;
+
     case simd_128:
         op_bytes = 16;
         break;
@@ -7747,33 +7778,33 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x9a): /* vfmsub132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x9c): /* vfnmadd132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x9e): /* vfnmsub132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xa6): /* vfmaddsub213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xa7): /* vfmsubadd213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xa8): /* vfmadd213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xaa): /* vfmsub213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xac): /* vfnmadd213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xae): /* vfnmsub213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xb6): /* vfmaddsub231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xb7): /* vfmsubadd231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xb8): /* vfmadd231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xba): /* vfmsub231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm */
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH 2/6] x86emul: extend MASKMOV{Q,DQU} tests
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
  2018-08-09  8:23 ` [PATCH 1/6] x86emul: fix FMA scalar operand sizes Jan Beulich
@ 2018-08-09  8:24 ` Jan Beulich
  2018-08-09  8:24 ` [PATCH 3/6] x86emul: support AVX512 opmask insns Jan Beulich
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-08-09  8:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

While deriving the first AVX512 pieces from existing code I've got the
(in the end wrong) impression that the emulation of these insns would be
broken. Besides testing that the instructions act as no-ops when the
controlling mask bits are all zero, add ones to also check that the data
merging actually works.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -2626,7 +2626,7 @@ int main(int argc, char **argv)
         printf("skipped\n");
 #endif
 
-    printf("%-40s", "Testing maskmovq (zero mask)...");
+    printf("%-40s", "Testing maskmovq %mm4,%mm4...");
     if ( stack_exec && cpu_has_sse )
     {
         decl_insn(maskmovq);
@@ -2639,12 +2639,25 @@ int main(int argc, char **argv)
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(maskmovq) )
             goto fail;
+
+        asm volatile ( "pcmpeqb %mm3, %mm3\n\t"
+                       "punpcklbw %mm3, %mm4\n" );
+        memset(res, 0x55, 24);
+
+        set_insn(maskmovq);
+        regs.edi = (unsigned long)(res + 2);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(maskmovq) ||
+             memcmp(res, res + 4, 8) ||
+             res[2] != 0xff55ff55 || res[3] != 0xff55ff55 )
+            goto fail;
+
         printf("okay\n");
     }
     else
         printf("skipped\n");
 
-    printf("%-40s", "Testing maskmovdqu (zero mask)...");
+    printf("%-40s", "Testing maskmovdqu %xmm3,%xmm3...");
     if ( stack_exec && cpu_has_sse2 )
     {
         decl_insn(maskmovdqu);
@@ -2653,9 +2666,24 @@ int main(int argc, char **argv)
                        put_insn(maskmovdqu, "maskmovdqu %xmm3, %xmm3") );
 
         set_insn(maskmovdqu);
+        regs.edi = 0;
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(maskmovdqu) )
             goto fail;
+
+        asm volatile ( "pcmpeqb %xmm4, %xmm4\n\t"
+                       "punpcklbw %xmm4, %xmm3\n" );
+        memset(res, 0x55, 48);
+
+        set_insn(maskmovdqu);
+        regs.edi = (unsigned long)(res + 4);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(maskmovdqu) ||
+             memcmp(res, res + 8, 16) ||
+             res[4] != 0xff55ff55 || res[5] != 0xff55ff55 ||
+             res[6] != 0xff55ff55 || res[7] != 0xff55ff55 )
+            goto fail;
+
         printf("okay\n");
     }
     else





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH 3/6] x86emul: support AVX512 opmask insns
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
  2018-08-09  8:23 ` [PATCH 1/6] x86emul: fix FMA scalar operand sizes Jan Beulich
  2018-08-09  8:24 ` [PATCH 2/6] x86emul: extend MASKMOV{Q,DQU} tests Jan Beulich
@ 2018-08-09  8:24 ` Jan Beulich
  2018-08-09  8:25 ` [PATCH 4/6] x86emul: clean up AVX2 insn use in test harness Jan Beulich
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-08-09  8:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

These are all VEX encoded, so the EVEX decoding logic continues to
remain unused at this point.

The new testcase is deliberately coded in assembly, as a C one would
have become almost unreadable due to the overwhelming amount of
__builtin_...() that would need to be used. After all the compiler has
no underlying type (yet) that could be operated on without builtins,
other than the vector types used for "normal" SIMD insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,6 +16,8 @@ FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
 
+OPMASK := avx512f avx512dq avx512bw
+
 blowfish-cflags := ""
 blowfish-cflags-x86_32 := "-mno-accumulate-outgoing-args -Dstatic="
 
@@ -51,6 +53,10 @@ xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
 
+avx512f-opmask-vecs := 2
+avx512dq-opmask-vecs := 1
+avx512bw-opmask-vecs := 4 8
+
 # For AVX and later, have the compiler avoid XMM0 to widen coverage of
 # the VEX.vvvv checks in the emulator.  For 3DNow!, however, force SSE
 # use for floating point operations, to avoid mixing MMX and FPU register
@@ -80,9 +86,13 @@ $(1)-cflags := \
 	   $(foreach flt,$($(1)-flts), \
 	     "-D_$(vec)x$(idx)f$(flt) -m$(1:-sg=) $(call non-sse,$(1)) -Os -DVEC_MAX=$(vec) -DIDX_SIZE=$(idx) -DFLOAT_SIZE=$(flt)")))
 endef
+define opmask-defs
+$(1)-opmask-cflags := $(foreach vec,$($(1)-opmask-vecs), "-D_$(vec) -m$(1) -Os -DSIZE=$(vec)")
+endef
 
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
+$(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
 $(addsuffix .h,$(TESTCASES)): %.h: %.c testcase.mk Makefile
 	rm -f $@.new $*.bin
@@ -100,6 +110,22 @@ $(addsuffix .h,$(TESTCASES)): %.h: %.c t
 	)
 	mv $@.new $@
 
+$(addsuffix -opmask.h,$(OPMASK)): %.h: opmask.S testcase.mk Makefile
+	rm -f $@.new $*.bin
+	$(foreach arch,$(filter-out $(XEN_COMPILE_ARCH),x86_32) $(XEN_COMPILE_ARCH), \
+	    for cflags in $($*-cflags) $($*-cflags-$(arch)); do \
+		$(MAKE) -f testcase.mk TESTCASE=$* XEN_TARGET_ARCH=$(arch) $*-cflags="$$cflags" all; \
+		prefix=$(shell echo $(subst -,_,$*) | sed -e 's,^\([0-9]\),_\1,'); \
+		flavor=$$(echo $${cflags} | sed -e 's, .*,,' -e 'y,-=,__,') ; \
+		(echo 'static const unsigned int __attribute__((section(".test, \"ax\", @progbits #")))' \
+		      "$${prefix}_$(arch)$${flavor}[] = {"; \
+		 od -v -t x $*.bin | sed -e 's/^[0-9]* /0x/' -e 's/ /, 0x/g' -e 's/$$/,/'; \
+		 echo "};") >>$@.new; \
+		rm -f $*.bin; \
+	    done; \
+	)
+	mv $@.new $@
+
 $(addsuffix .c,$(SIMD)):
 	ln -sf simd.c $@
 
@@ -145,4 +171,4 @@ x86-emulate.o test_x86_emulator.o wrappe
 x86-emulate.o: x86_emulate/x86_emulate.c
 x86-emulate.o: HOSTCFLAGS += -D__XEN_TOOLS__
 
-test_x86_emulator.o: $(addsuffix .h,$(TESTCASES))
+test_x86_emulator.o: $(addsuffix .h,$(TESTCASES)) $(addsuffix -opmask.h,$(OPMASK))
--- /dev/null
+++ b/tools/tests/x86_emulator/opmask.S
@@ -0,0 +1,144 @@
+#ifdef __i386__
+# define R(x) e##x
+# define DATA(x) x
+#else
+# if SIZE == 8
+#  define R(x) r##x
+# else
+#  define R(x) e##x
+# endif
+# define DATA(x) x(%rip)
+#endif
+
+#if SIZE == 1
+# define _(x) x##b
+#elif SIZE == 2
+# define _(x) x##w
+# define WIDEN(x) x##bw
+#elif SIZE == 4
+# define _(x) x##d
+# define WIDEN(x) x##wd
+#elif SIZE == 8
+# define _(x) x##q
+# define WIDEN(x) x##dq
+#endif
+
+    .macro check res1:req, res2:req, line:req
+    _(kmov)       %\res1, DATA(out)
+#if SIZE < 8 || !defined(__i386__)
+    _(kmov)       %\res2, %R(dx)
+    cmp           DATA(out), %R(dx)
+#else
+    sub           $8, %esp
+    kmovq         %\res2, (%esp)
+    pop           %ecx
+    pop           %edx
+    cmp           DATA(out), %ecx
+    jne           0f
+    cmp           DATA(out+4), %edx
+0:
+#endif
+    je            1f
+    mov           $\line, %eax
+    ret
+1:
+    .endm
+
+    .text
+    .globl _start
+_start:
+    _(kmov)       DATA(in1), %k1
+#if SIZE < 8 || !defined(__i386__)
+    mov           DATA(in2), %R(ax)
+    _(kmov)       %R(ax), %k2
+#else
+    _(kmov)       DATA(in2), %k2
+#endif
+
+    _(kor)        %k1, %k2, %k3
+    _(kand)       %k1, %k2, %k4
+    _(kandn)      %k3, %k4, %k5
+    _(kxor)       %k1, %k2, %k6
+    check         k5, k6, __LINE__
+
+    _(knot)       %k6, %k3
+    _(kxnor)      %k1, %k2, %k4
+    check         k3, k4, __LINE__
+
+    _(kshiftl)    $1, %k1, %k3
+    _(kshiftl)    $2, %k3, %k4
+    _(kshiftl)    $3, %k1, %k5
+    check         k4, k5, __LINE__
+
+    _(kshiftr)    $1, %k1, %k3
+    _(kshiftr)    $2, %k3, %k4
+    _(kshiftr)    $3, %k1, %k5
+    check         k4, k5, __LINE__
+
+    _(kortest)    %k6, %k6
+    jnbe          1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxor)       %k0, %k0, %k3
+    _(kortest)    %k3, %k3
+    jz            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxnor)      %k0, %k0, %k3
+    _(kortest)    %k3, %k3
+    jc            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+#if SIZE > 1
+
+    _(kshiftr)    $SIZE*4, %k3, %k4
+    WIDEN(kunpck) %k4, %k4, %k5
+    check         k3, k5, __LINE__
+
+#endif
+
+#if SIZE != 2 || defined(__AVX512DQ__)
+
+    _(kadd)       %k1, %k1, %k3
+    _(kshiftl)    $1, %k1, %k4
+    check         k3, k4, __LINE__
+
+    _(ktest)      %k2, %k1
+    jnbe          1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxor)       %k0, %k0, %k3
+    _(ktest)      %k0, %k3
+    jz            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxnor)      %k0, %k0, %k4
+    _(ktest)      %k0, %k4
+    jc            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+#endif
+
+    xor           %eax, %eax
+    ret
+
+    .section .rodata, "a", @progbits
+    .balign 8
+in1: .byte 0b10110011, 0b10001111, 0b00001111, 0b10000011, 0b11110000, 0b00111111, 0b10000000, 0b11111111
+in2: .byte 0b11111111, 0b00000001, 0b11111100, 0b00001111, 0b11000001, 0b11110000, 0b11110001, 0b11001101
+
+    .data
+    .balign 8
+out: .quad 0
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -18,6 +18,9 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx2.h"
 #include "avx2-sg.h"
 #include "xop.h"
+#include "avx512f-opmask.h"
+#include "avx512dq-opmask.h"
+#include "avx512bw-opmask.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -78,6 +81,24 @@ static bool simd_check_xop(void)
     return cpu_has_xop;
 }
 
+static bool simd_check_avx512f(void)
+{
+    return cpu_has_avx512f;
+}
+#define simd_check_avx512f_opmask simd_check_avx512f
+
+static bool simd_check_avx512dq(void)
+{
+    return cpu_has_avx512dq;
+}
+#define simd_check_avx512dq_opmask simd_check_avx512dq
+
+static bool simd_check_avx512bw(void)
+{
+    return cpu_has_avx512bw;
+}
+#define simd_check_avx512bw_opmask simd_check_avx512bw
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -223,6 +244,10 @@ static const struct {
     SIMD(XOP i16x16,              xop,      32i2),
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
+    SIMD(OPMASK/w,     avx512f_opmask,         2),
+    SIMD(OPMASK/b,    avx512dq_opmask,         1),
+    SIMD(OPMASK/d,    avx512bw_opmask,         4),
+    SIMD(OPMASK/q,    avx512bw_opmask,         8),
 #undef SIMD_
 #undef SIMD
 };
@@ -3469,8 +3494,8 @@ int main(int argc, char **argv)
             rc = x86_emulate(&ctxt, &emulops);
             if ( rc != X86EMUL_OKAY )
             {
-                printf("failed at %%eip == %08lx (opcode %08x)\n",
-                       (unsigned long)regs.eip, ctxt.opcode);
+                printf("failed (%d) at %%eip == %08lx (opcode %08x)\n",
+                       rc, (unsigned long)regs.eip, ctxt.opcode);
                 return 1;
             }
         }
--- a/tools/tests/x86_emulator/testcase.mk
+++ b/tools/tests/x86_emulator/testcase.mk
@@ -14,3 +14,9 @@ all: $(TESTCASE).bin
 	$(LD) $(LDFLAGS_DIRECT) -N -Ttext 0x100000 -o $*.tmp $*.o
 	$(OBJCOPY) -O binary $*.tmp $@
 	rm -f $*.tmp
+
+%-opmask.bin: opmask.S
+	$(CC) $(filter-out -M% .%,$(CFLAGS)) -c $< -o $*.o
+	$(LD) $(LDFLAGS_DIRECT) -N -Ttext 0x100000 -o $*.tmp $*.o
+	$(OBJCOPY) -O binary $*.tmp $@
+	rm -f $*.tmp
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -209,6 +209,9 @@ int emul_test_get_fpu(
     case X86EMUL_FPU_ymm:
         if ( cpu_has_avx )
             break;
+    case X86EMUL_FPU_opmask:
+        if ( cpu_has_avx512f )
+            break;
     default:
         return X86EMUL_UNHANDLEABLE;
     }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -236,6 +236,36 @@ static inline uint64_t xgetbv(uint32_t x
     (res.c & (1U << 21)) != 0; \
 })
 
+#define cpu_has_avx512f ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 16)) != 0; \
+})
+
+#define cpu_has_avx512dq ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 17)) != 0; \
+})
+
+#define cpu_has_avx512bw ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 30)) != 0; \
+})
+
 int emul_test_cpuid(
     uint32_t leaf,
     uint32_t subleaf,
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -491,6 +491,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
+    [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
@@ -1187,6 +1188,11 @@ static int _get_fpu(
             return X86EMUL_UNHANDLEABLE;
         break;
 
+    case X86EMUL_FPU_opmask:
+        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
+            return X86EMUL_UNHANDLEABLE;
+        break;
+
     default:
         break;
     }
@@ -1762,12 +1768,15 @@ static bool vcpu_has(
 #define vcpu_has_bmi2()        vcpu_has(         7, EBX,  8, ctxt, ops)
 #define vcpu_has_rtm()         vcpu_has(         7, EBX, 11, ctxt, ops)
 #define vcpu_has_mpx()         vcpu_has(         7, EBX, 14, ctxt, ops)
+#define vcpu_has_avx512f()     vcpu_has(         7, EBX, 16, ctxt, ops)
+#define vcpu_has_avx512dq()    vcpu_has(         7, EBX, 17, ctxt, ops)
 #define vcpu_has_rdseed()      vcpu_has(         7, EBX, 18, ctxt, ops)
 #define vcpu_has_adx()         vcpu_has(         7, EBX, 19, ctxt, ops)
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
+#define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -2396,6 +2405,18 @@ x86_decode_twobyte(
         }
         break;
 
+    case X86EMUL_OPC_VEX(0, 0x90):    /* kmov{w,q} */
+    case X86EMUL_OPC_VEX_66(0, 0x90): /* kmov{b,d} */
+        state->desc = DstReg | SrcMem | Mov;
+        state->simd_size = simd_other;
+        break;
+
+    case X86EMUL_OPC_VEX(0, 0x91):    /* kmov{w,q} */
+    case X86EMUL_OPC_VEX_66(0, 0x91): /* kmov{b,d} */
+        state->desc = DstMem | SrcReg | Mov;
+        state->simd_size = simd_other;
+        break;
+
     case 0xae:
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
@@ -6002,6 +6023,60 @@ x86_emulate(
             dst.val = src.val;
         break;
 
+    case X86EMUL_OPC_VEX(0x0f, 0x4a):    /* kadd{w,q} k,k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x41):    /* kand{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x41): /* kand{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x42):    /* kandn{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x42): /* kandn{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x45):    /* kor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x45): /* kor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x46):    /* kxnor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x46): /* kxnor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x47):    /* kxor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x47): /* kxor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x4a): /* kadd{b,d} k,k,k */
+        generate_exception_if(!vex.l, EXC_UD);
+    opmask_basic:
+        if ( vex.w )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( vex.pfx )
+            host_and_vcpu_must_have(avx512dq);
+    opmask_common:
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(!vex.r || (mode_64bit() && !(vex.reg & 8)) ||
+                              ea.type != OP_REG, EXC_UD);
+
+        vex.reg |= 8;
+        d &= ~TwoOp;
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        insn_bytes = PFX_BYTES + 2;
+
+        state->simd_size = simd_other;
+        op_bytes = 1; /* Any non-zero value will do. */
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x44):    /* knot{w,q} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x44): /* knot{b,d} k,k */
+        generate_exception_if(vex.l || vex.reg != 0xf, EXC_UD);
+        goto opmask_basic;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x4b):    /* kunpck{w,d}{d,q} k,k,k */
+        generate_exception_if(!vex.l, EXC_UD);
+        host_and_vcpu_must_have(avx512bw);
+        goto opmask_common;
+
+    case X86EMUL_OPC_VEX_66(0x0f, 0x4b): /* kunpckbw k,k,k */
+        generate_exception_if(!vex.l || vex.w, EXC_UD);
+        goto opmask_common;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x50):     /* movmskp{s,d} xmm,reg */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x50): /* vmovmskp{s,d} {x,y}mm,reg */
     CASE_SIMD_PACKED_INT(0x0f, 0xd7):      /* pmovmskb {,x}mm,reg */
@@ -6552,6 +6627,154 @@ x86_emulate(
         dst.val = test_cc(b, _regs.eflags);
         break;
 
+    case X86EMUL_OPC_VEX(0x0f, 0x91):    /* kmov{w,q} k,mem */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x91): /* kmov{b,d} k,mem */
+        generate_exception_if(ea.type != OP_MEM, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x90):    /* kmov{w,q} k/mem,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x90): /* kmov{b,d} k/mem,k */
+        generate_exception_if(vex.l || !vex.r, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.w )
+        {
+            host_and_vcpu_must_have(avx512bw);
+            op_bytes = 4 << !vex.pfx;
+        }
+        else if ( vex.pfx )
+        {
+            host_and_vcpu_must_have(avx512dq);
+            op_bytes = 1;
+        }
+        else
+            op_bytes = 2;
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            vex.b = 1;
+            opc[1] &= 0x38;
+        }
+        insn_bytes = PFX_BYTES + 2;
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x92):    /* kmovw r32,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x92): /* kmovb r32,k */
+    case X86EMUL_OPC_VEX_F2(0x0f, 0x92): /* kmov{d,q} reg,k */
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.pfx == vex_f2 )
+            host_and_vcpu_must_have(avx512bw);
+        else
+        {
+            generate_exception_if(vex.w, EXC_UD);
+            if ( vex.pfx )
+                host_and_vcpu_must_have(avx512dq);
+        }
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        vex.b = 1;
+        if ( !mode_64bit() )
+            vex.w = 0;
+        opc[1] = modrm & 0xf8;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        ea.reg = decode_gpr(&_regs, modrm_rm);
+        invoke_stub("", "", "=m" (dummy) : "a" (*ea.reg));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        dst.type = OP_NONE;
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x93):    /* kmovw k,r32 */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x93): /* kmovb k,r32 */
+    case X86EMUL_OPC_VEX_F2(0x0f, 0x93): /* kmov{d,q} k,reg */
+        generate_exception_if(vex.l || vex.reg != 0xf || ea.type != OP_REG,
+                              EXC_UD);
+        dst = ea;
+        dst.reg = decode_gpr(&_regs, modrm_reg);
+
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.pfx == vex_f2 )
+        {
+            host_and_vcpu_must_have(avx512bw);
+            dst.bytes = 4 << (mode_64bit() && vex.w);
+        }
+        else
+        {
+            generate_exception_if(vex.w, EXC_UD);
+            dst.bytes = 4;
+            if ( vex.pfx )
+                host_and_vcpu_must_have(avx512dq);
+        }
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        /* Convert GPR destination to %rAX. */
+        vex.r = 1;
+        if ( !mode_64bit() )
+            vex.w = 0;
+        opc[1] = modrm & 0xc7;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x99):    /* ktest{w,q} k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x98):    /* kortest{w,q} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x98): /* kortest{b,d} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x99): /* ktest{b,d} k,k */
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.w )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( vex.pfx )
+            host_and_vcpu_must_have(avx512dq);
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),
+                    _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),
+                    [eflags] "+g" (_regs.eflags),
+                    "=a" (dst.val), [tmp] "=&r" (dummy)
+                    : [mask] "i" (EFLAGS_MASK));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        dst.type = OP_NONE;
+        break;
+
     case X86EMUL_OPC(0x0f, 0xa2): /* cpuid */
         msr_val = 0;
         fail_if(ops->cpuid == NULL);
@@ -8170,6 +8393,23 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+    opmask_shift_imm:
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_opmask);
+        op_bytes = 1; /* Any non-zero value will do. */
+        goto simd_0f_imm8;
+
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x31): /* kshiftr{d,q} $imm8,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x33): /* kshiftl{d,q} $imm8,k,k */
+        host_and_vcpu_must_have(avx512bw);
+        goto opmask_shift_imm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x44):     /* pclmulqdq $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,xmm/m128,xmm,xmm */
         host_and_vcpu_must_have(pclmulqdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -170,6 +170,7 @@ enum x86_emulate_fpu_type {
     X86EMUL_FPU_mmx, /* MMX instruction set (%mm0-%mm7) */
     X86EMUL_FPU_xmm, /* SSE instruction set (%xmm0-%xmm7/15) */
     X86EMUL_FPU_ymm, /* AVX/XOP instruction set (%ymm0-%ymm7/15) */
+    X86EMUL_FPU_opmask, /* AVX512 opmask instruction set (%k0-%k7) */
     /* This sentinel will never be passed to ->get_fpu(). */
     X86EMUL_FPU_none
 };
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -99,9 +99,12 @@
 #define cpu_has_rtm             boot_cpu_has(X86_FEATURE_RTM)
 #define cpu_has_fpu_sel         (!boot_cpu_has(X86_FEATURE_NO_FPU_SEL))
 #define cpu_has_mpx             boot_cpu_has(X86_FEATURE_MPX)
+#define cpu_has_avx512f         boot_cpu_has(X86_FEATURE_AVX512F)
+#define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
+#define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH 4/6] x86emul: clean up AVX2 insn use in test harness
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (2 preceding siblings ...)
  2018-08-09  8:24 ` [PATCH 3/6] x86emul: support AVX512 opmask insns Jan Beulich
@ 2018-08-09  8:25 ` Jan Beulich
  2018-08-09  8:25 ` [PATCH 5/6] x86emul: correct EVEX decoding Jan Beulich
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-08-09  8:25 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

Drop the pretty pointless conditionals from code testing AVX insns and
properly use AVX2 mnemonics in code testing AVX2 insns (the test harness
is already requiring sufficiently new a compiler/assembler).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -2071,11 +2071,6 @@ int main(int argc, char **argv)
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(vmovdqu_from_mem) )
             goto fail;
-#if 0 /* Don't use AVX2 instructions for now */
-        asm ( "vpcmpeqb %%ymm2, %%ymm2, %%ymm2\n\t"
-              "vpcmpeqb %%ymm4, %%ymm2, %%ymm0\n\t"
-              "vpmovmskb %%ymm0, %0" : "=r" (rc) );
-#else
         asm ( "vextractf128 $1, %%ymm4, %%xmm3\n\t"
               "vpcmpeqb %%xmm2, %%xmm2, %%xmm2\n\t"
               "vpcmpeqb %%xmm4, %%xmm2, %%xmm0\n\t"
@@ -2083,7 +2078,6 @@ int main(int argc, char **argv)
               "vpmovmskb %%xmm0, %0\n\t"
               "vpmovmskb %%xmm1, %1" : "=r" (rc), "=r" (i) );
         rc |= i << 16;
-#endif
         if ( rc != 0xffffffff )
             goto fail;
         printf("okay\n");
@@ -2755,11 +2749,6 @@ int main(int argc, char **argv)
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(vlddqu) )
             goto fail;
-#if 0 /* Don't use AVX2 instructions for now */
-        asm ( "vpcmpeqb %%ymm2, %%ymm2, %%ymm2\n\t"
-              "vpcmpeqb %%ymm4, %%ymm2, %%ymm0\n\t"
-              "vpmovmskb %%ymm0, %0" : "=r" (rc) );
-#else
         asm ( "vextractf128 $1, %%ymm4, %%xmm3\n\t"
               "vpcmpeqb %%xmm2, %%xmm2, %%xmm2\n\t"
               "vpcmpeqb %%xmm4, %%xmm2, %%xmm0\n\t"
@@ -2767,7 +2756,6 @@ int main(int argc, char **argv)
               "vpmovmskb %%xmm0, %0\n\t"
               "vpmovmskb %%xmm1, %1" : "=r" (rc), "=r" (i) );
         rc |= i << 16;
-#endif
         if ( ~rc )
             goto fail;
         printf("okay\n");
@@ -2806,15 +2794,9 @@ int main(int argc, char **argv)
     {
         decl_insn(vmovntdqa);
 
-#if 0 /* Don't use AVX2 instructions for now */
         asm volatile ( "vpxor %%ymm4, %%ymm4, %%ymm4\n"
                        put_insn(vmovntdqa, "vmovntdqa (%0), %%ymm4")
                        :: "c" (NULL) );
-#else
-        asm volatile ( "vpxor %xmm4, %xmm4, %xmm4\n"
-                       put_insn(vmovntdqa,
-                                ".byte 0xc4, 0xe2, 0x7d, 0x2a, 0x21") );
-#endif
 
         set_insn(vmovntdqa);
         memset(res, 0x55, 96);
@@ -2823,19 +2805,9 @@ int main(int argc, char **argv)
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(vmovntdqa) )
             goto fail;
-#if 0 /* Don't use AVX2 instructions for now */
         asm ( "vpcmpeqb %%ymm2, %%ymm2, %%ymm2\n\t"
               "vpcmpeqb %%ymm4, %%ymm2, %%ymm0\n\t"
               "vpmovmskb %%ymm0, %0" : "=r" (rc) );
-#else
-        asm ( "vextractf128 $1, %%ymm4, %%xmm3\n\t"
-              "vpcmpeqb %%xmm2, %%xmm2, %%xmm2\n\t"
-              "vpcmpeqb %%xmm4, %%xmm2, %%xmm0\n\t"
-              "vpcmpeqb %%xmm3, %%xmm2, %%xmm1\n\t"
-              "vpmovmskb %%xmm0, %0\n\t"
-              "vpmovmskb %%xmm1, %1" : "=r" (rc), "=r" (i) );
-        rc |= i << 16;
-#endif
         if ( ~rc )
             goto fail;
         printf("okay\n");
@@ -3161,12 +3133,7 @@ int main(int argc, char **argv)
 
         asm volatile ( "vpxor %%xmm1, %%xmm1, %%xmm1\n\t"
                        "vpinsrd $0b00, %1, %%xmm1, %%xmm2\n\t"
-#if 0 /* Don't use AVX2 instructions for now */
                        put_insn(vpmaskmovd, "vpmaskmovd %%xmm1, %%xmm2, (%0)")
-#else
-                       put_insn(vpmaskmovd,
-                                ".byte 0xc4, 0xe2, 0x69, 0x8e, 0x0a")
-#endif
                        :: "d" (NULL), "r" (~0) );
 
         memset(res + MMAP_SZ / sizeof(*res) - 8, 0xdb, 32);
@@ -3200,14 +3167,8 @@ int main(int argc, char **argv)
 
         asm volatile ( "vpxor %%xmm1, %%xmm1, %%xmm1\n\t"
                        "vpcmpeqd %%xmm0, %%xmm0, %%xmm0\n\t"
-#if 0 /* Don't use AVX2 instructions for now */
                        "vpblendd $0b0011, %%xmm0, %%xmm1, %%xmm2\n\t"
                        put_insn(vpmaskmovq, "vpmaskmovq %%xmm1, %%xmm2, (%0)")
-#else
-                       ".byte 0xc4, 0xe3, 0x71, 0x02, 0xd0, 0b0011\n\t"
-                       put_insn(vpmaskmovq,
-                                ".byte 0xc4, 0xe2, 0xe9, 0x8e, 0x0a")
-#endif
                        :: "d" (NULL) );
 
         memset(res + MMAP_SZ / sizeof(*res) - 8, 0xdb, 32);
@@ -3221,11 +3182,7 @@ int main(int argc, char **argv)
                     res + MMAP_SZ / sizeof(*res) - 4, 8) )
             goto fail;
 
-#if 0 /* Don't use AVX2 instructions for now */
         asm volatile ( "vpermq $0b00000001, %ymm2, %ymm2" );
-#else
-        asm volatile ( ".byte 0xc4, 0xe3, 0xfd, 0x00, 0xd2, 0b00000001" );
-#endif
         memset(res, 0xdb, 32);
         set_insn(vpmaskmovq);
         regs.edx = (unsigned long)(res - 2);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH 5/6] x86emul: correct EVEX decoding
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (3 preceding siblings ...)
  2018-08-09  8:25 ` [PATCH 4/6] x86emul: clean up AVX2 insn use in test harness Jan Beulich
@ 2018-08-09  8:25 ` Jan Beulich
  2018-08-09  8:26 ` [PATCH 6/6] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-08-09  8:25 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

Fix an inverted pair of checks, drop an incorrect instance of #UD
raising for non-64-bit mode, and add further generic checks.

Note: Other than SDM Vol 2 rev 067 states, EVEX.V' is _not_ ignored
      outside of 64-bit mode when the field does not encode a register.
      Just like EVEX.VVVV is required to be 0b1111 in that case, EVEX.V'
      is required to be 1 there.

Also rename the bcst field to br, as #UD generation for individual insns
will need to consider both of its possible meanings.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -650,7 +650,7 @@ union evex {
         uint8_t w:1;
         uint8_t opmsk:3;
         uint8_t RX:1;
-        uint8_t bcst:1;
+        uint8_t br:1;
         uint8_t lr:2;
         uint8_t z:1;
     };
@@ -2760,13 +2760,11 @@ x86_decode(
                         evex.raw[1] = vex.raw[1];
                         evex.raw[2] = insn_fetch_type(uint8_t);
 
-                        generate_exception_if(evex.mbs || !evex.mbz, EXC_UD);
+                        generate_exception_if(!evex.mbs || evex.mbz, EXC_UD);
+                        generate_exception_if(!evex.opmsk && evex.z, EXC_UD);
 
                         if ( !mode_64bit() )
-                        {
-                            generate_exception_if(!evex.RX, EXC_UD);
                             evex.R = 1;
-                        }
 
                         vex.opcx = evex.opcx;
                         break;
@@ -3404,6 +3402,7 @@ x86_emulate(
         d = (d & ~DstMask) | DstMem;
         /* Becomes a normal DstMem operation from here on. */
     case DstMem:
+        generate_exception_if(ea.type == OP_MEM && evex.z, EXC_UD);
         if ( state->simd_size )
         {
             generate_exception_if(lock_prefix, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH 6/6] x86emul: generalize vector length handling for AVX512/EVEX
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (4 preceding siblings ...)
  2018-08-09  8:25 ` [PATCH 5/6] x86emul: correct EVEX decoding Jan Beulich
@ 2018-08-09  8:26 ` Jan Beulich
  2018-08-29 14:20 ` [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-08-09  8:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

To allow for some code sharing where possible, copy VEX.L into EVEX.LR
even for VEX (or XOP) encoded insns. Make operand size determination
use this right away, at the same time adding consistency checks for the
EVEX scalar insn cases (the non-scalar ones aren't uniform enough for
the checking to be done in a central place like this).

Note that the broadcast case is not handled here, but will be taken care
of elsewhere (in just a single place rather than at least two).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -191,14 +191,14 @@ enum simd_opsize {
      * Ordinary packed integers:
      * - 64 bits without prefix 66 (MMX)
      * - 128 bits with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      */
     simd_packed_int,
 
     /*
      * Ordinary packed/scalar floating point:
      * - 128 bits without prefix or with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      * - 32 bits with prefix F3 (scalar single)
      * - 64 bits with prefix F2 (scalar doubgle)
      */
@@ -207,14 +207,14 @@ enum simd_opsize {
     /*
      * Packed floating point:
      * - 128 bits without prefix or with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      */
     simd_packed_fp,
 
     /*
      * Single precision packed/scalar floating point:
      * - 128 bits without prefix (SSEn)
-     * - 128/256 bits depending on VEX.L, no prefix (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      * - 32 bits with prefix F3 (scalar)
      */
     simd_single_fp,
@@ -228,7 +228,7 @@ enum simd_opsize {
 
     /*
      * Scalar floating point:
-     * - 32/64 bits depending on VEX.W
+     * - 32/64 bits depending on VEX.W/EVEX.W
      */
     simd_scalar_vexw,
 
@@ -2818,6 +2818,9 @@ x86_decode(
 
                 opcode |= b | MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
 
+                if ( !evex.mbs )
+                    evex.lr = vex.l;
+
                 if ( !(d & ModRM) )
                     break;
 
@@ -3148,7 +3151,7 @@ x86_decode(
             }
             /* fall through */
         case vex_66:
-            op_bytes = 16 << vex.l;
+            op_bytes = 16 << evex.lr;
             break;
         default:
             op_bytes = 0;
@@ -3172,13 +3175,23 @@ x86_decode(
     case simd_any_fp:
         switch ( vex.pfx )
         {
-        default:     op_bytes = 16 << vex.l; break;
-        case vex_f3: op_bytes = 4;           break;
-        case vex_f2: op_bytes = 8;           break;
+        default:
+            op_bytes = 16 << evex.lr;
+            break;
+        case vex_f3:
+            generate_exception_if(evex.mbs && evex.w, EXC_UD);
+            op_bytes = 4;
+            break;
+        case vex_f2:
+            generate_exception_if(evex.mbs && !evex.w, EXC_UD);
+            op_bytes = 8;
+            break;
         }
         break;
 
     case simd_scalar_opc:
+        generate_exception_if(evex.mbs && (evex.w != (ctxt->opcode & 1)),
+                              EXC_UD);
         op_bytes = 4 << (ctxt->opcode & 1);
         break;
 





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (5 preceding siblings ...)
  2018-08-09  8:26 ` [PATCH 6/6] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
@ 2018-08-29 14:20 ` Jan Beulich
  2018-08-29 14:23   ` [PATCH v2 1/6] x86emul: fix FMA scalar operand sizes Jan Beulich
                     ` (5 more replies)
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (5 subsequent siblings)
  12 siblings, 6 replies; 465+ messages in thread
From: Jan Beulich @ 2018-08-29 14:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

1: fix FMA scalar operand sizes
2: extend MASKMOV{Q,DQU} tests
3: support AVX512 opmask insns
4: clean up AVX2 insn use in test harness
5: correct EVEX decoding
6: generalize vector length handling for AVX512/EVEX

While I also have ready a patch emulating the basic AVX512 moves,
its prereq to widen respective fields in HVM code to allow for 64 byte
operand sizes has a dependency on the "x86/HVM: implement memory
read caching" series, which looks to be stalled, and I prefer to avoid
posting further patches with dependencies on stalled series.

Some support beyond moves is also mostly ready, but it's not quite
enough yet to enable testing that code in the harness, and I'd rather
not post code that I haven't pushed through the harness yet. The
partial testing I've been doing means, however, that in particular the
last two patches here have been tested already, even if no such
testing would be possible for others until further patches actually get
posted.

Jan




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v2 1/6] x86emul: fix FMA scalar operand sizes
  2018-08-29 14:20 ` [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
@ 2018-08-29 14:23   ` Jan Beulich
  2018-09-03 16:43     ` Andrew Cooper
  2018-08-29 14:23   ` [PATCH v2 2/6] x86emul: extend MASKMOV{Q,DQU} tests Jan Beulich
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-08-29 14:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

FMA insns, other than the earlier AVX additions, don't use the low
opcode bit to distinguish between single and double vector elements.
While the difference is benign for packed flavors, the scalar ones
need to use VEX.W here. Oddly enough the table entries didn't even use
simd_scalar_fp, but uniformly used simd_packed_fp (implying the
distinction was by [VEX-encoded] opcode prefix).

Split simd_scalar_fp into simd_scalar_opc and simd_scalar_vexw, and
correct 

Also correct the scalar insn comments (they only ever use XMM registers
as operands).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -224,7 +224,13 @@ enum simd_opsize {
      * - 32 bits with low opcode bit clear (scalar single)
      * - 64 bits with low opcode bit set (scalar double)
      */
-    simd_scalar_fp,
+    simd_scalar_opc,
+
+    /*
+     * Scalar floating point:
+     * - 32/64 bits depending on VEX.W
+     */
+    simd_scalar_vexw,
 
     /*
      * 128 bits of integer or floating point data, with no further
@@ -407,7 +413,7 @@ static const struct ext0f38_table {
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x18 ... 0x19] = { .simd_size = simd_scalar_fp, .two_op = 1 },
+    [0x18 ... 0x19] = { .simd_size = simd_scalar_opc, .two_op = 1 },
     [0x1a] = { .simd_size = simd_128, .two_op = 1 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
@@ -427,9 +433,30 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_other },
     [0x8e] = { .simd_size = simd_other, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
-    [0x96 ... 0x9f] = { .simd_size = simd_packed_fp },
-    [0xa6 ... 0xaf] = { .simd_size = simd_packed_fp },
-    [0xb6 ... 0xbf] = { .simd_size = simd_packed_fp },
+    [0x96 ... 0x98] = { .simd_size = simd_packed_fp },
+    [0x99] = { .simd_size = simd_scalar_vexw },
+    [0x9a] = { .simd_size = simd_packed_fp },
+    [0x9b] = { .simd_size = simd_scalar_vexw },
+    [0x9c] = { .simd_size = simd_packed_fp },
+    [0x9d] = { .simd_size = simd_scalar_vexw },
+    [0x9e] = { .simd_size = simd_packed_fp },
+    [0x9f] = { .simd_size = simd_scalar_vexw },
+    [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp },
+    [0xa9] = { .simd_size = simd_scalar_vexw },
+    [0xaa] = { .simd_size = simd_packed_fp },
+    [0xab] = { .simd_size = simd_scalar_vexw },
+    [0xac] = { .simd_size = simd_packed_fp },
+    [0xad] = { .simd_size = simd_scalar_vexw },
+    [0xae] = { .simd_size = simd_packed_fp },
+    [0xaf] = { .simd_size = simd_scalar_vexw },
+    [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp },
+    [0xb9] = { .simd_size = simd_scalar_vexw },
+    [0xba] = { .simd_size = simd_packed_fp },
+    [0xbb] = { .simd_size = simd_scalar_vexw },
+    [0xbc] = { .simd_size = simd_packed_fp },
+    [0xbd] = { .simd_size = simd_scalar_vexw },
+    [0xbe] = { .simd_size = simd_packed_fp },
+    [0xbf] = { .simd_size = simd_scalar_vexw },
     [0xc8 ... 0xcd] = { .simd_size = simd_other },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
@@ -454,7 +481,7 @@ static const struct ext0f3a_table {
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1 },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
-    [0x0a ... 0x0b] = { .simd_size = simd_scalar_fp },
+    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
     [0x14 ... 0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1 },
@@ -476,13 +503,13 @@ static const struct ext0f3a_table {
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
-    [0x6a ... 0x6b] = { .simd_size = simd_scalar_fp, .four_op = 1 },
+    [0x6a ... 0x6b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x6c ... 0x6d] = { .simd_size = simd_packed_fp, .four_op = 1 },
-    [0x6e ... 0x6f] = { .simd_size = simd_scalar_fp, .four_op = 1 },
+    [0x6e ... 0x6f] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_packed_fp, .four_op = 1 },
-    [0x7a ... 0x7b] = { .simd_size = simd_scalar_fp, .four_op = 1 },
+    [0x7a ... 0x7b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x7c ... 0x7d] = { .simd_size = simd_packed_fp, .four_op = 1 },
-    [0x7e ... 0x7f] = { .simd_size = simd_scalar_fp, .four_op = 1 },
+    [0x7e ... 0x7f] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0xcc] = { .simd_size = simd_other },
     [0xdf] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xf0] = {},
@@ -518,7 +545,7 @@ static const struct ext8f09_table {
 } ext8f09_table[256] = {
     [0x01 ... 0x02] = { .two_op = 1 },
     [0x80 ... 0x81] = { .simd_size = simd_packed_fp, .two_op = 1 },
-    [0x82 ... 0x83] = { .simd_size = simd_scalar_fp, .two_op = 1 },
+    [0x82 ... 0x83] = { .simd_size = simd_scalar_opc, .two_op = 1 },
     [0x90 ... 0x9b] = { .simd_size = simd_packed_int },
     [0xc1 ... 0xc3] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xc6 ... 0xc7] = { .simd_size = simd_packed_int, .two_op = 1 },
@@ -3132,10 +3159,14 @@ x86_decode(
         }
         break;
 
-    case simd_scalar_fp:
+    case simd_scalar_opc:
         op_bytes = 4 << (ctxt->opcode & 1);
         break;
 
+    case simd_scalar_vexw:
+        op_bytes = 4 << vex.w;
+        break;
+
     case simd_128:
         op_bytes = 16;
         break;
@@ -7747,33 +7778,33 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x9a): /* vfmsub132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x9c): /* vfnmadd132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x9e): /* vfnmsub132p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xa6): /* vfmaddsub213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xa7): /* vfmsubadd213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xa8): /* vfmadd213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xaa): /* vfmsub213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xac): /* vfnmadd213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xae): /* vfnmsub213p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xb6): /* vfmaddsub231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xb7): /* vfmsubadd231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xb8): /* vfmadd231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xba): /* vfmsub231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm */
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v2 2/6] x86emul: extend MASKMOV{Q,DQU} tests
  2018-08-29 14:20 ` [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
  2018-08-29 14:23   ` [PATCH v2 1/6] x86emul: fix FMA scalar operand sizes Jan Beulich
@ 2018-08-29 14:23   ` Jan Beulich
  2018-09-03 16:44     ` [PATCH v2 2/6] x86emul: extend MASKMOV{Q, DQU} tests Andrew Cooper
  2018-08-29 14:24   ` [PATCH v2 3/6] x86emul: support AVX512 opmask insns Jan Beulich
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-08-29 14:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

While deriving the first AVX512 pieces from existing code I've got the
(in the end wrong) impression that the emulation of these insns would be
broken. Besides testing that the instructions act as no-ops when the
controlling mask bits are all zero, add ones to also check that the data
merging actually works.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -2626,7 +2626,7 @@ int main(int argc, char **argv)
         printf("skipped\n");
 #endif
 
-    printf("%-40s", "Testing maskmovq (zero mask)...");
+    printf("%-40s", "Testing maskmovq %mm4,%mm4...");
     if ( stack_exec && cpu_has_sse )
     {
         decl_insn(maskmovq);
@@ -2639,12 +2639,25 @@ int main(int argc, char **argv)
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(maskmovq) )
             goto fail;
+
+        asm volatile ( "pcmpeqb %mm3, %mm3\n\t"
+                       "punpcklbw %mm3, %mm4\n" );
+        memset(res, 0x55, 24);
+
+        set_insn(maskmovq);
+        regs.edi = (unsigned long)(res + 2);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(maskmovq) ||
+             memcmp(res, res + 4, 8) ||
+             res[2] != 0xff55ff55 || res[3] != 0xff55ff55 )
+            goto fail;
+
         printf("okay\n");
     }
     else
         printf("skipped\n");
 
-    printf("%-40s", "Testing maskmovdqu (zero mask)...");
+    printf("%-40s", "Testing maskmovdqu %xmm3,%xmm3...");
     if ( stack_exec && cpu_has_sse2 )
     {
         decl_insn(maskmovdqu);
@@ -2653,9 +2666,24 @@ int main(int argc, char **argv)
                        put_insn(maskmovdqu, "maskmovdqu %xmm3, %xmm3") );
 
         set_insn(maskmovdqu);
+        regs.edi = 0;
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(maskmovdqu) )
             goto fail;
+
+        asm volatile ( "pcmpeqb %xmm4, %xmm4\n\t"
+                       "punpcklbw %xmm4, %xmm3\n" );
+        memset(res, 0x55, 48);
+
+        set_insn(maskmovdqu);
+        regs.edi = (unsigned long)(res + 4);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(maskmovdqu) ||
+             memcmp(res, res + 8, 16) ||
+             res[4] != 0xff55ff55 || res[5] != 0xff55ff55 ||
+             res[6] != 0xff55ff55 || res[7] != 0xff55ff55 )
+            goto fail;
+
         printf("okay\n");
     }
     else





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v2 3/6] x86emul: support AVX512 opmask insns
  2018-08-29 14:20 ` [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
  2018-08-29 14:23   ` [PATCH v2 1/6] x86emul: fix FMA scalar operand sizes Jan Beulich
  2018-08-29 14:23   ` [PATCH v2 2/6] x86emul: extend MASKMOV{Q,DQU} tests Jan Beulich
@ 2018-08-29 14:24   ` Jan Beulich
  2018-09-03 17:57     ` Andrew Cooper
  2018-08-29 14:24   ` [PATCH v2 4/6] x86emul: clean up AVX2 insn use in test harness Jan Beulich
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-08-29 14:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

These are all VEX encoded, so the EVEX decoding logic continues to
remain unused at this point.

The new testcase is deliberately coded in assembly, as a C one would
have become almost unreadable due to the overwhelming amount of
__builtin_...() that would need to be used. After all the compiler has
no underlying type (yet) that could be operated on without builtins,
other than the vector types used for "normal" SIMD insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,6 +16,8 @@ FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
 
+OPMASK := avx512f avx512dq avx512bw
+
 blowfish-cflags := ""
 blowfish-cflags-x86_32 := "-mno-accumulate-outgoing-args -Dstatic="
 
@@ -51,6 +53,10 @@ xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
 
+avx512f-opmask-vecs := 2
+avx512dq-opmask-vecs := 1
+avx512bw-opmask-vecs := 4 8
+
 # For AVX and later, have the compiler avoid XMM0 to widen coverage of
 # the VEX.vvvv checks in the emulator.  For 3DNow!, however, force SSE
 # use for floating point operations, to avoid mixing MMX and FPU register
@@ -80,9 +86,13 @@ $(1)-cflags := \
 	   $(foreach flt,$($(1)-flts), \
 	     "-D_$(vec)x$(idx)f$(flt) -m$(1:-sg=) $(call non-sse,$(1)) -Os -DVEC_MAX=$(vec) -DIDX_SIZE=$(idx) -DFLOAT_SIZE=$(flt)")))
 endef
+define opmask-defs
+$(1)-opmask-cflags := $(foreach vec,$($(1)-opmask-vecs), "-D_$(vec) -m$(1) -Os -DSIZE=$(vec)")
+endef
 
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
+$(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
 $(addsuffix .h,$(TESTCASES)): %.h: %.c testcase.mk Makefile
 	rm -f $@.new $*.bin
@@ -100,6 +110,22 @@ $(addsuffix .h,$(TESTCASES)): %.h: %.c t
 	)
 	mv $@.new $@
 
+$(addsuffix -opmask.h,$(OPMASK)): %.h: opmask.S testcase.mk Makefile
+	rm -f $@.new $*.bin
+	$(foreach arch,$(filter-out $(XEN_COMPILE_ARCH),x86_32) $(XEN_COMPILE_ARCH), \
+	    for cflags in $($*-cflags) $($*-cflags-$(arch)); do \
+		$(MAKE) -f testcase.mk TESTCASE=$* XEN_TARGET_ARCH=$(arch) $*-cflags="$$cflags" all; \
+		prefix=$(shell echo $(subst -,_,$*) | sed -e 's,^\([0-9]\),_\1,'); \
+		flavor=$$(echo $${cflags} | sed -e 's, .*,,' -e 'y,-=,__,') ; \
+		(echo 'static const unsigned int __attribute__((section(".test, \"ax\", @progbits #")))' \
+		      "$${prefix}_$(arch)$${flavor}[] = {"; \
+		 od -v -t x $*.bin | sed -e 's/^[0-9]* /0x/' -e 's/ /, 0x/g' -e 's/$$/,/'; \
+		 echo "};") >>$@.new; \
+		rm -f $*.bin; \
+	    done; \
+	)
+	mv $@.new $@
+
 $(addsuffix .c,$(SIMD)):
 	ln -sf simd.c $@
 
@@ -145,4 +171,4 @@ x86-emulate.o test_x86_emulator.o wrappe
 x86-emulate.o: x86_emulate/x86_emulate.c
 x86-emulate.o: HOSTCFLAGS += -D__XEN_TOOLS__
 
-test_x86_emulator.o: $(addsuffix .h,$(TESTCASES))
+test_x86_emulator.o: $(addsuffix .h,$(TESTCASES)) $(addsuffix -opmask.h,$(OPMASK))
--- /dev/null
+++ b/tools/tests/x86_emulator/opmask.S
@@ -0,0 +1,144 @@
+#ifdef __i386__
+# define R(x) e##x
+# define DATA(x) x
+#else
+# if SIZE == 8
+#  define R(x) r##x
+# else
+#  define R(x) e##x
+# endif
+# define DATA(x) x(%rip)
+#endif
+
+#if SIZE == 1
+# define _(x) x##b
+#elif SIZE == 2
+# define _(x) x##w
+# define WIDEN(x) x##bw
+#elif SIZE == 4
+# define _(x) x##d
+# define WIDEN(x) x##wd
+#elif SIZE == 8
+# define _(x) x##q
+# define WIDEN(x) x##dq
+#endif
+
+    .macro check res1:req, res2:req, line:req
+    _(kmov)       %\res1, DATA(out)
+#if SIZE < 8 || !defined(__i386__)
+    _(kmov)       %\res2, %R(dx)
+    cmp           DATA(out), %R(dx)
+#else
+    sub           $8, %esp
+    kmovq         %\res2, (%esp)
+    pop           %ecx
+    pop           %edx
+    cmp           DATA(out), %ecx
+    jne           0f
+    cmp           DATA(out+4), %edx
+0:
+#endif
+    je            1f
+    mov           $\line, %eax
+    ret
+1:
+    .endm
+
+    .text
+    .globl _start
+_start:
+    _(kmov)       DATA(in1), %k1
+#if SIZE < 8 || !defined(__i386__)
+    mov           DATA(in2), %R(ax)
+    _(kmov)       %R(ax), %k2
+#else
+    _(kmov)       DATA(in2), %k2
+#endif
+
+    _(kor)        %k1, %k2, %k3
+    _(kand)       %k1, %k2, %k4
+    _(kandn)      %k3, %k4, %k5
+    _(kxor)       %k1, %k2, %k6
+    check         k5, k6, __LINE__
+
+    _(knot)       %k6, %k3
+    _(kxnor)      %k1, %k2, %k4
+    check         k3, k4, __LINE__
+
+    _(kshiftl)    $1, %k1, %k3
+    _(kshiftl)    $2, %k3, %k4
+    _(kshiftl)    $3, %k1, %k5
+    check         k4, k5, __LINE__
+
+    _(kshiftr)    $1, %k1, %k3
+    _(kshiftr)    $2, %k3, %k4
+    _(kshiftr)    $3, %k1, %k5
+    check         k4, k5, __LINE__
+
+    _(kortest)    %k6, %k6
+    jnbe          1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxor)       %k0, %k0, %k3
+    _(kortest)    %k3, %k3
+    jz            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxnor)      %k0, %k0, %k3
+    _(kortest)    %k3, %k3
+    jc            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+#if SIZE > 1
+
+    _(kshiftr)    $SIZE*4, %k3, %k4
+    WIDEN(kunpck) %k4, %k4, %k5
+    check         k3, k5, __LINE__
+
+#endif
+
+#if SIZE != 2 || defined(__AVX512DQ__)
+
+    _(kadd)       %k1, %k1, %k3
+    _(kshiftl)    $1, %k1, %k4
+    check         k3, k4, __LINE__
+
+    _(ktest)      %k2, %k1
+    jnbe          1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxor)       %k0, %k0, %k3
+    _(ktest)      %k0, %k3
+    jz            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxnor)      %k0, %k0, %k4
+    _(ktest)      %k0, %k4
+    jc            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+#endif
+
+    xor           %eax, %eax
+    ret
+
+    .section .rodata, "a", @progbits
+    .balign 8
+in1: .byte 0b10110011, 0b10001111, 0b00001111, 0b10000011, 0b11110000, 0b00111111, 0b10000000, 0b11111111
+in2: .byte 0b11111111, 0b00000001, 0b11111100, 0b00001111, 0b11000001, 0b11110000, 0b11110001, 0b11001101
+
+    .data
+    .balign 8
+out: .quad 0
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -18,6 +18,9 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx2.h"
 #include "avx2-sg.h"
 #include "xop.h"
+#include "avx512f-opmask.h"
+#include "avx512dq-opmask.h"
+#include "avx512bw-opmask.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -78,6 +81,24 @@ static bool simd_check_xop(void)
     return cpu_has_xop;
 }
 
+static bool simd_check_avx512f(void)
+{
+    return cpu_has_avx512f;
+}
+#define simd_check_avx512f_opmask simd_check_avx512f
+
+static bool simd_check_avx512dq(void)
+{
+    return cpu_has_avx512dq;
+}
+#define simd_check_avx512dq_opmask simd_check_avx512dq
+
+static bool simd_check_avx512bw(void)
+{
+    return cpu_has_avx512bw;
+}
+#define simd_check_avx512bw_opmask simd_check_avx512bw
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -223,6 +244,10 @@ static const struct {
     SIMD(XOP i16x16,              xop,      32i2),
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
+    SIMD(OPMASK/w,     avx512f_opmask,         2),
+    SIMD(OPMASK/b,    avx512dq_opmask,         1),
+    SIMD(OPMASK/d,    avx512bw_opmask,         4),
+    SIMD(OPMASK/q,    avx512bw_opmask,         8),
 #undef SIMD_
 #undef SIMD
 };
@@ -3469,8 +3494,8 @@ int main(int argc, char **argv)
             rc = x86_emulate(&ctxt, &emulops);
             if ( rc != X86EMUL_OKAY )
             {
-                printf("failed at %%eip == %08lx (opcode %08x)\n",
-                       (unsigned long)regs.eip, ctxt.opcode);
+                printf("failed (%d) at %%eip == %08lx (opcode %08x)\n",
+                       rc, (unsigned long)regs.eip, ctxt.opcode);
                 return 1;
             }
         }
--- a/tools/tests/x86_emulator/testcase.mk
+++ b/tools/tests/x86_emulator/testcase.mk
@@ -14,3 +14,9 @@ all: $(TESTCASE).bin
 	$(LD) $(LDFLAGS_DIRECT) -N -Ttext 0x100000 -o $*.tmp $*.o
 	$(OBJCOPY) -O binary $*.tmp $@
 	rm -f $*.tmp
+
+%-opmask.bin: opmask.S
+	$(CC) $(filter-out -M% .%,$(CFLAGS)) -c $< -o $*.o
+	$(LD) $(LDFLAGS_DIRECT) -N -Ttext 0x100000 -o $*.tmp $*.o
+	$(OBJCOPY) -O binary $*.tmp $@
+	rm -f $*.tmp
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -209,6 +209,9 @@ int emul_test_get_fpu(
     case X86EMUL_FPU_ymm:
         if ( cpu_has_avx )
             break;
+    case X86EMUL_FPU_opmask:
+        if ( cpu_has_avx512f )
+            break;
     default:
         return X86EMUL_UNHANDLEABLE;
     }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -236,6 +236,36 @@ static inline uint64_t xgetbv(uint32_t x
     (res.c & (1U << 21)) != 0; \
 })
 
+#define cpu_has_avx512f ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 16)) != 0; \
+})
+
+#define cpu_has_avx512dq ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 17)) != 0; \
+})
+
+#define cpu_has_avx512bw ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 30)) != 0; \
+})
+
 int emul_test_cpuid(
     uint32_t leaf,
     uint32_t subleaf,
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -491,6 +491,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
+    [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
@@ -1187,6 +1188,11 @@ static int _get_fpu(
             return X86EMUL_UNHANDLEABLE;
         break;
 
+    case X86EMUL_FPU_opmask:
+        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
+            return X86EMUL_UNHANDLEABLE;
+        break;
+
     default:
         break;
     }
@@ -1762,12 +1768,15 @@ static bool vcpu_has(
 #define vcpu_has_bmi2()        vcpu_has(         7, EBX,  8, ctxt, ops)
 #define vcpu_has_rtm()         vcpu_has(         7, EBX, 11, ctxt, ops)
 #define vcpu_has_mpx()         vcpu_has(         7, EBX, 14, ctxt, ops)
+#define vcpu_has_avx512f()     vcpu_has(         7, EBX, 16, ctxt, ops)
+#define vcpu_has_avx512dq()    vcpu_has(         7, EBX, 17, ctxt, ops)
 #define vcpu_has_rdseed()      vcpu_has(         7, EBX, 18, ctxt, ops)
 #define vcpu_has_adx()         vcpu_has(         7, EBX, 19, ctxt, ops)
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
+#define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -2396,6 +2405,18 @@ x86_decode_twobyte(
         }
         break;
 
+    case X86EMUL_OPC_VEX(0, 0x90):    /* kmov{w,q} */
+    case X86EMUL_OPC_VEX_66(0, 0x90): /* kmov{b,d} */
+        state->desc = DstReg | SrcMem | Mov;
+        state->simd_size = simd_other;
+        break;
+
+    case X86EMUL_OPC_VEX(0, 0x91):    /* kmov{w,q} */
+    case X86EMUL_OPC_VEX_66(0, 0x91): /* kmov{b,d} */
+        state->desc = DstMem | SrcReg | Mov;
+        state->simd_size = simd_other;
+        break;
+
     case 0xae:
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
@@ -6002,6 +6023,60 @@ x86_emulate(
             dst.val = src.val;
         break;
 
+    case X86EMUL_OPC_VEX(0x0f, 0x4a):    /* kadd{w,q} k,k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x41):    /* kand{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x41): /* kand{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x42):    /* kandn{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x42): /* kandn{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x45):    /* kor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x45): /* kor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x46):    /* kxnor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x46): /* kxnor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x47):    /* kxor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x47): /* kxor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x4a): /* kadd{b,d} k,k,k */
+        generate_exception_if(!vex.l, EXC_UD);
+    opmask_basic:
+        if ( vex.w )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( vex.pfx )
+            host_and_vcpu_must_have(avx512dq);
+    opmask_common:
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(!vex.r || (mode_64bit() && !(vex.reg & 8)) ||
+                              ea.type != OP_REG, EXC_UD);
+
+        vex.reg |= 8;
+        d &= ~TwoOp;
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        insn_bytes = PFX_BYTES + 2;
+
+        state->simd_size = simd_other;
+        op_bytes = 1; /* Any non-zero value will do. */
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x44):    /* knot{w,q} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x44): /* knot{b,d} k,k */
+        generate_exception_if(vex.l || vex.reg != 0xf, EXC_UD);
+        goto opmask_basic;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x4b):    /* kunpck{w,d}{d,q} k,k,k */
+        generate_exception_if(!vex.l, EXC_UD);
+        host_and_vcpu_must_have(avx512bw);
+        goto opmask_common;
+
+    case X86EMUL_OPC_VEX_66(0x0f, 0x4b): /* kunpckbw k,k,k */
+        generate_exception_if(!vex.l || vex.w, EXC_UD);
+        goto opmask_common;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x50):     /* movmskp{s,d} xmm,reg */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x50): /* vmovmskp{s,d} {x,y}mm,reg */
     CASE_SIMD_PACKED_INT(0x0f, 0xd7):      /* pmovmskb {,x}mm,reg */
@@ -6552,6 +6627,154 @@ x86_emulate(
         dst.val = test_cc(b, _regs.eflags);
         break;
 
+    case X86EMUL_OPC_VEX(0x0f, 0x91):    /* kmov{w,q} k,mem */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x91): /* kmov{b,d} k,mem */
+        generate_exception_if(ea.type != OP_MEM, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x90):    /* kmov{w,q} k/mem,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x90): /* kmov{b,d} k/mem,k */
+        generate_exception_if(vex.l || !vex.r, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.w )
+        {
+            host_and_vcpu_must_have(avx512bw);
+            op_bytes = 4 << !vex.pfx;
+        }
+        else if ( vex.pfx )
+        {
+            host_and_vcpu_must_have(avx512dq);
+            op_bytes = 1;
+        }
+        else
+            op_bytes = 2;
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            vex.b = 1;
+            opc[1] &= 0x38;
+        }
+        insn_bytes = PFX_BYTES + 2;
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x92):    /* kmovw r32,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x92): /* kmovb r32,k */
+    case X86EMUL_OPC_VEX_F2(0x0f, 0x92): /* kmov{d,q} reg,k */
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.pfx == vex_f2 )
+            host_and_vcpu_must_have(avx512bw);
+        else
+        {
+            generate_exception_if(vex.w, EXC_UD);
+            if ( vex.pfx )
+                host_and_vcpu_must_have(avx512dq);
+        }
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        vex.b = 1;
+        if ( !mode_64bit() )
+            vex.w = 0;
+        opc[1] = modrm & 0xf8;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        ea.reg = decode_gpr(&_regs, modrm_rm);
+        invoke_stub("", "", "=m" (dummy) : "a" (*ea.reg));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        dst.type = OP_NONE;
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x93):    /* kmovw k,r32 */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x93): /* kmovb k,r32 */
+    case X86EMUL_OPC_VEX_F2(0x0f, 0x93): /* kmov{d,q} k,reg */
+        generate_exception_if(vex.l || vex.reg != 0xf || ea.type != OP_REG,
+                              EXC_UD);
+        dst = ea;
+        dst.reg = decode_gpr(&_regs, modrm_reg);
+
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.pfx == vex_f2 )
+        {
+            host_and_vcpu_must_have(avx512bw);
+            dst.bytes = 4 << (mode_64bit() && vex.w);
+        }
+        else
+        {
+            generate_exception_if(vex.w, EXC_UD);
+            dst.bytes = 4;
+            if ( vex.pfx )
+                host_and_vcpu_must_have(avx512dq);
+        }
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        /* Convert GPR destination to %rAX. */
+        vex.r = 1;
+        if ( !mode_64bit() )
+            vex.w = 0;
+        opc[1] = modrm & 0xc7;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x99):    /* ktest{w,q} k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x98):    /* kortest{w,q} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x98): /* kortest{b,d} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x99): /* ktest{b,d} k,k */
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.w )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( vex.pfx )
+            host_and_vcpu_must_have(avx512dq);
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),
+                    _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),
+                    [eflags] "+g" (_regs.eflags),
+                    "=a" (dst.val), [tmp] "=&r" (dummy)
+                    : [mask] "i" (EFLAGS_MASK));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        dst.type = OP_NONE;
+        break;
+
     case X86EMUL_OPC(0x0f, 0xa2): /* cpuid */
         msr_val = 0;
         fail_if(ops->cpuid == NULL);
@@ -8170,6 +8393,23 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+    opmask_shift_imm:
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_opmask);
+        op_bytes = 1; /* Any non-zero value will do. */
+        goto simd_0f_imm8;
+
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x31): /* kshiftr{d,q} $imm8,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x33): /* kshiftl{d,q} $imm8,k,k */
+        host_and_vcpu_must_have(avx512bw);
+        goto opmask_shift_imm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x44):     /* pclmulqdq $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,xmm/m128,xmm,xmm */
         host_and_vcpu_must_have(pclmulqdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -170,6 +170,7 @@ enum x86_emulate_fpu_type {
     X86EMUL_FPU_mmx, /* MMX instruction set (%mm0-%mm7) */
     X86EMUL_FPU_xmm, /* SSE instruction set (%xmm0-%xmm7/15) */
     X86EMUL_FPU_ymm, /* AVX/XOP instruction set (%ymm0-%ymm7/15) */
+    X86EMUL_FPU_opmask, /* AVX512 opmask instruction set (%k0-%k7) */
     /* This sentinel will never be passed to ->get_fpu(). */
     X86EMUL_FPU_none
 };
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -99,9 +99,12 @@
 #define cpu_has_rtm             boot_cpu_has(X86_FEATURE_RTM)
 #define cpu_has_fpu_sel         (!boot_cpu_has(X86_FEATURE_NO_FPU_SEL))
 #define cpu_has_mpx             boot_cpu_has(X86_FEATURE_MPX)
+#define cpu_has_avx512f         boot_cpu_has(X86_FEATURE_AVX512F)
+#define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
+#define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v2 4/6] x86emul: clean up AVX2 insn use in test harness
  2018-08-29 14:20 ` [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (2 preceding siblings ...)
  2018-08-29 14:24   ` [PATCH v2 3/6] x86emul: support AVX512 opmask insns Jan Beulich
@ 2018-08-29 14:24   ` Jan Beulich
  2018-09-03 18:04     ` Andrew Cooper
  2018-08-29 14:25   ` [PATCH v2 5/6] x86emul: correct EVEX decoding Jan Beulich
  2018-08-29 14:25   ` [PATCH v2 6/6] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
  5 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-08-29 14:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

Drop the pretty pointless conditionals from code testing AVX insns and
properly use AVX2 mnemonics in code testing AVX2 insns (the test harness
is already requiring sufficiently new a compiler/assembler).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -2071,11 +2071,6 @@ int main(int argc, char **argv)
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(vmovdqu_from_mem) )
             goto fail;
-#if 0 /* Don't use AVX2 instructions for now */
-        asm ( "vpcmpeqb %%ymm2, %%ymm2, %%ymm2\n\t"
-              "vpcmpeqb %%ymm4, %%ymm2, %%ymm0\n\t"
-              "vpmovmskb %%ymm0, %0" : "=r" (rc) );
-#else
         asm ( "vextractf128 $1, %%ymm4, %%xmm3\n\t"
               "vpcmpeqb %%xmm2, %%xmm2, %%xmm2\n\t"
               "vpcmpeqb %%xmm4, %%xmm2, %%xmm0\n\t"
@@ -2083,7 +2078,6 @@ int main(int argc, char **argv)
               "vpmovmskb %%xmm0, %0\n\t"
               "vpmovmskb %%xmm1, %1" : "=r" (rc), "=r" (i) );
         rc |= i << 16;
-#endif
         if ( rc != 0xffffffff )
             goto fail;
         printf("okay\n");
@@ -2755,11 +2749,6 @@ int main(int argc, char **argv)
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(vlddqu) )
             goto fail;
-#if 0 /* Don't use AVX2 instructions for now */
-        asm ( "vpcmpeqb %%ymm2, %%ymm2, %%ymm2\n\t"
-              "vpcmpeqb %%ymm4, %%ymm2, %%ymm0\n\t"
-              "vpmovmskb %%ymm0, %0" : "=r" (rc) );
-#else
         asm ( "vextractf128 $1, %%ymm4, %%xmm3\n\t"
               "vpcmpeqb %%xmm2, %%xmm2, %%xmm2\n\t"
               "vpcmpeqb %%xmm4, %%xmm2, %%xmm0\n\t"
@@ -2767,7 +2756,6 @@ int main(int argc, char **argv)
               "vpmovmskb %%xmm0, %0\n\t"
               "vpmovmskb %%xmm1, %1" : "=r" (rc), "=r" (i) );
         rc |= i << 16;
-#endif
         if ( ~rc )
             goto fail;
         printf("okay\n");
@@ -2806,15 +2794,9 @@ int main(int argc, char **argv)
     {
         decl_insn(vmovntdqa);
 
-#if 0 /* Don't use AVX2 instructions for now */
         asm volatile ( "vpxor %%ymm4, %%ymm4, %%ymm4\n"
                        put_insn(vmovntdqa, "vmovntdqa (%0), %%ymm4")
                        :: "c" (NULL) );
-#else
-        asm volatile ( "vpxor %xmm4, %xmm4, %xmm4\n"
-                       put_insn(vmovntdqa,
-                                ".byte 0xc4, 0xe2, 0x7d, 0x2a, 0x21") );
-#endif
 
         set_insn(vmovntdqa);
         memset(res, 0x55, 96);
@@ -2823,19 +2805,9 @@ int main(int argc, char **argv)
         rc = x86_emulate(&ctxt, &emulops);
         if ( rc != X86EMUL_OKAY || !check_eip(vmovntdqa) )
             goto fail;
-#if 0 /* Don't use AVX2 instructions for now */
         asm ( "vpcmpeqb %%ymm2, %%ymm2, %%ymm2\n\t"
               "vpcmpeqb %%ymm4, %%ymm2, %%ymm0\n\t"
               "vpmovmskb %%ymm0, %0" : "=r" (rc) );
-#else
-        asm ( "vextractf128 $1, %%ymm4, %%xmm3\n\t"
-              "vpcmpeqb %%xmm2, %%xmm2, %%xmm2\n\t"
-              "vpcmpeqb %%xmm4, %%xmm2, %%xmm0\n\t"
-              "vpcmpeqb %%xmm3, %%xmm2, %%xmm1\n\t"
-              "vpmovmskb %%xmm0, %0\n\t"
-              "vpmovmskb %%xmm1, %1" : "=r" (rc), "=r" (i) );
-        rc |= i << 16;
-#endif
         if ( ~rc )
             goto fail;
         printf("okay\n");
@@ -3161,12 +3133,7 @@ int main(int argc, char **argv)
 
         asm volatile ( "vpxor %%xmm1, %%xmm1, %%xmm1\n\t"
                        "vpinsrd $0b00, %1, %%xmm1, %%xmm2\n\t"
-#if 0 /* Don't use AVX2 instructions for now */
                        put_insn(vpmaskmovd, "vpmaskmovd %%xmm1, %%xmm2, (%0)")
-#else
-                       put_insn(vpmaskmovd,
-                                ".byte 0xc4, 0xe2, 0x69, 0x8e, 0x0a")
-#endif
                        :: "d" (NULL), "r" (~0) );
 
         memset(res + MMAP_SZ / sizeof(*res) - 8, 0xdb, 32);
@@ -3200,14 +3167,8 @@ int main(int argc, char **argv)
 
         asm volatile ( "vpxor %%xmm1, %%xmm1, %%xmm1\n\t"
                        "vpcmpeqd %%xmm0, %%xmm0, %%xmm0\n\t"
-#if 0 /* Don't use AVX2 instructions for now */
                        "vpblendd $0b0011, %%xmm0, %%xmm1, %%xmm2\n\t"
                        put_insn(vpmaskmovq, "vpmaskmovq %%xmm1, %%xmm2, (%0)")
-#else
-                       ".byte 0xc4, 0xe3, 0x71, 0x02, 0xd0, 0b0011\n\t"
-                       put_insn(vpmaskmovq,
-                                ".byte 0xc4, 0xe2, 0xe9, 0x8e, 0x0a")
-#endif
                        :: "d" (NULL) );
 
         memset(res + MMAP_SZ / sizeof(*res) - 8, 0xdb, 32);
@@ -3221,11 +3182,7 @@ int main(int argc, char **argv)
                     res + MMAP_SZ / sizeof(*res) - 4, 8) )
             goto fail;
 
-#if 0 /* Don't use AVX2 instructions for now */
         asm volatile ( "vpermq $0b00000001, %ymm2, %ymm2" );
-#else
-        asm volatile ( ".byte 0xc4, 0xe3, 0xfd, 0x00, 0xd2, 0b00000001" );
-#endif
         memset(res, 0xdb, 32);
         set_insn(vpmaskmovq);
         regs.edx = (unsigned long)(res - 2);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v2 5/6] x86emul: correct EVEX decoding
  2018-08-29 14:20 ` [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (3 preceding siblings ...)
  2018-08-29 14:24   ` [PATCH v2 4/6] x86emul: clean up AVX2 insn use in test harness Jan Beulich
@ 2018-08-29 14:25   ` Jan Beulich
  2018-09-04 10:48     ` Andrew Cooper
  2018-08-29 14:25   ` [PATCH v2 6/6] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
  5 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-08-29 14:25 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

Fix an inverted pair of checks, drop an incorrect instance of #UD
raising for non-64-bit mode, and add further generic checks.

Note: Other than SDM Vol 2 rev 067 states, EVEX.V' is _not_ ignored
      outside of 64-bit mode when the field does not encode a register.
      Just like EVEX.VVVV is required to be 0b1111 in that case, EVEX.V'
      is required to be 1 there.

Also rename the bcst field to br, as #UD generation for individual insns
will need to consider both of its possible meanings.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -650,7 +650,7 @@ union evex {
         uint8_t w:1;
         uint8_t opmsk:3;
         uint8_t RX:1;
-        uint8_t bcst:1;
+        uint8_t br:1;
         uint8_t lr:2;
         uint8_t z:1;
     };
@@ -2760,13 +2760,11 @@ x86_decode(
                         evex.raw[1] = vex.raw[1];
                         evex.raw[2] = insn_fetch_type(uint8_t);
 
-                        generate_exception_if(evex.mbs || !evex.mbz, EXC_UD);
+                        generate_exception_if(!evex.mbs || evex.mbz, EXC_UD);
+                        generate_exception_if(!evex.opmsk && evex.z, EXC_UD);
 
                         if ( !mode_64bit() )
-                        {
-                            generate_exception_if(!evex.RX, EXC_UD);
                             evex.R = 1;
-                        }
 
                         vex.opcx = evex.opcx;
                         break;
@@ -3404,6 +3402,7 @@ x86_emulate(
         d = (d & ~DstMask) | DstMem;
         /* Becomes a normal DstMem operation from here on. */
     case DstMem:
+        generate_exception_if(ea.type == OP_MEM && evex.z, EXC_UD);
         if ( state->simd_size )
         {
             generate_exception_if(lock_prefix, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v2 6/6] x86emul: generalize vector length handling for AVX512/EVEX
  2018-08-29 14:20 ` [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (4 preceding siblings ...)
  2018-08-29 14:25   ` [PATCH v2 5/6] x86emul: correct EVEX decoding Jan Beulich
@ 2018-08-29 14:25   ` Jan Beulich
  2018-09-04 11:02     ` Andrew Cooper
  5 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-08-29 14:25 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

To allow for some code sharing where possible, copy VEX.L into EVEX.LR
even for VEX (or XOP) encoded insns. Make operand size determination
use this right away, at the same time adding consistency checks for the
EVEX scalar insn cases (the non-scalar ones aren't uniform enough for
the checking to be done in a central place like this).

Note that the broadcast case is not handled here, but will be taken care
of elsewhere (in just a single place rather than at least two).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Don't raise #UD in simd_scalar_opc case when EVEX.W != low-opcode-bit.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -191,14 +191,14 @@ enum simd_opsize {
      * Ordinary packed integers:
      * - 64 bits without prefix 66 (MMX)
      * - 128 bits with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      */
     simd_packed_int,
 
     /*
      * Ordinary packed/scalar floating point:
      * - 128 bits without prefix or with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      * - 32 bits with prefix F3 (scalar single)
      * - 64 bits with prefix F2 (scalar doubgle)
      */
@@ -207,14 +207,14 @@ enum simd_opsize {
     /*
      * Packed floating point:
      * - 128 bits without prefix or with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      */
     simd_packed_fp,
 
     /*
      * Single precision packed/scalar floating point:
      * - 128 bits without prefix (SSEn)
-     * - 128/256 bits depending on VEX.L, no prefix (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      * - 32 bits with prefix F3 (scalar)
      */
     simd_single_fp,
@@ -228,7 +228,7 @@ enum simd_opsize {
 
     /*
      * Scalar floating point:
-     * - 32/64 bits depending on VEX.W
+     * - 32/64 bits depending on VEX.W/EVEX.W
      */
     simd_scalar_vexw,
 
@@ -2818,6 +2818,9 @@ x86_decode(
 
                 opcode |= b | MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
 
+                if ( !evex.mbs )
+                    evex.lr = vex.l;
+
                 if ( !(d & ModRM) )
                     break;
 
@@ -3148,7 +3151,7 @@ x86_decode(
             }
             /* fall through */
         case vex_66:
-            op_bytes = 16 << vex.l;
+            op_bytes = 16 << evex.lr;
             break;
         default:
             op_bytes = 0;
@@ -3172,9 +3175,17 @@ x86_decode(
     case simd_any_fp:
         switch ( vex.pfx )
         {
-        default:     op_bytes = 16 << vex.l; break;
-        case vex_f3: op_bytes = 4;           break;
-        case vex_f2: op_bytes = 8;           break;
+        default:
+            op_bytes = 16 << evex.lr;
+            break;
+        case vex_f3:
+            generate_exception_if(evex.mbs && evex.w, EXC_UD);
+            op_bytes = 4;
+            break;
+        case vex_f2:
+            generate_exception_if(evex.mbs && !evex.w, EXC_UD);
+            op_bytes = 8;
+            break;
         }
         break;
 





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 1/6] x86emul: fix FMA scalar operand sizes
  2018-08-29 14:23   ` [PATCH v2 1/6] x86emul: fix FMA scalar operand sizes Jan Beulich
@ 2018-09-03 16:43     ` Andrew Cooper
  2018-09-04  7:52       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-09-03 16:43 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 29/08/18 15:23, Jan Beulich wrote:
> FMA insns, other than the earlier AVX additions, don't use the low
> opcode bit to distinguish between single and double vector elements.

I think I've worked out why "other than the" is so weird to read as a
native speaker here.  I think you mean "unlike the" in this context.

> While the difference is benign for packed flavors, the scalar ones
> need to use VEX.W here. Oddly enough the table entries didn't even use
> simd_scalar_fp, but uniformly used simd_packed_fp (implying the
> distinction was by [VEX-encoded] opcode prefix).

Was this a bug in the FMA patch then?

>
> Split simd_scalar_fp into simd_scalar_opc and simd_scalar_vexw, and
> correct 

Missing the rest of the sentence?  (v1 was similar)

>
> Also correct the scalar insn comments (they only ever use XMM registers
> as operands).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

As for the code content, Reviewed-by: Andrew Cooper
<andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 2/6] x86emul: extend MASKMOV{Q, DQU} tests
  2018-08-29 14:23   ` [PATCH v2 2/6] x86emul: extend MASKMOV{Q,DQU} tests Jan Beulich
@ 2018-09-03 16:44     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-09-03 16:44 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 29/08/18 15:23, Jan Beulich wrote:
> While deriving the first AVX512 pieces from existing code I've got the
> (in the end wrong) impression that the emulation of these insns would be
> broken. Besides testing that the instructions act as no-ops when the
> controlling mask bits are all zero, add ones to also check that the data
> merging actually works.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 3/6] x86emul: support AVX512 opmask insns
  2018-08-29 14:24   ` [PATCH v2 3/6] x86emul: support AVX512 opmask insns Jan Beulich
@ 2018-09-03 17:57     ` Andrew Cooper
  2018-09-04  7:58       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-09-03 17:57 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 29/08/18 15:24, Jan Beulich wrote:
> These are all VEX encoded, so the EVEX decoding logic continues to
> remain unused at this point.
>
> The new testcase is deliberately coded in assembly, as a C one would
> have become almost unreadable due to the overwhelming amount of
> __builtin_...() that would need to be used. After all the compiler has
> no underlying type (yet) that could be operated on without builtins,
> other than the vector types used for "normal" SIMD insns.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -6002,6 +6023,60 @@ x86_emulate(
>              dst.val = src.val;
>          break;
>  
> +    case X86EMUL_OPC_VEX(0x0f, 0x4a):    /* kadd{w,q} k,k,k */
> +        if ( !vex.w )
> +            host_and_vcpu_must_have(avx512dq);

Why is this kadd handled differently?  As far as I can tell from the
manual, its encoding looks to be consistent.

I'm afraid that I'm going to have to stare at the manual a bit more
before I can review the rest of this patch.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 4/6] x86emul: clean up AVX2 insn use in test harness
  2018-08-29 14:24   ` [PATCH v2 4/6] x86emul: clean up AVX2 insn use in test harness Jan Beulich
@ 2018-09-03 18:04     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-09-03 18:04 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 29/08/18 15:24, Jan Beulich wrote:
> Drop the pretty pointless conditionals from code testing AVX insns and
> properly use AVX2 mnemonics in code testing AVX2 insns (the test harness
> is already requiring sufficiently new a compiler/assembler).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 1/6] x86emul: fix FMA scalar operand sizes
  2018-09-03 16:43     ` Andrew Cooper
@ 2018-09-04  7:52       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-04  7:52 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 03.09.18 at 18:43, <andrew.cooper3@citrix.com> wrote:
> On 29/08/18 15:23, Jan Beulich wrote:
>> FMA insns, other than the earlier AVX additions, don't use the low
>> opcode bit to distinguish between single and double vector elements.
> 
> I think I've worked out why "other than the" is so weird to read as a
> native speaker here.  I think you mean "unlike the" in this context.

Changed; I'll try to remember this.

>> While the difference is benign for packed flavors, the scalar ones
>> need to use VEX.W here. Oddly enough the table entries didn't even use
>> simd_scalar_fp, but uniformly used simd_packed_fp (implying the
>> distinction was by [VEX-encoded] opcode prefix).
> 
> Was this a bug in the FMA patch then?

Yes.

>> Split simd_scalar_fp into simd_scalar_opc and simd_scalar_vexw, and
>> correct 
> 
> Missing the rest of the sentence?  (v1 was similar)

Oops: "...and correct FMA scalar table entries to use the latter."

>> Also correct the scalar insn comments (they only ever use XMM registers
>> as operands).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> As for the code content, Reviewed-by: Andrew Cooper
> <andrew.cooper3@citrix.com>

Thanks.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 3/6] x86emul: support AVX512 opmask insns
  2018-09-03 17:57     ` Andrew Cooper
@ 2018-09-04  7:58       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-04  7:58 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 03.09.18 at 19:57, <andrew.cooper3@citrix.com> wrote:
> On 29/08/18 15:24, Jan Beulich wrote:
>> These are all VEX encoded, so the EVEX decoding logic continues to
>> remain unused at this point.
>>
>> The new testcase is deliberately coded in assembly, as a C one would
>> have become almost unreadable due to the overwhelming amount of
>> __builtin_...() that would need to be used. After all the compiler has
>> no underlying type (yet) that could be operated on without builtins,
>> other than the vector types used for "normal" SIMD insns.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>> @@ -6002,6 +6023,60 @@ x86_emulate(
>>              dst.val = src.val;
>>          break;
>>  
>> +    case X86EMUL_OPC_VEX(0x0f, 0x4a):    /* kadd{w,q} k,k,k */
>> +        if ( !vex.w )
>> +            host_and_vcpu_must_have(avx512dq);
> 
> Why is this kadd handled differently?  As far as I can tell from the
> manual, its encoding looks to be consistent.

It's not the encoding, but the AVX512DQ property of kaddw
which other k<op>w insns don't have (those all are AVX512F).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 5/6] x86emul: correct EVEX decoding
  2018-08-29 14:25   ` [PATCH v2 5/6] x86emul: correct EVEX decoding Jan Beulich
@ 2018-09-04 10:48     ` Andrew Cooper
  2018-09-04 12:48       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-09-04 10:48 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 29/08/18 15:25, Jan Beulich wrote:
> Fix an inverted pair of checks, drop an incorrect instance of #UD
> raising for non-64-bit mode, and add further generic checks.
>
> Note: Other than SDM Vol 2 rev 067 states, EVEX.V' is _not_ ignored
>       outside of 64-bit mode when the field does not encode a register.
>       Just like EVEX.VVVV is required to be 0b1111 in that case, EVEX.V'
>       is required to be 1 there.
>
> Also rename the bcst field to br, as #UD generation for individual insns
> will need to consider both of its possible meanings.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -650,7 +650,7 @@ union evex {
>          uint8_t w:1;
>          uint8_t opmsk:3;
>          uint8_t RX:1;
> -        uint8_t bcst:1;
> +        uint8_t br:1;
>          uint8_t lr:2;
>          uint8_t z:1;

I'm afraid that some of the choices of field naming in here makes the
code impossible to follow, due to their differences from the manual. 
Particularly, the tail end of the structure would be easier to follow if
it were:

diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c
b/xen/arch/x86/x86_emulate/x86_emulate.c
index f38c73b..bc0d39b 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -648,10 +648,10 @@ union evex {
         uint8_t mbs:1;
         uint8_t reg:4;
         uint8_t w:1;
-        uint8_t opmsk:3;
-        uint8_t RX:1;
-        uint8_t br:1;
-        uint8_t lr:2;
+        uint8_t aaa:3;
+        uint8_t V:1;
+        uint8_t b:1;
+        uint8_t l:2;
         uint8_t z:1;
     };
 };

The manual refers to EVEX.RX as the combination of the R and X fields in
the first byte, whereas the field currently named RX in Xen is V' in the
manual.  It is unfortunate that Intel chose L'L for the vector notation,
but l on its own is clearer than lr.

>      };
> @@ -2760,13 +2760,11 @@ x86_decode(
>                          evex.raw[1] = vex.raw[1];
>                          evex.raw[2] = insn_fetch_type(uint8_t);
>  
> -                        generate_exception_if(evex.mbs || !evex.mbz, EXC_UD);
> +                        generate_exception_if(!evex.mbs || evex.mbz, EXC_UD);
> +                        generate_exception_if(!evex.opmsk && evex.z, EXC_UD);

Where does this check derive from?  I presume you've calculated it from
Table 2-40 in the manual, but I don't see anything there which suggests
the restriction applies universally.  Every check in that table is
specific to certain classes of instruction.

>  
>                          if ( !mode_64bit() )
> -                        {
> -                            generate_exception_if(!evex.RX, EXC_UD);
>                              evex.R = 1;
> -                        }
>  
>                          vex.opcx = evex.opcx;
>                          break;
> @@ -3404,6 +3402,7 @@ x86_emulate(
>          d = (d & ~DstMask) | DstMem;
>          /* Becomes a normal DstMem operation from here on. */
>      case DstMem:
> +        generate_exception_if(ea.type == OP_MEM && evex.z, EXC_UD);

I can't find any statement that all DstMem prohibit zero-masking.

There is a statement saying that the subset of DstMem instructions which
require an encoded k register may not use zero-masking.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 6/6] x86emul: generalize vector length handling for AVX512/EVEX
  2018-08-29 14:25   ` [PATCH v2 6/6] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
@ 2018-09-04 11:02     ` Andrew Cooper
  2018-09-04 12:50       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-09-04 11:02 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 29/08/18 15:25, Jan Beulich wrote:
> @@ -2818,6 +2818,9 @@ x86_decode(
>  
>                  opcode |= b | MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
>  
> +                if ( !evex.mbs )

This use of mbs is very confusing to read.  How about:

#define evex_encoded evex.mbs

which at least gives a semantic name to the how you are using the mbs bit?

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 5/6] x86emul: correct EVEX decoding
  2018-09-04 10:48     ` Andrew Cooper
@ 2018-09-04 12:48       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-04 12:48 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 04.09.18 at 12:48, <andrew.cooper3@citrix.com> wrote:
> On 29/08/18 15:25, Jan Beulich wrote:
>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>> @@ -650,7 +650,7 @@ union evex {
>>          uint8_t w:1;
>>          uint8_t opmsk:3;
>>          uint8_t RX:1;
>> -        uint8_t bcst:1;
>> +        uint8_t br:1;
>>          uint8_t lr:2;
>>          uint8_t z:1;
> 
> I'm afraid that some of the choices of field naming in here makes the
> code impossible to follow, due to their differences from the manual. 
> Particularly, the tail end of the structure would be easier to follow if
> it were:
> 
> diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c
> b/xen/arch/x86/x86_emulate/x86_emulate.c
> index f38c73b..bc0d39b 100644
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -648,10 +648,10 @@ union evex {
>          uint8_t mbs:1;
>          uint8_t reg:4;
>          uint8_t w:1;
> -        uint8_t opmsk:3;
> -        uint8_t RX:1;
> -        uint8_t br:1;
> -        uint8_t lr:2;
> +        uint8_t aaa:3;
> +        uint8_t V:1;
> +        uint8_t b:1;
> +        uint8_t l:2;
>          uint8_t z:1;
>      };
>  };
> 
> The manual refers to EVEX.RX as the combination of the R and X fields in
> the first byte, whereas the field currently named RX in Xen is V' in the
> manual.  It is unfortunate that Intel chose L'L for the vector notation,
> but l on its own is clearer than lr.

The naming isn't very good, yes, but what you suggest won't work:
What you name b and l are dual purpose fields, hence I really want
to retain their names (standing for "broadcast or rounding" and
"length or rounding" respectively). Furthermore you'll notice there
is a field named b already. And l alone would further risk mixing up
with VEX.L.

As much as I avoided introducing a field named vvvv, I also don't
view it sensible to introduce a field named aaa. If Intel considers
these reasonable names in their manuals - so be it. They aren't
reasonable at all imo in code.

V as a name would make sense only together with vvvv, but the
bit again has dual use, and in its secondary use V is misleading
rather than helpful.

Besides all of this I now have almost 20 more patches on top of this.
I really don't see myself renaming all those field references. The
patch here has been available for long enough to give such a
comment / make such suggestions, even more so that I'm changing
a single field here only anyway.

Best I can offer is to attach comments to the fields pointing out
their SDM names. But even that I would view as slightly odd a
request _here_, as both EVEX and VEX unions have been around
in our code for quite some time.

>> @@ -2760,13 +2760,11 @@ x86_decode(
>>                          evex.raw[1] = vex.raw[1];
>>                          evex.raw[2] = insn_fetch_type(uint8_t);
>>  
>> -                        generate_exception_if(evex.mbs || !evex.mbz, EXC_UD);
>> +                        generate_exception_if(!evex.mbs || evex.mbz, EXC_UD);
>> +                        generate_exception_if(!evex.opmsk && evex.z, EXC_UD);
> 
> Where does this check derive from?  I presume you've calculated it from
> Table 2-40 in the manual, but I don't see anything there which suggests
> the restriction applies universally.  Every check in that table is
> specific to certain classes of instruction.

The check doesn't derive from any tables, but from the fact that
"zeroing-masking" makes no sense without "masking", and from the
observed hardware behavior.

>> @@ -3404,6 +3402,7 @@ x86_emulate(
>>          d = (d & ~DstMask) | DstMem;
>>          /* Becomes a normal DstMem operation from here on. */
>>      case DstMem:
>> +        generate_exception_if(ea.type == OP_MEM && evex.z, EXC_UD);
> 
> I can't find any statement that all DstMem prohibit zero-masking.
> 
> There is a statement saying that the subset of DstMem instructions which
> require an encoded k register may not use zero-masking.

Not sure what you mean by "encoded k register". Various EVEX and
ModRM fields can encode a k register.

No current instruction allows zeroing-masking on a memory destination
(nor on a k-register destination, if that's what you mean, but that's not
the same as DstMem), and I can't currently foresee this to change
without a CPUID bit telling us, if even the most obvious candidates
(VMOVAP{S,D} and VMOVDQA{32,64}) don't allow this. Hence I prefer the
check to be in a central place instead of getting repeated in a number
of places.

Note that "generic checks" in the description is not meant to imply that
these exclusively follow what Intel lists in their "generic" tables. I've done
quite a bit of prereq work and classification over the last year, and this
is one of the patterns that resulted from that work without the SDM
explicitly stating it.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v2 6/6] x86emul: generalize vector length handling for AVX512/EVEX
  2018-09-04 11:02     ` Andrew Cooper
@ 2018-09-04 12:50       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-04 12:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 04.09.18 at 13:02, <andrew.cooper3@citrix.com> wrote:
> On 29/08/18 15:25, Jan Beulich wrote:
>> @@ -2818,6 +2818,9 @@ x86_decode(
>>  
>>                  opcode |= b | MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
>>  
>> +                if ( !evex.mbs )
> 
> This use of mbs is very confusing to read.  How about:
> 
> #define evex_encoded evex.mbs
> 
> which at least gives a semantic name to the how you are using the mbs bit?

As you seem to think it helps, I can do that. To me the present form was
obvious enough.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (6 preceding siblings ...)
  2018-08-29 14:20 ` [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
@ 2018-09-18 11:46 ` Jan Beulich
  2018-09-18 11:53   ` [PATCH v3 01/34] x86emul: support AVX512 opmask insns Jan Beulich
                     ` (33 more replies)
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (4 subsequent siblings)
  12 siblings, 34 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:46 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

01: support AVX512 opmask insns
02: x86/HVM: grow MMIO cache data size to 64 bytes
03: correct EVEX decoding
04: generalize vector length handling for AVX512/EVEX
05: support basic AVX512 moves
06: test for correct EVEX Disp8 scaling
07: use AVX512 logic for emulating V{,P}MASKMOV*
08: support AVX512F legacy-equivalent arithmetic FP insns
09: support AVX512DQ logic FP insns
10: support AVX512F "normal" FP compare insns
11: support AVX512F misc legacy-equivalent FP insns
12: support AVX512F fused-multiply-add insns
13: support AVX512F legacy-equivalent logic insns
14: support AVX512{F,DQ} FP broadcast insns
15: support AVX512F v{,u}comis{d,s} insns
16: test: introduce eq()
17: support AVX512{F,BW} packed integer compare insns
18: support AVX512{F,BW} packed integer arithmetic insns
19: use simd_128 also for legacy vector shift insns
20: support AVX512{F,BW} shift/rotate insns
21: support AVX512{F,BW,DQ} extract insns
22: support AVX512{F,BW,DQ} insert insns
23: basic AVX512F testing
24: support AVX512{F,BW,DQ} integer broadcast insns
25: basic AVX512VL testing
26: support AVX512{F,BW} zero- and sign-extending moves
27: support AVX512{F,BW} down conversion moves
28: support AVX512{F,BW} integer unpack insns
29: support AVX512{F,BW,_VBMI} full permute insns
30: support AVX512{F,BW} integer shuffle insns
31: support AVX512{BW,DQ} mask move insns
32: basic AVX512BW testing
33: basic AVX512DQ testing
34: also allow running the 32-bit harness on a 64-bit distro

The main goal of this series is to support enough of the instructions
such that basic AVX512F, AVX512BW, AVX512DQ, and AVX512VL
tests can be run (this set is relevant as a basis in particular due to
it together mostly [entirely?] covering the legacy-equivalent AVX512
insns). Later additions then may simply enable further of the
(conditional) tests in simd*.c (or by other means).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 01/34] x86emul: support AVX512 opmask insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
@ 2018-09-18 11:53   ` Jan Beulich
  2018-10-25 18:32     ` Andrew Cooper
  2018-09-18 11:53   ` [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes Jan Beulich
                     ` (32 subsequent siblings)
  33 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

These are all VEX encoded, so the EVEX decoding logic continues to
remain unused at this point.

The new testcase is deliberately coded in assembly, as a C one would
have become almost unreadable due to the overwhelming amount of
__builtin_...() that would need to be used. After all the compiler has
no underlying type (yet) that could be operated on without builtins,
other than the vector types used for "normal" SIMD insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Use distinct temporary file names in testcase.mk. Additions to clean
    target.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,6 +16,8 @@ FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
 
+OPMASK := avx512f avx512dq avx512bw
+
 blowfish-cflags := ""
 blowfish-cflags-x86_32 := "-mno-accumulate-outgoing-args -Dstatic="
 
@@ -51,6 +53,10 @@ xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
 
+avx512f-opmask-vecs := 2
+avx512dq-opmask-vecs := 1
+avx512bw-opmask-vecs := 4 8
+
 # For AVX and later, have the compiler avoid XMM0 to widen coverage of
 # the VEX.vvvv checks in the emulator.  For 3DNow!, however, force SSE
 # use for floating point operations, to avoid mixing MMX and FPU register
@@ -80,9 +86,13 @@ $(1)-cflags := \
 	   $(foreach flt,$($(1)-flts), \
 	     "-D_$(vec)x$(idx)f$(flt) -m$(1:-sg=) $(call non-sse,$(1)) -Os -DVEC_MAX=$(vec) -DIDX_SIZE=$(idx) -DFLOAT_SIZE=$(flt)")))
 endef
+define opmask-defs
+$(1)-opmask-cflags := $(foreach vec,$($(1)-opmask-vecs), "-D_$(vec) -m$(1) -Os -DSIZE=$(vec)")
+endef
 
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
+$(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
 $(addsuffix .h,$(TESTCASES)): %.h: %.c testcase.mk Makefile
 	rm -f $@.new $*.bin
@@ -100,6 +110,22 @@ $(addsuffix .h,$(TESTCASES)): %.h: %.c t
 	)
 	mv $@.new $@
 
+$(addsuffix -opmask.h,$(OPMASK)): %.h: opmask.S testcase.mk Makefile
+	rm -f $@.new $*.bin
+	$(foreach arch,$(filter-out $(XEN_COMPILE_ARCH),x86_32) $(XEN_COMPILE_ARCH), \
+	    for cflags in $($*-cflags) $($*-cflags-$(arch)); do \
+		$(MAKE) -f testcase.mk TESTCASE=$* XEN_TARGET_ARCH=$(arch) $*-cflags="$$cflags" all; \
+		prefix=$(shell echo $(subst -,_,$*) | sed -e 's,^\([0-9]\),_\1,'); \
+		flavor=$$(echo $${cflags} | sed -e 's, .*,,' -e 'y,-=,__,') ; \
+		(echo 'static const unsigned int __attribute__((section(".test, \"ax\", @progbits #")))' \
+		      "$${prefix}_$(arch)$${flavor}[] = {"; \
+		 od -v -t x $*.bin | sed -e 's/^[0-9]* /0x/' -e 's/ /, 0x/g' -e 's/$$/,/'; \
+		 echo "};") >>$@.new; \
+		rm -f $*.bin; \
+	    done; \
+	)
+	mv $@.new $@
+
 $(addsuffix .c,$(SIMD)):
 	ln -sf simd.c $@
 
@@ -118,7 +144,8 @@ $(TARGET): x86-emulate.o test_x86_emulat
 
 .PHONY: clean
 clean:
-	rm -rf $(TARGET) *.o *~ core $(addsuffix .h,$(TESTCASES)) *.bin x86_emulate
+	rm -rf $(TARGET) *.o *~ core *.bin x86_emulate
+	rm -rf $(TARGET) $(addsuffix .h,$(TESTCASES)) $(addsuffix -opmask.h,$(OPMASK))
 
 .PHONY: distclean
 distclean: clean
@@ -145,4 +172,4 @@ x86-emulate.o test_x86_emulator.o wrappe
 x86-emulate.o: x86_emulate/x86_emulate.c
 x86-emulate.o: HOSTCFLAGS += -D__XEN_TOOLS__
 
-test_x86_emulator.o: $(addsuffix .h,$(TESTCASES))
+test_x86_emulator.o: $(addsuffix .h,$(TESTCASES)) $(addsuffix -opmask.h,$(OPMASK))
--- /dev/null
+++ b/tools/tests/x86_emulator/opmask.S
@@ -0,0 +1,144 @@
+#ifdef __i386__
+# define R(x) e##x
+# define DATA(x) x
+#else
+# if SIZE == 8
+#  define R(x) r##x
+# else
+#  define R(x) e##x
+# endif
+# define DATA(x) x(%rip)
+#endif
+
+#if SIZE == 1
+# define _(x) x##b
+#elif SIZE == 2
+# define _(x) x##w
+# define WIDEN(x) x##bw
+#elif SIZE == 4
+# define _(x) x##d
+# define WIDEN(x) x##wd
+#elif SIZE == 8
+# define _(x) x##q
+# define WIDEN(x) x##dq
+#endif
+
+    .macro check res1:req, res2:req, line:req
+    _(kmov)       %\res1, DATA(out)
+#if SIZE < 8 || !defined(__i386__)
+    _(kmov)       %\res2, %R(dx)
+    cmp           DATA(out), %R(dx)
+#else
+    sub           $8, %esp
+    kmovq         %\res2, (%esp)
+    pop           %ecx
+    pop           %edx
+    cmp           DATA(out), %ecx
+    jne           0f
+    cmp           DATA(out+4), %edx
+0:
+#endif
+    je            1f
+    mov           $\line, %eax
+    ret
+1:
+    .endm
+
+    .text
+    .globl _start
+_start:
+    _(kmov)       DATA(in1), %k1
+#if SIZE < 8 || !defined(__i386__)
+    mov           DATA(in2), %R(ax)
+    _(kmov)       %R(ax), %k2
+#else
+    _(kmov)       DATA(in2), %k2
+#endif
+
+    _(kor)        %k1, %k2, %k3
+    _(kand)       %k1, %k2, %k4
+    _(kandn)      %k3, %k4, %k5
+    _(kxor)       %k1, %k2, %k6
+    check         k5, k6, __LINE__
+
+    _(knot)       %k6, %k3
+    _(kxnor)      %k1, %k2, %k4
+    check         k3, k4, __LINE__
+
+    _(kshiftl)    $1, %k1, %k3
+    _(kshiftl)    $2, %k3, %k4
+    _(kshiftl)    $3, %k1, %k5
+    check         k4, k5, __LINE__
+
+    _(kshiftr)    $1, %k1, %k3
+    _(kshiftr)    $2, %k3, %k4
+    _(kshiftr)    $3, %k1, %k5
+    check         k4, k5, __LINE__
+
+    _(kortest)    %k6, %k6
+    jnbe          1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxor)       %k0, %k0, %k3
+    _(kortest)    %k3, %k3
+    jz            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxnor)      %k0, %k0, %k3
+    _(kortest)    %k3, %k3
+    jc            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+#if SIZE > 1
+
+    _(kshiftr)    $SIZE*4, %k3, %k4
+    WIDEN(kunpck) %k4, %k4, %k5
+    check         k3, k5, __LINE__
+
+#endif
+
+#if SIZE != 2 || defined(__AVX512DQ__)
+
+    _(kadd)       %k1, %k1, %k3
+    _(kshiftl)    $1, %k1, %k4
+    check         k3, k4, __LINE__
+
+    _(ktest)      %k2, %k1
+    jnbe          1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxor)       %k0, %k0, %k3
+    _(ktest)      %k0, %k3
+    jz            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxnor)      %k0, %k0, %k4
+    _(ktest)      %k0, %k4
+    jc            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+#endif
+
+    xor           %eax, %eax
+    ret
+
+    .section .rodata, "a", @progbits
+    .balign 8
+in1: .byte 0b10110011, 0b10001111, 0b00001111, 0b10000011, 0b11110000, 0b00111111, 0b10000000, 0b11111111
+in2: .byte 0b11111111, 0b00000001, 0b11111100, 0b00001111, 0b11000001, 0b11110000, 0b11110001, 0b11001101
+
+    .data
+    .balign 8
+out: .quad 0
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -18,6 +18,9 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx2.h"
 #include "avx2-sg.h"
 #include "xop.h"
+#include "avx512f-opmask.h"
+#include "avx512dq-opmask.h"
+#include "avx512bw-opmask.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -78,6 +81,24 @@ static bool simd_check_xop(void)
     return cpu_has_xop;
 }
 
+static bool simd_check_avx512f(void)
+{
+    return cpu_has_avx512f;
+}
+#define simd_check_avx512f_opmask simd_check_avx512f
+
+static bool simd_check_avx512dq(void)
+{
+    return cpu_has_avx512dq;
+}
+#define simd_check_avx512dq_opmask simd_check_avx512dq
+
+static bool simd_check_avx512bw(void)
+{
+    return cpu_has_avx512bw;
+}
+#define simd_check_avx512bw_opmask simd_check_avx512bw
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -223,6 +244,10 @@ static const struct {
     SIMD(XOP i16x16,              xop,      32i2),
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
+    SIMD(OPMASK/w,     avx512f_opmask,         2),
+    SIMD(OPMASK/b,    avx512dq_opmask,         1),
+    SIMD(OPMASK/d,    avx512bw_opmask,         4),
+    SIMD(OPMASK/q,    avx512bw_opmask,         8),
 #undef SIMD_
 #undef SIMD
 };
@@ -3426,8 +3451,8 @@ int main(int argc, char **argv)
             rc = x86_emulate(&ctxt, &emulops);
             if ( rc != X86EMUL_OKAY )
             {
-                printf("failed at %%eip == %08lx (opcode %08x)\n",
-                       (unsigned long)regs.eip, ctxt.opcode);
+                printf("failed (%d) at %%eip == %08lx (opcode %08x)\n",
+                       rc, (unsigned long)regs.eip, ctxt.opcode);
                 return 1;
             }
         }
--- a/tools/tests/x86_emulator/testcase.mk
+++ b/tools/tests/x86_emulator/testcase.mk
@@ -14,3 +14,9 @@ all: $(TESTCASE).bin
 	$(LD) $(LDFLAGS_DIRECT) -N -Ttext 0x100000 -o $*.tmp $*.o
 	$(OBJCOPY) -O binary $*.tmp $@
 	rm -f $*.tmp
+
+%-opmask.bin: opmask.S
+	$(CC) $(filter-out -M% .%,$(CFLAGS)) -c $< -o $(basename $@).o
+	$(LD) $(LDFLAGS_DIRECT) -N -Ttext 0x100000 -o $(basename $@).tmp $(basename $@).o
+	$(OBJCOPY) -O binary $(basename $@).tmp $@
+	rm -f $(basename $@).tmp
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -209,6 +209,9 @@ int emul_test_get_fpu(
     case X86EMUL_FPU_ymm:
         if ( cpu_has_avx )
             break;
+    case X86EMUL_FPU_opmask:
+        if ( cpu_has_avx512f )
+            break;
     default:
         return X86EMUL_UNHANDLEABLE;
     }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -236,6 +236,36 @@ static inline uint64_t xgetbv(uint32_t x
     (res.c & (1U << 21)) != 0; \
 })
 
+#define cpu_has_avx512f ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 16)) != 0; \
+})
+
+#define cpu_has_avx512dq ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 17)) != 0; \
+})
+
+#define cpu_has_avx512bw ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 30)) != 0; \
+})
+
 int emul_test_cpuid(
     uint32_t leaf,
     uint32_t subleaf,
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -491,6 +491,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
+    [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
@@ -1187,6 +1188,11 @@ static int _get_fpu(
             return X86EMUL_UNHANDLEABLE;
         break;
 
+    case X86EMUL_FPU_opmask:
+        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
+            return X86EMUL_UNHANDLEABLE;
+        break;
+
     default:
         break;
     }
@@ -1762,12 +1768,15 @@ static bool vcpu_has(
 #define vcpu_has_bmi2()        vcpu_has(         7, EBX,  8, ctxt, ops)
 #define vcpu_has_rtm()         vcpu_has(         7, EBX, 11, ctxt, ops)
 #define vcpu_has_mpx()         vcpu_has(         7, EBX, 14, ctxt, ops)
+#define vcpu_has_avx512f()     vcpu_has(         7, EBX, 16, ctxt, ops)
+#define vcpu_has_avx512dq()    vcpu_has(         7, EBX, 17, ctxt, ops)
 #define vcpu_has_rdseed()      vcpu_has(         7, EBX, 18, ctxt, ops)
 #define vcpu_has_adx()         vcpu_has(         7, EBX, 19, ctxt, ops)
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
+#define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -2396,6 +2405,18 @@ x86_decode_twobyte(
         }
         break;
 
+    case X86EMUL_OPC_VEX(0, 0x90):    /* kmov{w,q} */
+    case X86EMUL_OPC_VEX_66(0, 0x90): /* kmov{b,d} */
+        state->desc = DstReg | SrcMem | Mov;
+        state->simd_size = simd_other;
+        break;
+
+    case X86EMUL_OPC_VEX(0, 0x91):    /* kmov{w,q} */
+    case X86EMUL_OPC_VEX_66(0, 0x91): /* kmov{b,d} */
+        state->desc = DstMem | SrcReg | Mov;
+        state->simd_size = simd_other;
+        break;
+
     case 0xae:
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
@@ -6002,6 +6023,60 @@ x86_emulate(
             dst.val = src.val;
         break;
 
+    case X86EMUL_OPC_VEX(0x0f, 0x4a):    /* kadd{w,q} k,k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x41):    /* kand{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x41): /* kand{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x42):    /* kandn{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x42): /* kandn{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x45):    /* kor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x45): /* kor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x46):    /* kxnor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x46): /* kxnor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x47):    /* kxor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x47): /* kxor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x4a): /* kadd{b,d} k,k,k */
+        generate_exception_if(!vex.l, EXC_UD);
+    opmask_basic:
+        if ( vex.w )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( vex.pfx )
+            host_and_vcpu_must_have(avx512dq);
+    opmask_common:
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(!vex.r || (mode_64bit() && !(vex.reg & 8)) ||
+                              ea.type != OP_REG, EXC_UD);
+
+        vex.reg |= 8;
+        d &= ~TwoOp;
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        insn_bytes = PFX_BYTES + 2;
+
+        state->simd_size = simd_other;
+        op_bytes = 1; /* Any non-zero value will do. */
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x44):    /* knot{w,q} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x44): /* knot{b,d} k,k */
+        generate_exception_if(vex.l || vex.reg != 0xf, EXC_UD);
+        goto opmask_basic;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x4b):    /* kunpck{w,d}{d,q} k,k,k */
+        generate_exception_if(!vex.l, EXC_UD);
+        host_and_vcpu_must_have(avx512bw);
+        goto opmask_common;
+
+    case X86EMUL_OPC_VEX_66(0x0f, 0x4b): /* kunpckbw k,k,k */
+        generate_exception_if(!vex.l || vex.w, EXC_UD);
+        goto opmask_common;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x50):     /* movmskp{s,d} xmm,reg */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x50): /* vmovmskp{s,d} {x,y}mm,reg */
     CASE_SIMD_PACKED_INT(0x0f, 0xd7):      /* pmovmskb {,x}mm,reg */
@@ -6552,6 +6627,154 @@ x86_emulate(
         dst.val = test_cc(b, _regs.eflags);
         break;
 
+    case X86EMUL_OPC_VEX(0x0f, 0x91):    /* kmov{w,q} k,mem */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x91): /* kmov{b,d} k,mem */
+        generate_exception_if(ea.type != OP_MEM, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x90):    /* kmov{w,q} k/mem,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x90): /* kmov{b,d} k/mem,k */
+        generate_exception_if(vex.l || !vex.r, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.w )
+        {
+            host_and_vcpu_must_have(avx512bw);
+            op_bytes = 4 << !vex.pfx;
+        }
+        else if ( vex.pfx )
+        {
+            host_and_vcpu_must_have(avx512dq);
+            op_bytes = 1;
+        }
+        else
+            op_bytes = 2;
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            vex.b = 1;
+            opc[1] &= 0x38;
+        }
+        insn_bytes = PFX_BYTES + 2;
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x92):    /* kmovw r32,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x92): /* kmovb r32,k */
+    case X86EMUL_OPC_VEX_F2(0x0f, 0x92): /* kmov{d,q} reg,k */
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.pfx == vex_f2 )
+            host_and_vcpu_must_have(avx512bw);
+        else
+        {
+            generate_exception_if(vex.w, EXC_UD);
+            if ( vex.pfx )
+                host_and_vcpu_must_have(avx512dq);
+        }
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        vex.b = 1;
+        if ( !mode_64bit() )
+            vex.w = 0;
+        opc[1] = modrm & 0xf8;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        ea.reg = decode_gpr(&_regs, modrm_rm);
+        invoke_stub("", "", "=m" (dummy) : "a" (*ea.reg));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        dst.type = OP_NONE;
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x93):    /* kmovw k,r32 */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x93): /* kmovb k,r32 */
+    case X86EMUL_OPC_VEX_F2(0x0f, 0x93): /* kmov{d,q} k,reg */
+        generate_exception_if(vex.l || vex.reg != 0xf || ea.type != OP_REG,
+                              EXC_UD);
+        dst = ea;
+        dst.reg = decode_gpr(&_regs, modrm_reg);
+
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.pfx == vex_f2 )
+        {
+            host_and_vcpu_must_have(avx512bw);
+            dst.bytes = 4 << (mode_64bit() && vex.w);
+        }
+        else
+        {
+            generate_exception_if(vex.w, EXC_UD);
+            dst.bytes = 4;
+            if ( vex.pfx )
+                host_and_vcpu_must_have(avx512dq);
+        }
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        /* Convert GPR destination to %rAX. */
+        vex.r = 1;
+        if ( !mode_64bit() )
+            vex.w = 0;
+        opc[1] = modrm & 0xc7;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x99):    /* ktest{w,q} k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x98):    /* kortest{w,q} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x98): /* kortest{b,d} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x99): /* ktest{b,d} k,k */
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.w )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( vex.pfx )
+            host_and_vcpu_must_have(avx512dq);
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),
+                    _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),
+                    [eflags] "+g" (_regs.eflags),
+                    "=a" (dst.val), [tmp] "=&r" (dummy)
+                    : [mask] "i" (EFLAGS_MASK));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        dst.type = OP_NONE;
+        break;
+
     case X86EMUL_OPC(0x0f, 0xa2): /* cpuid */
         msr_val = 0;
         fail_if(ops->cpuid == NULL);
@@ -8170,6 +8393,23 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+    opmask_shift_imm:
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_opmask);
+        op_bytes = 1; /* Any non-zero value will do. */
+        goto simd_0f_imm8;
+
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x31): /* kshiftr{d,q} $imm8,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x33): /* kshiftl{d,q} $imm8,k,k */
+        host_and_vcpu_must_have(avx512bw);
+        goto opmask_shift_imm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x44):     /* pclmulqdq $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,xmm/m128,xmm,xmm */
         host_and_vcpu_must_have(pclmulqdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -170,6 +170,7 @@ enum x86_emulate_fpu_type {
     X86EMUL_FPU_mmx, /* MMX instruction set (%mm0-%mm7) */
     X86EMUL_FPU_xmm, /* SSE instruction set (%xmm0-%xmm7/15) */
     X86EMUL_FPU_ymm, /* AVX/XOP instruction set (%ymm0-%ymm7/15) */
+    X86EMUL_FPU_opmask, /* AVX512 opmask instruction set (%k0-%k7) */
     /* This sentinel will never be passed to ->get_fpu(). */
     X86EMUL_FPU_none
 };
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -99,9 +99,12 @@
 #define cpu_has_rtm             boot_cpu_has(X86_FEATURE_RTM)
 #define cpu_has_fpu_sel         (!boot_cpu_has(X86_FEATURE_NO_FPU_SEL))
 #define cpu_has_mpx             boot_cpu_has(X86_FEATURE_MPX)
+#define cpu_has_avx512f         boot_cpu_has(X86_FEATURE_AVX512F)
+#define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
+#define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
  2018-09-18 11:53   ` [PATCH v3 01/34] x86emul: support AVX512 opmask insns Jan Beulich
@ 2018-09-18 11:53   ` Jan Beulich
  2018-09-18 16:05     ` Paul Durrant
  2018-10-25 18:36     ` Andrew Cooper
  2018-09-18 11:55   ` [PATCH v3 03/34] x86emul: correct EVEX decoding Jan Beulich
                     ` (31 subsequent siblings)
  33 siblings, 2 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant, Wei Liu

This is needed before enabling any AVX512 insns in the emulator. Change
the way alignment is enforced at the same time.

Add a check that the buffer won't actually overflow, and while at it
also convert the check for accesses to not cross page boundaries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -866,7 +866,18 @@ static int hvmemul_phys_mmio_access(
     int rc = X86EMUL_OKAY;
 
     /* Accesses must fall within a page. */
-    BUG_ON((gpa & ~PAGE_MASK) + size > PAGE_SIZE);
+    if ( (gpa & ~PAGE_MASK) + size > PAGE_SIZE )
+    {
+        ASSERT_UNREACHABLE();
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    /* Accesses must not overflow the cache's buffer. */
+    if ( size > sizeof(cache->buffer) )
+    {
+        ASSERT_UNREACHABLE();
+        return X86EMUL_UNHANDLEABLE;
+    }
 
     /*
      * hvmemul_do_io() cannot handle non-power-of-2 accesses or
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -42,15 +42,14 @@ struct hvm_vcpu_asid {
 };
 
 /*
- * We may read or write up to m256 as a number of device-model
+ * We may read or write up to m512 as a number of device-model
  * transactions.
  */
 struct hvm_mmio_cache {
     unsigned long gla;
     unsigned int size;
     uint8_t dir;
-    uint8_t pad[3]; /* make buffer[] long-aligned */
-    uint8_t buffer[32];
+    uint8_t buffer[64] __aligned(sizeof(long));
 };
 
 struct hvm_vcpu_io {





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 03/34] x86emul: correct EVEX decoding
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
  2018-09-18 11:53   ` [PATCH v3 01/34] x86emul: support AVX512 opmask insns Jan Beulich
  2018-09-18 11:53   ` [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes Jan Beulich
@ 2018-09-18 11:55   ` Jan Beulich
  2018-09-18 11:55   ` [PATCH v3 04/34] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
                     ` (30 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Fix an inverted pair of checks, drop an incorrect instance of #UD
raising for non-64-bit mode, and add further generic checks.

Note: Other than SDM Vol 2 rev 067 states, EVEX.V' is _not_ ignored
      outside of 64-bit mode when the field does not encode a register.
      Just like EVEX.VVVV is required to be 0b1111 in that case, EVEX.V'
      is required to be 1 there.

Also rename the bcst field to br, as #UD generation for individual insns
will need to consider both of its possible meanings.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -650,7 +650,7 @@ union evex {
         uint8_t w:1;
         uint8_t opmsk:3;
         uint8_t RX:1;
-        uint8_t bcst:1;
+        uint8_t br:1;
         uint8_t lr:2;
         uint8_t z:1;
     };
@@ -2760,13 +2760,11 @@ x86_decode(
                         evex.raw[1] = vex.raw[1];
                         evex.raw[2] = insn_fetch_type(uint8_t);
 
-                        generate_exception_if(evex.mbs || !evex.mbz, EXC_UD);
+                        generate_exception_if(!evex.mbs || evex.mbz, EXC_UD);
+                        generate_exception_if(!evex.opmsk && evex.z, EXC_UD);
 
                         if ( !mode_64bit() )
-                        {
-                            generate_exception_if(!evex.RX, EXC_UD);
                             evex.R = 1;
-                        }
 
                         vex.opcx = evex.opcx;
                         break;
@@ -3404,6 +3402,7 @@ x86_emulate(
         d = (d & ~DstMask) | DstMem;
         /* Becomes a normal DstMem operation from here on. */
     case DstMem:
+        generate_exception_if(ea.type == OP_MEM && evex.z, EXC_UD);
         if ( state->simd_size )
         {
             generate_exception_if(lock_prefix, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 04/34] x86emul: generalize vector length handling for AVX512/EVEX
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (2 preceding siblings ...)
  2018-09-18 11:55   ` [PATCH v3 03/34] x86emul: correct EVEX decoding Jan Beulich
@ 2018-09-18 11:55   ` Jan Beulich
  2018-09-18 11:56   ` [PATCH v3 05/34] x86emul: support basic AVX512 moves Jan Beulich
                     ` (29 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

To allow for some code sharing where possible, copy VEX.L into EVEX.LR
even for VEX (or XOP) encoded insns. Make operand size determination
use this right away, at the same time adding consistency checks for the
EVEX scalar insn cases (the non-scalar ones aren't uniform enough for
the checking to be done in a central place like this).

Note that the broadcast case is not handled here, but will be taken care
of elsewhere (in just a single place rather than at least two).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Introduce evex_encoded() to replace open-coded evex.mbs checks.
v2: Don't raise #UD in simd_scalar_opc case when EVEX.W != low-opcode-bit.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -191,14 +191,14 @@ enum simd_opsize {
      * Ordinary packed integers:
      * - 64 bits without prefix 66 (MMX)
      * - 128 bits with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      */
     simd_packed_int,
 
     /*
      * Ordinary packed/scalar floating point:
      * - 128 bits without prefix or with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      * - 32 bits with prefix F3 (scalar single)
      * - 64 bits with prefix F2 (scalar doubgle)
      */
@@ -207,14 +207,14 @@ enum simd_opsize {
     /*
      * Packed floating point:
      * - 128 bits without prefix or with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      */
     simd_packed_fp,
 
     /*
      * Single precision packed/scalar floating point:
      * - 128 bits without prefix (SSEn)
-     * - 128/256 bits depending on VEX.L, no prefix (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      * - 32 bits with prefix F3 (scalar)
      */
     simd_single_fp,
@@ -228,7 +228,7 @@ enum simd_opsize {
 
     /*
      * Scalar floating point:
-     * - 32/64 bits depending on VEX.W
+     * - 32/64 bits depending on VEX.W/EVEX.W
      */
     simd_scalar_vexw,
 
@@ -2249,6 +2249,7 @@ int x86emul_unhandleable_rw(
 #define lock_prefix (state->lock_prefix)
 #define vex (state->vex)
 #define evex (state->evex)
+#define evex_encoded() (evex.mbs)
 #define ea (state->ea)
 
 static int
@@ -2818,6 +2819,9 @@ x86_decode(
 
                 opcode |= b | MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
 
+                if ( !evex_encoded() )
+                    evex.lr = vex.l;
+
                 if ( !(d & ModRM) )
                     break;
 
@@ -3148,7 +3152,7 @@ x86_decode(
             }
             /* fall through */
         case vex_66:
-            op_bytes = 16 << vex.l;
+            op_bytes = 16 << evex.lr;
             break;
         default:
             op_bytes = 0;
@@ -3172,9 +3176,17 @@ x86_decode(
     case simd_any_fp:
         switch ( vex.pfx )
         {
-        default:     op_bytes = 16 << vex.l; break;
-        case vex_f3: op_bytes = 4;           break;
-        case vex_f2: op_bytes = 8;           break;
+        default:
+            op_bytes = 16 << evex.lr;
+            break;
+        case vex_f3:
+            generate_exception_if(evex_encoded() && evex.w, EXC_UD);
+            op_bytes = 4;
+            break;
+        case vex_f2:
+            generate_exception_if(evex_encoded() && !evex.w, EXC_UD);
+            op_bytes = 8;
+            break;
         }
         break;
 





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 05/34] x86emul: support basic AVX512 moves
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (3 preceding siblings ...)
  2018-09-18 11:55   ` [PATCH v3 04/34] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
@ 2018-09-18 11:56   ` Jan Beulich
  2018-09-18 11:57   ` [PATCH v3 06/34] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
                     ` (28 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note: SDM Vol 2 rev 067 is not really consistent about EVEX.L'L for LIG
      insns - the only place where this is made explicit is a table in
      the section titled "Vector Length Orthogonality": While they
      tolerate 0, 1, and 2, a value of 3 uniformly leads to #UD.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Restrict k-reg reading to insns with memory operand. Shrink scope of
    "disp8scale".
v2: Move "full" into more narrow scope.

--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -1985,6 +1985,53 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovq %xmm1,32(%edx)...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_to_mem);
+
+        asm volatile ( "pcmpgtb %%xmm1, %%xmm1\n"
+                       put_insn(evex_vmovq_to_mem, "%{evex%} vmovq %%xmm1, 32(%0)")
+                       :: "d" (NULL) );
+
+        memset(res, 0xdb, 64);
+        set_insn(evex_vmovq_to_mem);
+        regs.ecx = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_to_mem) ||
+             *((uint64_t *)res + 4) ||
+             memcmp(res, res + 10, 24) ||
+             memcmp(res, res + 6, 8) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing {evex} vmovq 32(%edx),%xmm0...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_from_mem);
+
+        asm volatile ( "pcmpeqb %%xmm0, %%xmm0\n"
+                       put_insn(evex_vmovq_from_mem, "%{evex%} vmovq 32(%0), %%xmm0")
+                       :: "d" (NULL) );
+
+        set_insn(evex_vmovq_from_mem);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_from_mem) )
+            goto fail;
+        asm ( "vmovq %1, %%xmm1\n\t"
+              "vpcmpeqq %%zmm0, %%zmm1, %%k0\n"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[8]) );
+        if ( rc != 0xff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movdqu %xmm2,(%ecx)...");
     if ( stack_exec && cpu_has_sse2 )
     {
@@ -2085,6 +2132,118 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vmovdqu32 %zmm2,(%ecx){%k1}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovdqu32_to_mem);
+
+        memset(res, 0x55, 128);
+
+        asm volatile ( "vpcmpeqd %%ymm2, %%ymm2, %%ymm2\n\t"
+                       "kmovw %1,%%k1\n"
+                       put_insn(vmovdqu32_to_mem,
+                                "vmovdqu32 %%zmm2, (%0)%{%%k1%}")
+                       :: "c" (NULL), "rm" (res[0]) );
+        set_insn(vmovdqu32_to_mem);
+
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || memcmp(res + 16, res + 24, 32) ||
+             !check_eip(vmovdqu32_to_mem) )
+            goto fail;
+
+        res[16] = ~0; res[18] = ~0; res[20] = ~0; res[22] = ~0;
+        res[24] =  0; res[26] =  0; res[28] =  0; res[30] =  0;
+        if ( memcmp(res, res + 16, 64) )
+            goto fail;
+
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovdqu32 64(%edx),%zmm2{%k2}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovdqu32_from_mem);
+
+        asm volatile ( "knotw %%k1, %%k2\n"
+                       put_insn(vmovdqu32_from_mem,
+                                "vmovdqu32 64(%0), %%zmm2%{%%k2%}")
+                       :: "d" (NULL) );
+
+        set_insn(vmovdqu32_from_mem);
+        regs.ecx = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vmovdqu32_from_mem) )
+            goto fail;
+        asm ( "vpcmpeqd %1, %%zmm2, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[0]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovdqu16 %zmm3,(%ecx){%k1}...");
+    if ( stack_exec && cpu_has_avx512bw )
+    {
+        decl_insn(vmovdqu16_to_mem);
+
+        memset(res, 0x55, 128);
+
+        asm volatile ( "vpcmpeqw %%ymm3, %%ymm3, %%ymm3\n\t"
+                       "kmovd %1,%%k1\n"
+                       put_insn(vmovdqu16_to_mem,
+                                "vmovdqu16 %%zmm3, (%0)%{%%k1%}")
+                       :: "c" (NULL), "rm" (res[0]) );
+        set_insn(vmovdqu16_to_mem);
+
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || memcmp(res + 16, res + 24, 32) ||
+             !check_eip(vmovdqu16_to_mem) )
+            goto fail;
+
+        for ( i = 16; i < 24; ++i )
+            res[i] |= 0x0000ffff;
+        for ( ; i < 32; ++i )
+            res[i] &= 0xffff0000;
+        if ( memcmp(res, res + 16, 64) )
+            goto fail;
+
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovdqu16 64(%edx),%zmm3{%k2}...");
+    if ( stack_exec && cpu_has_avx512bw )
+    {
+        decl_insn(vmovdqu16_from_mem);
+
+        asm volatile ( "knotd %%k1, %%k2\n"
+                       put_insn(vmovdqu16_from_mem,
+                                "vmovdqu16 64(%0), %%zmm3%{%%k2%}")
+                       :: "d" (NULL) );
+
+        set_insn(vmovdqu16_from_mem);
+        regs.ecx = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vmovdqu16_from_mem) )
+            goto fail;
+        asm ( "vpcmpeqw %1, %%zmm3, %%k0\n\t"
+              "kmovd %%k0, %0" : "=r" (rc) : "m" (res[0]) );
+        if ( rc != 0xffffffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movsd %xmm5,(%ecx)...");
     memset(res, 0x77, 64);
     memset(res + 10, 0x66, 8);
@@ -2186,6 +2345,71 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vmovsd %xmm5,16(%ecx){%k3}...");
+    memset(res, 0x88, 128);
+    memset(res + 20, 0x77, 8);
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovsd_masked_to_mem);
+
+        asm volatile ( "vbroadcastsd %0, %%ymm5\n\t"
+                       "kxorw %%k3, %%k3, %%k3\n"
+                       put_insn(vmovsd_masked_to_mem,
+                                "vmovsd %%xmm5, 16(%1)%{%%k3%}")
+                       :: "m" (res[20]), "c" (NULL) );
+
+        set_insn(vmovsd_masked_to_mem);
+        regs.ecx = 0;
+        regs.edx = 0;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(vmovsd_masked_to_mem) )
+            goto fail;
+
+        asm volatile ( "kmovw %0, %%k3\n" :: "m" (res[20]) );
+
+        set_insn(vmovsd_masked_to_mem);
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(vmovsd_masked_to_mem) ||
+             memcmp(res, res + 16, 64) )
+            goto fail;
+
+        printf("okay\n");
+    }
+    else
+    {
+        printf("skipped\n");
+        memset(res + 4, 0x77, 8);
+    }
+
+    printf("%-40s", "Testing vmovaps (%edx),%zmm7{%k3}{z}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovaps_masked_from_mem);
+
+        asm volatile ( "vpcmpeqd %%xmm7, %%xmm7, %%xmm7\n\t"
+                       "vbroadcastss %%xmm7, %%zmm7\n"
+                       put_insn(vmovaps_masked_from_mem,
+                                "vmovaps (%0), %%zmm7%{%%k3%}%{z%}")
+                       :: "d" (NULL) );
+
+        set_insn(vmovaps_masked_from_mem);
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vmovaps_masked_from_mem) )
+            goto fail;
+        asm ( "vcmpeqps %1, %%zmm7, %%k0\n\t"
+              "vxorps %%xmm0, %%xmm0, %%xmm0\n\t"
+              "vcmpeqps %%zmm0, %%zmm7, %%k1\n\t"
+              "kxorw %%k1, %%k0, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[16]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movd %mm3,32(%ecx)...");
     if ( stack_exec && cpu_has_mmx )
     {
@@ -2341,6 +2565,55 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovd %xmm3,32(%ecx)...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_to_mem);
+
+        asm volatile ( "pcmpeqb %%xmm3, %%xmm3\n"
+                       put_insn(evex_vmovd_to_mem,
+                                "%{evex%} vmovd %%xmm3, 32(%0)")
+                       :: "c" (NULL) );
+
+        memset(res, 0xbd, 64);
+        set_insn(evex_vmovd_to_mem);
+        regs.ecx = (unsigned long)res;
+        regs.edx = 0;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovd_to_mem) ||
+             res[8] + 1 ||
+             memcmp(res, res + 9, 28) ||
+             memcmp(res, res + 6, 8) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing {evex} vmovd 32(%ecx),%xmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_from_mem);
+
+        asm volatile ( "pcmpeqb %%xmm4, %%xmm4\n"
+                       put_insn(evex_vmovd_from_mem,
+                                "%{evex%} vmovd 32(%0), %%xmm4")
+                       :: "c" (NULL) );
+
+        set_insn(evex_vmovd_from_mem);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovd_from_mem) )
+            goto fail;
+        asm ( "vmovd %1, %%xmm0\n\t"
+              "vpcmpeqd %%zmm4, %%zmm0, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[8]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movd %mm3,%ebx...");
     if ( stack_exec && cpu_has_mmx )
     {
@@ -2507,6 +2780,57 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovd %xmm2,%ebx...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_to_reg);
+
+        /* See comment next to movd above. */
+        asm volatile ( "pcmpeqb %%xmm2, %%xmm2\n"
+                       put_insn(evex_vmovd_to_reg,
+                                "%{evex%} vmovd %%xmm2, %%ebx")
+                       :: );
+
+        set_insn(evex_vmovd_to_reg);
+#ifdef __x86_64__
+        regs.rbx = 0xbdbdbdbdbdbdbdbdUL;
+#else
+        regs.ebx = 0xbdbdbdbdUL;
+#endif
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(evex_vmovd_to_reg) ||
+             regs.ebx != 0xffffffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing {evex} vmovd %ebx,%xmm1...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_from_reg);
+
+        /* See comment next to movd above. */
+        asm volatile ( "pcmpgtb %%xmm1, %%xmm1\n"
+                       put_insn(evex_vmovd_from_reg,
+                                "%{evex%} vmovd %%ebx, %%xmm1")
+                       :: );
+
+        set_insn(evex_vmovd_from_reg);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(evex_vmovd_from_reg) )
+            goto fail;
+        asm ( "vmovd %1, %%xmm0\n\t"
+              "vpcmpeqd %%zmm1, %%zmm0, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[8]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #ifdef __x86_64__
     printf("%-40s", "Testing movq %mm3,32(%ecx)...");
     if ( stack_exec && cpu_has_mmx )
@@ -2584,6 +2908,36 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovq %xmm11,32(%ecx)...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_to_mem2);
+
+        asm volatile ( "pcmpeqb %%xmm11, %%xmm11\n"
+#if 0 /* This may not work, as the assembler might pick opcode D6. */
+                       put_insn(evex_vmovq_to_mem2,
+                                "{evex} vmovq %%xmm11, 32(%0)")
+#else
+                       put_insn(evex_vmovq_to_mem2,
+                                ".byte 0x62, 0xf1, 0xfd, 0x08, 0x7e, 0x49, 0x04")
+#endif
+                       :: "c" (NULL) );
+
+        memset(res, 0xbd, 64);
+        set_insn(evex_vmovq_to_mem2);
+        regs.ecx = (unsigned long)res;
+        regs.edx = 0;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_to_mem2) ||
+             *((long *)res + 4) + 1 ||
+             memcmp(res, res + 10, 24) ||
+             memcmp(res, res + 6, 8) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movq %mm3,%rbx...");
     if ( stack_exec && cpu_has_mmx )
     {
@@ -2643,6 +2997,28 @@ int main(int argc, char **argv)
     }
     else
         printf("skipped\n");
+
+    printf("%-40s", "Testing vmovq %xmm22,%rbx...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_to_reg);
+
+        /* See comment next to movd above. */
+        asm volatile ( "pcmpeqq %%xmm2, %%xmm2\n\t"
+                       "vmovq %%xmm2, %%xmm22\n"
+                       put_insn(evex_vmovq_to_reg, "vmovq %%xmm22, %%rbx")
+                       :: );
+
+        set_insn(evex_vmovq_to_reg);
+        regs.rbx = 0xbdbdbdbdbdbdbdbdUL;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_to_reg) ||
+             regs.rbx + 1 )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
 #endif
 
     printf("%-40s", "Testing maskmovq %mm4,%mm4...");
@@ -2812,6 +3188,32 @@ int main(int argc, char **argv)
             goto fail;
         printf("okay\n");
     }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovntdqa 64(%ecx),%zmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovntdqa);
+
+        asm volatile ( "vpxor %%xmm4, %%xmm4, %%xmm4\n"
+                       put_insn(evex_vmovntdqa, "vmovntdqa 64(%0), %%zmm4")
+                       :: "c" (NULL) );
+
+        set_insn(evex_vmovntdqa);
+        memset(res, 0x55, 192);
+        memset(res + 16, 0xff, 64);
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovntdqa) )
+            goto fail;
+        asm ( "vpbroadcastd %1, %%zmm2\n\t"
+              "vpcmpeqd %%zmm4, %%zmm2, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "0" (~0) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
     else
         printf("skipped\n");
 
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -210,6 +210,7 @@ int emul_test_get_fpu(
         if ( cpu_has_avx )
             break;
     case X86EMUL_FPU_opmask:
+    case X86EMUL_FPU_zmm:
         if ( cpu_has_avx512f )
             break;
     default:
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -266,6 +266,16 @@ static inline uint64_t xgetbv(uint32_t x
     (res.b & (1U << 30)) != 0; \
 })
 
+#define cpu_has_avx512vl ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 31)) != 0; \
+})
+
 int emul_test_cpuid(
     uint32_t leaf,
     uint32_t subleaf,
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -243,9 +243,25 @@ enum simd_opsize {
 };
 typedef uint8_t simd_opsize_t;
 
+enum disp8scale {
+    /* Values 0 ... 4 are explicit sizes. */
+    d8s_bw = 5,
+    d8s_dq,
+    /*
+     * All further values must strictly be last and in the order
+     * given so that arithmetic on the values works.
+     */
+    d8s_vl,
+    d8s_vl_by_2,
+    d8s_vl_by_4,
+    d8s_vl_by_8,
+};
+typedef uint8_t disp8scale_t;
+
 static const struct twobyte_table {
     opcode_desc_t desc;
-    simd_opsize_t size;
+    simd_opsize_t size:4;
+    disp8scale_t d8s:4;
 } twobyte_table[256] = {
     [0x00] = { ModRM },
     [0x01] = { ImplicitOps|ModRM },
@@ -260,8 +276,8 @@ static const struct twobyte_table {
     [0x0d] = { ImplicitOps|ModRM },
     [0x0e] = { ImplicitOps },
     [0x0f] = { ModRM|SrcImmByte },
-    [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp },
-    [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp },
+    [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
+    [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
     [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
@@ -270,10 +286,10 @@ static const struct twobyte_table {
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
     [0x20 ... 0x21] = { DstMem|SrcImplicit|ModRM },
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
-    [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp },
-    [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp },
+    [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
+    [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp },
+    [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp },
     [0x30 ... 0x35] = { ImplicitOps },
@@ -292,8 +308,8 @@ static const struct twobyte_table {
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
-    [0x6e] = { DstImplicit|SrcMem|ModRM|Mov },
-    [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int },
+    [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq },
+    [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
@@ -301,8 +317,8 @@ static const struct twobyte_table {
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x7e] = { DstMem|SrcImplicit|ModRM|Mov },
-    [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int },
+    [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq },
+    [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x80 ... 0x8f] = { DstImplicit|SrcImm },
     [0x90 ... 0x9f] = { ByteOp|DstMem|SrcNone|ModRM|Mov },
     [0xa0 ... 0xa1] = { ImplicitOps|Mov },
@@ -344,14 +360,14 @@ static const struct twobyte_table {
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
-    [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int },
+    [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -406,6 +422,7 @@ static const struct ext0f38_table {
     uint8_t to_mem:1;
     uint8_t two_op:1;
     uint8_t vsib:1;
+    disp8scale_t d8s:4;
 } ext0f38_table[256] = {
     [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
@@ -418,7 +435,7 @@ static const struct ext0f38_table {
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
     [0x28 ... 0x29] = { .simd_size = simd_packed_int },
-    [0x2a] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_other },
     [0x2e ... 0x2f] = { .simd_size = simd_other, .to_mem = 1 },
@@ -656,6 +673,22 @@ union evex {
     };
 };
 
+#define EVEX_PFX_BYTES 4
+#define init_evex(stub) ({ \
+    uint8_t *buf_ = get_stub(stub); \
+    buf_[0] = 0x62; \
+    buf_ + EVEX_PFX_BYTES; \
+})
+
+#define copy_EVEX(ptr, evex) ({ \
+    if ( !mode_64bit() ) \
+        (evex).reg |= 8; \
+    (ptr)[1 - EVEX_PFX_BYTES] = (evex).raw[0]; \
+    (ptr)[2 - EVEX_PFX_BYTES] = (evex).raw[1]; \
+    (ptr)[3 - EVEX_PFX_BYTES] = (evex).raw[2]; \
+    container_of((ptr) + 1 - EVEX_PFX_BYTES, typeof(evex), raw[0]); \
+})
+
 #define rep_prefix()   (vex.pfx >= vex_f3)
 #define repe_prefix()  (vex.pfx == vex_f3)
 #define repne_prefix() (vex.pfx == vex_f2)
@@ -768,6 +801,7 @@ typedef union {
     uint64_t mmx;
     uint64_t __attribute__ ((aligned(16))) xmm[2];
     uint64_t __attribute__ ((aligned(32))) ymm[4];
+    uint64_t __attribute__ ((aligned(64))) zmm[8];
 } mmval_t;
 
 /*
@@ -1183,6 +1217,11 @@ static int _get_fpu(
 
     switch ( type )
     {
+    case X86EMUL_FPU_zmm:
+        if ( !(xcr0 & X86_XCR0_ZMM) || !(xcr0 & X86_XCR0_HI_ZMM) ||
+             !(xcr0 & X86_XCR0_OPMASK) )
+            return X86EMUL_UNHANDLEABLE;
+        /* fall through */
     case X86EMUL_FPU_ymm:
         if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_YMM) )
             return X86EMUL_UNHANDLEABLE;
@@ -1777,6 +1816,7 @@ static bool vcpu_has(
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
+#define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -2150,6 +2190,62 @@ static unsigned long *decode_vex_gpr(
     return decode_gpr(regs, ~vex_reg & (mode_64bit() ? 0xf : 7));
 }
 
+static unsigned int decode_disp8scale(enum disp8scale scale,
+                                      const struct x86_emulate_state *state)
+{
+    switch ( scale )
+    {
+    case d8s_bw:
+        return state->evex.w;
+
+    default:
+        if ( scale < d8s_vl )
+            return scale;
+        if ( state->evex.br )
+        {
+    case d8s_dq:
+            return 2 + state->evex.w;
+        }
+        break;
+    }
+
+    switch ( state->simd_size )
+    {
+    case simd_any_fp:
+    case simd_single_fp:
+        if ( !(state->evex.pfx & VEX_PREFIX_SCALAR_MASK) )
+            break;
+        /* fall through */
+    case simd_scalar_opc:
+    case simd_scalar_vexw:
+        return 2 + state->evex.w;
+
+    case simd_128:
+        /* These should have an explicit size specified. */
+        ASSERT_UNREACHABLE();
+        return 4;
+
+    default:
+        break;
+    }
+
+    return 4 + state->evex.lr - (scale - d8s_vl);
+}
+
+#define avx512_vlen_check(lig) do { \
+    switch ( evex.lr ) \
+    { \
+    default: \
+        generate_exception(EXC_UD); \
+    case 2: \
+        break; \
+    case 0: case 1: \
+        if (!(lig)) \
+            host_and_vcpu_must_have(avx512vl); \
+        break; \
+    } \
+} while ( false )
+
 static bool is_aligned(enum x86_segment seg, unsigned long offs,
                        unsigned int size, struct x86_emulate_ctxt *ctxt,
                        const struct x86_emulate_ops *ops)
@@ -2399,6 +2495,7 @@ x86_decode_twobyte(
         if ( vex.pfx == vex_f3 ) /* movq xmm/m64,xmm */
         {
     case X86EMUL_OPC_VEX_F3(0, 0x7e): /* vmovq xmm/m64,xmm */
+    case X86EMUL_OPC_EVEX_F3(0, 0x7e): /* vmovq xmm/m64,xmm */
             state->desc = DstImplicit | SrcMem | TwoOp;
             state->simd_size = simd_other;
             /* Avoid the state->desc clobbering of TwoOp below. */
@@ -2834,6 +2931,8 @@ x86_decode(
 
     if ( d & ModRM )
     {
+        unsigned int disp8scale = 0;
+
         d &= ~ModRM;
 #undef ModRM /* Only its aliases are valid to use from here on. */
         modrm_reg = ((rex_prefix & 4) << 1) | ((modrm & 0x38) >> 3);
@@ -2876,6 +2975,9 @@ x86_decode(
             break;
 
         case ext_0f:
+            if ( evex_encoded() )
+                disp8scale = decode_disp8scale(twobyte_table[b].d8s, state);
+
             switch ( b )
             {
             case 0x20: /* mov cr,reg */
@@ -2900,6 +3002,8 @@ x86_decode(
             if ( ext0f38_table[b].vsib )
                 d |= vSIB;
             state->simd_size = ext0f38_table[b].simd_size;
+            if ( evex_encoded() )
+                disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
             break;
 
         case ext_8f09:
@@ -2968,7 +3072,7 @@ x86_decode(
                     ea.mem.off = insn_fetch_type(int16_t);
                 break;
             case 1:
-                ea.mem.off += insn_fetch_type(int8_t);
+                ea.mem.off += insn_fetch_type(int8_t) << disp8scale;
                 break;
             case 2:
                 ea.mem.off += insn_fetch_type(int16_t);
@@ -3027,7 +3131,7 @@ x86_decode(
                 pc_rel = mode_64bit();
                 break;
             case 1:
-                ea.mem.off += insn_fetch_type(int8_t);
+                ea.mem.off += insn_fetch_type(int8_t) << disp8scale;
                 break;
             case 2:
                 ea.mem.off += insn_fetch_type(int32_t);
@@ -3228,10 +3332,11 @@ x86_emulate(
     struct x86_emulate_state state;
     int rc;
     uint8_t b, d, *opc = NULL;
-    unsigned int first_byte = 0, insn_bytes = 0;
+    unsigned int first_byte = 0, elem_bytes, insn_bytes = 0;
+    uint64_t op_mask = ~0ULL;
     bool singlestep = (_regs.eflags & X86_EFLAGS_TF) &&
 	    !is_branch_step(ctxt, ops);
-    bool sfence = false;
+    bool sfence = false, fault_suppression = false;
     struct operand src = { .reg = PTR_POISON };
     struct operand dst = { .reg = PTR_POISON };
     unsigned long cr4;
@@ -3272,6 +3377,7 @@ x86_emulate(
     b = ctxt->opcode;
     d = state.desc;
 #define state (&state)
+    elem_bytes = 4 << evex.w;
 
     generate_exception_if(state->not_64bit && mode_64bit(), EXC_UD);
 
@@ -3346,6 +3452,28 @@ x86_emulate(
         break;
     }
 
+    /* With a memory operand, fetch the mask register in use (if any). */
+    if ( ea.type == OP_MEM && evex.opmsk )
+    {
+        uint8_t *stb = get_stub(stub);
+
+        /* KMOV{W,Q} %k<n>, (%rax) */
+        stb[0] = 0xc4;
+        stb[1] = 0xe1;
+        stb[2] = cpu_has_avx512bw ? 0xf8 : 0x78;
+        stb[3] = 0x91;
+        stb[4] = evex.opmsk << 3;
+        insn_bytes = 5;
+        stb[5] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+
+        insn_bytes = 0;
+        put_stub(stub);
+
+        fault_suppression = true;
+    }
+
     /* Decode (but don't fetch) the destination operand: register or memory. */
     switch ( d & DstMask )
     {
@@ -5708,6 +5836,41 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 2;
         break;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2b): /* vmovntp{s,d} [xyz]mm,mem */
+        generate_exception_if(ea.type != OP_MEM || evex.opmsk, EXC_UD);
+        sfence = true;
+        fault_suppression = false;
+        /* fall through */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x10): /* vmovup{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x10): /* vmovs{s,d} mem,xmm{k} */
+                                            /* vmovs{s,d} xmm,xmm,xmm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x11): /* vmovup{s,d} [xyz]mm,[xyz]mm/mem{k} */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x11): /* vmovs{s,d} xmm,mem{k} */
+                                            /* vmovs{s,d} xmm,xmm,xmm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x28): /* vmovap{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x29): /* vmovap{s,d} [xyz]mm,[xyz]mm/mem{k} */
+        /* vmovs{s,d} to/from memory have only two operands. */
+        if ( (b & ~1) == 0x10 && ea.type == OP_MEM )
+            d |= TwoOp;
+        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
+    simd_zmm:
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            evex.b = 1;
+            opc[1] &= 0x38;
+        }
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        break;
+
     case X86EMUL_OPC_66(0x0f, 0x12):       /* movlpd m64,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x12):   /* vmovlpd m64,xmm,xmm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0x13):     /* movlp{s,d} xmm,m64 */
@@ -6348,6 +6511,41 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7e): /* vmov{d,q} xmm,r/m */
+        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
+                               evex.reg != 0xf || !evex.RX),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert memory/GPR operand to (%rAX). */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        opc[1] = modrm & 0x38;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
+        dst.val = src.val;
+
+        put_stub(stub);
+        ASSERT(!state->simd_size);
+        break;
+
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7e): /* vmovq xmm/m64,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
+        generate_exception_if(evex.lr || !evex.w || evex.opmsk || evex.br,
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        d |= TwoOp;
+        op_bytes = 8;
+        goto simd_zmm;
+
     case X86EMUL_OPC_66(0x0f, 0xe7):     /* movntdq xmm,m128 */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe7): /* vmovntdq {x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
@@ -6368,6 +6566,30 @@ x86_emulate(
             goto simd_0f_avx;
         goto simd_0f_sse2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe7): /* vmovntdq [xyz]mm,mem */
+        generate_exception_if(ea.type != OP_MEM || evex.opmsk || evex.w,
+                              EXC_UD);
+        sfence = true;
+        fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6f): /* vmovdqa{32,64} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x6f): /* vmovdqu{32,64} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7f): /* vmovdqa{32,64} [xyz]mm,[xyz]mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7f): /* vmovdqu{32,64} [xyz]mm,[xyz]mm/mem{k} */
+    vmovdqa:
+        generate_exception_if(evex.br, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x6f): /* vmovdqu{8,16} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x7f): /* vmovdqu{8,16} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512bw);
+        elem_bytes = 1 << evex.w;
+        goto vmovdqa;
+
     case X86EMUL_OPC_VEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
         generate_exception_if(vex.l, EXC_UD);
         d |= TwoOp;
@@ -7734,6 +7956,15 @@ x86_emulate(
         }
         goto movdqa;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2a): /* vmovntdqa mem,[xyz]mm */
+        generate_exception_if(ea.type != OP_MEM || evex.opmsk || evex.w,
+                              EXC_UD);
+        /* Ignore the non-temporal hint for now, using vmovdqa32 instead. */
+        asm volatile ( "mfence" ::: "memory" );
+        b = 0x6f;
+        evex.opcx = vex_0f;
+        goto vmovdqa;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2c): /* vmaskmovps mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2d): /* vmaskmovpd mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2e): /* vmaskmovps {x,y}mm,{x,y}mm,mem */
@@ -8787,17 +9018,27 @@ x86_emulate(
     else if ( state->simd_size )
     {
         generate_exception_if(!op_bytes, EXC_UD);
-        generate_exception_if(vex.opcx && (d & TwoOp) && vex.reg != 0xf,
+        generate_exception_if((vex.opcx && (d & TwoOp) &&
+                               (vex.reg != 0xf || (evex_encoded() && !evex.RX))),
                               EXC_UD);
 
         if ( !opc )
             BUG();
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
-        copy_REX_VEX(opc, rex_prefix, vex);
+        if ( evex_encoded() )
+        {
+            opc[insn_bytes - EVEX_PFX_BYTES] = 0xc3;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            opc[insn_bytes - PFX_BYTES] = 0xc3;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
 
         if ( ea.type == OP_MEM )
         {
             uint32_t mxcsr = 0;
+            uint64_t full = 0;
 
             if ( op_bytes < 16 ||
                  (vex.opcx
@@ -8819,6 +9060,44 @@ x86_emulate(
                                   !is_aligned(ea.mem.seg, ea.mem.off, op_bytes,
                                               ctxt, ops),
                                   EXC_GP, 0);
+
+            if ( evex.br )
+            {
+                ASSERT((d & DstMask) != DstMem);
+                op_bytes = elem_bytes;
+            }
+            if ( evex.opmsk )
+            {
+                ASSERT(!(op_bytes % elem_bytes));
+                full = ~0ULL >> (64 - op_bytes / elem_bytes);
+                op_mask &= full;
+            }
+            if ( fault_suppression )
+            {
+                if ( !op_mask )
+                    goto simd_no_mem;
+                if ( !evex.br )
+                {
+                    first_byte = __builtin_ctzll(op_mask);
+                    op_mask >>= first_byte;
+                    full >>= first_byte;
+                    first_byte *= elem_bytes;
+                    op_bytes = (64 - __builtin_clzll(op_mask)) * elem_bytes;
+                }
+            }
+            /*
+             * Independent of fault suppression we may need to read (parts of)
+             * the memory operand for the purpose of merging without splitting
+             * the write below into multiple ones. Note that the EVEX.Z check
+             * here isn't strictly needed, due to there not currently being
+             * any instructions allowing zeroing-merging on memory writes (and
+             * we raise #UD during DstMem processing far above in this case),
+             * yet conceptually the read is then unnecessary.
+             */
+            if ( evex.opmsk && !evex.z && (d & DstMask) == DstMem &&
+                 op_mask != full )
+                d = (d & ~SrcMask) | SrcMem;
+
             switch ( d & SrcMask )
             {
             case SrcMem:
@@ -8865,7 +9144,10 @@ x86_emulate(
             }
         }
         else
+        {
+        simd_no_mem:
             dst.type = OP_NONE;
+        }
 
         /* {,v}maskmov{q,dqu}, as an exception, uses rDI. */
         if ( likely((ctxt->opcode & ~(X86EMUL_OPC_PFX_MASK |
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -171,6 +171,7 @@ enum x86_emulate_fpu_type {
     X86EMUL_FPU_xmm, /* SSE instruction set (%xmm0-%xmm7/15) */
     X86EMUL_FPU_ymm, /* AVX/XOP instruction set (%ymm0-%ymm7/15) */
     X86EMUL_FPU_opmask, /* AVX512 opmask instruction set (%k0-%k7) */
+    X86EMUL_FPU_zmm, /* AVX512 instruction set (%zmm0-%zmm7/31) */
     /* This sentinel will never be passed to ->get_fpu(). */
     X86EMUL_FPU_none
 };
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -105,6 +105,7 @@
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
+#define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 06/34] x86emul: test for correct EVEX Disp8 scaling
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (4 preceding siblings ...)
  2018-09-18 11:56   ` [PATCH v3 05/34] x86emul: support basic AVX512 moves Jan Beulich
@ 2018-09-18 11:57   ` Jan Beulich
  2018-09-18 11:57   ` [PATCH v3 07/34] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
                     ` (27 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:57 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Besides the already existing tests (which are going to be extended once
respective ISA extension support is complete), let's also ensure for
every individual insn that their Disp8 scaling (and memory access width)
are correct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -139,7 +139,7 @@ $(addsuffix .h,$(SIMD) $(FMA) $(SG)): si
 
 xop.h: simd-fma.c
 
-$(TARGET): x86-emulate.o test_x86_emulator.o wrappers.o
+$(TARGET): x86-emulate.o test_x86_emulator.o evex-disp8.o wrappers.o
 	$(HOSTCC) $(HOSTCFLAGS) -o $@ $^
 
 .PHONY: clean
@@ -166,7 +166,7 @@ x86.h := $(addprefix $(XEN_ROOT)/tools/i
                      x86-vendors.h x86-defns.h msr-index.h)
 x86_emulate.h := x86-emulate.h x86_emulate/x86_emulate.h $(x86.h)
 
-x86-emulate.o test_x86_emulator.o wrappers.o: %.o: %.c $(x86_emulate.h)
+x86-emulate.o test_x86_emulator.o evex-disp8.o wrappers.o: %.o: %.c $(x86_emulate.h)
 	$(HOSTCC) $(HOSTCFLAGS) -c -g -o $@ $<
 
 x86-emulate.o: x86_emulate/x86_emulate.c
--- /dev/null
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -0,0 +1,437 @@
+#include <stdarg.h>
+#include <stdio.h>
+
+#include "x86-emulate.h"
+
+struct test {
+    const char *mnemonic;
+    unsigned int opc:8;
+    unsigned int spc:2;
+    unsigned int pfx:2;
+    unsigned int vsz:3;
+    unsigned int esz:4;
+    unsigned int scale:1;
+    unsigned int ext:3;
+};
+
+enum spc {
+    SPC_invalid,
+    SPC_0f,
+    SPC_0f38,
+    SPC_0f3a,
+};
+
+enum pfx {
+    PFX_,
+    PFX_66,
+    PFX_f3,
+    PFX_f2
+};
+
+enum vl {
+    VL_128,
+    VL_256,
+    VL_512,
+};
+
+enum scale {
+    SC_vl,
+    SC_el,
+};
+
+enum vsz {
+    VSZ_vl,
+    VSZ_vl_2, /* VL / 2 */
+    VSZ_vl_4, /* VL / 4 */
+    VSZ_vl_8, /* VL / 8 */
+    /* "no broadcast" implied from here on. */
+    VSZ_el,
+    VSZ_el_2, /* EL * 2 */
+    VSZ_el_4, /* EL * 4 */
+    VSZ_el_8, /* EL * 8 */
+};
+
+enum esz {
+    ESZ_d,
+    ESZ_q,
+    ESZ_dq,
+    ESZ_sd,
+    ESZ_d_nb,
+    ESZ_q_nb,
+    /* "no broadcast" implied from here on. */
+    ESZ_b,
+    ESZ_w,
+    ESZ_bw,
+};
+
+#ifndef __i386__
+# define ESZ_dq64 ESZ_dq
+#else
+# define ESZ_dq64 ESZ_d
+#endif
+
+#define INSNX(m, p, sp, o, e, vs, es, sc) { \
+    .mnemonic = #m, .opc = 0x##o, .spc = SPC_##sp, .pfx = PFX_##p, \
+    .vsz = VSZ_##vs, .esz = ESZ_##es, .scale = SC_##sc, .ext = 0##e \
+}
+#define INSN(m, p, sp, o, vs, es, sc) INSNX(m, p, sp, o, 0, vs, es, sc)
+#define INSN_PFP(m, sp, o) \
+    INSN(m##pd, 66, sp, o, vl, q, vl), \
+    INSN(m##ps,   , sp, o, vl, d, vl)
+#define INSN_PFP_NB(m, sp, o) \
+    INSN(m##pd, 66, sp, o, vl, q_nb, vl), \
+    INSN(m##ps,   , sp, o, vl, d_nb, vl)
+#define INSN_SFP(m, sp, o) \
+    INSN(m##sd, f2, sp, o, el, q, el), \
+    INSN(m##ss, f3, sp, o, el, d, el)
+
+#define INSN_FP(m, sp, o) \
+    INSN_PFP(m, sp, o), \
+    INSN_SFP(m, sp, o)
+
+static const struct test avx512f_all[] = {
+    INSN_SFP(mov,            0f, 10),
+    INSN_SFP(mov,            0f, 11),
+    INSN_PFP_NB(mova,        0f, 28),
+    INSN_PFP_NB(mova,        0f, 29),
+    INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
+    INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
+    INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
+    INSN(movdqa64,     66,   0f, 7f,    vl,   q_nb, vl),
+    INSN(movdqu32,     f3,   0f, 6f,    vl,   d_nb, vl),
+    INSN(movdqu32,     f3,   0f, 7f,    vl,   d_nb, vl),
+    INSN(movdqu64,     f3,   0f, 6f,    vl,   q_nb, vl),
+    INSN(movdqu64,     f3,   0f, 7f,    vl,   q_nb, vl),
+    INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
+    INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
+    INSN_PFP_NB(movnt,       0f, 2b),
+    INSN_PFP_NB(movu,        0f, 10),
+    INSN_PFP_NB(movu,        0f, 11),
+};
+
+static const struct test avx512f_128[] = {
+    INSN(mov,       66,   0f, 6e, el, dq64, el),
+    INSN(mov,       66,   0f, 7e, el, dq64, el),
+    INSN(movq,      f3,   0f, 7e, el,    q, el),
+    INSN(movq,      66,   0f, d6, el,    q, el),
+};
+
+static const struct test avx512bw_all[] = {
+    INSN(movdqu8,     f2,   0f, 6f,    vl,   b, vl),
+    INSN(movdqu8,     f2,   0f, 7f,    vl,   b, vl),
+    INSN(movdqu16,    f2,   0f, 6f,    vl,   w, vl),
+    INSN(movdqu16,    f2,   0f, 7f,    vl,   w, vl),
+};
+
+static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
+static const unsigned char vl_128[] = { VL_128 };
+
+/*
+ * This table, indicating the presence of an immediate (byte) for an opcode
+ * space 0f major opcode, is indexed by high major opcode byte nibble, with
+ * each table element then bit-indexed by low major opcode byte nibble.
+ */
+static const uint16_t imm0f[16] = {
+    [0x7] = (1 << 0x0) /* vpshuf* */ |
+            (1 << 0x1) /* vps{ll,ra,rl}w */ |
+            (1 << 0x2) /* vps{l,r}ld, vp{rol,ror,sra}{d,q} */ |
+            (1 << 0x3) /* vps{l,r}l{,d}q */,
+    [0xc] = (1 << 0x2) /* vcmp{p,s}{d,s} */ |
+            (1 << 0x4) /* vpinsrw */ |
+            (1 << 0x5) /* vpextrw */ |
+            (1 << 0x6) /* vshufp{d,s} */,
+};
+
+static struct x86_emulate_ops emulops;
+
+static unsigned int accessed[3 * 64];
+
+static bool record_access(enum x86_segment seg, unsigned long offset,
+                          unsigned int bytes)
+{
+    while ( bytes-- )
+    {
+        if ( offset >= ARRAY_SIZE(accessed) )
+            return false;
+        ++accessed[offset++];
+    }
+
+    return true;
+}
+
+static int read(enum x86_segment seg, unsigned long offset, void *p_data,
+                unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+    if ( !record_access(seg, offset, bytes) )
+        return X86EMUL_UNHANDLEABLE;
+    memset(p_data, 0, bytes);
+    return X86EMUL_OKAY;
+}
+
+static int write(enum x86_segment seg, unsigned long offset, void *p_data,
+                 unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+    if ( !record_access(seg, offset, bytes) )
+        return X86EMUL_UNHANDLEABLE;
+    return X86EMUL_OKAY;
+}
+
+static void test_one(const struct test *test, enum vl vl,
+                     unsigned char *instr, struct x86_emulate_ctxt *ctxt)
+{
+    unsigned int vsz, esz, i;
+    int rc;
+    bool sg = strstr(test->mnemonic, "gather") ||
+              strstr(test->mnemonic, "scatter");
+    bool imm = test->spc == SPC_0f3a ||
+               (test->spc == SPC_0f &&
+                (imm0f[test->opc >> 4] & (1 << (test->opc & 0xf))));
+    union evex {
+        uint8_t raw[3];
+        struct {
+            uint8_t opcx:2;
+            uint8_t mbz:2;
+            uint8_t R:1;
+            uint8_t b:1;
+            uint8_t x:1;
+            uint8_t r:1;
+            uint8_t pfx:2;
+            uint8_t mbs:1;
+            uint8_t reg:4;
+            uint8_t w:1;
+            uint8_t opmsk:3;
+            uint8_t RX:1;
+            uint8_t bcst:1;
+            uint8_t lr:2;
+            uint8_t z:1;
+        };
+    } evex = {
+        .opcx = test->spc, .pfx = test->pfx, .lr = vl,
+        .R = 1, .b = 1, .x = 1, .r = 1, .mbs = 1,
+        .reg = 0xf, .RX = 1, .opmsk = sg,
+    };
+
+    switch ( test->esz )
+    {
+    case ESZ_b:
+        esz = 1;
+        break;
+
+    case ESZ_w:
+        esz = 2;
+        evex.w = 1;
+        break;
+
+    case ESZ_d: case ESZ_d_nb:
+        esz = 4;
+        break;
+
+    case ESZ_q: case ESZ_q_nb:
+        esz = 8;
+        evex.w = 1;
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    switch ( test->vsz )
+    {
+    case VSZ_vl:
+        vsz = 16 << vl;
+        break;
+
+    case VSZ_vl_2:
+        vsz = 8 << vl;
+        break;
+
+    case VSZ_vl_4:
+        vsz = 4 << vl;
+        break;
+
+    case VSZ_vl_8:
+        vsz = 2 << vl;
+        break;
+
+    case VSZ_el:
+        vsz = esz;
+        break;
+
+    case VSZ_el_2:
+        vsz = esz * 2;
+        break;
+
+    case VSZ_el_4:
+        vsz = esz * 4;
+        break;
+
+    case VSZ_el_8:
+        vsz = esz * 8;
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    /*
+     * Note: SIB addressing is used here, such that S/G insns can be handled
+     * without extra conditionals.
+     */
+    instr[0] = 0x62;
+    instr[1] = evex.raw[0];
+    instr[2] = evex.raw[1];
+    instr[3] = evex.raw[2];
+    instr[4] = test->opc;
+    instr[5] = 0x44 | (test->ext << 3); /* ModR/M */
+    instr[6] = 0x12; /* SIB: base rDX, index none / xMM4 */
+    instr[7] = 1; /* Disp8 */
+    instr[8] = 0; /* immediate, if any */
+
+    asm volatile ( "kxnorw %k1, %k1, %k1" );
+    asm volatile ( "vxorps %xmm4, %xmm4, %xmm4" );
+
+    ctxt->regs->eip = (unsigned long)&instr[0];
+    ctxt->regs->edx = 0;
+    memset(accessed, 0, sizeof(accessed));
+
+    rc = x86_emulate(ctxt, &emulops);
+    if ( rc != X86EMUL_OKAY ||
+         (ctxt->regs->eip != (unsigned long)&instr[8 + imm]) )
+        goto fail;
+
+    for ( i = 0; i < vsz; ++i )
+         if ( accessed[i] )
+             goto fail;
+    for ( ; i < (test->scale == SC_vl ? vsz : esz) + (sg ? esz : vsz); ++i )
+         if ( accessed[i] != (sg ? vsz / esz : 1) )
+             goto fail;
+    for ( ; i < ARRAY_SIZE(accessed); ++i )
+         if ( accessed[i] )
+             goto fail;
+
+    /* Also check the broadcast case, if available. */
+    if ( test->vsz >= VSZ_el || test->scale != SC_vl )
+        return;
+
+    switch ( test->esz )
+    {
+    case ESZ_d_nb: case ESZ_q_nb:
+    case ESZ_b: case ESZ_w: case ESZ_bw:
+        return;
+
+    case ESZ_d: case ESZ_q:
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    evex.bcst = 1;
+    instr[3] = evex.raw[2];
+
+    ctxt->regs->eip = (unsigned long)&instr[0];
+    memset(accessed, 0, sizeof(accessed));
+
+    rc = x86_emulate(ctxt, &emulops);
+    if ( rc != X86EMUL_OKAY ||
+         (ctxt->regs->eip != (unsigned long)&instr[8 + imm]) )
+        goto fail;
+
+    for ( i = 0; i < esz; ++i )
+         if ( accessed[i] )
+             goto fail;
+    for ( ; i < esz * 2; ++i )
+         if ( accessed[i] != 1 )
+             goto fail;
+    for ( ; i < ARRAY_SIZE(accessed); ++i )
+         if ( accessed[i] )
+             goto fail;
+
+    return;
+
+ fail:
+    printf("failed (v%s%s %u-bit)\n", test->mnemonic,
+           evex.bcst ? "/bcst" : "", 128 << vl);
+    exit(1);
+}
+
+static void test_pair(const struct test *tmpl, enum vl vl,
+                      enum esz esz1, const char *suffix1,
+                      enum esz esz2, const char *suffix2,
+                      unsigned char *instr, struct x86_emulate_ctxt *ctxt)
+{
+    struct test test = *tmpl;
+    char mnemonic[24];
+
+    test.esz = esz1;
+    snprintf(mnemonic, ARRAY_SIZE(mnemonic), "%s%s", tmpl->mnemonic, suffix1);
+    test.mnemonic = mnemonic;
+    test_one(&test, vl, instr, ctxt);
+
+    test.esz = esz2;
+    snprintf(mnemonic, ARRAY_SIZE(mnemonic), "%s%s", tmpl->mnemonic, suffix2);
+    test.mnemonic = mnemonic;
+    test_one(&test, vl, instr, ctxt);
+}
+
+static void test_group(const struct test tests[], unsigned int nr_test,
+                       const unsigned char vl[], unsigned int nr_vl,
+                       void *instr, struct x86_emulate_ctxt *ctxt)
+{
+    unsigned int i, j;
+
+    for ( i = 0; i < nr_test; ++i )
+    {
+        for ( j = 0; j < nr_vl; ++j )
+        {
+            if ( vl[0] == VL_512 && vl[j] != VL_512 && !cpu_has_avx512vl )
+                continue;
+
+            switch ( tests[i].esz )
+            {
+            default:
+                test_one(&tests[i], vl[j], instr, ctxt);
+                break;
+
+            case ESZ_bw:
+                test_pair(&tests[i], vl[j], ESZ_b, "b", ESZ_w, "w",
+                          instr, ctxt);
+                break;
+
+            case ESZ_dq:
+                test_pair(&tests[i], vl[j], ESZ_d, "d", ESZ_q, "q",
+                          instr, ctxt);
+                break;
+
+            case ESZ_sd:
+                test_pair(&tests[i], vl[j],
+                          ESZ_d, tests[i].vsz < VSZ_el ? "ps" : "ss",
+                          ESZ_q, tests[i].vsz < VSZ_el ? "pd" : "sd",
+                          instr, ctxt);
+                break;
+            }
+        }
+    }
+}
+
+void evex_disp8_test(void *instr, struct x86_emulate_ctxt *ctxt,
+                     const struct x86_emulate_ops *ops)
+{
+    emulops = *ops;
+    emulops.read = read;
+    emulops.write = write;
+
+#define RUN(feat, vl) do { \
+    if ( cpu_has_##feat ) \
+    { \
+        printf("%-40s", "Testing " #feat "/" #vl " disp8 handling..."); \
+        test_group(feat ## _ ## vl, ARRAY_SIZE(feat ## _ ## vl), \
+                   vl_ ## vl, ARRAY_SIZE(vl_ ## vl), instr, ctxt); \
+        printf("okay\n"); \
+    } \
+} while ( false )
+
+    RUN(avx512f, all);
+    RUN(avx512f, 128);
+    RUN(avx512bw, all);
+}
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3795,6 +3795,9 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    if ( stack_exec )
+        evex_disp8_test(instr, &ctxt, &emulops);
+
     for ( j = 0; j < ARRAY_SIZE(blobs); j++ )
     {
         if ( blobs[j].check_cpu && !blobs[j].check_cpu() )
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -71,6 +71,9 @@ WRAP(puts);
 
 #include "x86_emulate/x86_emulate.h"
 
+void evex_disp8_test(void *instr, struct x86_emulate_ctxt *ctxt,
+                     const struct x86_emulate_ops *ops);
+
 static inline uint64_t xgetbv(uint32_t xcr)
 {
     uint32_t lo, hi;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 07/34] x86emul: use AVX512 logic for emulating V{, P}MASKMOV*
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (5 preceding siblings ...)
  2018-09-18 11:57   ` [PATCH v3 06/34] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
@ 2018-09-18 11:57   ` Jan Beulich
  2018-09-18 11:58   ` [PATCH v3 08/34] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
                     ` (26 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:57 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

The more generic AVX512 implementation allows quite a bit of insn-
specific code to be dropped/shared.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -437,8 +437,8 @@ static const struct ext0f38_table {
     [0x28 ... 0x29] = { .simd_size = simd_packed_int },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
-    [0x2c ... 0x2d] = { .simd_size = simd_other },
-    [0x2e ... 0x2f] = { .simd_size = simd_other, .to_mem = 1 },
+    [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
+    [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int },
     [0x40] = { .simd_size = simd_packed_int },
@@ -447,8 +447,8 @@ static const struct ext0f38_table {
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
-    [0x8c] = { .simd_size = simd_other },
-    [0x8e] = { .simd_size = simd_other, .to_mem = 1 },
+    [0x8c] = { .simd_size = simd_packed_int },
+    [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp },
     [0x99] = { .simd_size = simd_scalar_vexw },
@@ -7974,6 +7974,8 @@ x86_emulate(
 
         generate_exception_if(ea.type != OP_MEM || vex.w, EXC_UD);
         host_and_vcpu_must_have(avx);
+        elem_bytes = 4 << (b & 1);
+    vmaskmov:
         get_fpu(X86EMUL_FPU_ymm);
 
         /*
@@ -7988,7 +7990,7 @@ x86_emulate(
         opc = init_prefixes(stub);
         pvex = copy_VEX(opc, vex);
         pvex->opcx = vex_0f;
-        if ( !(b & 1) )
+        if ( elem_bytes == 4 )
             pvex->pfx = vex_none;
         opc[0] = 0x50; /* vmovmskp{s,d} */
         /* Use %rax as GPR destination and VEX.vvvv as source. */
@@ -8001,21 +8003,9 @@ x86_emulate(
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
         put_stub(stub);
 
-        if ( !ea.val )
-            goto complete_insn;
-
-        op_bytes = 4 << (b & 1);
-        first_byte = __builtin_ctz(ea.val);
-        ea.val >>= first_byte;
-        first_byte *= op_bytes;
-        op_bytes *= 32 - __builtin_clz(ea.val);
-
-        /*
-         * Even for the memory write variant a memory read is needed, unless
-         * all set mask bits are contiguous.
-         */
-        if ( ea.val & (ea.val + 1) )
-            d = (d & ~SrcMask) | SrcMem;
+        evex.opmsk = 1; /* fake */
+        op_mask = ea.val;
+        fault_suppression = true;
 
         opc = init_prefixes(stub);
         opc[0] = b;
@@ -8066,63 +8056,10 @@ x86_emulate(
 
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
-    {
-        typeof(vex) *pvex;
-        unsigned int mask = vex.w ? 0x80808080U : 0x88888888U;
-
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
         host_and_vcpu_must_have(avx2);
-        get_fpu(X86EMUL_FPU_ymm);
-
-        /*
-         * While we can't reasonably provide fully correct behavior here
-         * (in particular, for writes, avoiding the memory read in anticipation
-         * of all elements in the range eventually being written), we can (and
-         * should) still limit the memory access to the smallest possible range
-         * (suppressing it altogether if all mask bits are clear), to provide
-         * correct faulting behavior. Read the mask bits via vmovmskp{s,d}
-         * for that purpose.
-         */
-        opc = init_prefixes(stub);
-        pvex = copy_VEX(opc, vex);
-        pvex->opcx = vex_0f;
-        opc[0] = 0xd7; /* vpmovmskb */
-        /* Use %rax as GPR destination and VEX.vvvv as source. */
-        pvex->r = 1;
-        pvex->b = !mode_64bit() || (vex.reg >> 3);
-        opc[1] = 0xc0 | (~vex.reg & 7);
-        pvex->reg = 0xf;
-        opc[2] = 0xc3;
-
-        invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
-        put_stub(stub);
-
-        /* Convert byte granular result to dword/qword granularity. */
-        ea.val &= mask;
-        if ( !ea.val )
-            goto complete_insn;
-
-        first_byte = __builtin_ctz(ea.val) & ~((4 << vex.w) - 1);
-        ea.val >>= first_byte;
-        op_bytes = 32 - __builtin_clz(ea.val);
-
-        /*
-         * Even for the memory write variant a memory read is needed, unless
-         * all set mask bits are contiguous.
-         */
-        if ( ea.val & (ea.val + ~mask + 1) )
-            d = (d & ~SrcMask) | SrcMem;
-
-        opc = init_prefixes(stub);
-        opc[0] = b;
-        /* Convert memory operand to (%rAX). */
-        rex_prefix &= ~REX_B;
-        vex.b = 1;
-        opc[1] = modrm & 0x38;
-        insn_bytes = PFX_BYTES + 2;
-
-        break;
-    }
+        elem_bytes = 4 << vex.w;
+        goto vmaskmov;
 
     case X86EMUL_OPC_VEX_66(0x0f38, 0x90): /* vpgatherd{d,q} {x,y}mm,mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x91): /* vpgatherq{d,q} {x,y}mm,mem,{x,y}mm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 08/34] x86emul: support AVX512F legacy-equivalent arithmetic FP insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (6 preceding siblings ...)
  2018-09-18 11:57   ` [PATCH v3 07/34] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
@ 2018-09-18 11:58   ` Jan Beulich
  2018-09-18 11:59   ` [PATCH v3 09/34] x86emul: support AVX512DQ logic " Jan Beulich
                     ` (25 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:58 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -90,6 +90,10 @@ enum esz {
     INSN_SFP(m, sp, o)
 
 static const struct test avx512f_all[] = {
+    INSN_FP(add,             0f, 58),
+    INSN_FP(div,             0f, 5e),
+    INSN_FP(max,             0f, 5f),
+    INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
     INSN_SFP(mov,            0f, 11),
     INSN_PFP_NB(mova,        0f, 28),
@@ -107,6 +111,9 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movnt,       0f, 2b),
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
+    INSN_FP(mul,             0f, 59),
+    INSN_FP(sqrt,            0f, 51),
+    INSN_FP(sub,             0f, 5c),
 };
 
 static const struct test avx512f_128[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -298,12 +298,12 @@ static const struct twobyte_table {
     [0x3a] = { DstReg|SrcImmByte|ModRM },
     [0x40 ... 0x4f] = { DstReg|SrcMem|ModRM|Mov },
     [0x50] = { DstReg|SrcImplicit|ModRM|Mov },
-    [0x51] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_any_fp },
+    [0x51] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_any_fp, d8s_vl },
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
-    [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp },
+    [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp },
+    [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -5853,10 +5853,22 @@ x86_emulate(
         if ( (b & ~1) == 0x10 && ea.type == OP_MEM )
             d |= TwoOp;
         generate_exception_if(evex.br, EXC_UD);
-        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+        /* fall through */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x51):    /* vsqrtp{s,d} [xyz]mm/mem,[xyz]mm{k} */
+                                            /* vsqrts{s,d} xmm/m32,xmm,xmm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x58):    /* vadd{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x59):    /* vmul{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5c):    /* vsub{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5d):    /* vmin{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5e):    /* vdiv{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5f):    /* vmax{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
+                               (ea.type == OP_MEM && evex.br &&
+                                (evex.pfx & VEX_PREFIX_SCALAR_MASK))),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
-        avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
     simd_zmm:
         get_fpu(X86EMUL_FPU_zmm);
         opc = init_evex(stub);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 09/34] x86emul: support AVX512DQ logic FP insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (7 preceding siblings ...)
  2018-09-18 11:58   ` [PATCH v3 08/34] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
@ 2018-09-18 11:59   ` Jan Beulich
  2018-09-18 11:59   ` [PATCH v3 10/34] x86emul: support AVX512F "normal" FP compare insns Jan Beulich
                     ` (24 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:59 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -130,6 +130,13 @@ static const struct test avx512bw_all[]
     INSN(movdqu16,    f2,   0f, 7f,    vl,   w, vl),
 };
 
+static const struct test avx512dq_all[] = {
+    INSN_PFP(and,              0f, 54),
+    INSN_PFP(andn,             0f, 55),
+    INSN_PFP(or,               0f, 56),
+    INSN_PFP(xor,              0f, 57),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 
@@ -441,4 +448,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, all);
     RUN(avx512f, 128);
     RUN(avx512bw, all);
+    RUN(avx512dq, all);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -300,7 +300,7 @@ static const struct twobyte_table {
     [0x50] = { DstReg|SrcImplicit|ModRM|Mov },
     [0x51] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_any_fp, d8s_vl },
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
-    [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
+    [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
@@ -6319,6 +6319,17 @@ x86_emulate(
         dst.bytes = 4;
         break;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x54): /* vandp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x55): /* vandnp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x56): /* vorp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x57): /* vxorp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
+                               (ea.type != OP_MEM && evex.br)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512dq);
+        avx512_vlen_check(false);
+        goto simd_zmm;
+
     CASE_SIMD_ALL_FP(, 0x0f, 0x5a):        /* cvt{p,s}{s,d}2{p,s}{s,d} xmm/mem,xmm */
     CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} xmm/mem,xmm */
                                            /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 10/34] x86emul: support AVX512F "normal" FP compare insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (8 preceding siblings ...)
  2018-09-18 11:59   ` [PATCH v3 09/34] x86emul: support AVX512DQ logic " Jan Beulich
@ 2018-09-18 11:59   ` Jan Beulich
  2018-09-18 12:00   ` [PATCH v3 11/34] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
                     ` (23 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 11:59 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Also correct the AVX counterpart's comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -91,6 +91,7 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN_FP(cmp,             0f, c2),
     INSN_FP(div,             0f, 5e),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -350,7 +350,7 @@ static const struct twobyte_table {
     [0xbf] = { DstReg|SrcMem16|ModRM|Mov },
     [0xc0] = { ByteOp|DstMem|SrcReg|ModRM },
     [0xc1] = { DstMem|SrcReg|ModRM },
-    [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp },
+    [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp, d8s_vl },
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
     [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
@@ -7428,7 +7428,7 @@ x86_emulate(
         goto add;
 
     CASE_SIMD_ALL_FP(, 0x0f, 0xc2):        /* cmp{p,s}{s,d} $imm8,xmm/mem,xmm */
-    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0xc2):    /* vcmp{p,s}{s,d} $imm8,{x,y}mm/mem,{x,y}mm */
+    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0xc2):    /* vcmp{p,s}{s,d} $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0xc6):     /* shufp{s,d} $imm8,xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,{x,y}mm/mem,{x,y}mm */
         d = (d & ~SrcMask) | SrcMem;
@@ -7442,6 +7442,30 @@ x86_emulate(
         }
         goto simd_0f_imm8_avx;
 
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0xc2): /* vcmp{p,s}{s,d} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
+                               (ea.type == OP_MEM && evex.br &&
+                                (evex.pfx & VEX_PREFIX_SCALAR_MASK)) ||
+                               !evex.r || !evex.R || evex.z),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
+        d = (d & ~SrcMask) | SrcMem;
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            evex.b = 1;
+            opc[1] &= 0x38;
+        }
+        opc[2] = imm1;
+        insn_bytes = EVEX_PFX_BYTES + 3;
+        break;
+
     case X86EMUL_OPC(0x0f, 0xc3): /* movnti */
         /* Ignore the non-temporal hint for now. */
         vcpu_must_have(sse2);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 11/34] x86emul: support AVX512F misc legacy-equivalent FP insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (9 preceding siblings ...)
  2018-09-18 11:59   ` [PATCH v3 10/34] x86emul: support AVX512F "normal" FP compare insns Jan Beulich
@ 2018-09-18 12:00   ` Jan Beulich
  2018-09-18 12:00   ` [PATCH v3 12/34] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
                     ` (22 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:00 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Also correct an AVX counterpart's comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -113,8 +113,11 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
+    INSN_PFP(unpckh,         0f, 15),
+    INSN_PFP(unpckl,         0f, 14),
 };
 
 static const struct test avx512f_128[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -280,7 +280,7 @@ static const struct twobyte_table {
     [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
-    [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
+    [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
@@ -354,7 +354,7 @@ static const struct twobyte_table {
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
     [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
-    [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp },
+    [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp, d8s_vl },
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -5922,6 +5922,17 @@ x86_emulate(
         host_and_vcpu_must_have(sse3);
         goto simd_0f_xmm;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x14): /* vunpcklp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+                              EXC_UD);
+        fault_suppression = false;
+    avx512f_no_sae:
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        avx512_vlen_check(false);
+        goto simd_zmm;
+
     case X86EMUL_OPC(0x0f, 0x20): /* mov cr,reg */
     case X86EMUL_OPC(0x0f, 0x21): /* mov dr,reg */
     case X86EMUL_OPC(0x0f, 0x22): /* mov reg,cr */
@@ -6601,11 +6612,9 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_F3(0x0f, 0x7f): /* vmovdqu{32,64} [xyz]mm,[xyz]mm/mem{k} */
     vmovdqa:
         generate_exception_if(evex.br, EXC_UD);
-        host_and_vcpu_must_have(avx512f);
-        avx512_vlen_check(false);
         d |= TwoOp;
         op_bytes = 16 << evex.lr;
-        goto simd_zmm;
+        goto avx512f_no_sae;
 
     case X86EMUL_OPC_EVEX_F2(0x0f, 0x6f): /* vmovdqu{8,16} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F2(0x0f, 0x7f): /* vmovdqu{8,16} [xyz]mm,[xyz]mm/mem{k} */
@@ -7430,7 +7439,7 @@ x86_emulate(
     CASE_SIMD_ALL_FP(, 0x0f, 0xc2):        /* cmp{p,s}{s,d} $imm8,xmm/mem,xmm */
     CASE_SIMD_ALL_FP(_VEX, 0x0f, 0xc2):    /* vcmp{p,s}{s,d} $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0xc6):     /* shufp{s,d} $imm8,xmm/mem,xmm */
-    CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,{x,y}mm/mem,{x,y}mm */
+    CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         d = (d & ~SrcMask) | SrcMem;
         if ( vex.opcx == vex_none )
         {
@@ -7451,7 +7460,9 @@ x86_emulate(
         host_and_vcpu_must_have(avx512f);
         if ( ea.type == OP_MEM || !evex.br )
             avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
-        d = (d & ~SrcMask) | SrcMem;
+    simd_imm8_zmm:
+        if ( (d & SrcMask) == SrcImmByte )
+            d = (d & ~SrcMask) | SrcMem;
         get_fpu(X86EMUL_FPU_zmm);
         opc = init_evex(stub);
         opc[0] = b;
@@ -7495,6 +7506,15 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 3;
         goto simd_0f_to_gpr;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC(0x0f, 0xc7): /* Grp9 */
     {
         union {




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 12/34] x86emul: support AVX512F fused-multiply-add insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (10 preceding siblings ...)
  2018-09-18 12:00   ` [PATCH v3 11/34] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
@ 2018-09-18 12:00   ` Jan Beulich
  2018-09-18 12:01   ` [PATCH v3 13/34] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
                     ` (21 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:00 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -93,6 +93,36 @@ static const struct test avx512f_all[] =
     INSN_FP(add,             0f, 58),
     INSN_FP(cmp,             0f, c2),
     INSN_FP(div,             0f, 5e),
+    INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
+    INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
+    INSN(fmadd213,     66, 0f38, a8,    vl,     sd, vl),
+    INSN(fmadd213,     66, 0f38, a9,    el,     sd, el),
+    INSN(fmadd231,     66, 0f38, b8,    vl,     sd, vl),
+    INSN(fmadd231,     66, 0f38, b9,    el,     sd, el),
+    INSN(fmaddsub132,  66, 0f38, 96,    vl,     sd, vl),
+    INSN(fmaddsub213,  66, 0f38, a6,    vl,     sd, vl),
+    INSN(fmaddsub231,  66, 0f38, b6,    vl,     sd, vl),
+    INSN(fmsub132,     66, 0f38, 9a,    vl,     sd, vl),
+    INSN(fmsub132,     66, 0f38, 9b,    el,     sd, el),
+    INSN(fmsub213,     66, 0f38, aa,    vl,     sd, vl),
+    INSN(fmsub213,     66, 0f38, ab,    el,     sd, el),
+    INSN(fmsub231,     66, 0f38, ba,    vl,     sd, vl),
+    INSN(fmsub231,     66, 0f38, bb,    el,     sd, el),
+    INSN(fmsubadd132,  66, 0f38, 97,    vl,     sd, vl),
+    INSN(fmsubadd213,  66, 0f38, a7,    vl,     sd, vl),
+    INSN(fmsubadd231,  66, 0f38, b7,    vl,     sd, vl),
+    INSN(fnmadd132,    66, 0f38, 9c,    vl,     sd, vl),
+    INSN(fnmadd132,    66, 0f38, 9d,    el,     sd, el),
+    INSN(fnmadd213,    66, 0f38, ac,    vl,     sd, vl),
+    INSN(fnmadd213,    66, 0f38, ad,    el,     sd, el),
+    INSN(fnmadd231,    66, 0f38, bc,    vl,     sd, vl),
+    INSN(fnmadd231,    66, 0f38, bd,    el,     sd, el),
+    INSN(fnmsub132,    66, 0f38, 9e,    vl,     sd, vl),
+    INSN(fnmsub132,    66, 0f38, 9f,    el,     sd, el),
+    INSN(fnmsub213,    66, 0f38, ae,    vl,     sd, vl),
+    INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
+    INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
+    INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -450,30 +450,30 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
-    [0x96 ... 0x98] = { .simd_size = simd_packed_fp },
-    [0x99] = { .simd_size = simd_scalar_vexw },
-    [0x9a] = { .simd_size = simd_packed_fp },
-    [0x9b] = { .simd_size = simd_scalar_vexw },
-    [0x9c] = { .simd_size = simd_packed_fp },
-    [0x9d] = { .simd_size = simd_scalar_vexw },
-    [0x9e] = { .simd_size = simd_packed_fp },
-    [0x9f] = { .simd_size = simd_scalar_vexw },
-    [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp },
-    [0xa9] = { .simd_size = simd_scalar_vexw },
-    [0xaa] = { .simd_size = simd_packed_fp },
-    [0xab] = { .simd_size = simd_scalar_vexw },
-    [0xac] = { .simd_size = simd_packed_fp },
-    [0xad] = { .simd_size = simd_scalar_vexw },
-    [0xae] = { .simd_size = simd_packed_fp },
-    [0xaf] = { .simd_size = simd_scalar_vexw },
-    [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp },
-    [0xb9] = { .simd_size = simd_scalar_vexw },
-    [0xba] = { .simd_size = simd_packed_fp },
-    [0xbb] = { .simd_size = simd_scalar_vexw },
-    [0xbc] = { .simd_size = simd_packed_fp },
-    [0xbd] = { .simd_size = simd_scalar_vexw },
-    [0xbe] = { .simd_size = simd_packed_fp },
-    [0xbf] = { .simd_size = simd_scalar_vexw },
+    [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x9a] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x9b] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x9c] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x9d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x9e] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x9f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xa9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xaa] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xab] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xac] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xad] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xae] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xaf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xb9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xba] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xbb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xbc] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xc8 ... 0xcd] = { .simd_size = simd_other },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
@@ -8277,6 +8277,49 @@ x86_emulate(
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9a): /* vfmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9c): /* vfnmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9e): /* vfnmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa6): /* vfmaddsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa7): /* vfmsubadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa8): /* vfmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaa): /* vfmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xac): /* vfnmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xae): /* vfnmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb6): /* vfmaddsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb7): /* vfmsubadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb8): /* vfmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xba): /* vfmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM )
+        {
+            generate_exception_if(evex.br, EXC_UD);
+            avx512_vlen_check(true);
+        }
+        goto simd_zmm;
+
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xc9):     /* sha1msg1 xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xca):     /* sha1msg2 xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 13/34] x86emul: support AVX512F legacy-equivalent logic insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (11 preceding siblings ...)
  2018-09-18 12:00   ` [PATCH v3 12/34] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
@ 2018-09-18 12:01   ` Jan Beulich
  2018-09-18 12:02   ` [PATCH v3 14/34] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
                     ` (20 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:01 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Plus vpternlog{d,q} as being extensively used by the compiler, in order
to facilitate test enabling in the harness as soon as possible. Also the
twobyte_table[] entries for a few more insns get their .d8s field set
right away, in order to not split and later re-combine the groups.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -143,6 +143,11 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(pand,         66,   0f, db,    vl,     dq, vl),
+    INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
+    INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -362,13 +362,13 @@ static const struct twobyte_table {
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
-    [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
-    [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
@@ -491,6 +491,7 @@ static const struct ext0f3a_table {
     uint8_t to_mem:1;
     uint8_t two_op:1;
     uint8_t four_op:1;
+    disp8scale_t d8s:4;
 } ext0f3a_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x01] = { .simd_size = simd_packed_fp, .two_op = 1 },
@@ -508,6 +509,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
+    [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
@@ -3006,20 +3008,33 @@ x86_decode(
                 disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
             break;
 
+        case ext_0f3a:
+            /*
+             * Cannot update d here yet, as the immediate operand still
+             * needs fetching.
+             */
+            state->simd_size = ext0f3a_table[b].simd_size;
+            if ( evex_encoded() )
+                disp8scale = decode_disp8scale(ext0f3a_table[b].d8s, state);
+            break;
+
         case ext_8f09:
             if ( ext8f09_table[b].two_op )
                 d |= TwoOp;
             state->simd_size = ext8f09_table[b].simd_size;
             break;
 
-        case ext_0f3a:
         case ext_8f08:
+        case ext_8f0a:
             /*
              * Cannot update d here yet, as the immediate operand still
              * needs fetching.
              */
-        default:
             break;
+
+        default:
+            ASSERT_UNREACHABLE();
+            return X86EMUL_UNIMPLEMENTED;
         }
 
         if ( modrm_mod == 3 )
@@ -3198,7 +3213,6 @@ x86_decode(
         else if ( ext0f3a_table[b].four_op && !mode_64bit() && vex.opcx )
             imm1 &= 0x7f;
         state->desc = d;
-        state->simd_size = ext0f3a_table[b].simd_size;
         rc = x86_decode_0f3a(state, ctxt, ops);
         break;
 
@@ -5927,6 +5941,11 @@ x86_emulate(
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
         fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdb): /* vpand{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -7510,6 +7529,8 @@ x86_emulate(
         fault_suppression = false;
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
         avx512_vlen_check(false);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 14/34] x86emul: support AVX512{F, DQ} FP broadcast insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (12 preceding siblings ...)
  2018-09-18 12:01   ` [PATCH v3 13/34] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
@ 2018-09-18 12:02   ` Jan Beulich
  2018-09-18 12:03   ` [PATCH v3 15/34] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
                     ` (19 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:02 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -91,6 +91,7 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
@@ -162,6 +163,15 @@ static const struct test avx512f_128[] =
     INSN(movq,      66,   0f, d6, el,    q, el),
 };
 
+static const struct test avx512f_no128[] = {
+    INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
+    INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
+};
+
+static const struct test avx512f_512[] = {
+    INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+};
+
 static const struct test avx512bw_all[] = {
     INSN(movdqu8,     f2,   0f, 6f,    vl,   b, vl),
     INSN(movdqu8,     f2,   0f, 7f,    vl,   b, vl),
@@ -176,8 +186,19 @@ static const struct test avx512dq_all[]
     INSN_PFP(xor,              0f, 57),
 };
 
+static const struct test avx512dq_no128[] = {
+    INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
+    INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+};
+
+static const struct test avx512dq_512[] = {
+    INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
+static const unsigned char vl_no128[] = { VL_512, VL_256 };
+static const unsigned char vl_512[] = { VL_512 };
 
 /*
  * This table, indicating the presence of an immediate (byte) for an opcode
@@ -486,6 +507,10 @@ void evex_disp8_test(void *instr, struct
 
     RUN(avx512f, all);
     RUN(avx512f, 128);
+    RUN(avx512f, no128);
+    RUN(avx512f, 512);
     RUN(avx512bw, all);
     RUN(avx512dq, all);
+    RUN(avx512dq, no128);
+    RUN(avx512dq, 512);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -234,10 +234,16 @@ enum simd_opsize {
 
     /*
      * 128 bits of integer or floating point data, with no further
-     * formatting information.
+     * formatting information, or with it encoded by EVEX.W.
      */
     simd_128,
 
+    /*
+     * 256 bits of integer or floating point data, with formatting
+     * encoded by EVEX.W.
+     */
+    simd_256,
+
     /* Operand size encoded in non-standard way. */
     simd_other
 };
@@ -430,8 +436,10 @@ static const struct ext0f38_table {
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x18 ... 0x19] = { .simd_size = simd_scalar_opc, .two_op = 1 },
-    [0x1a] = { .simd_size = simd_128, .two_op = 1 },
+    [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
+    [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
+    [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
+    [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
     [0x28 ... 0x29] = { .simd_size = simd_packed_int },
@@ -3320,6 +3328,10 @@ x86_decode(
         op_bytes = 16;
         break;
 
+    case simd_256:
+        op_bytes = 32;
+        break;
+
     default:
         op_bytes = 0;
         break;
@@ -7969,6 +7981,42 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+    avx512_broadcast:
+        /*
+         * For the respective code below the main switch() to work we need to
+         * fold op_mask here: A source element gets read whenever any of its
+         * respective destination elements' mask bits is set.
+         */
+        if ( fault_suppression )
+        {
+            n = 1 << ((b & 3) - evex.w);
+            ASSERT(op_bytes == n * elem_bytes);
+            for ( i = n; i < (16 << evex.lr) / elem_bytes; i += n )
+                op_mask |= (op_mask >> i) & ((1 << n) - 1);
+        }
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
+                                            /* vbroadcastf64x4 m256,zmm{k} */
+        generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
+                                            /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
+        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcastf64x2 m128,{y,z}mm{k} */
+        generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.br,
+                              EXC_UD);
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
     case X86EMUL_OPC_66(0x0f38, 0x20): /* pmovsxbw xmm/m64,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x21): /* pmovsxbd xmm/m32,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x22): /* pmovsxbq xmm/m16,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 15/34] x86emul: support AVX512F v{, u}comis{d, s} insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (13 preceding siblings ...)
  2018-09-18 12:02   ` [PATCH v3 14/34] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
@ 2018-09-18 12:03   ` Jan Beulich
  2018-09-18 12:03   ` [PATCH v3 16/34] x86emul/test: introduce eq() Jan Beulich
                     ` (18 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:03 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -93,6 +93,8 @@ static const struct test avx512f_all[] =
     INSN_FP(add,             0f, 58),
     INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
+    INSN(comisd,       66,   0f, 2f,    el,      q, el),
+    INSN(comiss,         ,   0f, 2f,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -152,6 +154,8 @@ static const struct test avx512f_all[] =
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
+    INSN(ucomisd,      66,   0f, 2e,    el,      q, el),
+    INSN(ucomiss,        ,   0f, 2e,    el,      d, el),
     INSN_PFP(unpckh,         0f, 15),
     INSN_PFP(unpckl,         0f, 14),
 };
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -297,7 +297,7 @@ static const struct twobyte_table {
     [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp },
+    [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp, simd_none, d8s_dq },
     [0x30 ... 0x35] = { ImplicitOps },
     [0x37] = { ImplicitOps },
     [0x38] = { DstReg|SrcMem|ModRM },
@@ -6101,24 +6101,34 @@ x86_emulate(
         }
 
         opc = init_prefixes(stub);
+        op_bytes = 4 << vex.pfx;
+    vcomi:
         opc[0] = b;
         opc[1] = modrm;
         if ( ea.type == OP_MEM )
         {
-            rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp, vex.pfx ? 8 : 4,
-                           ctxt);
+            rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp, op_bytes, ctxt);
             if ( rc != X86EMUL_OKAY )
                 goto done;
 
             /* Convert memory operand to (%rAX). */
             rex_prefix &= ~REX_B;
             vex.b = 1;
+            evex.b = 1;
             opc[1] &= 0x38;
         }
-        insn_bytes = PFX_BYTES + 2;
+        if ( evex_encoded() )
+        {
+            insn_bytes = EVEX_PFX_BYTES + 2;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 2;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
         opc[2] = 0xc3;
 
-        copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),
                     _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),
                     [eflags] "+g" (_regs.eflags),
@@ -6129,6 +6139,19 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2e): /* vucomis{s,d} xmm/mem,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2f): /* vcomis{s,d} xmm/mem,xmm */
+        generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
+                               (ea.type == OP_MEM && evex.br) ||
+                               evex.w != evex.pfx),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        op_bytes = 4 << evex.w;
+        goto vcomi;
+
     case X86EMUL_OPC(0x0f, 0x30): /* wrmsr */
         generate_exception_if(!mode_ring0(), EXC_GP, 0);
         fail_if(ops->write_msr == NULL);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 16/34] x86emul/test: introduce eq()
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (14 preceding siblings ...)
  2018-09-18 12:03   ` [PATCH v3 15/34] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
@ 2018-09-18 12:03   ` Jan Beulich
  2018-09-18 12:04   ` [PATCH v3 17/34] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
                     ` (17 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:03 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

In preparation for sensible to-boolean conversion on AVX512, wrap
another abstraction function around the present to_bool(<x> == <y>), to
get rid of the open-coded == (which will get in the way of using
built-in functions instead). For the future AVX512 use scalar operands
can't be used then anymore: Use (vec_t){} when the operand is zero,
and broadcast (if available) otherwise (assume pre-AVX512 when broadcast
is not available, in which case a plain scalar is still fine).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -46,6 +46,10 @@ static inline bool _to_bool(byte_vec_t b
 # define to_bool(cmp) _to_bool((byte_vec_t)(cmp))
 #endif
 
+#ifndef eq
+# define eq(x, y) to_bool((x) == (y))
+#endif
+
 #if VEC_SIZE == FLOAT_SIZE
 # define to_int(x) ((vec_t){ (int)(x)[0] })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
@@ -605,18 +609,18 @@ int simd_test(void)
     touch(src);
     x = src;
     touch(x);
-    if ( !to_bool(x == src) ) return __LINE__;
+    if ( !eq(x, src) ) return __LINE__;
 
     touch(src);
     y = x + src;
     touch(src);
     touch(y);
-    if ( !to_bool(y == 2 * src) ) return __LINE__;
+    if ( !eq(y, 2 * src) ) return __LINE__;
 
     touch(src);
     z = y -= src;
     touch(z);
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
 #if defined(UINT_SIZE)
 
@@ -628,7 +632,7 @@ int simd_test(void)
     z ^= inv;
     touch(inv);
     touch(x);
-    if ( !to_bool((x & ~y) == z) ) return __LINE__;
+    if ( !eq(x & ~y, z) ) return __LINE__;
 
 #elif ELEM_SIZE > 1 || VEC_SIZE <= 8
 
@@ -639,7 +643,7 @@ int simd_test(void)
     z = src + inv;
     touch(inv);
     z *= (src - inv);
-    if ( !to_bool(x - y == z) ) return __LINE__;
+    if ( !eq(x - y, z) ) return __LINE__;
 
 #endif
 
@@ -648,10 +652,10 @@ int simd_test(void)
     x = src * alt;
     touch(alt);
     y = src / alt;
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
     touch(alt);
     touch(src);
-    if ( !to_bool(x * -alt == -src) ) return __LINE__;
+    if ( !eq(x * -alt, -src) ) return __LINE__;
 
 # if defined(recip) && defined(to_int)
 
@@ -659,16 +663,16 @@ int simd_test(void)
     x = recip(src);
     touch(src);
     touch(x);
-    if ( !to_bool(to_int(recip(x)) == src) ) return __LINE__;
+    if ( !eq(to_int(recip(x)), src) ) return __LINE__;
 
 #  ifdef rsqrt
     x = src * src;
     touch(x);
     y = rsqrt(x);
     touch(y);
-    if ( !to_bool(to_int(recip(y)) == src) ) return __LINE__;
+    if ( !eq(to_int(recip(y)), src) ) return __LINE__;
     touch(src);
-    if ( !to_bool(to_int(y) == to_int(recip(src))) ) return __LINE__;
+    if ( !eq(to_int(y), to_int(recip(src))) ) return __LINE__;
 #  endif
 
 # endif
@@ -676,7 +680,7 @@ int simd_test(void)
 # ifdef sqrt
     x = src * src;
     touch(x);
-    if ( !to_bool(sqrt(x) == src) ) return __LINE__;
+    if ( !eq(sqrt(x), src) ) return __LINE__;
 # endif
 
 # ifdef trunc
@@ -684,20 +688,20 @@ int simd_test(void)
     y = (vec_t){ 1 };
     touch(x);
     z = trunc(x);
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
 # endif
 
 # ifdef frac
     touch(src);
     x = frac(src);
     touch(src);
-    if ( !to_bool(x == 0) ) return __LINE__;
+    if ( !eq(x, (vec_t){}) ) return __LINE__;
 
     x = 1 / (src + 1);
     touch(x);
     y = frac(x);
     touch(x);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 # endif
 
 # if defined(trunc) && defined(frac)
@@ -707,7 +711,7 @@ int simd_test(void)
     touch(x);
     z = frac(x);
     touch(x);
-    if ( !to_bool(x == y + z) ) return __LINE__;
+    if ( !eq(x, y + z) ) return __LINE__;
 # endif
 
 #else
@@ -720,16 +724,16 @@ int simd_test(void)
     y[ELEM_COUNT - 1] = y[0] = j = ELEM_COUNT;
     for ( i = 1; i < ELEM_COUNT / 2; ++i )
         y[ELEM_COUNT - i - 1] = y[i] = y[i - 1] + (j -= 2);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 
 #  ifdef mul_hi
     touch(alt);
     x = mul_hi(src, alt);
     touch(alt);
 #   ifdef INT_SIZE
-    if ( !to_bool(x == (alt < 0)) ) return __LINE__;
+    if ( !eq(x, alt < 0) ) return __LINE__;
 #   else
-    if ( !to_bool(x == (src & alt) + alt) ) return __LINE__;
+    if ( !eq(x, (src & alt) + alt) ) return __LINE__;
 #   endif
 #  endif
 
@@ -745,7 +749,7 @@ int simd_test(void)
         z[i] = res;
         z[i + 1] = res >> (ELEM_SIZE << 3);
     }
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
 #  endif
 
     z = src;
@@ -757,12 +761,12 @@ int simd_test(void)
     touch(z);
     y = z << 2;
     touch(z);
-    if ( !to_bool(x == y + y) ) return __LINE__;
+    if ( !eq(x, y + y) ) return __LINE__;
 
     touch(x);
     z = x >> 2;
     touch(x);
-    if ( !to_bool(y == z + z) ) return __LINE__;
+    if ( !eq(y, z + z) ) return __LINE__;
 
     z = src;
 #  ifdef INT_SIZE
@@ -781,11 +785,11 @@ int simd_test(void)
     touch(j);
     y = z << j;
     touch(j);
-    if ( !to_bool(x == y + y) ) return __LINE__;
+    if ( !eq(x, y + y) ) return __LINE__;
 
     z = x >> j;
     touch(j);
-    if ( !to_bool(y == z + z) ) return __LINE__;
+    if ( !eq(y, z + z) ) return __LINE__;
 
 # endif
 
@@ -809,12 +813,12 @@ int simd_test(void)
     --sh;
     touch(sh);
     y = z << sh;
-    if ( !to_bool(x == y + y) ) return __LINE__;
+    if ( !eq(x, y + y) ) return __LINE__;
 
 #  if (defined(__AVX2__) && ELEM_SIZE >= 4) || defined(__XOP__)
     touch(sh);
     x = y >> sh;
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 #  endif
 
 # endif
@@ -828,7 +832,7 @@ int simd_test(void)
     touch(inv);
     y = max(src, inv);
     touch(inv);
-    if ( !to_bool(x + y == src + inv) ) return __LINE__;
+    if ( !eq(x + y, src + inv) ) return __LINE__;
 # else
     x = src * alt;
     y = inv * alt;
@@ -837,33 +841,33 @@ int simd_test(void)
     touch(y);
     y = min(x, y);
     touch(y);
-    if ( !to_bool((y + z) * alt == src + inv) ) return __LINE__;
+    if ( !eq((y + z) * alt, src + inv) ) return __LINE__;
 # endif
 #endif
 
 #ifdef abs
     x = src * alt;
     touch(x);
-    if ( !to_bool(abs(x) == src) ) return __LINE__;
+    if ( !eq(abs(x), src) ) return __LINE__;
 #endif
 
 #ifdef copysignz
     touch(alt);
-    if ( !to_bool(copysignz((vec_t){} + 1, alt) == alt) ) return __LINE__;
+    if ( !eq(copysignz((vec_t){} + 1, alt), alt) ) return __LINE__;
 #endif
 
 #ifdef swap
     touch(src);
-    if ( !to_bool(swap(src) == inv) ) return __LINE__;
+    if ( !eq(swap(src), inv) ) return __LINE__;
 #endif
 
 #ifdef swap2
     touch(src);
-    if ( !to_bool(swap2(src) == inv) ) return __LINE__;
+    if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
 #if defined(broadcast)
-    if ( !to_bool(broadcast(ELEM_COUNT + 1) == src + inv) ) return __LINE__;
+    if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
 #if defined(interleave_lo) && defined(interleave_hi)
@@ -877,7 +881,11 @@ int simd_test(void)
 # else
     z = (x - y) * alt;
 # endif
-    if ( !to_bool(z == ELEM_COUNT / 2) ) return __LINE__;
+# ifdef broadcast
+    if ( !eq(z, broadcast(ELEM_COUNT / 2)) ) return __LINE__;
+# else
+    if ( !eq(z, ELEM_COUNT / 2) ) return __LINE__;
+# endif
 #endif
 
 #if defined(INT_SIZE) && defined(widen1) && defined(interleave_lo)
@@ -887,7 +895,7 @@ int simd_test(void)
     touch(x);
     z = widen1(x);
     touch(x);
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 
 # ifdef widen2
     y = interleave_lo(alt < 0, alt < 0);
@@ -895,7 +903,7 @@ int simd_test(void)
     touch(x);
     z = widen2(x);
     touch(x);
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 
 #  ifdef widen3
     y = interleave_lo(alt < 0, alt < 0);
@@ -904,7 +912,7 @@ int simd_test(void)
     touch(x);
     z = widen3(x);
     touch(x);
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 #  endif
 # endif
 
@@ -919,21 +927,21 @@ int simd_test(void)
     touch(src);
     x = widen1(src);
     touch(src);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 # endif
 
 # ifdef widen2
     touch(src);
     x = widen2(src);
     touch(src);
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 # endif
 
 # ifdef widen3
     touch(src);
     x = widen3(src);
     touch(src);
-    if ( !to_bool(x == interleave_lo(z, (vec_t){})) ) return __LINE__;
+    if ( !eq(x, interleave_lo(z, (vec_t){})) ) return __LINE__;
 # endif
 
 #endif
@@ -942,14 +950,14 @@ int simd_test(void)
     touch(src);
     x = dup_lo(src);
     touch(src);
-    if ( !to_bool(x - src == (alt - 1) / 2) ) return __LINE__;
+    if ( !eq(x - src, (alt - 1) / 2) ) return __LINE__;
 #endif
 
 #ifdef dup_hi
     touch(src);
     x = dup_hi(src);
     touch(src);
-    if ( !to_bool(x - src == (alt + 1) / 2) ) return __LINE__;
+    if ( !eq(x - src, (alt + 1) / 2) ) return __LINE__;
 #endif
 
     for ( i = 0; i < ELEM_COUNT; ++i )
@@ -961,7 +969,7 @@ int simd_test(void)
 # else
     select(&z, src, inv, alt > 0);
 # endif
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 #endif
 
 #ifdef select2
@@ -970,14 +978,14 @@ int simd_test(void)
 # else
     select2(&z, src, inv, alt > 0);
 # endif
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 #endif
 
 #ifdef mix
     touch(src);
     touch(inv);
     x = mix(src, inv);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 
 # ifdef addsub
     touch(src);
@@ -986,22 +994,22 @@ int simd_test(void)
     touch(src);
     touch(inv);
     y = mix(src - inv, src + inv);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 # endif
 #endif
 
 #ifdef rotr
     x = rotr(src, 1);
     y = (src & (ELEM_COUNT - 1)) + 1;
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 #endif
 
 #ifdef dot_product
     touch(src);
     touch(inv);
     x = dot_product(src, inv);
-    if ( !to_bool(x == (vec_t){ (ELEM_COUNT * (ELEM_COUNT + 1) *
-                                 (ELEM_COUNT + 2)) / 6 }) ) return __LINE__;
+    if ( !eq(x, (vec_t){ (ELEM_COUNT * (ELEM_COUNT + 1) *
+                          (ELEM_COUNT + 2)) / 6 }) ) return __LINE__;
 #endif
 
 #ifdef hadd
@@ -1022,7 +1030,7 @@ int simd_test(void)
     x = hsub(src, inv);
     for ( i = ELEM_COUNT; i >>= 1; )
         x = hadd(x, (vec_t){});
-    if ( !to_bool(x == 0) ) return __LINE__;
+    if ( !eq(x, (vec_t){}) ) return __LINE__;
 # endif
 #endif
 
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -20,6 +20,10 @@ ENTRY(fma_test);
 # endif
 #endif
 
+#ifndef eq
+# define eq(x, y) to_bool((x) == (y))
+#endif
+
 #if VEC_SIZE == 16
 # if FLOAT_SIZE == 4
 #  define addsub(x, y) __builtin_ia32_addsubps(x, y)
@@ -62,38 +66,38 @@ int fma_test(void)
     y = (src - one) * inv;
     touch(src);
     z = inv * src + inv;
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
     touch(src);
     z = -inv * src - inv;
-    if ( !to_bool(-x == z) ) return __LINE__;
+    if ( !eq(-x, z) ) return __LINE__;
 
     touch(src);
     z = inv * src - inv;
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
 
     touch(src);
     z = -inv * src + inv;
-    if ( !to_bool(-y == z) ) return __LINE__;
+    if ( !eq(-y, z) ) return __LINE__;
     touch(src);
 
     x = src + inv;
     y = src - inv;
     touch(inv);
     z = src * one + inv;
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
     touch(inv);
     z = -src * one - inv;
-    if ( !to_bool(-x == z) ) return __LINE__;
+    if ( !eq(-x, z) ) return __LINE__;
 
     touch(inv);
     z = src * one - inv;
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
 
     touch(inv);
     z = -src * one + inv;
-    if ( !to_bool(-y == z) ) return __LINE__;
+    if ( !eq(-y, z) ) return __LINE__;
     touch(inv);
 
 #if defined(addsub) && defined(fmaddsub)
@@ -101,21 +105,21 @@ int fma_test(void)
     y = addsub(src * inv, -one);
     touch(one);
     z = fmaddsub(src, inv, one);
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
     touch(one);
     z = fmaddsub(src, inv, -one);
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
     touch(one);
 
     x = addsub(src * inv, one);
     touch(inv);
     z = fmaddsub(src, inv, one);
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
     touch(inv);
     z = fmaddsub(src, inv, -one);
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
     touch(inv);
 #endif
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 17/34] x86emul: support AVX512{F, BW} packed integer compare insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (15 preceding siblings ...)
  2018-09-18 12:03   ` [PATCH v3 16/34] x86emul/test: introduce eq() Jan Beulich
@ 2018-09-18 12:04   ` Jan Beulich
  2018-09-18 12:05   ` [PATCH v3 18/34] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
                     ` (16 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:04 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Include VPTEST{,N}M{B,D,Q,W} as once again possibly used by the compiler
for comparison against all-zero vectors.

Also table entries for a few more insns get their .d8s field set right
away, again in order to not split and later re-combine the groups.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -148,8 +148,16 @@ static const struct test avx512f_all[] =
     INSN_FP(mul,             0f, 59),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
+    INSN(pcmpeqd,      66,   0f, 76,    vl,      d, vl),
+    INSN(pcmpeqq,      66, 0f38, 29,    vl,      q, vl),
+    INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
+    INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
+    INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
+    INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
+    INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
@@ -181,6 +189,14 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,   b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,   w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,   w, vl),
+    INSN(pcmp,        66, 0f3a, 3f,    vl,  bw, vl),
+    INSN(pcmpeqb,     66,   0f, 74,    vl,   b, vl),
+    INSN(pcmpeqw,     66,   0f, 75,    vl,   w, vl),
+    INSN(pcmpgtb,     66,   0f, 64,    vl,   b, vl),
+    INSN(pcmpgtw,     66,   0f, 65,    vl,   w, vl),
+    INSN(pcmpu,       66, 0f3a, 3e,    vl,  bw, vl),
+    INSN(ptestm,      66, 0f38, 26,    vl,  bw, vl),
+    INSN(ptestnm,     f3, 0f38, 26,    vl,  bw, vl),
 };
 
 static const struct test avx512dq_all[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -311,14 +311,14 @@ static const struct twobyte_table {
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
-    [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
@@ -442,13 +442,13 @@ static const struct ext0f38_table {
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
-    [0x28 ... 0x29] = { .simd_size = simd_packed_int },
+    [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
-    [0x36 ... 0x3f] = { .simd_size = simd_packed_int },
+    [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int },
@@ -514,6 +514,7 @@ static const struct ext0f3a_table {
     [0x18] = { .simd_size = simd_128 },
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
+    [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
@@ -521,6 +522,7 @@ static const struct ext0f3a_table {
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
     [0x44] = { .simd_size = simd_packed_int },
@@ -6558,6 +6560,32 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
+        op_bytes = 16 << evex.lr;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x64): /* vpcmpeqb [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x65): /* vpcmpeqw [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x66): /* vpcmpeqd [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x74): /* vpcmpgtb [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x75): /* vpcmpgtw [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x76): /* vpcmpgtd [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x26): /* vptestm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x27): /* vptestm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x29): /* vpcmpeqq [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x37): /* vpcmpgtq [xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( b & (ext == ext_0f38 ? 1 : 2) )
+        {
+            generate_exception_if(b != 0x27 && evex.w != (b & 1), EXC_UD);
+            goto avx512f_no_sae;
+        }
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << (ext == ext_0f ? b & 1 : evex.w);
+        avx512_vlen_check(false);
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x6e):    /* mov{d,q} r/m,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
     CASE_SIMD_PACKED_INT(0x0f, 0x7e):    /* mov{d,q} {,x}mm,r/m */
@@ -7566,6 +7594,7 @@ x86_emulate(
                               EXC_UD);
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    avx512f_imm_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
         avx512_vlen_check(false);
@@ -8739,6 +8768,19 @@ x86_emulate(
         break;
     }
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1e): /* vpcmpu{d,q} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1f): /* vpcmp{d,q} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3e): /* vpcmpu{b,w} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3f): /* vpcmp{b,w} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( !(b & 0x20) )
+            goto avx512f_imm_no_sae;
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x20): /* pinsrb $imm8,r32/m8,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x22): /* pinsr{d,q} $imm8,r/m,xmm */
         host_and_vcpu_must_have(sse4_1);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 18/34] x86emul: support AVX512{F, BW} packed integer arithmetic insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (16 preceding siblings ...)
  2018-09-18 12:04   ` [PATCH v3 17/34] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
@ 2018-09-18 12:05   ` Jan Beulich
  2018-09-18 12:05   ` [PATCH v3 19/34] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
                     ` (15 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:05 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note: vpadd* / vpsub* et al are put at seemingly the wrong slot of the
big switch(). This is in anticipation of adding vpunpck* to those
groups (see the legacy/VEX encoded case labels nearby to support this).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -146,6 +146,8 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(paddd,        66,   0f, fe,    vl,      d, vl),
+    INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
     INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
@@ -154,7 +156,16 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
+    INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
+    INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
+    INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmuldq,       66, 0f38, 28,    vl,      q, vl),
+    INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
+    INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSN(psubd,        66,   0f, fa,    vl,      d, vl),
+    INSN(psubq,        66,   0f, fb,    vl,      q, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
     INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
     INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
@@ -189,12 +200,39 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,   b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,   w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,   w, vl),
+    INSN(paddb,       66,   0f, fc,    vl,   b, vl),
+    INSN(paddsb,      66,   0f, ec,    vl,   b, vl),
+    INSN(paddsw,      66,   0f, ed,    vl,   w, vl),
+    INSN(paddusb,     66,   0f, dc,    vl,   b, vl),
+    INSN(paddusw,     66,   0f, dd,    vl,   w, vl),
+    INSN(paddw,       66,   0f, fd,    vl,   w, vl),
+    INSN(pavgb,       66,   0f, e0,    vl,   b, vl),
+    INSN(pavgw,       66,   0f, e3,    vl,   w, vl),
     INSN(pcmp,        66, 0f3a, 3f,    vl,  bw, vl),
     INSN(pcmpeqb,     66,   0f, 74,    vl,   b, vl),
     INSN(pcmpeqw,     66,   0f, 75,    vl,   w, vl),
     INSN(pcmpgtb,     66,   0f, 64,    vl,   b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,   w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,  bw, vl),
+    INSN(pmaddwd,     66,   0f, f5,    vl,   w, vl),
+    INSN(pmaxsb,      66, 0f38, 3c,    vl,   b, vl),
+    INSN(pmaxsw,      66,   0f, ee,    vl,   w, vl),
+    INSN(pmaxub,      66,   0f, de,    vl,   b, vl),
+    INSN(pmaxuw,      66, 0f38, 3e,    vl,   w, vl),
+    INSN(pminsb,      66, 0f38, 38,    vl,   b, vl),
+    INSN(pminsw,      66,   0f, ea,    vl,   w, vl),
+    INSN(pminub,      66,   0f, da,    vl,   b, vl),
+    INSN(pminuw,      66, 0f38, 3a,    vl,   w, vl),
+    INSN(pmulhuw,     66,   0f, e4,    vl,   w, vl),
+    INSN(pmulhw,      66,   0f, e5,    vl,   w, vl),
+    INSN(pmullw,      66,   0f, d5,    vl,   w, vl),
+    INSN(psadbw,      66,   0f, f6,    vl,   b, vl),
+    INSN(psubb,       66,   0f, f8,    vl,   b, vl),
+    INSN(psubsb,      66,   0f, e8,    vl,   b, vl),
+    INSN(psubsw,      66,   0f, e9,    vl,   w, vl),
+    INSN(psubusb,     66,   0f, d8,    vl,   b, vl),
+    INSN(psubusw,     66,   0f, d9,    vl,   w, vl),
+    INSN(psubw,       66,   0f, f9,    vl,   w, vl),
     INSN(ptestm,      66, 0f38, 26,    vl,  bw, vl),
     INSN(ptestnm,     f3, 0f38, 26,    vl,  bw, vl),
 };
@@ -203,6 +241,7 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN_PFP(or,               0f, 56),
+    INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
 };
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -365,21 +365,21 @@ static const struct twobyte_table {
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
-    [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xff] = { ModRM }
 };
 
@@ -449,7 +449,7 @@ static const struct ext0f38_table {
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x40] = { .simd_size = simd_packed_int },
+    [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int },
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
@@ -5960,6 +5960,10 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x39): /* vpmins{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3b): /* vpminu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3d): /* vpmaxs{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3f): /* vpmaxu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -6560,6 +6564,37 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd8): /* vpsubusb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd9): /* vpsubusw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdc): /* vpaddusb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdd): /* vpaddusw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe0): /* vpavgb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe3): /* vpavgw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe5): /* vpmulhw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe8): /* vpsubsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe9): /* vpsubsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xec): /* vpaddsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xed): /* vpaddsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf8): /* vpsubb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << (b & 1);
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
         op_bytes = 16 << evex.lr;
@@ -6586,6 +6621,12 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd4): /* vpaddq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf4): /* vpmuludq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x28): /* vpmuldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        goto avx512f_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x6e):    /* mov{d,q} r/m,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
     CASE_SIMD_PACKED_INT(0x0f, 0x7e):    /* mov{d,q} {,x}mm,r/m */
@@ -7837,6 +7878,16 @@ x86_emulate(
         vcpu_must_have(mmxext);
         goto simd_0f_mmx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xda): /* vpminub [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xde): /* vpmaxub [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe4): /* vpmulhuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xea): /* vpminsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xee): /* vpmaxsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = b & 0x10 ? 1 : 2;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f, 0xe6):       /* cvttpd2dq xmm/mem,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe6):   /* vcvttpd2dq {x,y}mm/mem,xmm */
     case X86EMUL_OPC_F3(0x0f, 0xe6):       /* cvtdq2pd xmm/mem,xmm */
@@ -8210,6 +8261,20 @@ x86_emulate(
         host_and_vcpu_must_have(sse4_2);
         goto simd_0f38_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x38): /* vpminsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3a): /* vpminuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3c): /* vpmaxsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3e): /* vpmaxuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = b & 2 ?: 1;
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x40): /* vpmull{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0xdb):     /* aesimc xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xdb): /* vaesimc xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xdc):     /* aesenc xmm/m128,xmm,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 19/34] x86emul: use simd_128 also for legacy vector shift insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (17 preceding siblings ...)
  2018-09-18 12:05   ` [PATCH v3 18/34] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
@ 2018-09-18 12:05   ` Jan Beulich
  2018-09-18 12:05   ` [PATCH v3 20/34] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
                     ` (14 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:05 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

This eliminates a separate case block here, and allows to get away with
fewer new ones when adding AVX512 vector shifts.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -364,19 +364,19 @@ static const struct twobyte_table {
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128 },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128 },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
     [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -3327,7 +3327,8 @@ x86_decode(
         break;
 
     case simd_128:
-        op_bytes = 16;
+        /* The special case here are MMX shift insns. */
+        op_bytes = vex.opcx || vex.pfx ? 16 : 8;
         break;
 
     case simd_256:
@@ -6455,6 +6456,12 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f, 0x75): /* vpcmpeqw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0x76):    /* pcmpeqd {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x76): /* vpcmpeqd {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xd1):    /* psrlw {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xd2):    /* psrld {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xd3):    /* psrlq {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xd4):     /* paddq xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xd4): /* vpaddq {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0xd5):    /* pmullw {,x}mm/mem,{,x}mm */
@@ -6477,6 +6484,10 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f, 0xdf): /* vpandn {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xe0):     /* pavgb xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe0): /* vpavgb {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xe1):    /* psraw {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xe2):    /* psrad {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xe2): /* vpsrad xmm/m128,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xe3):     /* pavgw xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe3): /* vpavgw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xe4):     /* pmulhuw xmm/m128,xmm */
@@ -6499,6 +6510,12 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f, 0xee): /* vpmaxsw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0xef):    /* pxor {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xef): /* vpxor {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xf1):    /* psllw {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xf2):    /* pslld {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xf2): /* vpslld xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xf3):    /* psllq {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xf4):     /* pmuludq xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xf4): /* vpmuludq {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xf6):     /* psadbw xmm/m128,xmm */
@@ -7831,25 +7848,6 @@ x86_emulate(
         }
         break;
 
-    CASE_SIMD_PACKED_INT(0x0f, 0xd1):    /* psrlw {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xd2):    /* psrld {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xd3):    /* psrlq {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xe1):    /* psraw {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xe2):    /* psrad {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xe2): /* vpsrad xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xf1):    /* psllw {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xf2):    /* pslld {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xf2): /* vpslld xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xf3):    /* psllq {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,{x,y}mm,{x,y}mm */
-        op_bytes = vex.pfx ? 16 : 8;
-        goto simd_0f_int;
-
     case X86EMUL_OPC(0x0f, 0xd4):        /* paddq mm/m64,mm */
     case X86EMUL_OPC(0x0f, 0xf4):        /* pmuludq mm/m64,mm */
     case X86EMUL_OPC(0x0f, 0xfb):        /* psubq mm/m64,mm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 20/34] x86emul: support AVX512{F, BW} shift/rotate insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (18 preceding siblings ...)
  2018-09-18 12:05   ` [PATCH v3 19/34] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
@ 2018-09-18 12:05   ` Jan Beulich
  2018-09-18 12:06   ` [PATCH v3 21/34] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
                     ` (13 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:05 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note that simd_packed_fp for the opcode space 0f38 major opcodes 14 and
15 is not really correct, but sufficient for the purposes here. Further
adjustments may later be needed for the down conversion unsigned
saturating VPMOV* insns, first and foremost for the different Disp8
scaling those ones use.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -164,6 +164,24 @@ static const struct test avx512f_all[] =
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSNX(prol,        66,   0f, 72, 1, vl,     dq, vl),
+    INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
+    INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
+    INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
+    INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
+    INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
+    INSNX(psllq,       66,   0f, 73, 6, vl,      q, vl),
+    INSN(psllv,        66, 0f38, 47,    vl,     dq, vl),
+    INSNX(psra,        66,   0f, 72, 4, vl,     dq, vl),
+    INSN(psrad,        66,   0f, e2,    el_4,    d, vl),
+    INSN(psraq,        66,   0f, e2,    el_2,    q, vl),
+    INSN(psrav,        66, 0f38, 46,    vl,     dq, vl),
+    INSN(psrld,        66,   0f, d2,    el_4,    d, vl),
+    INSNX(psrld,       66,   0f, 72, 2, vl,      d, vl),
+    INSN(psrlq,        66,   0f, d3,    el_2,    q, vl),
+    INSNX(psrlq,       66,   0f, 73, 2, vl,      q, vl),
+    INSN(psrlv,        66, 0f38, 45,    vl,     dq, vl),
     INSN(psubd,        66,   0f, fa,    vl,      d, vl),
     INSN(psubq,        66,   0f, fb,    vl,      q, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
@@ -227,6 +245,17 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,   w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,   w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,   b, vl),
+    INSNX(pslldq,     66,   0f, 73, 7, vl,   b, vl),
+    INSN(psllvw,      66, 0f38, 12,    vl,   w, vl),
+    INSN(psllw,       66,   0f, f1,    el_8, w, vl),
+    INSNX(psllw,      66,   0f, 71, 6, vl,   w, vl),
+    INSN(psravw,      66, 0f38, 11,    vl,   w, vl),
+    INSN(psraw,       66,   0f, e1,    el_8, w, vl),
+    INSNX(psraw,      66,   0f, 71, 4, vl,   w, vl),
+    INSNX(psrldq,     66,   0f, 73, 3, vl,   b, vl),
+    INSN(psrlvw,      66, 0f38, 10,    vl,   w, vl),
+    INSN(psrlw,       66,   0f, d1,    el_8, w, vl),
+    INSNX(psrlw,      66,   0f, 71, 2, vl,   w, vl),
     INSN(psubb,       66,   0f, f8,    vl,   b, vl),
     INSN(psubsb,      66,   0f, e8,    vl,   b, vl),
     INSN(psubsw,      66,   0f, e9,    vl,   w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -317,7 +317,7 @@ static const struct twobyte_table {
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
-    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
+    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
@@ -364,19 +364,19 @@ static const struct twobyte_table {
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
     [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -432,9 +432,9 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
-    [0x10] = { .simd_size = simd_packed_int },
+    [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
-    [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
+    [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
     [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
@@ -451,7 +451,7 @@ static const struct ext0f38_table {
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x45 ... 0x47] = { .simd_size = simd_packed_int },
+    [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
@@ -5961,10 +5961,15 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x14): /* vprorv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x15): /* vprolv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x39): /* vpmins{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3b): /* vpminu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3d): /* vpmaxs{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3f): /* vpmaxu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -6581,6 +6586,9 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
@@ -6606,6 +6614,16 @@ x86_emulate(
         elem_bytes = 1 << (b & 1);
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe2): /* vpsra{d,q} xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.br, EXC_UD);
+        fault_suppression = false;
+        if ( b == 0xe2 )
+            goto avx512f_no_sae;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6886,6 +6904,37 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x71): /* Grp12 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsraw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512bw_shift_imm:
+            fault_suppression = false;
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512bw_imm;
+        }
+        goto unrecognized_insn;
+
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x72): /* Grp13 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpslld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(evex.w, EXC_UD);
+            /* fall through */
+        case 0: /* vpror{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 1: /* vprol{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsra{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512f_shift_imm:
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512f_imm_no_sae;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x73):        /* Grp14 */
         switch ( modrm_reg & 7 )
         {
@@ -6911,6 +6960,19 @@ x86_emulate(
         }
         goto unrecognized_insn;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x73): /* Grp14 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(!evex.w, EXC_UD);
+            goto avx512f_shift_imm;
+        case 3: /* vpsrldq $imm8,{x,y}mm,{x,y}mm */
+        case 7: /* vpslldq $imm8,{x,y}mm,{x,y}mm */
+            goto avx512bw_shift_imm;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x77):        /* emms */
     case X86EMUL_OPC_VEX(0x0f, 0x77):    /* vzero{all,upper} */
         if ( vex.opcx != vex_none )
@@ -8082,6 +8144,14 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x10): /* vpsrlvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x11): /* vpsravw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x12): /* vpsllvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
         generate_exception_if(evex.w || evex.br, EXC_UD);
     avx512_broadcast:
@@ -8838,6 +8908,7 @@ x86_emulate(
         generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
         if ( !(b & 0x20) )
             goto avx512f_imm_no_sae;
+    avx512bw_imm:
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.br, EXC_UD);
         elem_bytes = 1 << evex.w;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 21/34] x86emul: support AVX512{F, BW, DQ} extract insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (19 preceding siblings ...)
  2018-09-18 12:05   ` [PATCH v3 20/34] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
@ 2018-09-18 12:06   ` Jan Beulich
  2018-09-18 12:07   ` [PATCH v3 22/34] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
                     ` (12 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:06 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -198,6 +198,7 @@ static const struct test avx512f_all[] =
 };
 
 static const struct test avx512f_128[] = {
+    INSN(extractps, 66, 0f3a, 17, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -207,10 +208,14 @@ static const struct test avx512f_128[] =
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
+    INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
+    INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
+    INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -266,6 +271,12 @@ static const struct test avx512bw_all[]
     INSN(ptestnm,     f3, 0f38, 26,    vl,  bw, vl),
 };
 
+static const struct test avx512bw_128[] = {
+    INSN(pextrb, 66, 0f3a, 14, el, b, el),
+//       pextrw, 66,   0f, c5,     w
+    INSN(pextrw, 66, 0f3a, 15, el, w, el),
+};
+
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
@@ -274,13 +285,21 @@ static const struct test avx512dq_all[]
     INSN_PFP(xor,              0f, 57),
 };
 
+static const struct test avx512dq_128[] = {
+    INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+};
+
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
+    INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
+    INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
@@ -598,7 +617,9 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, no128);
     RUN(avx512f, 512);
     RUN(avx512bw, all);
+    RUN(avx512bw, 128);
     RUN(avx512dq, all);
+    RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -510,9 +510,13 @@ static const struct ext0f3a_table {
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
-    [0x14 ... 0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1 },
+    [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
+    [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
+    [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
+    [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
     [0x18] = { .simd_size = simd_128 },
-    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none },
@@ -521,7 +525,8 @@ static const struct ext0f3a_table {
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
-    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
@@ -2655,6 +2660,8 @@ x86_decode_0f3a(
      ... X86EMUL_OPC_66(0, 0x17):     /* pextr*, extractps */
     case X86EMUL_OPC_VEX_66(0, 0x14)
      ... X86EMUL_OPC_VEX_66(0, 0x17): /* vpextr*, vextractps */
+    case X86EMUL_OPC_EVEX_66(0, 0x14)
+     ... X86EMUL_OPC_EVEX_66(0, 0x17): /* vpextr*, vextractps */
     case X86EMUL_OPC_VEX_F2(0, 0xf0): /* rorx */
         break;
 
@@ -8827,9 +8834,9 @@ x86_emulate(
         opc[0] = b;
         /* Convert memory/GPR operand to (%rAX). */
         rex_prefix &= ~REX_B;
-        vex.b = 1;
+        evex.b = vex.b = 1;
         if ( !mode_64bit() )
-            vex.w = 0;
+            evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
         opc[3] = 0xc3;
@@ -8839,7 +8846,10 @@ x86_emulate(
             --opc;
         }
 
-        copy_REX_VEX(opc, rex_prefix, vex);
+        if ( evex_encoded() )
+            copy_EVEX(opc, evex);
+        else
+            copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=m" (dst.val) : "a" (&dst.val));
         put_stub(stub);
 
@@ -8859,6 +8869,52 @@ x86_emulate(
         opc = init_prefixes(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
+        generate_exception_if(ea.type != OP_REG, EXC_UD);
+        /* Convert to alternative encoding: We want to use a memory operand. */
+        evex.opcx = ext_0f3a;
+        b = 0x15;
+        modrm <<= 3;
+        evex.r = evex.b;
+        evex.R = evex.x;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x14): /* vpextrb $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x15): /* vpextrw $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x16): /* vpextr{d,q} $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x17): /* vextractps $imm8,xmm,r/m */
+        generate_exception_if((evex.lr || evex.reg != 0xf || !evex.RX ||
+                               evex.opmsk || evex.br),
+                              EXC_UD);
+        if ( !(b & 2) )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( !(b & 1) )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto pextr;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(evex.lr != 2 || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
     {
         uint32_t mxcsr;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 22/34] x86emul: support AVX512{F, BW, DQ} insert insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (20 preceding siblings ...)
  2018-09-18 12:06   ` [PATCH v3 21/34] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
@ 2018-09-18 12:07   ` Jan Beulich
  2018-09-18 12:07   ` [PATCH v3 23/34] x86emul: basic AVX512F testing Jan Beulich
                     ` (11 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:07 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Also correct the comment of the AVX form of VINSERTPS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -199,6 +199,7 @@ static const struct test avx512f_all[] =
 
 static const struct test avx512f_128[] = {
     INSN(extractps, 66, 0f3a, 17, el,    d, el),
+    INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -210,12 +211,16 @@ static const struct test avx512f_no128[]
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
+    INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
+    INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
+    INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
+    INSN(inserti64x4,    66, 0f3a, 3a, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -275,6 +280,8 @@ static const struct test avx512bw_128[]
     INSN(pextrb, 66, 0f3a, 14, el, b, el),
 //       pextrw, 66,   0f, c5,     w
     INSN(pextrw, 66, 0f3a, 15, el, w, el),
+    INSN(pinsrb, 66, 0f3a, 20, el, b, el),
+    INSN(pinsrw, 66,   0f, c4, el, w, el),
 };
 
 static const struct test avx512dq_all[] = {
@@ -287,6 +294,7 @@ static const struct test avx512dq_all[]
 
 static const struct test avx512dq_128[] = {
     INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+    INSN(pinsr, 66, 0f3a, 22, el, dq64, el),
 };
 
 static const struct test avx512dq_no128[] = {
@@ -294,12 +302,16 @@ static const struct test avx512dq_no128[
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
+    INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
+    INSN(inserti64x2,    66, 0f3a, 38, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
+    INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
+    INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -358,7 +358,7 @@ static const struct twobyte_table {
     [0xc1] = { DstMem|SrcReg|ModRM },
     [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp, d8s_vl },
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
-    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
+    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int, 1 },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
     [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp, d8s_vl },
     [0xc7] = { ImplicitOps|ModRM },
@@ -514,17 +514,19 @@ static const struct ext0f3a_table {
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
     [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
-    [0x18] = { .simd_size = simd_128 },
+    [0x18] = { .simd_size = simd_128, .d8s = 4 },
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x20] = { .simd_size = simd_none },
-    [0x21] = { .simd_size = simd_other },
-    [0x22] = { .simd_size = simd_none },
+    [0x20] = { .simd_size = simd_none, .d8s = 0 },
+    [0x21] = { .simd_size = simd_other, .d8s = 2 },
+    [0x22] = { .simd_size = simd_none, .d8s = d8s_dq },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
-    [0x38] = { .simd_size = simd_128 },
+    [0x38] = { .simd_size = simd_128, .d8s = 4 },
+    [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -2565,6 +2567,7 @@ x86_decode_twobyte(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
     case X86EMUL_OPC_VEX_66(0, 0xc4): /* vpinsrw */
+    case X86EMUL_OPC_EVEX_66(0, 0xc4): /* vpinsrw */
         state->desc = DstReg | SrcMem16;
         break;
 
@@ -2667,6 +2670,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x20):     /* pinsrb */
     case X86EMUL_OPC_VEX_66(0, 0x20): /* vpinsrb */
+    case X86EMUL_OPC_EVEX_66(0, 0x20): /* vpinsrb */
         state->desc = DstImplicit | SrcMem;
         if ( modrm_mod != 3 )
             state->desc |= ByteOp;
@@ -2674,6 +2678,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x22):     /* pinsr{d,q} */
     case X86EMUL_OPC_VEX_66(0, 0x22): /* vpinsr{d,q} */
+    case X86EMUL_OPC_EVEX_66(0, 0x22): /* vpinsr{d,q} */
         state->desc = DstImplicit | SrcMem;
         break;
 
@@ -7700,6 +7705,23 @@ x86_emulate(
         ea.type = OP_MEM;
         goto simd_0f_int_imm8;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc4):   /* vpinsrw $imm8,r32/m16,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x20): /* vpinsrb $imm8,r32/m8,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x22): /* vpinsr{d,q} $imm8,r/m,xmm,xmm */
+        generate_exception_if(evex.lr || evex.opmsk || evex.br, EXC_UD);
+        if ( b & 2 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        if ( !mode_64bit() )
+            evex.w = 0;
+        memcpy(mmvalp, &src.val, op_bytes);
+        ea.type = OP_MEM;
+        op_bytes = src.bytes;
+        d = SrcMem16; /* Fake for the common SIMD code below. */
+        state->simd_size = simd_other;
+        goto avx512f_imm_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0xc5):      /* pextrw $imm8,{,x}mm,reg */
     case X86EMUL_OPC_VEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
         generate_exception_if(vex.l, EXC_UD);
@@ -8895,8 +8917,12 @@ x86_emulate(
         opc = init_evex(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x18): /* vinsertf32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinsertf64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x38): /* vinserti32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinserti64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
@@ -8905,8 +8931,12 @@ x86_emulate(
         fault_suppression = false;
         goto avx512f_imm_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1a): /* vinsertf32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinsertf64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3a): /* vinserti32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinserti64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
         if ( !evex.w )
@@ -8999,13 +9029,19 @@ x86_emulate(
         op_bytes = 4;
         goto simd_0f3a_common;
 
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m128,xmm,xmm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
         op_bytes = 4;
         /* fall through */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x41): /* vdppd $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
+        op_bytes = 4;
+        generate_exception_if(evex.lr || evex.w || evex.opmsk || evex.br,
+                              EXC_UD);
+        goto avx512f_imm_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 23/34] x86emul: basic AVX512F testing
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (21 preceding siblings ...)
  2018-09-18 12:07   ` [PATCH v3 22/34] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
@ 2018-09-18 12:07   ` Jan Beulich
  2018-09-18 12:08   ` [PATCH v3 24/34] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
                     ` (10 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:07 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -11,7 +11,7 @@ all: $(TARGET)
 run: $(TARGET)
 	./$(TARGET)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -52,6 +52,9 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
+avx512f-vecs := 64
+avx512f-ints := 4 8
+avx512f-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
@@ -137,7 +140,7 @@ $(addsuffix .c,$(SG)):
 
 $(addsuffix .h,$(SIMD) $(FMA) $(SG)): simd.h
 
-xop.h: simd-fma.c
+xop.h avx512f.h: simd-fma.c
 
 $(TARGET): x86-emulate.o test_x86_emulator.o evex-disp8.o wrappers.o
 	$(HOSTCC) $(HOSTCFLAGS) -o $@ $^
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -2,7 +2,41 @@
 
 ENTRY(simd_test);
 
-#if VEC_SIZE == 8 && defined(__SSE__)
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if VEC_SIZE == 4
+#  define eq(x, y) ({ \
+    float x_ = (x)[0]; \
+    float __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpss $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif VEC_SIZE == 8
+#  define eq(x, y) ({ \
+    double x_ = (x)[0]; \
+    double __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpsd $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif FLOAT_SIZE == 4
+/*
+ * gcc's (up to at least 8.2) __builtin_ia32_cmpps256_mask() has an anomaly in
+ * that its return type is QI rather than UQI, and hence the value would get
+ * sign-extended before comapring to ALL_TRUE. The same oddity does not matter
+ * for __builtin_ia32_cmppd256_mask(), as there only 4 bits are significant.
+ * Hence the extra " & ALL_TRUE".
+ */
+#  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
+# elif FLOAT_SIZE == 8
+#  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif INT_SIZE == 4
+#  define eq(x, y) (B(pcmpeqd, _mask, x, y, -1) == ALL_TRUE)
+# elif INT_SIZE == 8
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+# endif
+#elif VEC_SIZE == 8 && defined(__SSE__)
 # define to_bool(cmp) (__builtin_ia32_pmovmskb(cmp) == 0xff)
 #elif VEC_SIZE == 16
 # if defined(__AVX__) && defined(FLOAT_SIZE)
@@ -93,6 +127,50 @@ static inline bool _to_bool(byte_vec_t b
     touch(x); \
     __builtin_ia32_pfrcpit2(__builtin_ia32_pfrsqit1(__builtin_ia32_pfmul(t_, t_), x), t_); \
 })
+#elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if VEC_SIZE > FLOAT_SIZE
+#  if FLOAT_SIZE == 4
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastss %1, %0" \
+          : "=v" (t_) : "m" (*(float[1]){ x }) ); \
+    t_; \
+})
+#   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#   define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#   if VEC_SIZE == 16
+#    define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
+#    define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
+#    define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#   endif
+#  elif FLOAT_SIZE == 8
+#   if VEC_SIZE >= 32
+#    define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastsd %1, %0" : "=v" (t_) \
+          : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#   else
+#    define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#   endif
+#   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#   define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#   if VEC_SIZE == 16
+#    define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
+#    define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
+#    define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#   endif
+#  endif
+# endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
 # if VEC_SIZE == 32 && defined(__AVX__)
 #  if defined(__AVX2__)
@@ -191,7 +269,30 @@ static inline bool _to_bool(byte_vec_t b
 #  define sqrt(x) scalar_1op(x, "sqrtsd %[in], %[out]")
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE2__)
+#if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
+     defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 4 || UINT_SIZE == 4
+#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+# endif
+# if INT_SIZE == 4
+#  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
+#  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 4
+#  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# endif
+#elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
 #  define interleave_lo(x, y) ((vec_t)__builtin_ia32_punpcklbw128((vqi_t)(x), (vqi_t)(y)))
@@ -587,6 +688,10 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if defined(__AVX512F__) && defined(FLOAT_SIZE)
+# include "simd-fma.c"
+#endif
+
 int simd_test(void)
 {
     unsigned int i, j;
@@ -1034,7 +1139,8 @@ int simd_test(void)
 # endif
 #endif
 
-#if defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)
+#if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
+    (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
 #endif
 
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,9 +70,111 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE == 16
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
+# define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
+#elif VEC_SIZE == 32
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 256 ## s(a)
+#elif VEC_SIZE == 64
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 512 ## s(a)
+# define BR(n, s, a...)  __builtin_ia32_ ## n ## 512 ## s(a, 4)
+#endif
+#ifndef B_
+# define B_ B
+#endif
+#ifndef BR
+# define BR B
+# define BR_ B_
+#endif
+#ifndef BR_
+# define BR_ BR
+#endif
+
+#ifdef __AVX512F__
+
+/*
+ * The original plan was to effect use of EVEX encodings for scalar as well as
+ * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
+ * only of course) XMM16-XMM31 only. All sorts of compiler errors result when
+ * doing this with gcc 8.2. Therefore resort to injecting {evex} prefixes,
+ * which has the benefit of also working for 32-bit. Granted, there is a lot of
+ * escaping to get right here.
+ */
+asm ( ".macro override insn    \n\t"
+      ".macro $\\insn o:vararg \n\t"
+      ".purgem \\insn          \n\t"
+      "{evex} \\insn \\(\\)o   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\(\\))o     \n\t"
+      ".endm                   \n\t"
+      ".endm                   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\)o         \n\t"
+      ".endm                   \n\t"
+      ".endm" );
+
+#define OVR(n) asm ( "override v" #n )
+#define OVR_SFP(n) OVR(n ## sd); OVR(n ## ss)
+
+#ifdef __AVX512VL__
+# ifdef __AVX512BW__
+#  define OVR_BW(n) OVR(p ## n ## b); OVR(p ## n ## w)
+# else
+#  define OVR_BW(n)
+# endif
+# define OVR_DQ(n) OVR(p ## n ## d); OVR(p ## n ## q)
+# define OVR_VFP(n) OVR(n ## pd); OVR(n ## ps)
+#else
+# define OVR_BW(n)
+# define OVR_DQ(n)
+# define OVR_VFP(n)
+#endif
+
+#define OVR_FMA(n, w) OVR_ ## w(n ## 132); OVR_ ## w(n ## 213); \
+                      OVR_ ## w(n ## 231)
+#define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
+#define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
+
+OVR_SFP(broadcast);
+OVR_SFP(comi);
+OVR_FP(add);
+OVR_FP(div);
+OVR(extractps);
+OVR_FMA(fmadd, FP);
+OVR_FMA(fmsub, FP);
+OVR_FMA(fnmadd, FP);
+OVR_FMA(fnmsub, FP);
+OVR(insertps);
+OVR_FP(max);
+OVR_FP(min);
+OVR(movd);
+OVR(movq);
+OVR_SFP(mov);
+OVR_FP(mul);
+OVR_FP(sqrt);
+OVR_FP(sub);
+OVR_SFP(ucomi);
+
+#undef OVR_VFP
+#undef OVR_SFP
+#undef OVR_INT
+#undef OVR_FP
+#undef OVR_FMA
+#undef OVR_DQ
+#undef OVR_BW
+#undef OVR
+
+#endif
+
 /*
  * Suppress value propagation by the compiler, preventing unwanted
  * optimization. This at once makes the compiler use memory operands
  * more often, which for our purposes is the more interesting case.
  */
 #define touch(var) asm volatile ( "" : "+m" (var) )
+
+static inline vec_t undef(void)
+{
+    vec_t v = v;
+    return v;
+}
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -1,10 +1,9 @@
+#if !defined(__XOP__) && !defined(__AVX512F__)
 #include "simd.h"
-
-#ifndef __XOP__
 ENTRY(fma_test);
 #endif
 
-#if VEC_SIZE < 16
+#if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
 #elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
@@ -24,7 +23,13 @@ ENTRY(fma_test);
 # define eq(x, y) to_bool((x) == (y))
 #endif
 
-#if VEC_SIZE == 16
+#if defined(__AVX512F__) && VEC_SIZE > FLOAT_SIZE
+# if FLOAT_SIZE == 4
+#  define fmaddsub(x, y, z) BR(vfmaddsubps, _mask, x, y, z, ~0)
+# elif FLOAT_SIZE == 8
+#  define fmaddsub(x, y, z) BR(vfmaddsubpd, _mask, x, y, z, ~0)
+# endif
+#elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
 #  define addsub(x, y) __builtin_ia32_addsubps(x, y)
 #  if defined(__FMA4__) || defined(__FMA__)
@@ -50,6 +55,10 @@ ENTRY(fma_test);
 # endif
 #endif
 
+#if defined(fmaddsub) && !defined(addsub)
+# define addsub(x, y) fmaddsub(x, broadcast(1), y)
+#endif
+
 int fma_test(void)
 {
     unsigned int i;
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -21,6 +21,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f-opmask.h"
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
+#include "avx512f.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -248,6 +249,14 @@ static const struct {
     SIMD(OPMASK/b,    avx512dq_opmask,         1),
     SIMD(OPMASK/d,    avx512bw_opmask,         4),
     SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(AVX512F f32 scalar,  avx512f,        f4),
+    SIMD(AVX512F f32x16,      avx512f,      64f4),
+    SIMD(AVX512F f64 scalar,  avx512f,        f8),
+    SIMD(AVX512F f64x8,       avx512f,      64f8),
+    SIMD(AVX512F s32x16,      avx512f,      64i4),
+    SIMD(AVX512F u32x16,      avx512f,      64u4),
+    SIMD(AVX512F s64x8,       avx512f,      64i8),
+    SIMD(AVX512F u64x8,       avx512f,      64u8),
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 24/34] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (22 preceding siblings ...)
  2018-09-18 12:07   ` [PATCH v3 23/34] x86emul: basic AVX512F testing Jan Beulich
@ 2018-09-18 12:08   ` Jan Beulich
  2018-09-18 12:09   ` [PATCH v3 25/34] x86emul: basic AVX512VL testing Jan Beulich
                     ` (9 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:08 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note that the pbroadcastw table entry in evex-disp8.c is slightly
different from what one would expect, due to it requiring EVEX.W to be
zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -150,6 +150,9 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+//       pbroadcast,   66, 0f38, 7c,          dq64
+    INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
+    INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
     INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
     INSN(pcmpeqd,      66,   0f, 76,    vl,      d, vl),
     INSN(pcmpeqq,      66, 0f38, 29,    vl,      q, vl),
@@ -208,6 +211,7 @@ static const struct test avx512f_128[] =
 
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
+    INSN(broadcasti32x4, 66, 0f38, 5a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
@@ -217,6 +221,7 @@ static const struct test avx512f_no128[]
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(broadcasti64x4, 66, 0f38, 5b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
     INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
@@ -236,6 +241,10 @@ static const struct test avx512bw_all[]
     INSN(paddw,       66,   0f, fd,    vl,   w, vl),
     INSN(pavgb,       66,   0f, e0,    vl,   b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,   w, vl),
+    INSN(pbroadcastb, 66, 0f38, 78,    el,   b, el),
+//       pbroadcastb, 66, 0f38, 7a,          b
+    INSN(pbroadcastw, 66, 0f38, 79,    el_2, b, vl),
+//       pbroadcastw, 66, 0f38, 7b,          b
     INSN(pcmp,        66, 0f3a, 3f,    vl,  bw, vl),
     INSN(pcmpeqb,     66,   0f, 74,    vl,   b, vl),
     INSN(pcmpeqw,     66,   0f, 75,    vl,   w, vl),
@@ -287,6 +296,7 @@ static const struct test avx512bw_128[]
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
+    INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
@@ -300,6 +310,7 @@ static const struct test avx512dq_128[]
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(broadcasti64x2, 66, 0f38, 5a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
     INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
@@ -308,6 +319,7 @@ static const struct test avx512dq_no128[
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(broadcasti32x8, 66, 0f38, 5b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
     INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -272,9 +272,33 @@ static inline bool _to_bool(byte_vec_t b
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if INT_SIZE == 4 || UINT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastd %1, %0" \
+          : "=v" (t_) : "m" (*(int[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(long long[1]){ x }) ); \
+    t_; \
+})
+#  ifdef __x86_64__
+#   define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastq %1, %0" : "=v" (t_) : "r" ((x) + 0ULL) ); \
+    t_; \
+})
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
@@ -971,10 +995,14 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
-#if defined(broadcast)
+#ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#ifdef broadcast2
+    if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -452,9 +452,13 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
-    [0x5a] = { .simd_size = simd_128, .two_op = 1 },
-    [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
+    [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
+    [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
+    [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
+    [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x78] = { .simd_size = simd_other, .two_op = 1 },
+    [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
+    [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -2615,6 +2619,11 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
+    case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
+    case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
+        break;
+
     case 0xf0: /* movbe / crc32 */
         state->desc |= repne_prefix() ? ByteOp : Mov;
         if ( rep_prefix() )
@@ -8182,6 +8191,8 @@ x86_emulate(
         goto avx512f_no_sae;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
+        op_bytes = elem_bytes;
         generate_exception_if(evex.w || evex.br, EXC_UD);
     avx512_broadcast:
         /*
@@ -8200,17 +8211,27 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
                                             /* vbroadcastf64x4 m256,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
+                                            /* vbroadcasti64x4 m256,zmm{k} */
         generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
                                             /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
-        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        generate_exception_if(!evex.lr, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
+                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
+        if ( b == 0x59 )
+            op_bytes = 8;
+        generate_exception_if(evex.br, EXC_UD);
         if ( !evex.w )
             host_and_vcpu_must_have(avx512dq);
         goto avx512_broadcast;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
                                             /* vbroadcastf64x2 m128,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
         generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.br,
                               EXC_UD);
         if ( evex.w )
@@ -8404,6 +8425,45 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+        op_bytes = elem_bytes = 1 << (b & 1);
+        /* See the comment at the avx512_broadcast label. */
+        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
+        generate_exception_if((ea.type != OP_REG || evex.br ||
+                               evex.reg != 0xf || !evex.RX),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        opc[1] = modrm & 0xf8;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "+m" (src.val) : "a" (src.val));
+
+        put_stub(stub);
+        ASSERT(!state->simd_size);
+        break;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 25/34] x86emul: basic AVX512VL testing
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (23 preceding siblings ...)
  2018-09-18 12:08   ` [PATCH v3 24/34] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
@ 2018-09-18 12:09   ` Jan Beulich
  2018-09-18 12:09   ` [PATCH v3 26/34] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
                     ` (8 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:09 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Test the 128- and 256-bit variants of the insns which have been
implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -52,7 +52,7 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
-avx512f-vecs := 64
+avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
 
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -5,13 +5,13 @@ ENTRY(fma_test);
 
 #if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
-#elif VEC_SIZE == 16
+#elif VEC_SIZE == 16 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
 #  define to_bool(cmp) __builtin_ia32_vtestcpd(cmp, (vec_t){} == 0)
 # endif
-#elif VEC_SIZE == 32
+#elif VEC_SIZE == 32 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps256(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -533,7 +533,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define rotr(x, n) ((vec_t)__builtin_ia32_palignr128((vdi_t)(x), (vdi_t)(x), (n) * 64))
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE4_1__)
+#if VEC_SIZE == 16 && defined(__SSE4_1__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)__builtin_ia32_pmaxsb128((vqi_t)(x), (vqi_t)(y)))
 #  define min(x, y) ((vec_t)__builtin_ia32_pminsb128((vqi_t)(x), (vqi_t)(y)))
@@ -587,7 +587,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define mix(x, y) __builtin_ia32_blendpd(x, y, 0b10)
 # endif
 #endif
-#if VEC_SIZE == 32 && defined(__AVX__)
+#if VEC_SIZE == 32 && defined(__AVX__) && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define dot_product(x, y) ({ \
     vec_t t_ = __builtin_ia32_dpps256(x, y, 0b11110001); \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -92,6 +92,15 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+#if VEC_SIZE < 64
+# pragma GCC target ( "avx512vl" )
+#endif
+
+#define REN(insn, old, new)                      \
+    asm ( ".macro v" #insn #old " o:vararg \n\t" \
+          "v" #insn #new " \\o             \n\t" \
+          ".endm" )
+
 /*
  * The original plan was to effect use of EVEX encodings for scalar as well as
  * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
@@ -135,25 +144,88 @@ asm ( ".macro override insn    \n\t"
 #define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
 #define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
 
+OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
+OVR_INT(add);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
+OVR_FMA(fmaddsub, VFP);
 OVR_FMA(fmsub, FP);
+OVR_FMA(fmsubadd, VFP);
 OVR_FMA(fnmadd, FP);
 OVR_FMA(fnmsub, FP);
 OVR(insertps);
 OVR_FP(max);
+OVR_INT(maxs);
+OVR_INT(maxu);
 OVR_FP(min);
+OVR_INT(mins);
+OVR_INT(minu);
 OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
+OVR_VFP(mova);
+OVR_VFP(movnt);
+OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(shuf);
+OVR_INT(sll);
+OVR_DQ(sllv);
 OVR_FP(sqrt);
+OVR_INT(sra);
+OVR_DQ(srav);
+OVR_INT(srl);
+OVR_DQ(srlv);
 OVR_FP(sub);
+OVR_INT(sub);
 OVR_SFP(ucomi);
+OVR_VFP(unpckh);
+OVR_VFP(unpckl);
+
+#ifdef __AVX512VL__
+# if ELEM_SIZE == 8 && defined(__AVX512DQ__)
+REN(extract, f128, f64x2);
+REN(extract, i128, i64x2);
+REN(insert, f128, f64x2);
+REN(insert, i128, i64x2);
+# else
+REN(extract, f128, f32x4);
+REN(extract, i128, i32x4);
+REN(insert, f128, f32x4);
+REN(insert, i128, i32x4);
+# endif
+# if ELEM_SIZE == 8
+REN(movdqa, , 64);
+REN(movdqu, , 64);
+REN(pand, , q);
+REN(pandn, , q);
+REN(por, , q);
+REN(pxor, , q);
+# else
+#  if ELEM_SIZE == 1 && defined(__AVX512BW__)
+REN(movdq, a, u8);
+REN(movdqu, , 8);
+#  elif ELEM_SIZE == 2 && defined(__AVX512BW__)
+REN(movdq, a, u16);
+REN(movdqu, , 16);
+#  else
+REN(movdqa, , 32);
+REN(movdqu, , 32);
+#  endif
+REN(pand, , d);
+REN(pandn, , d);
+REN(por, , d);
+REN(pxor, , d);
+# endif
+OVR(movntdq);
+OVR(movntdqa);
+OVR(pmulld);
+OVR(pmuldq);
+OVR(pmuludq);
+#endif
 
 #undef OVR_VFP
 #undef OVR_SFP
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -88,6 +88,11 @@ static bool simd_check_avx512f(void)
 }
 #define simd_check_avx512f_opmask simd_check_avx512f
 
+static bool simd_check_avx512f_vl(void)
+{
+    return cpu_has_avx512f && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512dq(void)
 {
     return cpu_has_avx512dq;
@@ -142,11 +147,21 @@ static const struct {
       .check_cpu = simd_check_ ## feat,                             \
       .set_regs = simd_set_regs,                                    \
       .check_regs = simd_check_regs }
+#define AVX512VL_(bits, desc, feat, form)                          \
+    { .code = feat ## _x86_ ## bits ## _D ## _ ## form,            \
+      .size = sizeof(feat ## _x86_ ## bits ## _D ## _ ## form),    \
+      .bitness = bits, .name = "AVX512" #desc,                     \
+      .check_cpu = simd_check_ ## feat ## _vl,                     \
+      .set_regs = simd_set_regs,                                   \
+      .check_regs = simd_check_regs }
 #ifdef __x86_64__
 # define SIMD(desc, feat, form) SIMD_(64, desc, feat, form), \
                                 SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(64, desc, feat, form), \
+                                    AVX512VL_(32, desc, feat, form)
 #else
 # define SIMD(desc, feat, form) SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(32, desc, feat, form)
 #endif
     SIMD(3DNow! single,          _3dnow,     8f4),
     SIMD(SSE scalar single,      sse,         f4),
@@ -257,6 +272,20 @@ static const struct {
     SIMD(AVX512F u32x16,      avx512f,      64u4),
     SIMD(AVX512F s64x8,       avx512f,      64i8),
     SIMD(AVX512F u64x8,       avx512f,      64u8),
+    AVX512VL(VL f32x4,        avx512f,      16f4),
+    AVX512VL(VL f64x2,        avx512f,      16f8),
+    AVX512VL(VL f32x8,        avx512f,      32f4),
+    AVX512VL(VL f64x4,        avx512f,      32f8),
+    AVX512VL(VL s32x4,        avx512f,      16i4),
+    AVX512VL(VL u32x4,        avx512f,      16u4),
+    AVX512VL(VL s32x8,        avx512f,      32i4),
+    AVX512VL(VL u32x8,        avx512f,      32u4),
+    AVX512VL(VL s64x2,        avx512f,      16i8),
+    AVX512VL(VL u64x2,        avx512f,      16u8),
+    AVX512VL(VL s64x4,        avx512f,      32i8),
+    AVX512VL(VL u64x4,        avx512f,      32u8),
+#undef AVX512VL_
+#undef AVX512VL
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 26/34] x86emul: support AVX512{F, BW} zero- and sign-extending moves
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (24 preceding siblings ...)
  2018-09-18 12:09   ` [PATCH v3 25/34] x86emul: basic AVX512VL testing Jan Beulich
@ 2018-09-18 12:09   ` Jan Beulich
  2018-09-18 12:09   ` [PATCH v3 27/34] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
                     ` (7 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:09 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note that the testing in simd.c doesn't really follow the ISA extension
pattern - to fit the scheme, extensions from byte and word granular
vectors can (currently) sensibly only happen in the AVX512BW case (and
hence respective abstraction macros will be added there rather than
here).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -163,6 +163,16 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
+    INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
+    INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
+    INSN(pmovzxwq,     66, 0f38, 34,    vl_4,    w, vl),
+    INSN(pmovzxdq,     66, 0f38, 35,    vl_2, d_nb, vl),
     INSN(pmuldq,       66, 0f38, 28,    vl,      q, vl),
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
@@ -260,6 +270,8 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,   w, vl),
     INSN(pminub,      66,   0f, da,    vl,   b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,   w, vl),
+    INSN(pmovsxbw,    66, 0f38, 20,    vl_2, b, vl),
+    INSN(pmovzxbw,    66, 0f38, 30,    vl_2, b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,   w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,   w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,   w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -441,13 +441,23 @@ static const struct ext0f38_table {
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
+    [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x23] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x24] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
-    [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
+    [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x32] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x33] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x34] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x35] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
@@ -8297,6 +8307,25 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x25): /* vpmovsxdq {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
+        elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -305,10 +305,12 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxdq, _mask, x, (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 4
 #  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -171,6 +171,16 @@ OVR_VFP(mova);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
+OVR(pmovsxbd);
+OVR(pmovsxbq);
+OVR(pmovsxdq);
+OVR(pmovsxwd);
+OVR(pmovsxwq);
+OVR(pmovzxbd);
+OVR(pmovzxbq);
+OVR(pmovzxdq);
+OVR(pmovzxwd);
+OVR(pmovzxwq);
 OVR_VFP(shuf);
 OVR_INT(sll);
 OVR_DQ(sllv);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 27/34] x86emul: support AVX512{F, BW} down conversion moves
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (25 preceding siblings ...)
  2018-09-18 12:09   ` [PATCH v3 26/34] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
@ 2018-09-18 12:09   ` Jan Beulich
  2018-09-18 12:10   ` [PATCH v3 28/34] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
                     ` (6 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:09 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note that the vpmov{,s,us}{d,q}w table entries in evex-disp8.c are
slightly different from what one would expect, due to them requiring
EVEX.W to be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -163,11 +163,26 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovdb,       f3, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovdw,       f3, 0f38, 33,    vl_2,    b, vl),
+    INSN(pmovqb,       f3, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovqd,       f3, 0f38, 35,    vl_2, d_nb, vl),
+    INSN(pmovqw,       f3, 0f38, 34,    vl_4,    b, vl),
+    INSN(pmovsdb,      f3, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsdw,      f3, 0f38, 23,    vl_2,    b, vl),
+    INSN(pmovsqb,      f3, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsqd,      f3, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovsqw,      f3, 0f38, 24,    vl_4,    b, vl),
     INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
     INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
     INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
     INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
     INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovusdb,     f3, 0f38, 11,    vl_4,    b, vl),
+    INSN(pmovusdw,     f3, 0f38, 13,    vl_2,    b, vl),
+    INSN(pmovusqb,     f3, 0f38, 12,    vl_8,    b, vl),
+    INSN(pmovusqd,     f3, 0f38, 15,    vl_2, d_nb, vl),
+    INSN(pmovusqw,     f3, 0f38, 14,    vl_4,    b, vl),
     INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
     INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
     INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
@@ -270,7 +285,10 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,   w, vl),
     INSN(pminub,      66,   0f, da,    vl,   b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,   w, vl),
+    INSN(pmovswb,     f3, 0f38, 20,    vl_2, b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2, b, vl),
+    INSN(pmovuswb,    f3, 0f38, 10,    vl_2, b, vl),
+    INSN(pmovwb,      f3, 0f38, 30,    vl_2, b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2, b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,   w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,   w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -271,6 +271,17 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextracti{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextracti32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -285,6 +296,7 @@ static inline bool _to_bool(byte_vec_t b
 })
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -714,6 +726,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if VEC_SIZE >= 16
+
+# if !defined(low_half) && defined(HALF_SIZE)
+static inline half_t low_half(vec_t x)
+{
+#  if HALF_SIZE < VEC_SIZE
+    half_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 2; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1081,6 +1114,21 @@ int simd_test(void)
 
 #endif
 
+#if defined(widen1) && defined(shrink1)
+    {
+        half_t aux1 = low_half(src), aux2;
+
+        touch(aux1);
+        x = widen1(aux1);
+        touch(x);
+        aux2 = shrink1(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 2; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
 #ifdef dup_lo
     touch(src);
     x = dup_lo(src);
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,23 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE >= 16
+
+# if ELEM_COUNT >= 2
+#  if VEC_SIZE > 32
+#   define HALF_SIZE (VEC_SIZE / 2)
+#  else
+#   define HALF_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(HALF_SIZE))) half_t;
+typedef char __attribute__((vector_size(HALF_SIZE))) vqi_half_t;
+typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
+typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
+typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+# endif
+
+#endif
+
 #if VEC_SIZE == 16
 # define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
 # define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3046,7 +3046,22 @@ x86_decode(
                 d |= vSIB;
             state->simd_size = ext0f38_table[b].simd_size;
             if ( evex_encoded() )
-                disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+            {
+                /*
+                 * VPMOVUS* are identical to VPMOVS* Disp8-scaling-wise, but
+                 * their attributes don't match those of the vex_66 encoded
+                 * insns with the same base opcodes. Rather than adding new
+                 * columns to the table, handle this here for now.
+                 */
+                if ( evex.pfx != vex_f3 || (b & 0xf8) != 0x10 )
+                    disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+                else
+                {
+                    disp8scale = decode_disp8scale(ext0f38_table[b + 0x10].d8s,
+                                                   state);
+                    state->simd_size = simd_other;
+                }
+            }
             break;
 
         case ext_0f3a:
@@ -8307,10 +8322,14 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10): /* vpmovuswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20): /* vpmovswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30): /* vpmovwb [xyz]mm,{x,y}mm/mem{k} */
         host_and_vcpu_must_have(avx512bw);
-        /* fall through */
+        if ( evex.pfx != vex_f3 )
+        {
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
@@ -8321,7 +8340,28 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
-        generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+            generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        }
+        else
+        {
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x11): /* vpmovusdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x12): /* vpmovusqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x13): /* vpmovusdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x14): /* vpmovusqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* vpmovusqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x21): /* vpmovsdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x22): /* vpmovsqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x23): /* vpmovsdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x24): /* vpmovsqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* vpmovsqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x31): /* vpmovdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x32): /* vpmovqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x33): /* vpmovdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x34): /* vpmovqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* vpmovqd [xyz]mm,{x,y}mm/mem{k} */
+            generate_exception_if(evex.w, EXC_UD);
+            d = DstMem | SrcReg | TwoOp;
+        }
         op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 28/34] x86emul: support AVX512{F, BW} integer unpack insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (26 preceding siblings ...)
  2018-09-18 12:09   ` [PATCH v3 27/34] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
@ 2018-09-18 12:10   ` Jan Beulich
  2018-09-18 12:11   ` [PATCH v3 29/34] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
                     ` (5 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:10 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

There's once again one extra twobyte_table[] entry which gets its Disp8
shift value set right away without getting support implemented just yet,
again to avoid needlessly splitting groups of entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -215,6 +215,10 @@ static const struct test avx512f_all[] =
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
     INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
     INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
+    INSN(punpckhdq,    66,   0f, 6a,    vl,      d, vl),
+    INSN(punpckhqdq,   66,   0f, 6d,    vl,      q, vl),
+    INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
+    INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
@@ -313,6 +317,10 @@ static const struct test avx512bw_all[]
     INSN(psubw,       66,   0f, f9,    vl,   w, vl),
     INSN(ptestm,      66, 0f38, 26,    vl,  bw, vl),
     INSN(ptestnm,     f3, 0f38, 26,    vl,  bw, vl),
+    INSN(punpckhbw,   66,   0f, 68,    vl,   b, vl),
+    INSN(punpckhwd,   66,   0f, 69,    vl,   w, vl),
+    INSN(punpcklbw,   66,   0f, 60,    vl,   b, vl),
+    INSN(punpcklwd,   66,   0f, 61,    vl,   w, vl),
 };
 
 static const struct test avx512bw_128[] = {
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -294,6 +294,10 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
@@ -311,6 +315,10 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -198,6 +198,10 @@ OVR(pmovzxbq);
 OVR(pmovzxdq);
 OVR(pmovzxwd);
 OVR(pmovzxwq);
+OVR(punpckhdq);
+OVR(punpckhqdq);
+OVR(punpckldq);
+OVR(punpcklqdq);
 OVR_VFP(shuf);
 OVR_INT(sll);
 OVR_DQ(sllv);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -310,10 +310,10 @@ static const struct twobyte_table {
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
+    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
@@ -6632,6 +6632,12 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        op_bytes = 16 << evex.lr;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6660,6 +6666,13 @@ x86_emulate(
         elem_bytes = 1 << (b & 1);
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x62): /* vpunpckldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6a): /* vpunpckhdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        fault_suppression = false;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe2): /* vpsra{d,q} xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6702,6 +6715,10 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd4): /* vpaddq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf4): /* vpmuludq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x28): /* vpmuldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 29/34] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (27 preceding siblings ...)
  2018-09-18 12:10   ` [PATCH v3 28/34] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
@ 2018-09-18 12:11   ` Jan Beulich
  2018-09-18 12:11   ` [PATCH v3 29/34] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
                     ` (4 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:11 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Take the liberty and also correct the (public interface) name of the
AVX512_VBMI feature flag, on the assumption that no external consumer
has actually been using that flag so far. Furthermore make it have
AVX512BW instead of AVX512F as a prerequisite, for requiring full
64-bit mask registers (the upper 48 bits of which can't be accessed
other than through XSAVE/XRSTOR without AVX512BW support).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -159,6 +159,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
+    INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -280,6 +284,8 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,   b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,   w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,  bw, vl),
+    INSN(permi2w,     66, 0f38, 75,    vl,   w, vl),
+    INSN(permt2w,     66, 0f38, 7d,    vl,   w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,   w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,   b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,   w, vl),
@@ -364,6 +370,11 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512_vbmi_all[] = {
+    INSN(permi2b,       66, 0f38, 75, vl, b, vl),
+    INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -684,4 +695,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -144,6 +144,9 @@ static inline bool _to_bool(byte_vec_t b
 #    define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #    define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #    define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#   else
+#    define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
+#    define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #   endif
 #  elif FLOAT_SIZE == 8
 #   if VEC_SIZE >= 32
@@ -168,6 +171,9 @@ static inline bool _to_bool(byte_vec_t b
 #    define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #    define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #    define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#   else
+#    define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
+#    define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
 #   endif
 #  endif
 # endif
@@ -297,6 +303,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -318,6 +327,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
@@ -763,6 +775,7 @@ int simd_test(void)
 {
     unsigned int i, j;
     vec_t x, y, z, src, inv, alt, sh;
+    vint_t interleave_lo, interleave_hi;
 
     for ( i = 0, j = ELEM_SIZE << 3; i < ELEM_COUNT; ++i )
     {
@@ -776,6 +789,9 @@ int simd_test(void)
         if ( !(i & (i + 1)) )
             --j;
         sh[i] = j;
+
+        interleave_lo[i] = ((i & 1) * ELEM_COUNT) | (i >> 1);
+        interleave_hi[i] = interleave_lo[i] + (ELEM_COUNT / 2);
     }
 
     touch(src);
@@ -1069,7 +1085,7 @@ int simd_test(void)
     x = src * alt;
     y = interleave_lo(x, alt < 0);
     touch(x);
-    z = widen1(x);
+    z = widen1(low_half(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1101,7 +1117,7 @@ int simd_test(void)
 
 # ifdef widen1
     touch(src);
-    x = widen1(src);
+    x = widen1(low_half(src));
     touch(src);
     if ( !eq(x, y) ) return __LINE__;
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,16 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if ELEM_SIZE == 1
+typedef vqi_t vint_t;
+#elif ELEM_SIZE == 2
+typedef vhi_t vint_t;
+#elif ELEM_SIZE == 4
+typedef vsi_t vint_t;
+#elif ELEM_SIZE == 8
+typedef vdi_t vint_t;
+#endif
+
 #if VEC_SIZE >= 16
 
 # if ELEM_COUNT >= 2
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -279,6 +279,16 @@ static inline uint64_t xgetbv(uint32_t x
     (res.b & (1U << 31)) != 0; \
 })
 
+#define cpu_has_avx512_vbmi ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.c = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.c & (1U << 1)) != 0; \
+})
+
 int emul_test_cpuid(
     uint32_t leaf,
     uint32_t subleaf,
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -466,9 +466,13 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
     [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
+    [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -1850,6 +1854,7 @@ static bool vcpu_has(
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
+#define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -6001,6 +6006,11 @@ x86_emulate(
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x76): /* vpermi2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x77): /* vpermi2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7e): /* vpermt2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7f): /* vpermt2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdb): /* vpand{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8511,6 +8521,16 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512_vbmi);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -107,6 +107,9 @@
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
 
+/* CPUID level 0x00000007:0.ecx */
+#define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
+
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
 
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -224,7 +224,7 @@ XEN_CPUFEATURE(AVX512VL,      5*32+31) /
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.ecx, word 6 */
 XEN_CPUFEATURE(PREFETCHWT1,   6*32+ 0) /*A  PREFETCHWT1 instruction */
-XEN_CPUFEATURE(AVX512VBMI,    6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -254,12 +254,17 @@ def crunch_numbers(state):
         AVX2: [AVX512F],
 
         # AVX512F is taken to mean hardware support for 512bit registers
-        # (which in practice depends on the EVEX prefix to encode), and the
-        # instructions themselves. All further AVX512 features are built on
-        # top of AVX512F
+        # (which in practice depends on the EVEX prefix to encode) as well
+        # as mask registers, and the instructions themselves. All further
+        # AVX512 features are built on top of AVX512F
         AVX512F: [AVX512DQ, AVX512IFMA, AVX512PF, AVX512ER, AVX512CD,
-                  AVX512BW, AVX512VL, AVX512VBMI, AVX512_4VNNIW,
-                  AVX512_4FMAPS, AVX512_VPOPCNTDQ],
+                  AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
+                  AVX512_VPOPCNTDQ],
+
+        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # dependents of AVX512BW (as to requiring wider than 16-bit mask
+        # registers), despite the SDM not formally making this connection.
+        AVX512BW: [AVX512_VBMI],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 29/34] x86emul: support AVX512{F, BW} integer shuffle insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (28 preceding siblings ...)
  2018-09-18 12:11   ` [PATCH v3 29/34] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
@ 2018-09-18 12:11   ` Jan Beulich
  2018-09-18 12:12   ` [PATCH v3 30/34] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
                     ` (3 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:11 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Also include shuff{32x4,64x2} as being very similar to shufi{32x4,64x2}.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -200,6 +200,7 @@ static const struct test avx512f_all[] =
     INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
     INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
     INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pshufd,       66,   0f, 70,    vl,      d, vl),
     INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
     INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
     INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
@@ -250,6 +251,10 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
+    INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
+    INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
+    INSN(shufi64x2,      66, 0f3a, 43, vl,    q, vl),
 };
 
 static const struct test avx512f_512[] = {
@@ -304,6 +309,9 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,   w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,   w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,   b, vl),
+    INSN(pshufb,      66, 0f38, 00,    vl,   b, vl),
+    INSN(pshufhw,     f3,   0f, 70,    vl,   w, vl),
+    INSN(pshuflw,     f2,   0f, 70,    vl,   w, vl),
     INSNX(pslldq,     66,   0f, 73, 7, vl,   b, vl),
     INSN(psllvw,      66, 0f38, 12,    vl,   w, vl),
     INSN(psllw,       66,   0f, f1,    el_8, w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -147,6 +147,10 @@ static inline bool _to_bool(byte_vec_t b
 #   else
 #    define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #    define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
+#    define swap(x) ({ \
+    vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
+})
 #   endif
 #  elif FLOAT_SIZE == 8
 #   if VEC_SIZE >= 32
@@ -174,6 +178,10 @@ static inline bool _to_bool(byte_vec_t b
 #   else
 #    define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #    define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
+#    define swap(x) ({ \
+    vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
+})
 #   endif
 #  endif
 # endif
@@ -303,9 +311,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
+                               VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
+                             0b00011011, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -327,9 +340,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b01001110, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
+                                      VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -119,6 +119,12 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+/* Sadly there are a few exceptions to the general naming rules. */
+#define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
+#define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
+#define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
+#define __builtin_ia32_shuf_i64x2_512_mask __builtin_ia32_shuf_i64x2_mask
+
 #if VEC_SIZE < 64
 # pragma GCC target ( "avx512vl" )
 #endif
@@ -208,6 +214,7 @@ OVR(pmovzxbq);
 OVR(pmovzxdq);
 OVR(pmovzxwd);
 OVR(pmovzxwq);
+OVR(pshufd);
 OVR(punpckhdq);
 OVR(punpckhqdq);
 OVR(punpckldq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -316,7 +316,7 @@ static const struct twobyte_table {
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
-    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
+    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other, d8s_vl },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
@@ -430,7 +430,8 @@ static const struct ext0f38_table {
     uint8_t vsib:1;
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
-    [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
@@ -541,6 +542,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq },
+    [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
@@ -550,6 +552,7 @@ static const struct ext0f3a_table {
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
+    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6653,6 +6656,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6924,6 +6928,20 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 3;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x70): /* vpshufd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x70): /* vpshufhw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x70): /* vpshuflw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx == vex_66 )
+            generate_exception_if(evex.w, EXC_UD);
+        else
+        {
+            host_and_vcpu_must_have(avx512bw);
+            generate_exception_if(evex.br, EXC_UD);
+        }
+        d = (d & ~SrcMask) | SrcMem | TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_imm_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x71):    /* Grp12 */
     case X86EMUL_OPC_VEX_66(0x0f, 0x71):
     CASE_SIMD_PACKED_INT(0x0f, 0x72):    /* Grp13 */
@@ -9093,7 +9111,13 @@ x86_emulate(
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
-        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        generate_exception_if(evex.br, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x23): /* vshuff32x4 $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+                                            /* vshuff64x2 $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x43): /* vshufi32x4 $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+                                            /* vshufi64x2 $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
         fault_suppression = false;
         goto avx512f_imm_no_sae;
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 30/34] x86emul: support AVX512{BW, DQ} mask move insns
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (29 preceding siblings ...)
  2018-09-18 12:11   ` [PATCH v3 29/34] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
@ 2018-09-18 12:12   ` Jan Beulich
  2018-09-18 12:14   ` [PATCH v3 32/34] x86emul: basic AVX512BW testing Jan Beulich
                     ` (2 subsequent siblings)
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:12 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Entries to the tables in evex-disp8.c are added despite these insns not
allowing for memory operands, with the goal of the tables giving a
complete picture of the supported EVEX-encoded insns in the end.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -300,9 +300,12 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,   w, vl),
     INSN(pminub,      66,   0f, da,    vl,   b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,   w, vl),
+//       pmovb2m,     f3, 0f38, 29,          b
+//       pmovm2,      f3, 0f38, 28,         bw
     INSN(pmovswb,     f3, 0f38, 20,    vl_2, b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2, b, vl),
     INSN(pmovuswb,    f3, 0f38, 10,    vl_2, b, vl),
+//       pmovw2m,     f3, 0f38, 29,          w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2, b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2, b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,   w, vl),
@@ -350,6 +353,9 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
+//       pmovd2m,        f3, 0f38, 39,        d
+//       pmovm2,         f3, 0f38, 38,       dq
+//       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
 };
--- a/tools/tests/x86_emulator/opmask.S
+++ b/tools/tests/x86_emulator/opmask.S
@@ -12,17 +12,23 @@
 
 #if SIZE == 1
 # define _(x) x##b
+# define _v(x, t) _v_(x##q, t)
 #elif SIZE == 2
 # define _(x) x##w
+# define _v(x, t) _v_(x##d, t)
 # define WIDEN(x) x##bw
 #elif SIZE == 4
 # define _(x) x##d
+# define _v(x, t) _v_(x##w, t)
 # define WIDEN(x) x##wd
 #elif SIZE == 8
 # define _(x) x##q
+# define _v(x, t) _v_(x##b, t)
 # define WIDEN(x) x##dq
 #endif
 
+#define _v_(x, t) v##x##t
+
     .macro check res1:req, res2:req, line:req
     _(kmov)       %\res1, DATA(out)
 #if SIZE < 8 || !defined(__i386__)
@@ -131,6 +137,15 @@ _start:
 
 #endif
 
+#if SIZE > 2 ? defined(__AVX512BW__) : defined(__AVX512DQ__)
+
+    _(kmov)       DATA(in1), %k0
+    _v(pmovm2,)   %k0, %zmm7
+    _v(pmov,2m)   %zmm7, %k3
+    check         k0, k3, __LINE__
+
+#endif
+
     xor           %eax, %eax
     ret
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -8411,6 +8411,21 @@ x86_emulate(
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x29): /* vpmov{b,w}2m [xyz]mm,k */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x39): /* vpmov{d,q}2m [xyz]mm,k */
+        generate_exception_if(!evex.r || !evex.R, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x28): /* vpmovm2{b,w} k,[xyz]mm */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x38): /* vpmovm2{d,q} k,[xyz]mm */
+        if ( b & 0x10 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.opmsk || ea.type != OP_REG, EXC_UD);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 32/34] x86emul: basic AVX512BW testing
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (30 preceding siblings ...)
  2018-09-18 12:12   ` [PATCH v3 30/34] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
@ 2018-09-18 12:14   ` Jan Beulich
  2018-09-18 12:14   ` [PATCH v3 33/34] x86emul: basic AVX512DQ testing Jan Beulich
  2018-09-18 12:14   ` [PATCH v3 34/34] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:14 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -11,7 +11,7 @@ all: $(TARGET)
 run: $(TARGET)
 	./$(TARGET)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -55,6 +55,9 @@ xop-flts := $(avx-flts)
 avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
+avx512bw-vecs := $(avx512f-vecs)
+avx512bw-ints := 1 2
+avx512bw-flts :=
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -31,6 +31,10 @@ ENTRY(simd_test);
 #  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
 # elif FLOAT_SIZE == 8
 #  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif INT_SIZE == 1
+#  define eq(x, y) (B(pcmpeqb, _mask, (vqi_t)(x), (vqi_t)(y), -1) == ALL_TRUE)
+# elif INT_SIZE == 2
+#  define eq(x, y) (B(pcmpeqw, _mask, x, y, -1) == ALL_TRUE)
 # elif INT_SIZE == 4
 #  define eq(x, y) (B(pcmpeqd, _mask, x, y, -1) == ALL_TRUE)
 # elif INT_SIZE == 8
@@ -368,6 +372,87 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # endif
+#elif (INT_SIZE == 1 || UINT_SIZE == 1 || INT_SIZE == 2 || UINT_SIZE == 2) && \
+      defined(__AVX512BW__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 1 || UINT_SIZE == 1
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastb %1, %0" \
+          : "=v" (t_) : "m" (*(char[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastb %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  elif defined(__AVX512VBMI__)
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
+#  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+# elif INT_SIZE == 2 || UINT_SIZE == 2
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastw %1, %0" \
+          : "=v" (t_) : "m" (*(short[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastw %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(pshufhw, _mask, \
+                                      B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
+                                      0b00011011, (vhi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+# endif
+# if INT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovsxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 2
+#  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
+#  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
+#  define widen1(x) ((vec_t)B(pmovsxwd, _mask, x, (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxwq, _mask, x, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 2
+#  define max(x, y) ((vec_t)B(pmaxuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define mul_hi(x, y) ((vec_t)B(pmulhuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxwd, _mask, (vhi_half_t)(x), (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxwq, _mask, (vhi_quarter_t)(x), (vdi_t)undef(), ~0))
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
@@ -559,7 +644,7 @@ static inline bool _to_bool(byte_vec_t b
 #  endif
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSSE3__)
+#if VEC_SIZE == 16 && defined(__SSSE3__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define abs(x) ((vec_t)__builtin_ia32_pabsb128((vqi_t)(x)))
 # elif INT_SIZE == 2
@@ -783,6 +868,40 @@ static inline half_t low_half(vec_t x)
 }
 # endif
 
+# if !defined(low_quarter) && defined(QUARTER_SIZE)
+static inline quarter_t low_quarter(vec_t x)
+{
+#  if QUARTER_SIZE < VEC_SIZE
+    quarter_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+# if !defined(low_eighth) && defined(EIGHTH_SIZE)
+static inline eighth_t low_eighth(vec_t x)
+{
+#  if EIGHTH_SIZE < VEC_SIZE
+    eighth_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
 #endif
 
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
@@ -1111,7 +1230,7 @@ int simd_test(void)
     y = interleave_lo(alt < 0, alt < 0);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen2(x);
+    z = widen2(low_quarter(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1120,7 +1239,7 @@ int simd_test(void)
     y = interleave_lo(y, y);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen3(x);
+    z = widen3(low_eighth(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 #  endif
@@ -1142,14 +1261,14 @@ int simd_test(void)
 
 # ifdef widen2
     touch(src);
-    x = widen2(src);
+    x = widen2(low_quarter(src));
     touch(src);
     if ( !eq(x, z) ) return __LINE__;
 # endif
 
 # ifdef widen3
     touch(src);
-    x = widen3(src);
+    x = widen3(low_eighth(src));
     touch(src);
     if ( !eq(x, interleave_lo(z, (vec_t){})) ) return __LINE__;
 # endif
@@ -1169,6 +1288,36 @@ int simd_test(void)
             if ( aux2[i] != src[i] )
                 return __LINE__;
     }
+#endif
+
+#if defined(widen2) && defined(shrink2)
+    {
+        quarter_t aux1 = low_quarter(src), aux2;
+
+        touch(aux1);
+        x = widen2(aux1);
+        touch(x);
+        aux2 = shrink2(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 4; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
+#if defined(widen3) && defined(shrink3)
+    {
+        eighth_t aux1 = low_eighth(src), aux2;
+
+        touch(aux1);
+        x = widen3(aux1);
+        touch(x);
+        aux2 = shrink3(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 8; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
 #endif
 
 #ifdef dup_lo
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -95,6 +95,32 @@ typedef int __attribute__((vector_size(H
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
 # endif
 
+# if ELEM_COUNT >= 4
+#  if VEC_SIZE > 64
+#   define QUARTER_SIZE (VEC_SIZE / 4)
+#  else
+#   define QUARTER_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(QUARTER_SIZE))) quarter_t;
+typedef char __attribute__((vector_size(QUARTER_SIZE))) vqi_quarter_t;
+typedef short __attribute__((vector_size(QUARTER_SIZE))) vhi_quarter_t;
+typedef int __attribute__((vector_size(QUARTER_SIZE))) vsi_quarter_t;
+typedef long long __attribute__((vector_size(QUARTER_SIZE))) vdi_quarter_t;
+# endif
+
+# if ELEM_COUNT >= 8
+#  if VEC_SIZE > 128
+#   define EIGHTH_SIZE (VEC_SIZE / 8)
+#  else
+#   define EIGHTH_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(EIGHTH_SIZE))) eighth_t;
+typedef char __attribute__((vector_size(EIGHTH_SIZE))) vqi_eighth_t;
+typedef short __attribute__((vector_size(EIGHTH_SIZE))) vhi_eighth_t;
+typedef int __attribute__((vector_size(EIGHTH_SIZE))) vsi_eighth_t;
+typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
+# endif
+
 #endif
 
 #if VEC_SIZE == 16
@@ -182,6 +208,9 @@ OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
 OVR_INT(add);
+OVR_BW(adds);
+OVR_BW(addus);
+OVR_BW(avg);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
@@ -229,6 +258,8 @@ OVR_INT(srl);
 OVR_DQ(srlv);
 OVR_FP(sub);
 OVR_INT(sub);
+OVR_BW(subs);
+OVR_BW(subus);
 OVR_SFP(ucomi);
 OVR_VFP(unpckh);
 OVR_VFP(unpckl);
@@ -275,6 +306,29 @@ OVR(pmuldq);
 OVR(pmuludq);
 #endif
 
+#ifdef __AVX512BW__
+OVR(pextrb);
+OVR(pextrw);
+OVR(pinsrb);
+OVR(pinsrw);
+OVR(pmaddwd);
+OVR(pmovsxbw);
+OVR(pmovzxbw);
+OVR(pmulhuw);
+OVR(pmulhw);
+OVR(pmullw);
+OVR(psadbw);
+OVR(pshufb);
+OVR(pshufhw);
+OVR(pshuflw);
+OVR(punpckhbw);
+OVR(punpckhwd);
+OVR(punpcklbw);
+OVR(punpcklwd);
+OVR(slldq);
+OVR(srldq);
+#endif
+
 #undef OVR_VFP
 #undef OVR_SFP
 #undef OVR_INT
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
+#include "avx512bw.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -105,6 +106,11 @@ static bool simd_check_avx512bw(void)
 }
 #define simd_check_avx512bw_opmask simd_check_avx512bw
 
+static bool simd_check_avx512bw_vl(void)
+{
+    return cpu_has_avx512bw && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -284,6 +290,18 @@ static const struct {
     AVX512VL(VL u64x2,        avx512f,      16u8),
     AVX512VL(VL s64x4,        avx512f,      32i8),
     AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512BW s8x64,     avx512bw,      64i1),
+    SIMD(AVX512BW u8x64,     avx512bw,      64u1),
+    SIMD(AVX512BW s16x32,    avx512bw,      64i2),
+    SIMD(AVX512BW u16x32,    avx512bw,      64u2),
+    AVX512VL(BW+VL s8x16,    avx512bw,      16i1),
+    AVX512VL(BW+VL u8x16,    avx512bw,      16u1),
+    AVX512VL(BW+VL s8x32,    avx512bw,      32i1),
+    AVX512VL(BW+VL u8x32,    avx512bw,      32u1),
+    AVX512VL(BW+VL s16x8,    avx512bw,      16i2),
+    AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
+    AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
+    AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 33/34] x86emul: basic AVX512DQ testing
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (31 preceding siblings ...)
  2018-09-18 12:14   ` [PATCH v3 32/34] x86emul: basic AVX512BW testing Jan Beulich
@ 2018-09-18 12:14   ` Jan Beulich
  2018-09-18 12:14   ` [PATCH v3 34/34] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:14 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -11,7 +11,7 @@ all: $(TARGET)
 run: $(TARGET)
 	./$(TARGET)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -58,9 +58,12 @@ avx512f-flts := 4 8
 avx512bw-vecs := $(avx512f-vecs)
 avx512bw-ints := 1 2
 avx512bw-flts :=
+avx512dq-vecs := $(avx512f-vecs)
+avx512dq-ints := $(avx512f-ints)
+avx512dq-flts := $(avx512f-flts)
 
 avx512f-opmask-vecs := 2
-avx512dq-opmask-vecs := 1
+avx512dq-opmask-vecs := 1 2
 avx512bw-opmask-vecs := 4 8
 
 # For AVX and later, have the compiler avoid XMM0 to widen coverage of
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -121,6 +121,34 @@ typedef int __attribute__((vector_size(E
 typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
 # endif
 
+# define DECL_PAIR(w) \
+typedef w ## _t pair_t; \
+typedef vsi_ ## w ## _t vsi_pair_t; \
+typedef vdi_ ## w ## _t vdi_pair_t
+# define DECL_QUARTET(w) \
+typedef w ## _t quartet_t; \
+typedef vsi_ ## w ## _t vsi_quartet_t; \
+typedef vdi_ ## w ## _t vdi_quartet_t
+# define DECL_OCTET(w) \
+typedef w ## _t octet_t; \
+typedef vsi_ ## w ## _t vsi_octet_t; \
+typedef vdi_ ## w ## _t vdi_octet_t
+
+# if ELEM_COUNT == 4
+DECL_PAIR(half);
+# elif ELEM_COUNT == 8
+DECL_PAIR(quarter);
+DECL_QUARTET(half);
+# elif ELEM_COUNT == 16
+DECL_PAIR(eighth);
+DECL_QUARTET(quarter);
+DECL_OCTET(half);
+# endif
+
+# undef DECL_OCTET
+# undef DECL_QUARTET
+# undef DECL_PAIR
+
 #endif
 
 #if VEC_SIZE == 16
@@ -146,6 +174,14 @@ typedef long long __attribute__((vector_
 #ifdef __AVX512F__
 
 /* Sadly there are a few exceptions to the general naming rules. */
+#define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
+#define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+#define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
+#define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
+#define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
+#define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
+#define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
+#define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
 #define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 #define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 #define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -329,6 +365,18 @@ OVR(slldq);
 OVR(srldq);
 #endif
 
+#ifdef __AVX512DQ__
+OVR_VFP(and);
+OVR_VFP(andn);
+OVR_VFP(or);
+OVR(pextrd);
+OVR(pextrq);
+OVR(pinsrd);
+OVR(pinsrq);
+OVR(pmullq);
+OVR_VFP(xor);
+#endif
+
 #undef OVR_VFP
 #undef OVR_SFP
 #undef OVR_INT
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -23,6 +23,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
 #include "avx512bw.h"
+#include "avx512dq.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -100,6 +101,11 @@ static bool simd_check_avx512dq(void)
 }
 #define simd_check_avx512dq_opmask simd_check_avx512dq
 
+static bool simd_check_avx512dq_vl(void)
+{
+    return cpu_has_avx512dq && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -267,9 +273,10 @@ static const struct {
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
     SIMD(OPMASK/w,     avx512f_opmask,         2),
-    SIMD(OPMASK/b,    avx512dq_opmask,         1),
-    SIMD(OPMASK/d,    avx512bw_opmask,         4),
-    SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(OPMASK+DQ/b, avx512dq_opmask,         1),
+    SIMD(OPMASK+DQ/w, avx512dq_opmask,         2),
+    SIMD(OPMASK+BW/d, avx512bw_opmask,         4),
+    SIMD(OPMASK+BW/q, avx512bw_opmask,         8),
     SIMD(AVX512F f32 scalar,  avx512f,        f4),
     SIMD(AVX512F f32x16,      avx512f,      64f4),
     SIMD(AVX512F f64 scalar,  avx512f,        f8),
@@ -302,6 +309,24 @@ static const struct {
     AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
     AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
     AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
+    SIMD(AVX512DQ f32x16,    avx512dq,      64f4),
+    SIMD(AVX512DQ f64x8,     avx512dq,      64f8),
+    SIMD(AVX512DQ s32x16,    avx512dq,      64i4),
+    SIMD(AVX512DQ u32x16,    avx512dq,      64u4),
+    SIMD(AVX512DQ s64x8,     avx512dq,      64i8),
+    SIMD(AVX512DQ u64x8,     avx512dq,      64u8),
+    AVX512VL(DQ+VL f32x4,    avx512dq,      16f4),
+    AVX512VL(DQ+VL f64x2,    avx512dq,      16f8),
+    AVX512VL(DQ+VL f32x8,    avx512dq,      32f4),
+    AVX512VL(DQ+VL f64x4,    avx512dq,      32f8),
+    AVX512VL(DQ+VL s32x4,    avx512dq,      16i4),
+    AVX512VL(DQ+VL u32x4,    avx512dq,      16u4),
+    AVX512VL(DQ+VL s32x8,    avx512dq,      32i4),
+    AVX512VL(DQ+VL u32x8,    avx512dq,      32u4),
+    AVX512VL(DQ+VL s64x2,    avx512dq,      16i8),
+    AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
+    AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
+    AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -134,6 +134,27 @@ static inline bool _to_bool(byte_vec_t b
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if VEC_SIZE > FLOAT_SIZE
+#  if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
+       (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
+       (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#   define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+#  endif
+#  if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
+       (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#   define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+#  endif
 #  if FLOAT_SIZE == 4
 #   define broadcast(x) ({ \
     vec_t t_; \
@@ -141,6 +162,17 @@ static inline bool _to_bool(byte_vec_t b
           : "=v" (t_) : "m" (*(float[1]){ x }) ); \
     t_; \
 })
+#   if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#    define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcastf32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#   endif
+#   if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#    define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
+#    define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
+#   endif
 #   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #   define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
@@ -149,6 +181,13 @@ static inline bool _to_bool(byte_vec_t b
 #    define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #    define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
 #   else
+#    define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
+#    define insert_pair(x, y, p) \
+    B(insertf32x4_, _mask, x, \
+      /* Cast needed below to work around gcc 7.x quirk. */ \
+      (p) & 1 ? (typeof(y))__builtin_ia32_shufps(y, y, 0b01000100) : (y), \
+      (p) >> 1, x, 3 << ((p) * 2))
+#    define insert_quartet(x, y, p) B(insertf32x4_, _mask, x, y, p, undef(), ~0)
 #    define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #    define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #    define swap(x) ({ \
@@ -172,6 +211,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #   endif
+#   if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#    define broadcast_pair(x) B(broadcastf64x2_, _mask, x, undef(), ~0)
+#    define insert_pair(x, y, p) B(insertf64x2_, _mask, x, y, p, undef(), ~0)
+#   endif
+#   if VEC_SIZE == 64
+#    define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
+#    define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
+#   endif
 #   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #   define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
@@ -300,6 +347,16 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 # endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextracti32x4 */ || \
+       (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -312,11 +369,30 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  ifdef __AVX512DQ__
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcasti32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) ((vec_t)B(broadcasti32x8_, _mask, (vsi_octet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_octet(x, y, p) ((vec_t)B(inserti32x8_, _mask, (vsi_t)(x), (vsi_octet_t)(y), p, (vsi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti32x4_, _mask, (vsi_quartet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_pair(x, y, p) \
+    (vec_t)(B(inserti32x4_, _mask, (vsi_t)(x), \
+              /* First cast needed below to work around gcc 7.x quirk. */ \
+              (p) & 1 ? (vsi_pair_t)__builtin_ia32_pshufd((vsi_pair_t)(y), 0b01000100) \
+                      : (vsi_pair_t)(y), \
+              (p) >> 1, (vsi_t)(x), 3 << ((p) * 2)))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti32x4_, _mask, (vsi_t)(x), (vsi_quartet_t)(y), p, (vsi_t)undef(), ~0))
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
@@ -341,6 +417,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ((vec_t)B(broadcasti64x2_, _mask, (vdi_pair_t)(x), (vdi_t)undef(), ~0))
+#   define insert_pair(x, y, p) ((vec_t)B(inserti64x2_, _mask, (vdi_t)(x), (vdi_pair_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti64x4_, , (vdi_quartet_t)(x), (vdi_t)undef(), ~0))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti64x4_, _mask, (vdi_t)(x), (vdi_quartet_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
@@ -892,7 +976,7 @@ static inline eighth_t low_eighth(vec_t
     eighth_t y;
     unsigned int i;
 
-    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+    for ( i = 0; i < ELEM_COUNT / 8; ++i )
         y[i] = x[i];
 
     return y;
@@ -904,6 +988,50 @@ static inline eighth_t low_eighth(vec_t
 
 #endif
 
+#ifdef broadcast_pair
+# if ELEM_COUNT == 4
+#  define broadcast_half broadcast_pair
+# elif ELEM_COUNT == 8
+#  define broadcast_quarter broadcast_pair
+# elif ELEM_COUNT == 16
+#  define broadcast_eighth broadcast_pair
+# endif
+#endif
+
+#ifdef insert_pair
+# if ELEM_COUNT == 4
+#  define insert_half insert_pair
+# elif ELEM_COUNT == 8
+#  define insert_quarter insert_pair
+# elif ELEM_COUNT == 16
+#  define insert_eighth insert_pair
+# endif
+#endif
+
+#ifdef broadcast_quartet
+# if ELEM_COUNT == 8
+#  define broadcast_half broadcast_quartet
+# elif ELEM_COUNT == 16
+#  define broadcast_quarter broadcast_quartet
+# endif
+#endif
+
+#ifdef insert_quartet
+# if ELEM_COUNT == 8
+#  define insert_half insert_quartet
+# elif ELEM_COUNT == 16
+#  define insert_quarter insert_quartet
+# endif
+#endif
+
+#if defined(broadcast_octet) && ELEM_COUNT == 16
+# define broadcast_half broadcast_octet
+#endif
+
+#if defined(insert_octet) && ELEM_COUNT == 16
+# define insert_half insert_octet
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1199,6 +1327,60 @@ int simd_test(void)
     if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#if defined(broadcast_half) && defined(insert_half)
+    {
+        half_t aux = low_half(src);
+
+        touch(aux);
+        x = broadcast_half(aux);
+        touch(aux);
+        y = insert_half(src, aux, 1);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_quarter) && defined(insert_quarter)
+    {
+        quarter_t aux = low_quarter(src);
+
+        touch(aux);
+        x = broadcast_quarter(aux);
+        touch(aux);
+        y = insert_quarter(src, aux, 1);
+        touch(aux);
+        y = insert_quarter(y, aux, 2);
+        touch(aux);
+        y = insert_quarter(y, aux, 3);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_eighth) && defined(insert_eighth) && \
+    /* At least gcc 7.3 "optimizes" away all insert_eighth() calls below. */ \
+    __GNUC__ >= 8
+    {
+        eighth_t aux = low_eighth(src);
+
+        touch(aux);
+        x = broadcast_eighth(aux);
+        touch(aux);
+        y = insert_eighth(src, aux, 1);
+        touch(aux);
+        y = insert_eighth(y, aux, 2);
+        touch(aux);
+        y = insert_eighth(y, aux, 3);
+        touch(aux);
+        y = insert_eighth(y, aux, 4);
+        touch(aux);
+        y = insert_eighth(y, aux, 5);
+        touch(aux);
+        y = insert_eighth(y, aux, 6);
+        touch(aux);
+        y = insert_eighth(y, aux, 7);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v3 34/34] x86emul: also allow running the 32-bit harness on a 64-bit distro
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (32 preceding siblings ...)
  2018-09-18 12:14   ` [PATCH v3 33/34] x86emul: basic AVX512DQ testing Jan Beulich
@ 2018-09-18 12:14   ` Jan Beulich
  33 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-18 12:14 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

In order to be able to verify the 32-bit variant builds and runs,
introduce a respective target (and the necessary other adjustments).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/.gitignore
+++ b/.gitignore
@@ -240,6 +240,7 @@ tools/security/xensec_tool
 tools/tests/depriv/depriv-fd-checker
 tools/tests/x86_emulator/*.bin
 tools/tests/x86_emulator/*.tmp
+tools/tests/x86_emulator/32/x86_emulate
 tools/tests/x86_emulator/3dnow*.[ch]
 tools/tests/x86_emulator/asm
 tools/tests/x86_emulator/avx*.[ch]
--- /dev/null
+++ b/tools/tests/x86_emulator/32/Makefile
@@ -0,0 +1,4 @@
+override XEN_COMPILE_ARCH := x86_32
+XEN_ROOT = $(CURDIR)/../../../..
+VPATH += ..
+include ../Makefile
--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -1,5 +1,7 @@
 
+ifeq ($(XEN_ROOT),)
 XEN_ROOT=$(CURDIR)/../../..
+endif
 include $(XEN_ROOT)/tools/Rules.mk
 
 TARGET := test_x86_emulator
@@ -18,6 +20,12 @@ TESTCASES := blowfish $(SIMD) $(FMA) $(S
 
 OPMASK := avx512f avx512dq avx512bw
 
+ifeq ($(origin XEN_COMPILE_ARCH),override)
+
+HOSTCFLAGS += -m32
+
+else
+
 blowfish-cflags := ""
 blowfish-cflags-x86_32 := "-mno-accumulate-outgoing-args -Dstatic="
 
@@ -148,6 +156,8 @@ $(addsuffix .h,$(SIMD) $(FMA) $(SG)): si
 
 xop.h avx512f.h: simd-fma.c
 
+endif # 32-bit override
+
 $(TARGET): x86-emulate.o test_x86_emulator.o evex-disp8.o wrappers.o
 	$(HOSTCC) $(HOSTCFLAGS) -o $@ $^
 
@@ -162,6 +172,15 @@ distclean: clean
 .PHONY: install uninstall
 install uninstall:
 
+.PHONY: run32 clean32
+ifeq ($(XEN_TARGET_ARCH),x86_64)
+run32 clean32: %32: $(addsuffix .h,$(TESTCASES)) $(addsuffix -opmask.h,$(OPMASK))
+	$(MAKE) -C 32 $*
+clean: clean32
+else
+run32 clean32: %32: %
+endif
+
 x86_emulate:
 	[ -L $@ ] || ln -sf $(XEN_ROOT)/xen/arch/x86/$@
 





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes
  2018-09-18 11:53   ` [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes Jan Beulich
@ 2018-09-18 16:05     ` Paul Durrant
  2018-10-25 18:36     ` Andrew Cooper
  1 sibling, 0 replies; 465+ messages in thread
From: Paul Durrant @ 2018-09-18 16:05 UTC (permalink / raw)
  To: 'Jan Beulich', xen-devel; +Cc: Andrew Cooper, Wei Liu, George Dunlap

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 18 September 2018 12:54
> To: xen-devel <xen-devel@lists.xenproject.org>
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Paul Durrant
> <Paul.Durrant@citrix.com>; Wei Liu <wei.liu2@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>
> Subject: [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes
> 
> This is needed before enabling any AVX512 insns in the emulator. Change
> the way alignment is enforced at the same time.
> 
> Add a check that the buffer won't actually overflow, and while at it
> also convert the check for accesses to not cross page boundaries.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> v3: New.
> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -866,7 +866,18 @@ static int hvmemul_phys_mmio_access(
>      int rc = X86EMUL_OKAY;
> 
>      /* Accesses must fall within a page. */
> -    BUG_ON((gpa & ~PAGE_MASK) + size > PAGE_SIZE);
> +    if ( (gpa & ~PAGE_MASK) + size > PAGE_SIZE )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return X86EMUL_UNHANDLEABLE;
> +    }
> +
> +    /* Accesses must not overflow the cache's buffer. */
> +    if ( size > sizeof(cache->buffer) )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return X86EMUL_UNHANDLEABLE;
> +    }
> 
>      /*
>       * hvmemul_do_io() cannot handle non-power-of-2 accesses or
> --- a/xen/include/asm-x86/hvm/vcpu.h
> +++ b/xen/include/asm-x86/hvm/vcpu.h
> @@ -42,15 +42,14 @@ struct hvm_vcpu_asid {
>  };
> 
>  /*
> - * We may read or write up to m256 as a number of device-model
> + * We may read or write up to m512 as a number of device-model
>   * transactions.
>   */
>  struct hvm_mmio_cache {
>      unsigned long gla;
>      unsigned int size;
>      uint8_t dir;
> -    uint8_t pad[3]; /* make buffer[] long-aligned */
> -    uint8_t buffer[32];
> +    uint8_t buffer[64] __aligned(sizeof(long));
>  };
> 
>  struct hvm_vcpu_io {
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (7 preceding siblings ...)
  2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
@ 2018-09-25 13:14 ` Jan Beulich
  2018-09-25 13:25   ` [PATCH v4 01/44] x86emul: support AVX512 opmask insns Jan Beulich
                     ` (43 more replies)
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                   ` (3 subsequent siblings)
  12 siblings, 44 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:14 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

01: support AVX512 opmask insns
02: x86/HVM: grow MMIO cache data size to 64 bytes
03: correct EVEX decoding
04: generalize vector length handling for AVX512/EVEX
05: support basic AVX512 moves
06: test for correct EVEX Disp8 scaling
07: also allow running the 32-bit harness on a 64-bit distro
08: use AVX512 logic for emulating V{,P}MASKMOV*
09: support AVX512F legacy-equivalent arithmetic FP insns
10: support AVX512DQ logic FP insns
11: support AVX512F "normal" FP compare insns
12: support AVX512F misc legacy-equivalent FP insns
13: support AVX512F fused-multiply-add insns
14: support AVX512F legacy-equivalent logic insns
15: support AVX512{F,DQ} FP broadcast insns
16: support AVX512F v{,u}comis{d,s} insns
17: test: introduce eq()
18: support AVX512{F,BW} packed integer compare insns
19: support AVX512{F,BW} packed integer arithmetic insns
20: use simd_128 also for legacy vector shift insns
21: support AVX512{F,BW} shift/rotate insns
22: support AVX512{F,BW,DQ} extract insns
23: support AVX512{F,BW,DQ} insert insns
24: basic AVX512F testing
25: support AVX512{F,BW,DQ} integer broadcast insns
26: basic AVX512VL testing
27: support AVX512{F,BW} zero- and sign-extending moves
28: support AVX512{F,BW} down conversion moves
29: support AVX512{F,BW} integer unpack insns
30: support AVX512{F,BW,_VBMI} full permute insns
31: support AVX512{F,BW} integer shuffle insns
32: support AVX512{BW,DQ} mask move insns
33: basic AVX512BW testing
34: basic AVX512DQ testing
35: support AVX512F move high/low insns
36: support AVX512F move duplicate insns
37: support AVX512{F,BW,VBMI} permute insns
38: support AVX512BW pack insns
39: support AVX512F floating-point conversion insns
40: support AVX512F legacy-equivalent packed int/FP conversion insns
41: support AVX512F legacy-equivalent scalar int/FP conversion insns
42: support AVX512DQ packed quad-int/FP conversion insns
43: support AVX512{F,DQ} uint-to-FP conversion insns
44: support AVX512{F,DQ} FP-to-uint conversion insns

The main goal of this series is to support enough of the instructions
such that basic AVX512F, AVX512BW, AVX512DQ, and AVX512VL
tests can be run (this set is relevant as a basis in particular due to
it together mostly [entirely?] covering the legacy-equivalent AVX512
insns). Later additions then (will) simply enable further of the
(conditional) tests in simd*.c (or by other means).

Besides the additions to the series there are two main changes to
previously posted patches: While implementing support for convert
instructions I've noticed an issue with insns being WIG outside of
64-bit mode, but EVEX.W being meaningful in 64-bit mode (movd,
pextrd, pinsrd, and the scalar convert insns). Furthermore I now
hope that the test harness will also run on AVX512VL-incapable
hardware (albeit I'm not myself able to test this, so I can't exclude
there are still issues left). For details, including further smaller
adjustments, please see the individual patches.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 01/44] x86emul: support AVX512 opmask insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
@ 2018-09-25 13:25   ` Jan Beulich
  2018-09-26  6:06     ` Jan Beulich
  2018-09-25 13:26   ` [PATCH v4 02/44] x86/HVM: grow MMIO cache data size to 64 bytes Jan Beulich
                     ` (42 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:25 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

These are all VEX encoded, so the EVEX decoding logic continues to
remain unused at this point.

The new testcase is deliberately coded in assembly, as a C one would
have become almost unreadable due to the overwhelming amount of
__builtin_...() that would need to be used. After all the compiler has
no underlying type (yet) that could be operated on without builtins,
other than the vector types used for "normal" SIMD insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Use distinct temporary file names in testcase.mk. Additions to clean
    target.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,6 +16,8 @@ FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
 
+OPMASK := avx512f avx512dq avx512bw
+
 blowfish-cflags := ""
 blowfish-cflags-x86_32 := "-mno-accumulate-outgoing-args -Dstatic="
 
@@ -51,6 +53,10 @@ xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
 
+avx512f-opmask-vecs := 2
+avx512dq-opmask-vecs := 1
+avx512bw-opmask-vecs := 4 8
+
 # For AVX and later, have the compiler avoid XMM0 to widen coverage of
 # the VEX.vvvv checks in the emulator.  For 3DNow!, however, force SSE
 # use for floating point operations, to avoid mixing MMX and FPU register
@@ -80,9 +86,13 @@ $(1)-cflags := \
 	   $(foreach flt,$($(1)-flts), \
 	     "-D_$(vec)x$(idx)f$(flt) -m$(1:-sg=) $(call non-sse,$(1)) -Os -DVEC_MAX=$(vec) -DIDX_SIZE=$(idx) -DFLOAT_SIZE=$(flt)")))
 endef
+define opmask-defs
+$(1)-opmask-cflags := $(foreach vec,$($(1)-opmask-vecs), "-D_$(vec) -m$(1) -Os -DSIZE=$(vec)")
+endef
 
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
+$(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
 $(addsuffix .h,$(TESTCASES)): %.h: %.c testcase.mk Makefile
 	rm -f $@.new $*.bin
@@ -100,6 +110,22 @@ $(addsuffix .h,$(TESTCASES)): %.h: %.c t
 	)
 	mv $@.new $@
 
+$(addsuffix -opmask.h,$(OPMASK)): %.h: opmask.S testcase.mk Makefile
+	rm -f $@.new $*.bin
+	$(foreach arch,$(filter-out $(XEN_COMPILE_ARCH),x86_32) $(XEN_COMPILE_ARCH), \
+	    for cflags in $($*-cflags) $($*-cflags-$(arch)); do \
+		$(MAKE) -f testcase.mk TESTCASE=$* XEN_TARGET_ARCH=$(arch) $*-cflags="$$cflags" all; \
+		prefix=$(shell echo $(subst -,_,$*) | sed -e 's,^\([0-9]\),_\1,'); \
+		flavor=$$(echo $${cflags} | sed -e 's, .*,,' -e 'y,-=,__,') ; \
+		(echo 'static const unsigned int __attribute__((section(".test, \"ax\", @progbits #")))' \
+		      "$${prefix}_$(arch)$${flavor}[] = {"; \
+		 od -v -t x $*.bin | sed -e 's/^[0-9]* /0x/' -e 's/ /, 0x/g' -e 's/$$/,/'; \
+		 echo "};") >>$@.new; \
+		rm -f $*.bin; \
+	    done; \
+	)
+	mv $@.new $@
+
 $(addsuffix .c,$(SIMD)):
 	ln -sf simd.c $@
 
@@ -118,7 +144,8 @@ $(TARGET): x86-emulate.o test_x86_emulat
 
 .PHONY: clean
 clean:
-	rm -rf $(TARGET) *.o *~ core $(addsuffix .h,$(TESTCASES)) *.bin x86_emulate
+	rm -rf $(TARGET) *.o *~ core *.bin x86_emulate
+	rm -rf $(TARGET) $(addsuffix .h,$(TESTCASES)) $(addsuffix -opmask.h,$(OPMASK))
 
 .PHONY: distclean
 distclean: clean
@@ -145,4 +172,4 @@ x86-emulate.o test_x86_emulator.o wrappe
 x86-emulate.o: x86_emulate/x86_emulate.c
 x86-emulate.o: HOSTCFLAGS += -D__XEN_TOOLS__
 
-test_x86_emulator.o: $(addsuffix .h,$(TESTCASES))
+test_x86_emulator.o: $(addsuffix .h,$(TESTCASES)) $(addsuffix -opmask.h,$(OPMASK))
--- /dev/null
+++ b/tools/tests/x86_emulator/opmask.S
@@ -0,0 +1,144 @@
+#ifdef __i386__
+# define R(x) e##x
+# define DATA(x) x
+#else
+# if SIZE == 8
+#  define R(x) r##x
+# else
+#  define R(x) e##x
+# endif
+# define DATA(x) x(%rip)
+#endif
+
+#if SIZE == 1
+# define _(x) x##b
+#elif SIZE == 2
+# define _(x) x##w
+# define WIDEN(x) x##bw
+#elif SIZE == 4
+# define _(x) x##d
+# define WIDEN(x) x##wd
+#elif SIZE == 8
+# define _(x) x##q
+# define WIDEN(x) x##dq
+#endif
+
+    .macro check res1:req, res2:req, line:req
+    _(kmov)       %\res1, DATA(out)
+#if SIZE < 8 || !defined(__i386__)
+    _(kmov)       %\res2, %R(dx)
+    cmp           DATA(out), %R(dx)
+#else
+    sub           $8, %esp
+    kmovq         %\res2, (%esp)
+    pop           %ecx
+    pop           %edx
+    cmp           DATA(out), %ecx
+    jne           0f
+    cmp           DATA(out+4), %edx
+0:
+#endif
+    je            1f
+    mov           $\line, %eax
+    ret
+1:
+    .endm
+
+    .text
+    .globl _start
+_start:
+    _(kmov)       DATA(in1), %k1
+#if SIZE < 8 || !defined(__i386__)
+    mov           DATA(in2), %R(ax)
+    _(kmov)       %R(ax), %k2
+#else
+    _(kmov)       DATA(in2), %k2
+#endif
+
+    _(kor)        %k1, %k2, %k3
+    _(kand)       %k1, %k2, %k4
+    _(kandn)      %k3, %k4, %k5
+    _(kxor)       %k1, %k2, %k6
+    check         k5, k6, __LINE__
+
+    _(knot)       %k6, %k3
+    _(kxnor)      %k1, %k2, %k4
+    check         k3, k4, __LINE__
+
+    _(kshiftl)    $1, %k1, %k3
+    _(kshiftl)    $2, %k3, %k4
+    _(kshiftl)    $3, %k1, %k5
+    check         k4, k5, __LINE__
+
+    _(kshiftr)    $1, %k1, %k3
+    _(kshiftr)    $2, %k3, %k4
+    _(kshiftr)    $3, %k1, %k5
+    check         k4, k5, __LINE__
+
+    _(kortest)    %k6, %k6
+    jnbe          1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxor)       %k0, %k0, %k3
+    _(kortest)    %k3, %k3
+    jz            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxnor)      %k0, %k0, %k3
+    _(kortest)    %k3, %k3
+    jc            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+#if SIZE > 1
+
+    _(kshiftr)    $SIZE*4, %k3, %k4
+    WIDEN(kunpck) %k4, %k4, %k5
+    check         k3, k5, __LINE__
+
+#endif
+
+#if SIZE != 2 || defined(__AVX512DQ__)
+
+    _(kadd)       %k1, %k1, %k3
+    _(kshiftl)    $1, %k1, %k4
+    check         k3, k4, __LINE__
+
+    _(ktest)      %k2, %k1
+    jnbe          1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxor)       %k0, %k0, %k3
+    _(ktest)      %k0, %k3
+    jz            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+    _(kxnor)      %k0, %k0, %k4
+    _(ktest)      %k0, %k4
+    jc            1f
+    mov           $__LINE__, %eax
+    ret
+1:
+
+#endif
+
+    xor           %eax, %eax
+    ret
+
+    .section .rodata, "a", @progbits
+    .balign 8
+in1: .byte 0b10110011, 0b10001111, 0b00001111, 0b10000011, 0b11110000, 0b00111111, 0b10000000, 0b11111111
+in2: .byte 0b11111111, 0b00000001, 0b11111100, 0b00001111, 0b11000001, 0b11110000, 0b11110001, 0b11001101
+
+    .data
+    .balign 8
+out: .quad 0
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -18,6 +18,9 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx2.h"
 #include "avx2-sg.h"
 #include "xop.h"
+#include "avx512f-opmask.h"
+#include "avx512dq-opmask.h"
+#include "avx512bw-opmask.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -78,6 +81,24 @@ static bool simd_check_xop(void)
     return cpu_has_xop;
 }
 
+static bool simd_check_avx512f(void)
+{
+    return cpu_has_avx512f;
+}
+#define simd_check_avx512f_opmask simd_check_avx512f
+
+static bool simd_check_avx512dq(void)
+{
+    return cpu_has_avx512dq;
+}
+#define simd_check_avx512dq_opmask simd_check_avx512dq
+
+static bool simd_check_avx512bw(void)
+{
+    return cpu_has_avx512bw;
+}
+#define simd_check_avx512bw_opmask simd_check_avx512bw
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -223,6 +244,10 @@ static const struct {
     SIMD(XOP i16x16,              xop,      32i2),
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
+    SIMD(OPMASK/w,     avx512f_opmask,         2),
+    SIMD(OPMASK/b,    avx512dq_opmask,         1),
+    SIMD(OPMASK/d,    avx512bw_opmask,         4),
+    SIMD(OPMASK/q,    avx512bw_opmask,         8),
 #undef SIMD_
 #undef SIMD
 };
@@ -3426,8 +3451,8 @@ int main(int argc, char **argv)
             rc = x86_emulate(&ctxt, &emulops);
             if ( rc != X86EMUL_OKAY )
             {
-                printf("failed at %%eip == %08lx (opcode %08x)\n",
-                       (unsigned long)regs.eip, ctxt.opcode);
+                printf("failed (%d) at %%eip == %08lx (opcode %08x)\n",
+                       rc, (unsigned long)regs.eip, ctxt.opcode);
                 return 1;
             }
         }
--- a/tools/tests/x86_emulator/testcase.mk
+++ b/tools/tests/x86_emulator/testcase.mk
@@ -14,3 +14,9 @@ all: $(TESTCASE).bin
 	$(LD) $(LDFLAGS_DIRECT) -N -Ttext 0x100000 -o $*.tmp $*.o
 	$(OBJCOPY) -O binary $*.tmp $@
 	rm -f $*.tmp
+
+%-opmask.bin: opmask.S
+	$(CC) $(filter-out -M% .%,$(CFLAGS)) -c $< -o $(basename $@).o
+	$(LD) $(LDFLAGS_DIRECT) -N -Ttext 0x100000 -o $(basename $@).tmp $(basename $@).o
+	$(OBJCOPY) -O binary $(basename $@).tmp $@
+	rm -f $(basename $@).tmp
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -209,6 +209,9 @@ int emul_test_get_fpu(
     case X86EMUL_FPU_ymm:
         if ( cpu_has_avx )
             break;
+    case X86EMUL_FPU_opmask:
+        if ( cpu_has_avx512f )
+            break;
     default:
         return X86EMUL_UNHANDLEABLE;
     }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -236,6 +236,36 @@ static inline uint64_t xgetbv(uint32_t x
     (res.c & (1U << 21)) != 0; \
 })
 
+#define cpu_has_avx512f ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 16)) != 0; \
+})
+
+#define cpu_has_avx512dq ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 17)) != 0; \
+})
+
+#define cpu_has_avx512bw ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 30)) != 0; \
+})
+
 int emul_test_cpuid(
     uint32_t leaf,
     uint32_t subleaf,
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -491,6 +491,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
+    [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
@@ -1187,6 +1188,11 @@ static int _get_fpu(
             return X86EMUL_UNHANDLEABLE;
         break;
 
+    case X86EMUL_FPU_opmask:
+        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
+            return X86EMUL_UNHANDLEABLE;
+        break;
+
     default:
         break;
     }
@@ -1762,12 +1768,15 @@ static bool vcpu_has(
 #define vcpu_has_bmi2()        vcpu_has(         7, EBX,  8, ctxt, ops)
 #define vcpu_has_rtm()         vcpu_has(         7, EBX, 11, ctxt, ops)
 #define vcpu_has_mpx()         vcpu_has(         7, EBX, 14, ctxt, ops)
+#define vcpu_has_avx512f()     vcpu_has(         7, EBX, 16, ctxt, ops)
+#define vcpu_has_avx512dq()    vcpu_has(         7, EBX, 17, ctxt, ops)
 #define vcpu_has_rdseed()      vcpu_has(         7, EBX, 18, ctxt, ops)
 #define vcpu_has_adx()         vcpu_has(         7, EBX, 19, ctxt, ops)
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
+#define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -2396,6 +2405,18 @@ x86_decode_twobyte(
         }
         break;
 
+    case X86EMUL_OPC_VEX(0, 0x90):    /* kmov{w,q} */
+    case X86EMUL_OPC_VEX_66(0, 0x90): /* kmov{b,d} */
+        state->desc = DstReg | SrcMem | Mov;
+        state->simd_size = simd_other;
+        break;
+
+    case X86EMUL_OPC_VEX(0, 0x91):    /* kmov{w,q} */
+    case X86EMUL_OPC_VEX_66(0, 0x91): /* kmov{b,d} */
+        state->desc = DstMem | SrcReg | Mov;
+        state->simd_size = simd_other;
+        break;
+
     case 0xae:
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
@@ -6002,6 +6023,60 @@ x86_emulate(
             dst.val = src.val;
         break;
 
+    case X86EMUL_OPC_VEX(0x0f, 0x4a):    /* kadd{w,q} k,k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x41):    /* kand{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x41): /* kand{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x42):    /* kandn{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x42): /* kandn{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x45):    /* kor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x45): /* kor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x46):    /* kxnor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x46): /* kxnor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX(0x0f, 0x47):    /* kxor{w,q} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x47): /* kxor{b,d} k,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x4a): /* kadd{b,d} k,k,k */
+        generate_exception_if(!vex.l, EXC_UD);
+    opmask_basic:
+        if ( vex.w )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( vex.pfx )
+            host_and_vcpu_must_have(avx512dq);
+    opmask_common:
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(!vex.r || (mode_64bit() && !(vex.reg & 8)) ||
+                              ea.type != OP_REG, EXC_UD);
+
+        vex.reg |= 8;
+        d &= ~TwoOp;
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        insn_bytes = PFX_BYTES + 2;
+
+        state->simd_size = simd_other;
+        op_bytes = 1; /* Any non-zero value will do. */
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x44):    /* knot{w,q} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x44): /* knot{b,d} k,k */
+        generate_exception_if(vex.l || vex.reg != 0xf, EXC_UD);
+        goto opmask_basic;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x4b):    /* kunpck{w,d}{d,q} k,k,k */
+        generate_exception_if(!vex.l, EXC_UD);
+        host_and_vcpu_must_have(avx512bw);
+        goto opmask_common;
+
+    case X86EMUL_OPC_VEX_66(0x0f, 0x4b): /* kunpckbw k,k,k */
+        generate_exception_if(!vex.l || vex.w, EXC_UD);
+        goto opmask_common;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x50):     /* movmskp{s,d} xmm,reg */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x50): /* vmovmskp{s,d} {x,y}mm,reg */
     CASE_SIMD_PACKED_INT(0x0f, 0xd7):      /* pmovmskb {,x}mm,reg */
@@ -6552,6 +6627,154 @@ x86_emulate(
         dst.val = test_cc(b, _regs.eflags);
         break;
 
+    case X86EMUL_OPC_VEX(0x0f, 0x91):    /* kmov{w,q} k,mem */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x91): /* kmov{b,d} k,mem */
+        generate_exception_if(ea.type != OP_MEM, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x90):    /* kmov{w,q} k/mem,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x90): /* kmov{b,d} k/mem,k */
+        generate_exception_if(vex.l || !vex.r, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.w )
+        {
+            host_and_vcpu_must_have(avx512bw);
+            op_bytes = 4 << !vex.pfx;
+        }
+        else if ( vex.pfx )
+        {
+            host_and_vcpu_must_have(avx512dq);
+            op_bytes = 1;
+        }
+        else
+            op_bytes = 2;
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            vex.b = 1;
+            opc[1] &= 0x38;
+        }
+        insn_bytes = PFX_BYTES + 2;
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x92):    /* kmovw r32,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x92): /* kmovb r32,k */
+    case X86EMUL_OPC_VEX_F2(0x0f, 0x92): /* kmov{d,q} reg,k */
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.pfx == vex_f2 )
+            host_and_vcpu_must_have(avx512bw);
+        else
+        {
+            generate_exception_if(vex.w, EXC_UD);
+            if ( vex.pfx )
+                host_and_vcpu_must_have(avx512dq);
+        }
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        vex.b = 1;
+        if ( !mode_64bit() )
+            vex.w = 0;
+        opc[1] = modrm & 0xf8;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        ea.reg = decode_gpr(&_regs, modrm_rm);
+        invoke_stub("", "", "=m" (dummy) : "a" (*ea.reg));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        dst.type = OP_NONE;
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x93):    /* kmovw k,r32 */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x93): /* kmovb k,r32 */
+    case X86EMUL_OPC_VEX_F2(0x0f, 0x93): /* kmov{d,q} k,reg */
+        generate_exception_if(vex.l || vex.reg != 0xf || ea.type != OP_REG,
+                              EXC_UD);
+        dst = ea;
+        dst.reg = decode_gpr(&_regs, modrm_reg);
+
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.pfx == vex_f2 )
+        {
+            host_and_vcpu_must_have(avx512bw);
+            dst.bytes = 4 << (mode_64bit() && vex.w);
+        }
+        else
+        {
+            generate_exception_if(vex.w, EXC_UD);
+            dst.bytes = 4;
+            if ( vex.pfx )
+                host_and_vcpu_must_have(avx512dq);
+        }
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        /* Convert GPR destination to %rAX. */
+        vex.r = 1;
+        if ( !mode_64bit() )
+            vex.w = 0;
+        opc[1] = modrm & 0xc7;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        break;
+
+    case X86EMUL_OPC_VEX(0x0f, 0x99):    /* ktest{w,q} k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+        /* fall through */
+    case X86EMUL_OPC_VEX(0x0f, 0x98):    /* kortest{w,q} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x98): /* kortest{b,d} k,k */
+    case X86EMUL_OPC_VEX_66(0x0f, 0x99): /* ktest{b,d} k,k */
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( vex.w )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( vex.pfx )
+            host_and_vcpu_must_have(avx512dq);
+
+        get_fpu(X86EMUL_FPU_opmask);
+
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        opc[2] = 0xc3;
+
+        copy_VEX(opc, vex);
+        invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),
+                    _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),
+                    [eflags] "+g" (_regs.eflags),
+                    "=a" (dst.val), [tmp] "=&r" (dummy)
+                    : [mask] "i" (EFLAGS_MASK));
+
+        put_stub(stub);
+
+        ASSERT(!state->simd_size);
+        dst.type = OP_NONE;
+        break;
+
     case X86EMUL_OPC(0x0f, 0xa2): /* cpuid */
         msr_val = 0;
         fail_if(ops->cpuid == NULL);
@@ -8170,6 +8393,23 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
+        if ( !vex.w )
+            host_and_vcpu_must_have(avx512dq);
+    opmask_shift_imm:
+        generate_exception_if(vex.l || !vex.r || vex.reg != 0xf ||
+                              ea.type != OP_REG, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_opmask);
+        op_bytes = 1; /* Any non-zero value will do. */
+        goto simd_0f_imm8;
+
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x31): /* kshiftr{d,q} $imm8,k,k */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x33): /* kshiftl{d,q} $imm8,k,k */
+        host_and_vcpu_must_have(avx512bw);
+        goto opmask_shift_imm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x44):     /* pclmulqdq $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,xmm/m128,xmm,xmm */
         host_and_vcpu_must_have(pclmulqdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -170,6 +170,7 @@ enum x86_emulate_fpu_type {
     X86EMUL_FPU_mmx, /* MMX instruction set (%mm0-%mm7) */
     X86EMUL_FPU_xmm, /* SSE instruction set (%xmm0-%xmm7/15) */
     X86EMUL_FPU_ymm, /* AVX/XOP instruction set (%ymm0-%ymm7/15) */
+    X86EMUL_FPU_opmask, /* AVX512 opmask instruction set (%k0-%k7) */
     /* This sentinel will never be passed to ->get_fpu(). */
     X86EMUL_FPU_none
 };
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -99,9 +99,12 @@
 #define cpu_has_rtm             boot_cpu_has(X86_FEATURE_RTM)
 #define cpu_has_fpu_sel         (!boot_cpu_has(X86_FEATURE_NO_FPU_SEL))
 #define cpu_has_mpx             boot_cpu_has(X86_FEATURE_MPX)
+#define cpu_has_avx512f         boot_cpu_has(X86_FEATURE_AVX512F)
+#define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
+#define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 02/44] x86/HVM: grow MMIO cache data size to 64 bytes
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
  2018-09-25 13:25   ` [PATCH v4 01/44] x86emul: support AVX512 opmask insns Jan Beulich
@ 2018-09-25 13:26   ` Jan Beulich
  2018-09-25 13:27   ` [PATCH v4 03/44] x86emul: correct EVEX decoding Jan Beulich
                     ` (41 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:26 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

This is needed before enabling any AVX512 insns in the emulator. Change
the way alignment is enforced at the same time.

Add a check that the buffer won't actually overflow, and while at it
also convert the check for accesses to not cross page boundaries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
---
v3: New.

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -866,7 +866,18 @@ static int hvmemul_phys_mmio_access(
     int rc = X86EMUL_OKAY;
 
     /* Accesses must fall within a page. */
-    BUG_ON((gpa & ~PAGE_MASK) + size > PAGE_SIZE);
+    if ( (gpa & ~PAGE_MASK) + size > PAGE_SIZE )
+    {
+        ASSERT_UNREACHABLE();
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    /* Accesses must not overflow the cache's buffer. */
+    if ( size > sizeof(cache->buffer) )
+    {
+        ASSERT_UNREACHABLE();
+        return X86EMUL_UNHANDLEABLE;
+    }
 
     /*
      * hvmemul_do_io() cannot handle non-power-of-2 accesses or
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -42,15 +42,14 @@ struct hvm_vcpu_asid {
 };
 
 /*
- * We may read or write up to m256 as a number of device-model
+ * We may read or write up to m512 as a number of device-model
  * transactions.
  */
 struct hvm_mmio_cache {
     unsigned long gla;
     unsigned int size;
     uint8_t dir;
-    uint8_t pad[3]; /* make buffer[] long-aligned */
-    uint8_t buffer[32];
+    uint8_t buffer[64] __aligned(sizeof(long));
 };
 
 struct hvm_vcpu_io {





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 03/44] x86emul: correct EVEX decoding
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
  2018-09-25 13:25   ` [PATCH v4 01/44] x86emul: support AVX512 opmask insns Jan Beulich
  2018-09-25 13:26   ` [PATCH v4 02/44] x86/HVM: grow MMIO cache data size to 64 bytes Jan Beulich
@ 2018-09-25 13:27   ` Jan Beulich
  2018-10-26 14:33     ` Andrew Cooper
  2018-09-25 13:28   ` [PATCH v4 04/44] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
                     ` (40 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:27 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Fix an inverted pair of checks, drop an incorrect instance of #UD
raising for non-64-bit mode, and add further generic checks.

Note: Other than SDM Vol 2 rev 067 states, EVEX.V' is _not_ ignored
      outside of 64-bit mode when the field does not encode a register.
      Just like EVEX.VVVV is required to be 0b1111 in that case, EVEX.V'
      is required to be 1 there.

Also rename the bcst field to br, as #UD generation for individual insns
will need to consider both of its possible meanings.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -650,7 +650,7 @@ union evex {
         uint8_t w:1;
         uint8_t opmsk:3;
         uint8_t RX:1;
-        uint8_t bcst:1;
+        uint8_t br:1;
         uint8_t lr:2;
         uint8_t z:1;
     };
@@ -2760,13 +2760,11 @@ x86_decode(
                         evex.raw[1] = vex.raw[1];
                         evex.raw[2] = insn_fetch_type(uint8_t);
 
-                        generate_exception_if(evex.mbs || !evex.mbz, EXC_UD);
+                        generate_exception_if(!evex.mbs || evex.mbz, EXC_UD);
+                        generate_exception_if(!evex.opmsk && evex.z, EXC_UD);
 
                         if ( !mode_64bit() )
-                        {
-                            generate_exception_if(!evex.RX, EXC_UD);
                             evex.R = 1;
-                        }
 
                         vex.opcx = evex.opcx;
                         break;
@@ -3404,6 +3402,7 @@ x86_emulate(
         d = (d & ~DstMask) | DstMem;
         /* Becomes a normal DstMem operation from here on. */
     case DstMem:
+        generate_exception_if(ea.type == OP_MEM && evex.z, EXC_UD);
         if ( state->simd_size )
         {
             generate_exception_if(lock_prefix, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 04/44] x86emul: generalize vector length handling for AVX512/EVEX
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (2 preceding siblings ...)
  2018-09-25 13:27   ` [PATCH v4 03/44] x86emul: correct EVEX decoding Jan Beulich
@ 2018-09-25 13:28   ` Jan Beulich
  2018-10-26 16:10     ` Andrew Cooper
  2018-09-25 13:28   ` [PATCH v4 05/44] x86emul: support basic AVX512 moves Jan Beulich
                     ` (39 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:28 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

To allow for some code sharing where possible, copy VEX.L into EVEX.LR
even for VEX (or XOP) encoded insns. Make operand size determination
use this right away, at the same time adding consistency checks for the
EVEX scalar insn cases (the non-scalar ones aren't uniform enough for
the checking to be done in a central place like this).

Note that the broadcast case is not handled here, but will be taken care
of elsewhere (in just a single place rather than at least two).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Introduce evex_encoded() to replace open-coded evex.mbs checks.
v2: Don't raise #UD in simd_scalar_opc case when EVEX.W != low-opcode-bit.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -191,14 +191,14 @@ enum simd_opsize {
      * Ordinary packed integers:
      * - 64 bits without prefix 66 (MMX)
      * - 128 bits with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      */
     simd_packed_int,
 
     /*
      * Ordinary packed/scalar floating point:
      * - 128 bits without prefix or with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      * - 32 bits with prefix F3 (scalar single)
      * - 64 bits with prefix F2 (scalar doubgle)
      */
@@ -207,14 +207,14 @@ enum simd_opsize {
     /*
      * Packed floating point:
      * - 128 bits without prefix or with prefix 66 (SSEn)
-     * - 128/256 bits depending on VEX.L (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      */
     simd_packed_fp,
 
     /*
      * Single precision packed/scalar floating point:
      * - 128 bits without prefix (SSEn)
-     * - 128/256 bits depending on VEX.L, no prefix (AVX)
+     * - 128/256/512 bits depending on VEX.L/EVEX.LR (AVX+)
      * - 32 bits with prefix F3 (scalar)
      */
     simd_single_fp,
@@ -228,7 +228,7 @@ enum simd_opsize {
 
     /*
      * Scalar floating point:
-     * - 32/64 bits depending on VEX.W
+     * - 32/64 bits depending on VEX.W/EVEX.W
      */
     simd_scalar_vexw,
 
@@ -2249,6 +2249,7 @@ int x86emul_unhandleable_rw(
 #define lock_prefix (state->lock_prefix)
 #define vex (state->vex)
 #define evex (state->evex)
+#define evex_encoded() (evex.mbs)
 #define ea (state->ea)
 
 static int
@@ -2818,6 +2819,9 @@ x86_decode(
 
                 opcode |= b | MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
 
+                if ( !evex_encoded() )
+                    evex.lr = vex.l;
+
                 if ( !(d & ModRM) )
                     break;
 
@@ -3148,7 +3152,7 @@ x86_decode(
             }
             /* fall through */
         case vex_66:
-            op_bytes = 16 << vex.l;
+            op_bytes = 16 << evex.lr;
             break;
         default:
             op_bytes = 0;
@@ -3172,9 +3176,17 @@ x86_decode(
     case simd_any_fp:
         switch ( vex.pfx )
         {
-        default:     op_bytes = 16 << vex.l; break;
-        case vex_f3: op_bytes = 4;           break;
-        case vex_f2: op_bytes = 8;           break;
+        default:
+            op_bytes = 16 << evex.lr;
+            break;
+        case vex_f3:
+            generate_exception_if(evex_encoded() && evex.w, EXC_UD);
+            op_bytes = 4;
+            break;
+        case vex_f2:
+            generate_exception_if(evex_encoded() && !evex.w, EXC_UD);
+            op_bytes = 8;
+            break;
         }
         break;
 





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 05/44] x86emul: support basic AVX512 moves
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (3 preceding siblings ...)
  2018-09-25 13:28   ` [PATCH v4 04/44] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
@ 2018-09-25 13:28   ` Jan Beulich
  2018-11-13 17:12     ` Andrew Cooper
  2018-09-25 13:29   ` [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
                     ` (38 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:28 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note: SDM Vol 2 rev 067 is not really consistent about EVEX.L'L for LIG
      insns - the only place where this is made explicit is a table in
      the section titled "Vector Length Orthogonality": While they
      tolerate 0, 1, and 2, a value of 3 uniformly leads to #UD.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Introduce d8s_dq64 to deal with 32-bit mode VMOVD with EVEX.W set.
    Adjust a comment.
v3: Restrict k-reg reading to insns with memory operand. Shrink scope of
    "disp8scale".
v2: Move "full" into more narrow scope.

--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -1985,6 +1985,53 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovq %xmm1,32(%edx)...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_to_mem);
+
+        asm volatile ( "pcmpgtb %%xmm1, %%xmm1\n"
+                       put_insn(evex_vmovq_to_mem, "%{evex%} vmovq %%xmm1, 32(%0)")
+                       :: "d" (NULL) );
+
+        memset(res, 0xdb, 64);
+        set_insn(evex_vmovq_to_mem);
+        regs.ecx = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_to_mem) ||
+             *((uint64_t *)res + 4) ||
+             memcmp(res, res + 10, 24) ||
+             memcmp(res, res + 6, 8) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing {evex} vmovq 32(%edx),%xmm0...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_from_mem);
+
+        asm volatile ( "pcmpeqb %%xmm0, %%xmm0\n"
+                       put_insn(evex_vmovq_from_mem, "%{evex%} vmovq 32(%0), %%xmm0")
+                       :: "d" (NULL) );
+
+        set_insn(evex_vmovq_from_mem);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_from_mem) )
+            goto fail;
+        asm ( "vmovq %1, %%xmm1\n\t"
+              "vpcmpeqq %%zmm0, %%zmm1, %%k0\n"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[8]) );
+        if ( rc != 0xff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movdqu %xmm2,(%ecx)...");
     if ( stack_exec && cpu_has_sse2 )
     {
@@ -2085,6 +2132,118 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vmovdqu32 %zmm2,(%ecx){%k1}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovdqu32_to_mem);
+
+        memset(res, 0x55, 128);
+
+        asm volatile ( "vpcmpeqd %%ymm2, %%ymm2, %%ymm2\n\t"
+                       "kmovw %1,%%k1\n"
+                       put_insn(vmovdqu32_to_mem,
+                                "vmovdqu32 %%zmm2, (%0)%{%%k1%}")
+                       :: "c" (NULL), "rm" (res[0]) );
+        set_insn(vmovdqu32_to_mem);
+
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || memcmp(res + 16, res + 24, 32) ||
+             !check_eip(vmovdqu32_to_mem) )
+            goto fail;
+
+        res[16] = ~0; res[18] = ~0; res[20] = ~0; res[22] = ~0;
+        res[24] =  0; res[26] =  0; res[28] =  0; res[30] =  0;
+        if ( memcmp(res, res + 16, 64) )
+            goto fail;
+
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovdqu32 64(%edx),%zmm2{%k2}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovdqu32_from_mem);
+
+        asm volatile ( "knotw %%k1, %%k2\n"
+                       put_insn(vmovdqu32_from_mem,
+                                "vmovdqu32 64(%0), %%zmm2%{%%k2%}")
+                       :: "d" (NULL) );
+
+        set_insn(vmovdqu32_from_mem);
+        regs.ecx = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vmovdqu32_from_mem) )
+            goto fail;
+        asm ( "vpcmpeqd %1, %%zmm2, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[0]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovdqu16 %zmm3,(%ecx){%k1}...");
+    if ( stack_exec && cpu_has_avx512bw )
+    {
+        decl_insn(vmovdqu16_to_mem);
+
+        memset(res, 0x55, 128);
+
+        asm volatile ( "vpcmpeqw %%ymm3, %%ymm3, %%ymm3\n\t"
+                       "kmovd %1,%%k1\n"
+                       put_insn(vmovdqu16_to_mem,
+                                "vmovdqu16 %%zmm3, (%0)%{%%k1%}")
+                       :: "c" (NULL), "rm" (res[0]) );
+        set_insn(vmovdqu16_to_mem);
+
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || memcmp(res + 16, res + 24, 32) ||
+             !check_eip(vmovdqu16_to_mem) )
+            goto fail;
+
+        for ( i = 16; i < 24; ++i )
+            res[i] |= 0x0000ffff;
+        for ( ; i < 32; ++i )
+            res[i] &= 0xffff0000;
+        if ( memcmp(res, res + 16, 64) )
+            goto fail;
+
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovdqu16 64(%edx),%zmm3{%k2}...");
+    if ( stack_exec && cpu_has_avx512bw )
+    {
+        decl_insn(vmovdqu16_from_mem);
+
+        asm volatile ( "knotd %%k1, %%k2\n"
+                       put_insn(vmovdqu16_from_mem,
+                                "vmovdqu16 64(%0), %%zmm3%{%%k2%}")
+                       :: "d" (NULL) );
+
+        set_insn(vmovdqu16_from_mem);
+        regs.ecx = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vmovdqu16_from_mem) )
+            goto fail;
+        asm ( "vpcmpeqw %1, %%zmm3, %%k0\n\t"
+              "kmovd %%k0, %0" : "=r" (rc) : "m" (res[0]) );
+        if ( rc != 0xffffffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movsd %xmm5,(%ecx)...");
     memset(res, 0x77, 64);
     memset(res + 10, 0x66, 8);
@@ -2186,6 +2345,71 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vmovsd %xmm5,16(%ecx){%k3}...");
+    memset(res, 0x88, 128);
+    memset(res + 20, 0x77, 8);
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovsd_masked_to_mem);
+
+        asm volatile ( "vbroadcastsd %0, %%ymm5\n\t"
+                       "kxorw %%k3, %%k3, %%k3\n"
+                       put_insn(vmovsd_masked_to_mem,
+                                "vmovsd %%xmm5, 16(%1)%{%%k3%}")
+                       :: "m" (res[20]), "c" (NULL) );
+
+        set_insn(vmovsd_masked_to_mem);
+        regs.ecx = 0;
+        regs.edx = 0;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(vmovsd_masked_to_mem) )
+            goto fail;
+
+        asm volatile ( "kmovw %0, %%k3\n" :: "m" (res[20]) );
+
+        set_insn(vmovsd_masked_to_mem);
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(vmovsd_masked_to_mem) ||
+             memcmp(res, res + 16, 64) )
+            goto fail;
+
+        printf("okay\n");
+    }
+    else
+    {
+        printf("skipped\n");
+        memset(res + 4, 0x77, 8);
+    }
+
+    printf("%-40s", "Testing vmovaps (%edx),%zmm7{%k3}{z}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovaps_masked_from_mem);
+
+        asm volatile ( "vpcmpeqd %%xmm7, %%xmm7, %%xmm7\n\t"
+                       "vbroadcastss %%xmm7, %%zmm7\n"
+                       put_insn(vmovaps_masked_from_mem,
+                                "vmovaps (%0), %%zmm7%{%%k3%}%{z%}")
+                       :: "d" (NULL) );
+
+        set_insn(vmovaps_masked_from_mem);
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vmovaps_masked_from_mem) )
+            goto fail;
+        asm ( "vcmpeqps %1, %%zmm7, %%k0\n\t"
+              "vxorps %%xmm0, %%xmm0, %%xmm0\n\t"
+              "vcmpeqps %%zmm0, %%zmm7, %%k1\n\t"
+              "kxorw %%k1, %%k0, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[16]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movd %mm3,32(%ecx)...");
     if ( stack_exec && cpu_has_mmx )
     {
@@ -2341,6 +2565,55 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovd %xmm3,32(%ecx)...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_to_mem);
+
+        asm volatile ( "pcmpeqb %%xmm3, %%xmm3\n"
+                       put_insn(evex_vmovd_to_mem,
+                                "%{evex%} vmovd %%xmm3, 32(%0)")
+                       :: "c" (NULL) );
+
+        memset(res, 0xbd, 64);
+        set_insn(evex_vmovd_to_mem);
+        regs.ecx = (unsigned long)res;
+        regs.edx = 0;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovd_to_mem) ||
+             res[8] + 1 ||
+             memcmp(res, res + 9, 28) ||
+             memcmp(res, res + 6, 8) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing {evex} vmovd 32(%ecx),%xmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_from_mem);
+
+        asm volatile ( "pcmpeqb %%xmm4, %%xmm4\n"
+                       put_insn(evex_vmovd_from_mem,
+                                "%{evex%} vmovd 32(%0), %%xmm4")
+                       :: "c" (NULL) );
+
+        set_insn(evex_vmovd_from_mem);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovd_from_mem) )
+            goto fail;
+        asm ( "vmovd %1, %%xmm0\n\t"
+              "vpcmpeqd %%zmm4, %%zmm0, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[8]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movd %mm3,%ebx...");
     if ( stack_exec && cpu_has_mmx )
     {
@@ -2507,6 +2780,57 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovd %xmm2,%ebx...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_to_reg);
+
+        /* See comment next to movd above. */
+        asm volatile ( "pcmpeqb %%xmm2, %%xmm2\n"
+                       put_insn(evex_vmovd_to_reg,
+                                "%{evex%} vmovd %%xmm2, %%ebx")
+                       :: );
+
+        set_insn(evex_vmovd_to_reg);
+#ifdef __x86_64__
+        regs.rbx = 0xbdbdbdbdbdbdbdbdUL;
+#else
+        regs.ebx = 0xbdbdbdbdUL;
+#endif
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(evex_vmovd_to_reg) ||
+             regs.ebx != 0xffffffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing {evex} vmovd %ebx,%xmm1...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_from_reg);
+
+        /* See comment next to movd above. */
+        asm volatile ( "pcmpgtb %%xmm1, %%xmm1\n"
+                       put_insn(evex_vmovd_from_reg,
+                                "%{evex%} vmovd %%ebx, %%xmm1")
+                       :: );
+
+        set_insn(evex_vmovd_from_reg);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(evex_vmovd_from_reg) )
+            goto fail;
+        asm ( "vmovd %1, %%xmm0\n\t"
+              "vpcmpeqd %%zmm1, %%zmm0, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[8]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #ifdef __x86_64__
     printf("%-40s", "Testing movq %mm3,32(%ecx)...");
     if ( stack_exec && cpu_has_mmx )
@@ -2584,6 +2908,36 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovq %xmm11,32(%ecx)...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_to_mem2);
+
+        asm volatile ( "pcmpeqb %%xmm11, %%xmm11\n"
+#if 0 /* This may not work, as the assembler might pick opcode D6. */
+                       put_insn(evex_vmovq_to_mem2,
+                                "{evex} vmovq %%xmm11, 32(%0)")
+#else
+                       put_insn(evex_vmovq_to_mem2,
+                                ".byte 0x62, 0xf1, 0xfd, 0x08, 0x7e, 0x49, 0x04")
+#endif
+                       :: "c" (NULL) );
+
+        memset(res, 0xbd, 64);
+        set_insn(evex_vmovq_to_mem2);
+        regs.ecx = (unsigned long)res;
+        regs.edx = 0;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_to_mem2) ||
+             *((long *)res + 4) + 1 ||
+             memcmp(res, res + 10, 24) ||
+             memcmp(res, res + 6, 8) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movq %mm3,%rbx...");
     if ( stack_exec && cpu_has_mmx )
     {
@@ -2643,6 +2997,28 @@ int main(int argc, char **argv)
     }
     else
         printf("skipped\n");
+
+    printf("%-40s", "Testing vmovq %xmm22,%rbx...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_to_reg);
+
+        /* See comment next to movd above. */
+        asm volatile ( "pcmpeqq %%xmm2, %%xmm2\n\t"
+                       "vmovq %%xmm2, %%xmm22\n"
+                       put_insn(evex_vmovq_to_reg, "vmovq %%xmm22, %%rbx")
+                       :: );
+
+        set_insn(evex_vmovq_to_reg);
+        regs.rbx = 0xbdbdbdbdbdbdbdbdUL;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_to_reg) ||
+             regs.rbx + 1 )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
 #endif
 
     printf("%-40s", "Testing maskmovq %mm4,%mm4...");
@@ -2812,6 +3188,32 @@ int main(int argc, char **argv)
             goto fail;
         printf("okay\n");
     }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovntdqa 64(%ecx),%zmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovntdqa);
+
+        asm volatile ( "vpxor %%xmm4, %%xmm4, %%xmm4\n"
+                       put_insn(evex_vmovntdqa, "vmovntdqa 64(%0), %%zmm4")
+                       :: "c" (NULL) );
+
+        set_insn(evex_vmovntdqa);
+        memset(res, 0x55, 192);
+        memset(res + 16, 0xff, 64);
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovntdqa) )
+            goto fail;
+        asm ( "vpbroadcastd %1, %%zmm2\n\t"
+              "vpcmpeqd %%zmm4, %%zmm2, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "0" (~0) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
     else
         printf("skipped\n");
 
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -210,6 +210,7 @@ int emul_test_get_fpu(
         if ( cpu_has_avx )
             break;
     case X86EMUL_FPU_opmask:
+    case X86EMUL_FPU_zmm:
         if ( cpu_has_avx512f )
             break;
     default:
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -266,6 +266,16 @@ static inline uint64_t xgetbv(uint32_t x
     (res.b & (1U << 30)) != 0; \
 })
 
+#define cpu_has_avx512vl ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.b = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.b & (1U << 31)) != 0; \
+})
+
 int emul_test_cpuid(
     uint32_t leaf,
     uint32_t subleaf,
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -243,9 +243,27 @@ enum simd_opsize {
 };
 typedef uint8_t simd_opsize_t;
 
+enum disp8scale {
+    /* Values 0 ... 4 are explicit sizes. */
+    d8s_bw = 5,
+    d8s_dq,
+    /* EVEX.W ignored outside of 64-bit mode */
+    d8s_dq64,
+    /*
+     * All further values must strictly be last and in the order
+     * given so that arithmetic on the values works.
+     */
+    d8s_vl,
+    d8s_vl_by_2,
+    d8s_vl_by_4,
+    d8s_vl_by_8,
+};
+typedef uint8_t disp8scale_t;
+
 static const struct twobyte_table {
     opcode_desc_t desc;
-    simd_opsize_t size;
+    simd_opsize_t size:4;
+    disp8scale_t d8s:4;
 } twobyte_table[256] = {
     [0x00] = { ModRM },
     [0x01] = { ImplicitOps|ModRM },
@@ -260,8 +278,8 @@ static const struct twobyte_table {
     [0x0d] = { ImplicitOps|ModRM },
     [0x0e] = { ImplicitOps },
     [0x0f] = { ModRM|SrcImmByte },
-    [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp },
-    [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp },
+    [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
+    [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
     [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
@@ -270,10 +288,10 @@ static const struct twobyte_table {
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
     [0x20 ... 0x21] = { DstMem|SrcImplicit|ModRM },
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
-    [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp },
-    [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp },
+    [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
+    [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp },
+    [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp },
     [0x30 ... 0x35] = { ImplicitOps },
@@ -292,8 +310,8 @@ static const struct twobyte_table {
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
-    [0x6e] = { DstImplicit|SrcMem|ModRM|Mov },
-    [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int },
+    [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
+    [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
@@ -301,8 +319,8 @@ static const struct twobyte_table {
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x7e] = { DstMem|SrcImplicit|ModRM|Mov },
-    [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int },
+    [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
+    [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x80 ... 0x8f] = { DstImplicit|SrcImm },
     [0x90 ... 0x9f] = { ByteOp|DstMem|SrcNone|ModRM|Mov },
     [0xa0 ... 0xa1] = { ImplicitOps|Mov },
@@ -344,14 +362,14 @@ static const struct twobyte_table {
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
-    [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int },
+    [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -406,6 +424,7 @@ static const struct ext0f38_table {
     uint8_t to_mem:1;
     uint8_t two_op:1;
     uint8_t vsib:1;
+    disp8scale_t d8s:4;
 } ext0f38_table[256] = {
     [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
@@ -418,7 +437,7 @@ static const struct ext0f38_table {
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
     [0x28 ... 0x29] = { .simd_size = simd_packed_int },
-    [0x2a] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_other },
     [0x2e ... 0x2f] = { .simd_size = simd_other, .to_mem = 1 },
@@ -656,6 +675,22 @@ union evex {
     };
 };
 
+#define EVEX_PFX_BYTES 4
+#define init_evex(stub) ({ \
+    uint8_t *buf_ = get_stub(stub); \
+    buf_[0] = 0x62; \
+    buf_ + EVEX_PFX_BYTES; \
+})
+
+#define copy_EVEX(ptr, evex) ({ \
+    if ( !mode_64bit() ) \
+        (evex).reg |= 8; \
+    (ptr)[1 - EVEX_PFX_BYTES] = (evex).raw[0]; \
+    (ptr)[2 - EVEX_PFX_BYTES] = (evex).raw[1]; \
+    (ptr)[3 - EVEX_PFX_BYTES] = (evex).raw[2]; \
+    container_of((ptr) + 1 - EVEX_PFX_BYTES, typeof(evex), raw[0]); \
+})
+
 #define rep_prefix()   (vex.pfx >= vex_f3)
 #define repe_prefix()  (vex.pfx == vex_f3)
 #define repne_prefix() (vex.pfx == vex_f2)
@@ -768,6 +803,7 @@ typedef union {
     uint64_t mmx;
     uint64_t __attribute__ ((aligned(16))) xmm[2];
     uint64_t __attribute__ ((aligned(32))) ymm[4];
+    uint64_t __attribute__ ((aligned(64))) zmm[8];
 } mmval_t;
 
 /*
@@ -1183,6 +1219,11 @@ static int _get_fpu(
 
     switch ( type )
     {
+    case X86EMUL_FPU_zmm:
+        if ( !(xcr0 & X86_XCR0_ZMM) || !(xcr0 & X86_XCR0_HI_ZMM) ||
+             !(xcr0 & X86_XCR0_OPMASK) )
+            return X86EMUL_UNHANDLEABLE;
+        /* fall through */
     case X86EMUL_FPU_ymm:
         if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_YMM) )
             return X86EMUL_UNHANDLEABLE;
@@ -1777,6 +1818,7 @@ static bool vcpu_has(
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
+#define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -2150,6 +2192,65 @@ static unsigned long *decode_vex_gpr(
     return decode_gpr(regs, ~vex_reg & (mode_64bit() ? 0xf : 7));
 }
 
+static unsigned int decode_disp8scale(enum disp8scale scale,
+                                      const struct x86_emulate_state *state)
+{
+    switch ( scale )
+    {
+    case d8s_bw:
+        return state->evex.w;
+
+    default:
+        if ( scale < d8s_vl )
+            return scale;
+        if ( state->evex.br )
+        {
+    case d8s_dq:
+            return 2 + state->evex.w;
+        }
+        break;
+
+    case d8s_dq64:
+        return 2 + (state->op_bytes == 8);
+    }
+
+    switch ( state->simd_size )
+    {
+    case simd_any_fp:
+    case simd_single_fp:
+        if ( !(state->evex.pfx & VEX_PREFIX_SCALAR_MASK) )
+            break;
+        /* fall through */
+    case simd_scalar_opc:
+    case simd_scalar_vexw:
+        return 2 + state->evex.w;
+
+    case simd_128:
+        /* These should have an explicit size specified. */
+        ASSERT_UNREACHABLE();
+        return 4;
+
+    default:
+        break;
+    }
+
+    return 4 + state->evex.lr - (scale - d8s_vl);
+}
+
+#define avx512_vlen_check(lig) do { \
+    switch ( evex.lr ) \
+    { \
+    default: \
+        generate_exception(EXC_UD); \
+    case 2: \
+        break; \
+    case 0: case 1: \
+        if (!(lig)) \
+            host_and_vcpu_must_have(avx512vl); \
+        break; \
+    } \
+} while ( false )
+
 static bool is_aligned(enum x86_segment seg, unsigned long offs,
                        unsigned int size, struct x86_emulate_ctxt *ctxt,
                        const struct x86_emulate_ops *ops)
@@ -2399,6 +2500,7 @@ x86_decode_twobyte(
         if ( vex.pfx == vex_f3 ) /* movq xmm/m64,xmm */
         {
     case X86EMUL_OPC_VEX_F3(0, 0x7e): /* vmovq xmm/m64,xmm */
+    case X86EMUL_OPC_EVEX_F3(0, 0x7e): /* vmovq xmm/m64,xmm */
             state->desc = DstImplicit | SrcMem | TwoOp;
             state->simd_size = simd_other;
             /* Avoid the state->desc clobbering of TwoOp below. */
@@ -2469,7 +2571,7 @@ x86_decode_twobyte(
     }
 
     /*
-     * Scalar forms of most VEX-encoded TwoOp instructions have
+     * Scalar forms of most VEX-/EVEX-encoded TwoOp instructions have
      * three operands.  Those which do really have two operands
      * should have exited earlier.
      */
@@ -2834,6 +2936,8 @@ x86_decode(
 
     if ( d & ModRM )
     {
+        unsigned int disp8scale = 0;
+
         d &= ~ModRM;
 #undef ModRM /* Only its aliases are valid to use from here on. */
         modrm_reg = ((rex_prefix & 4) << 1) | ((modrm & 0x38) >> 3);
@@ -2876,6 +2980,9 @@ x86_decode(
             break;
 
         case ext_0f:
+            if ( evex_encoded() )
+                disp8scale = decode_disp8scale(twobyte_table[b].d8s, state);
+
             switch ( b )
             {
             case 0x20: /* mov cr,reg */
@@ -2889,6 +2996,11 @@ x86_decode(
                  */
                 modrm_mod = 3;
                 break;
+
+            case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
+                if ( disp8scale == 2 && evex.pfx == vex_f3 )
+                    disp8scale = 3;
+                break;
             }
             break;
 
@@ -2900,6 +3012,8 @@ x86_decode(
             if ( ext0f38_table[b].vsib )
                 d |= vSIB;
             state->simd_size = ext0f38_table[b].simd_size;
+            if ( evex_encoded() )
+                disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
             break;
 
         case ext_8f09:
@@ -2968,7 +3082,7 @@ x86_decode(
                     ea.mem.off = insn_fetch_type(int16_t);
                 break;
             case 1:
-                ea.mem.off += insn_fetch_type(int8_t);
+                ea.mem.off += insn_fetch_type(int8_t) << disp8scale;
                 break;
             case 2:
                 ea.mem.off += insn_fetch_type(int16_t);
@@ -3027,7 +3141,7 @@ x86_decode(
                 pc_rel = mode_64bit();
                 break;
             case 1:
-                ea.mem.off += insn_fetch_type(int8_t);
+                ea.mem.off += insn_fetch_type(int8_t) << disp8scale;
                 break;
             case 2:
                 ea.mem.off += insn_fetch_type(int32_t);
@@ -3228,10 +3342,11 @@ x86_emulate(
     struct x86_emulate_state state;
     int rc;
     uint8_t b, d, *opc = NULL;
-    unsigned int first_byte = 0, insn_bytes = 0;
+    unsigned int first_byte = 0, elem_bytes, insn_bytes = 0;
+    uint64_t op_mask = ~0ULL;
     bool singlestep = (_regs.eflags & X86_EFLAGS_TF) &&
 	    !is_branch_step(ctxt, ops);
-    bool sfence = false;
+    bool sfence = false, fault_suppression = false;
     struct operand src = { .reg = PTR_POISON };
     struct operand dst = { .reg = PTR_POISON };
     unsigned long cr4;
@@ -3272,6 +3387,7 @@ x86_emulate(
     b = ctxt->opcode;
     d = state.desc;
 #define state (&state)
+    elem_bytes = 4 << evex.w;
 
     generate_exception_if(state->not_64bit && mode_64bit(), EXC_UD);
 
@@ -3346,6 +3462,28 @@ x86_emulate(
         break;
     }
 
+    /* With a memory operand, fetch the mask register in use (if any). */
+    if ( ea.type == OP_MEM && evex.opmsk )
+    {
+        uint8_t *stb = get_stub(stub);
+
+        /* KMOV{W,Q} %k<n>, (%rax) */
+        stb[0] = 0xc4;
+        stb[1] = 0xe1;
+        stb[2] = cpu_has_avx512bw ? 0xf8 : 0x78;
+        stb[3] = 0x91;
+        stb[4] = evex.opmsk << 3;
+        insn_bytes = 5;
+        stb[5] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+
+        insn_bytes = 0;
+        put_stub(stub);
+
+        fault_suppression = true;
+    }
+
     /* Decode (but don't fetch) the destination operand: register or memory. */
     switch ( d & DstMask )
     {
@@ -5708,6 +5846,41 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 2;
         break;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2b): /* vmovntp{s,d} [xyz]mm,mem */
+        generate_exception_if(ea.type != OP_MEM || evex.opmsk, EXC_UD);
+        sfence = true;
+        fault_suppression = false;
+        /* fall through */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x10): /* vmovup{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x10): /* vmovs{s,d} mem,xmm{k} */
+                                            /* vmovs{s,d} xmm,xmm,xmm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x11): /* vmovup{s,d} [xyz]mm,[xyz]mm/mem{k} */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x11): /* vmovs{s,d} xmm,mem{k} */
+                                            /* vmovs{s,d} xmm,xmm,xmm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x28): /* vmovap{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x29): /* vmovap{s,d} [xyz]mm,[xyz]mm/mem{k} */
+        /* vmovs{s,d} to/from memory have only two operands. */
+        if ( (b & ~1) == 0x10 && ea.type == OP_MEM )
+            d |= TwoOp;
+        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
+    simd_zmm:
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            evex.b = 1;
+            opc[1] &= 0x38;
+        }
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        break;
+
     case X86EMUL_OPC_66(0x0f, 0x12):       /* movlpd m64,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x12):   /* vmovlpd m64,xmm,xmm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0x13):     /* movlp{s,d} xmm,m64 */
@@ -6348,6 +6521,41 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7e): /* vmov{d,q} xmm,r/m */
+        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
+                               evex.reg != 0xf || !evex.RX),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert memory/GPR operand to (%rAX). */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        opc[1] = modrm & 0x38;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
+        dst.val = src.val;
+
+        put_stub(stub);
+        ASSERT(!state->simd_size);
+        break;
+
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7e): /* vmovq xmm/m64,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
+        generate_exception_if(evex.lr || !evex.w || evex.opmsk || evex.br,
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        d |= TwoOp;
+        op_bytes = 8;
+        goto simd_zmm;
+
     case X86EMUL_OPC_66(0x0f, 0xe7):     /* movntdq xmm,m128 */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe7): /* vmovntdq {x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
@@ -6368,6 +6576,30 @@ x86_emulate(
             goto simd_0f_avx;
         goto simd_0f_sse2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe7): /* vmovntdq [xyz]mm,mem */
+        generate_exception_if(ea.type != OP_MEM || evex.opmsk || evex.w,
+                              EXC_UD);
+        sfence = true;
+        fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6f): /* vmovdqa{32,64} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x6f): /* vmovdqu{32,64} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7f): /* vmovdqa{32,64} [xyz]mm,[xyz]mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7f): /* vmovdqu{32,64} [xyz]mm,[xyz]mm/mem{k} */
+    vmovdqa:
+        generate_exception_if(evex.br, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x6f): /* vmovdqu{8,16} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x7f): /* vmovdqu{8,16} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512bw);
+        elem_bytes = 1 << evex.w;
+        goto vmovdqa;
+
     case X86EMUL_OPC_VEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
         generate_exception_if(vex.l, EXC_UD);
         d |= TwoOp;
@@ -7734,6 +7966,15 @@ x86_emulate(
         }
         goto movdqa;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2a): /* vmovntdqa mem,[xyz]mm */
+        generate_exception_if(ea.type != OP_MEM || evex.opmsk || evex.w,
+                              EXC_UD);
+        /* Ignore the non-temporal hint for now, using vmovdqa32 instead. */
+        asm volatile ( "mfence" ::: "memory" );
+        b = 0x6f;
+        evex.opcx = vex_0f;
+        goto vmovdqa;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2c): /* vmaskmovps mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2d): /* vmaskmovpd mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2e): /* vmaskmovps {x,y}mm,{x,y}mm,mem */
@@ -8787,17 +9028,27 @@ x86_emulate(
     else if ( state->simd_size )
     {
         generate_exception_if(!op_bytes, EXC_UD);
-        generate_exception_if(vex.opcx && (d & TwoOp) && vex.reg != 0xf,
+        generate_exception_if((vex.opcx && (d & TwoOp) &&
+                               (vex.reg != 0xf || (evex_encoded() && !evex.RX))),
                               EXC_UD);
 
         if ( !opc )
             BUG();
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
-        copy_REX_VEX(opc, rex_prefix, vex);
+        if ( evex_encoded() )
+        {
+            opc[insn_bytes - EVEX_PFX_BYTES] = 0xc3;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            opc[insn_bytes - PFX_BYTES] = 0xc3;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
 
         if ( ea.type == OP_MEM )
         {
             uint32_t mxcsr = 0;
+            uint64_t full = 0;
 
             if ( op_bytes < 16 ||
                  (vex.opcx
@@ -8819,6 +9070,44 @@ x86_emulate(
                                   !is_aligned(ea.mem.seg, ea.mem.off, op_bytes,
                                               ctxt, ops),
                                   EXC_GP, 0);
+
+            if ( evex.br )
+            {
+                ASSERT((d & DstMask) != DstMem);
+                op_bytes = elem_bytes;
+            }
+            if ( evex.opmsk )
+            {
+                ASSERT(!(op_bytes % elem_bytes));
+                full = ~0ULL >> (64 - op_bytes / elem_bytes);
+                op_mask &= full;
+            }
+            if ( fault_suppression )
+            {
+                if ( !op_mask )
+                    goto simd_no_mem;
+                if ( !evex.br )
+                {
+                    first_byte = __builtin_ctzll(op_mask);
+                    op_mask >>= first_byte;
+                    full >>= first_byte;
+                    first_byte *= elem_bytes;
+                    op_bytes = (64 - __builtin_clzll(op_mask)) * elem_bytes;
+                }
+            }
+            /*
+             * Independent of fault suppression we may need to read (parts of)
+             * the memory operand for the purpose of merging without splitting
+             * the write below into multiple ones. Note that the EVEX.Z check
+             * here isn't strictly needed, due to there not currently being
+             * any instructions allowing zeroing-merging on memory writes (and
+             * we raise #UD during DstMem processing far above in this case),
+             * yet conceptually the read is then unnecessary.
+             */
+            if ( evex.opmsk && !evex.z && (d & DstMask) == DstMem &&
+                 op_mask != full )
+                d = (d & ~SrcMask) | SrcMem;
+
             switch ( d & SrcMask )
             {
             case SrcMem:
@@ -8865,7 +9154,10 @@ x86_emulate(
             }
         }
         else
+        {
+        simd_no_mem:
             dst.type = OP_NONE;
+        }
 
         /* {,v}maskmov{q,dqu}, as an exception, uses rDI. */
         if ( likely((ctxt->opcode & ~(X86EMUL_OPC_PFX_MASK |
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -171,6 +171,7 @@ enum x86_emulate_fpu_type {
     X86EMUL_FPU_xmm, /* SSE instruction set (%xmm0-%xmm7/15) */
     X86EMUL_FPU_ymm, /* AVX/XOP instruction set (%ymm0-%ymm7/15) */
     X86EMUL_FPU_opmask, /* AVX512 opmask instruction set (%k0-%k7) */
+    X86EMUL_FPU_zmm, /* AVX512 instruction set (%zmm0-%zmm7/31) */
     /* This sentinel will never be passed to ->get_fpu(). */
     X86EMUL_FPU_none
 };
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -105,6 +105,7 @@
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
+#define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (4 preceding siblings ...)
  2018-09-25 13:28   ` [PATCH v4 05/44] x86emul: support basic AVX512 moves Jan Beulich
@ 2018-09-25 13:29   ` Jan Beulich
  2018-11-12 17:42     ` Andrew Cooper
  2018-09-25 13:29   ` [PATCH v4 07/44] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
                     ` (37 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:29 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Besides the already existing tests (which are going to be extended once
respective ISA extension support is complete), let's also ensure for
every individual insn that their Disp8 scaling (and memory access width)
are correct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Introduce ESZ_d_WIG.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -139,7 +139,7 @@ $(addsuffix .h,$(SIMD) $(FMA) $(SG)): si
 
 xop.h: simd-fma.c
 
-$(TARGET): x86-emulate.o test_x86_emulator.o wrappers.o
+$(TARGET): x86-emulate.o test_x86_emulator.o evex-disp8.o wrappers.o
 	$(HOSTCC) $(HOSTCFLAGS) -o $@ $^
 
 .PHONY: clean
@@ -166,7 +166,7 @@ x86.h := $(addprefix $(XEN_ROOT)/tools/i
                      x86-vendors.h x86-defns.h msr-index.h)
 x86_emulate.h := x86-emulate.h x86_emulate/x86_emulate.h $(x86.h)
 
-x86-emulate.o test_x86_emulator.o wrappers.o: %.o: %.c $(x86_emulate.h)
+x86-emulate.o test_x86_emulator.o evex-disp8.o wrappers.o: %.o: %.c $(x86_emulate.h)
 	$(HOSTCC) $(HOSTCFLAGS) -c -g -o $@ $<
 
 x86-emulate.o: x86_emulate/x86_emulate.c
--- /dev/null
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -0,0 +1,452 @@
+#include <stdarg.h>
+#include <stdio.h>
+
+#include "x86-emulate.h"
+
+struct test {
+    const char *mnemonic;
+    unsigned int opc:8;
+    unsigned int spc:2;
+    unsigned int pfx:2;
+    unsigned int vsz:3;
+    unsigned int esz:4;
+    unsigned int scale:1;
+    unsigned int ext:3;
+};
+
+enum spc {
+    SPC_invalid,
+    SPC_0f,
+    SPC_0f38,
+    SPC_0f3a,
+};
+
+enum pfx {
+    PFX_,
+    PFX_66,
+    PFX_f3,
+    PFX_f2
+};
+
+enum vl {
+    VL_128,
+    VL_256,
+    VL_512,
+};
+
+enum scale {
+    SC_vl,
+    SC_el,
+};
+
+enum vsz {
+    VSZ_vl,
+    VSZ_vl_2, /* VL / 2 */
+    VSZ_vl_4, /* VL / 4 */
+    VSZ_vl_8, /* VL / 8 */
+    /* "no broadcast" implied from here on. */
+    VSZ_el,
+    VSZ_el_2, /* EL * 2 */
+    VSZ_el_4, /* EL * 4 */
+    VSZ_el_8, /* EL * 8 */
+};
+
+enum esz {
+    ESZ_d,
+    ESZ_q,
+    ESZ_dq,
+    ESZ_sd,
+    ESZ_d_nb,
+    ESZ_q_nb,
+    /* "no broadcast" implied from here on. */
+#ifdef __i386__
+    ESZ_d_WIG,
+#endif
+    ESZ_b,
+    ESZ_w,
+    ESZ_bw,
+};
+
+#ifndef __i386__
+# define ESZ_dq64 ESZ_dq
+#else
+# define ESZ_dq64 ESZ_d_WIG
+#endif
+
+#define INSNX(m, p, sp, o, e, vs, es, sc) { \
+    .mnemonic = #m, .opc = 0x##o, .spc = SPC_##sp, .pfx = PFX_##p, \
+    .vsz = VSZ_##vs, .esz = ESZ_##es, .scale = SC_##sc, .ext = 0##e \
+}
+#define INSN(m, p, sp, o, vs, es, sc) INSNX(m, p, sp, o, 0, vs, es, sc)
+#define INSN_PFP(m, sp, o) \
+    INSN(m##pd, 66, sp, o, vl, q, vl), \
+    INSN(m##ps,   , sp, o, vl, d, vl)
+#define INSN_PFP_NB(m, sp, o) \
+    INSN(m##pd, 66, sp, o, vl, q_nb, vl), \
+    INSN(m##ps,   , sp, o, vl, d_nb, vl)
+#define INSN_SFP(m, sp, o) \
+    INSN(m##sd, f2, sp, o, el, q, el), \
+    INSN(m##ss, f3, sp, o, el, d, el)
+
+#define INSN_FP(m, sp, o) \
+    INSN_PFP(m, sp, o), \
+    INSN_SFP(m, sp, o)
+
+static const struct test avx512f_all[] = {
+    INSN_SFP(mov,            0f, 10),
+    INSN_SFP(mov,            0f, 11),
+    INSN_PFP_NB(mova,        0f, 28),
+    INSN_PFP_NB(mova,        0f, 29),
+    INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
+    INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
+    INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
+    INSN(movdqa64,     66,   0f, 7f,    vl,   q_nb, vl),
+    INSN(movdqu32,     f3,   0f, 6f,    vl,   d_nb, vl),
+    INSN(movdqu32,     f3,   0f, 7f,    vl,   d_nb, vl),
+    INSN(movdqu64,     f3,   0f, 6f,    vl,   q_nb, vl),
+    INSN(movdqu64,     f3,   0f, 7f,    vl,   q_nb, vl),
+    INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
+    INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
+    INSN_PFP_NB(movnt,       0f, 2b),
+    INSN_PFP_NB(movu,        0f, 10),
+    INSN_PFP_NB(movu,        0f, 11),
+};
+
+static const struct test avx512f_128[] = {
+    INSN(mov,       66,   0f, 6e, el, dq64, el),
+    INSN(mov,       66,   0f, 7e, el, dq64, el),
+    INSN(movq,      f3,   0f, 7e, el,    q, el),
+    INSN(movq,      66,   0f, d6, el,    q, el),
+};
+
+static const struct test avx512bw_all[] = {
+    INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
+    INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
+    INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
+    INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+};
+
+static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
+static const unsigned char vl_128[] = { VL_128 };
+
+/*
+ * This table, indicating the presence of an immediate (byte) for an opcode
+ * space 0f major opcode, is indexed by high major opcode byte nibble, with
+ * each table element then bit-indexed by low major opcode byte nibble.
+ */
+static const uint16_t imm0f[16] = {
+    [0x7] = (1 << 0x0) /* vpshuf* */ |
+            (1 << 0x1) /* vps{ll,ra,rl}w */ |
+            (1 << 0x2) /* vps{l,r}ld, vp{rol,ror,sra}{d,q} */ |
+            (1 << 0x3) /* vps{l,r}l{,d}q */,
+    [0xc] = (1 << 0x2) /* vcmp{p,s}{d,s} */ |
+            (1 << 0x4) /* vpinsrw */ |
+            (1 << 0x5) /* vpextrw */ |
+            (1 << 0x6) /* vshufp{d,s} */,
+};
+
+static struct x86_emulate_ops emulops;
+
+static unsigned int accessed[3 * 64];
+
+static bool record_access(enum x86_segment seg, unsigned long offset,
+                          unsigned int bytes)
+{
+    while ( bytes-- )
+    {
+        if ( offset >= ARRAY_SIZE(accessed) )
+            return false;
+        ++accessed[offset++];
+    }
+
+    return true;
+}
+
+static int read(enum x86_segment seg, unsigned long offset, void *p_data,
+                unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+    if ( !record_access(seg, offset, bytes) )
+        return X86EMUL_UNHANDLEABLE;
+    memset(p_data, 0, bytes);
+    return X86EMUL_OKAY;
+}
+
+static int write(enum x86_segment seg, unsigned long offset, void *p_data,
+                 unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+    if ( !record_access(seg, offset, bytes) )
+        return X86EMUL_UNHANDLEABLE;
+    return X86EMUL_OKAY;
+}
+
+static void test_one(const struct test *test, enum vl vl,
+                     unsigned char *instr, struct x86_emulate_ctxt *ctxt)
+{
+    unsigned int vsz, esz, i;
+    int rc;
+    bool sg = strstr(test->mnemonic, "gather") ||
+              strstr(test->mnemonic, "scatter");
+    bool imm = test->spc == SPC_0f3a ||
+               (test->spc == SPC_0f &&
+                (imm0f[test->opc >> 4] & (1 << (test->opc & 0xf))));
+    union evex {
+        uint8_t raw[3];
+        struct {
+            uint8_t opcx:2;
+            uint8_t mbz:2;
+            uint8_t R:1;
+            uint8_t b:1;
+            uint8_t x:1;
+            uint8_t r:1;
+            uint8_t pfx:2;
+            uint8_t mbs:1;
+            uint8_t reg:4;
+            uint8_t w:1;
+            uint8_t opmsk:3;
+            uint8_t RX:1;
+            uint8_t bcst:1;
+            uint8_t lr:2;
+            uint8_t z:1;
+        };
+    } evex = {
+        .opcx = test->spc, .pfx = test->pfx, .lr = vl,
+        .R = 1, .b = 1, .x = 1, .r = 1, .mbs = 1,
+        .reg = 0xf, .RX = 1, .opmsk = sg,
+    };
+
+    switch ( test->esz )
+    {
+    case ESZ_b:
+        esz = 1;
+        break;
+
+    case ESZ_w:
+        esz = 2;
+        evex.w = 1;
+        break;
+
+#ifdef __i386__
+    case ESZ_d_WIG:
+        evex.w = 1;
+        /* fall through */
+#endif
+    case ESZ_d: case ESZ_d_nb:
+        esz = 4;
+        break;
+
+    case ESZ_q: case ESZ_q_nb:
+        esz = 8;
+        evex.w = 1;
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    switch ( test->vsz )
+    {
+    case VSZ_vl:
+        vsz = 16 << vl;
+        break;
+
+    case VSZ_vl_2:
+        vsz = 8 << vl;
+        break;
+
+    case VSZ_vl_4:
+        vsz = 4 << vl;
+        break;
+
+    case VSZ_vl_8:
+        vsz = 2 << vl;
+        break;
+
+    case VSZ_el:
+        vsz = esz;
+        break;
+
+    case VSZ_el_2:
+        vsz = esz * 2;
+        break;
+
+    case VSZ_el_4:
+        vsz = esz * 4;
+        break;
+
+    case VSZ_el_8:
+        vsz = esz * 8;
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    /*
+     * Note: SIB addressing is used here, such that S/G insns can be handled
+     * without extra conditionals.
+     */
+    instr[0] = 0x62;
+    instr[1] = evex.raw[0];
+    instr[2] = evex.raw[1];
+    instr[3] = evex.raw[2];
+    instr[4] = test->opc;
+    instr[5] = 0x44 | (test->ext << 3); /* ModR/M */
+    instr[6] = 0x12; /* SIB: base rDX, index none / xMM4 */
+    instr[7] = 1; /* Disp8 */
+    instr[8] = 0; /* immediate, if any */
+
+    asm volatile ( "kxnorw %k1, %k1, %k1" );
+    asm volatile ( "vxorps %xmm4, %xmm4, %xmm4" );
+
+    ctxt->regs->eip = (unsigned long)&instr[0];
+    ctxt->regs->edx = 0;
+    memset(accessed, 0, sizeof(accessed));
+
+    rc = x86_emulate(ctxt, &emulops);
+    if ( rc != X86EMUL_OKAY ||
+         (ctxt->regs->eip != (unsigned long)&instr[8 + imm]) )
+        goto fail;
+
+    for ( i = 0; i < vsz; ++i )
+         if ( accessed[i] )
+             goto fail;
+    for ( ; i < (test->scale == SC_vl ? vsz : esz) + (sg ? esz : vsz); ++i )
+         if ( accessed[i] != (sg ? vsz / esz : 1) )
+             goto fail;
+    for ( ; i < ARRAY_SIZE(accessed); ++i )
+         if ( accessed[i] )
+             goto fail;
+
+    /* Also check the broadcast case, if available. */
+    if ( test->vsz >= VSZ_el || test->scale != SC_vl )
+        return;
+
+    switch ( test->esz )
+    {
+    case ESZ_d_nb: case ESZ_q_nb:
+    case ESZ_b: case ESZ_w: case ESZ_bw:
+        return;
+
+    case ESZ_d: case ESZ_q:
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    evex.bcst = 1;
+    instr[3] = evex.raw[2];
+
+    ctxt->regs->eip = (unsigned long)&instr[0];
+    memset(accessed, 0, sizeof(accessed));
+
+    rc = x86_emulate(ctxt, &emulops);
+    if ( rc != X86EMUL_OKAY ||
+         (ctxt->regs->eip != (unsigned long)&instr[8 + imm]) )
+        goto fail;
+
+    for ( i = 0; i < esz; ++i )
+         if ( accessed[i] )
+             goto fail;
+    for ( ; i < esz * 2; ++i )
+         if ( accessed[i] != 1 )
+             goto fail;
+    for ( ; i < ARRAY_SIZE(accessed); ++i )
+         if ( accessed[i] )
+             goto fail;
+
+    return;
+
+ fail:
+    printf("failed (v%s%s %u-bit)\n", test->mnemonic,
+           evex.bcst ? "/bcst" : "", 128 << vl);
+    exit(1);
+}
+
+static void test_pair(const struct test *tmpl, enum vl vl,
+                      enum esz esz1, const char *suffix1,
+                      enum esz esz2, const char *suffix2,
+                      unsigned char *instr, struct x86_emulate_ctxt *ctxt)
+{
+    struct test test = *tmpl;
+    char mnemonic[24];
+
+    test.esz = esz1;
+    snprintf(mnemonic, ARRAY_SIZE(mnemonic), "%s%s", tmpl->mnemonic, suffix1);
+    test.mnemonic = mnemonic;
+    test_one(&test, vl, instr, ctxt);
+
+    test.esz = esz2;
+    snprintf(mnemonic, ARRAY_SIZE(mnemonic), "%s%s", tmpl->mnemonic, suffix2);
+    test.mnemonic = mnemonic;
+    test_one(&test, vl, instr, ctxt);
+}
+
+static void test_group(const struct test tests[], unsigned int nr_test,
+                       const unsigned char vl[], unsigned int nr_vl,
+                       void *instr, struct x86_emulate_ctxt *ctxt)
+{
+    unsigned int i, j;
+
+    for ( i = 0; i < nr_test; ++i )
+    {
+        for ( j = 0; j < nr_vl; ++j )
+        {
+            if ( vl[0] == VL_512 && vl[j] != VL_512 && !cpu_has_avx512vl )
+                continue;
+
+            switch ( tests[i].esz )
+            {
+            default:
+                test_one(&tests[i], vl[j], instr, ctxt);
+                break;
+
+            case ESZ_bw:
+                test_pair(&tests[i], vl[j], ESZ_b, "b", ESZ_w, "w",
+                          instr, ctxt);
+                break;
+
+            case ESZ_dq:
+                test_pair(&tests[i], vl[j], ESZ_d, "d", ESZ_q, "q",
+                          instr, ctxt);
+                break;
+
+#ifdef __i386__
+            case ESZ_d_WIG:
+                test_pair(&tests[i], vl[j], ESZ_d, "/W0",
+                          ESZ_d_WIG, "/W1", instr, ctxt);
+                break;
+#endif
+
+            case ESZ_sd:
+                test_pair(&tests[i], vl[j],
+                          ESZ_d, tests[i].vsz < VSZ_el ? "ps" : "ss",
+                          ESZ_q, tests[i].vsz < VSZ_el ? "pd" : "sd",
+                          instr, ctxt);
+                break;
+            }
+        }
+    }
+}
+
+void evex_disp8_test(void *instr, struct x86_emulate_ctxt *ctxt,
+                     const struct x86_emulate_ops *ops)
+{
+    emulops = *ops;
+    emulops.read = read;
+    emulops.write = write;
+
+#define RUN(feat, vl) do { \
+    if ( cpu_has_##feat ) \
+    { \
+        printf("%-40s", "Testing " #feat "/" #vl " disp8 handling..."); \
+        test_group(feat ## _ ## vl, ARRAY_SIZE(feat ## _ ## vl), \
+                   vl_ ## vl, ARRAY_SIZE(vl_ ## vl), instr, ctxt); \
+        printf("okay\n"); \
+    } \
+} while ( false )
+
+    RUN(avx512f, all);
+    RUN(avx512f, 128);
+    RUN(avx512bw, all);
+}
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3795,6 +3795,9 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    if ( stack_exec )
+        evex_disp8_test(instr, &ctxt, &emulops);
+
     for ( j = 0; j < ARRAY_SIZE(blobs); j++ )
     {
         if ( blobs[j].check_cpu && !blobs[j].check_cpu() )
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -71,6 +71,9 @@ WRAP(puts);
 
 #include "x86_emulate/x86_emulate.h"
 
+void evex_disp8_test(void *instr, struct x86_emulate_ctxt *ctxt,
+                     const struct x86_emulate_ops *ops);
+
 static inline uint64_t xgetbv(uint32_t xcr)
 {
     uint32_t lo, hi;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 07/44] x86emul: also allow running the 32-bit harness on a 64-bit distro
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (5 preceding siblings ...)
  2018-09-25 13:29   ` [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
@ 2018-09-25 13:29   ` Jan Beulich
  2018-11-12 17:50     ` Andrew Cooper
  2018-09-25 13:30   ` [PATCH v4 08/44] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
                     ` (36 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:29 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

In order to be able to verify the 32-bit variant builds and runs,
introduce a respective target (and the necessary other adjustments).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Moved ahead in series.
v3: New.

--- a/.gitignore
+++ b/.gitignore
@@ -240,6 +240,7 @@ tools/security/xensec_tool
 tools/tests/depriv/depriv-fd-checker
 tools/tests/x86_emulator/*.bin
 tools/tests/x86_emulator/*.tmp
+tools/tests/x86_emulator/32/x86_emulate
 tools/tests/x86_emulator/3dnow*.[ch]
 tools/tests/x86_emulator/asm
 tools/tests/x86_emulator/avx*.[ch]
--- /dev/null
+++ b/tools/tests/x86_emulator/32/Makefile
@@ -0,0 +1,4 @@
+override XEN_COMPILE_ARCH := x86_32
+XEN_ROOT = $(CURDIR)/../../../..
+VPATH += ..
+include ../Makefile
--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -1,5 +1,7 @@
 
+ifeq ($(XEN_ROOT),)
 XEN_ROOT=$(CURDIR)/../../..
+endif
 include $(XEN_ROOT)/tools/Rules.mk
 
 TARGET := test_x86_emulator
@@ -18,6 +20,12 @@ TESTCASES := blowfish $(SIMD) $(FMA) $(S
 
 OPMASK := avx512f avx512dq avx512bw
 
+ifeq ($(origin XEN_COMPILE_ARCH),override)
+
+HOSTCFLAGS += -m32
+
+else
+
 blowfish-cflags := ""
 blowfish-cflags-x86_32 := "-mno-accumulate-outgoing-args -Dstatic="
 
@@ -139,6 +147,8 @@ $(addsuffix .h,$(SIMD) $(FMA) $(SG)): si
 
 xop.h: simd-fma.c
 
+endif # 32-bit override
+
 $(TARGET): x86-emulate.o test_x86_emulator.o evex-disp8.o wrappers.o
 	$(HOSTCC) $(HOSTCFLAGS) -o $@ $^
 
@@ -153,6 +163,15 @@ distclean: clean
 .PHONY: install uninstall
 install uninstall:
 
+.PHONY: run32 clean32
+ifeq ($(XEN_TARGET_ARCH),x86_64)
+run32 clean32: %32: $(addsuffix .h,$(TESTCASES)) $(addsuffix -opmask.h,$(OPMASK))
+	$(MAKE) -C 32 $*
+clean: clean32
+else
+run32 clean32: %32: %
+endif
+
 x86_emulate:
 	[ -L $@ ] || ln -sf $(XEN_ROOT)/xen/arch/x86/$@
 





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 08/44] x86emul: use AVX512 logic for emulating V{, P}MASKMOV*
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (6 preceding siblings ...)
  2018-09-25 13:29   ` [PATCH v4 07/44] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
@ 2018-09-25 13:30   ` Jan Beulich
  2018-11-13 18:12     ` Andrew Cooper
  2018-09-25 13:31   ` [PATCH v4 09/44] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
                     ` (35 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:30 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

The more generic AVX512 implementation allows quite a bit of insn-
specific code to be dropped/shared.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -439,8 +439,8 @@ static const struct ext0f38_table {
     [0x28 ... 0x29] = { .simd_size = simd_packed_int },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
-    [0x2c ... 0x2d] = { .simd_size = simd_other },
-    [0x2e ... 0x2f] = { .simd_size = simd_other, .to_mem = 1 },
+    [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
+    [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int },
     [0x40] = { .simd_size = simd_packed_int },
@@ -449,8 +449,8 @@ static const struct ext0f38_table {
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
-    [0x8c] = { .simd_size = simd_other },
-    [0x8e] = { .simd_size = simd_other, .to_mem = 1 },
+    [0x8c] = { .simd_size = simd_packed_int },
+    [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp },
     [0x99] = { .simd_size = simd_scalar_vexw },
@@ -7984,6 +7984,8 @@ x86_emulate(
 
         generate_exception_if(ea.type != OP_MEM || vex.w, EXC_UD);
         host_and_vcpu_must_have(avx);
+        elem_bytes = 4 << (b & 1);
+    vmaskmov:
         get_fpu(X86EMUL_FPU_ymm);
 
         /*
@@ -7998,7 +8000,7 @@ x86_emulate(
         opc = init_prefixes(stub);
         pvex = copy_VEX(opc, vex);
         pvex->opcx = vex_0f;
-        if ( !(b & 1) )
+        if ( elem_bytes == 4 )
             pvex->pfx = vex_none;
         opc[0] = 0x50; /* vmovmskp{s,d} */
         /* Use %rax as GPR destination and VEX.vvvv as source. */
@@ -8011,21 +8013,9 @@ x86_emulate(
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
         put_stub(stub);
 
-        if ( !ea.val )
-            goto complete_insn;
-
-        op_bytes = 4 << (b & 1);
-        first_byte = __builtin_ctz(ea.val);
-        ea.val >>= first_byte;
-        first_byte *= op_bytes;
-        op_bytes *= 32 - __builtin_clz(ea.val);
-
-        /*
-         * Even for the memory write variant a memory read is needed, unless
-         * all set mask bits are contiguous.
-         */
-        if ( ea.val & (ea.val + 1) )
-            d = (d & ~SrcMask) | SrcMem;
+        evex.opmsk = 1; /* fake */
+        op_mask = ea.val;
+        fault_suppression = true;
 
         opc = init_prefixes(stub);
         opc[0] = b;
@@ -8076,63 +8066,10 @@ x86_emulate(
 
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
-    {
-        typeof(vex) *pvex;
-        unsigned int mask = vex.w ? 0x80808080U : 0x88888888U;
-
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
         host_and_vcpu_must_have(avx2);
-        get_fpu(X86EMUL_FPU_ymm);
-
-        /*
-         * While we can't reasonably provide fully correct behavior here
-         * (in particular, for writes, avoiding the memory read in anticipation
-         * of all elements in the range eventually being written), we can (and
-         * should) still limit the memory access to the smallest possible range
-         * (suppressing it altogether if all mask bits are clear), to provide
-         * correct faulting behavior. Read the mask bits via vmovmskp{s,d}
-         * for that purpose.
-         */
-        opc = init_prefixes(stub);
-        pvex = copy_VEX(opc, vex);
-        pvex->opcx = vex_0f;
-        opc[0] = 0xd7; /* vpmovmskb */
-        /* Use %rax as GPR destination and VEX.vvvv as source. */
-        pvex->r = 1;
-        pvex->b = !mode_64bit() || (vex.reg >> 3);
-        opc[1] = 0xc0 | (~vex.reg & 7);
-        pvex->reg = 0xf;
-        opc[2] = 0xc3;
-
-        invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
-        put_stub(stub);
-
-        /* Convert byte granular result to dword/qword granularity. */
-        ea.val &= mask;
-        if ( !ea.val )
-            goto complete_insn;
-
-        first_byte = __builtin_ctz(ea.val) & ~((4 << vex.w) - 1);
-        ea.val >>= first_byte;
-        op_bytes = 32 - __builtin_clz(ea.val);
-
-        /*
-         * Even for the memory write variant a memory read is needed, unless
-         * all set mask bits are contiguous.
-         */
-        if ( ea.val & (ea.val + ~mask + 1) )
-            d = (d & ~SrcMask) | SrcMem;
-
-        opc = init_prefixes(stub);
-        opc[0] = b;
-        /* Convert memory operand to (%rAX). */
-        rex_prefix &= ~REX_B;
-        vex.b = 1;
-        opc[1] = modrm & 0x38;
-        insn_bytes = PFX_BYTES + 2;
-
-        break;
-    }
+        elem_bytes = 4 << vex.w;
+        goto vmaskmov;
 
     case X86EMUL_OPC_VEX_66(0x0f38, 0x90): /* vpgatherd{d,q} {x,y}mm,mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x91): /* vpgatherq{d,q} {x,y}mm,mem,{x,y}mm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 09/44] x86emul: support AVX512F legacy-equivalent arithmetic FP insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (7 preceding siblings ...)
  2018-09-25 13:30   ` [PATCH v4 08/44] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
@ 2018-09-25 13:31   ` Jan Beulich
  2018-11-13 18:21     ` Andrew Cooper
  2018-09-25 13:32   ` [PATCH v4 10/44] x86emul: support AVX512DQ logic " Jan Beulich
                     ` (34 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:31 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -93,6 +93,10 @@ enum esz {
     INSN_SFP(m, sp, o)
 
 static const struct test avx512f_all[] = {
+    INSN_FP(add,             0f, 58),
+    INSN_FP(div,             0f, 5e),
+    INSN_FP(max,             0f, 5f),
+    INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
     INSN_SFP(mov,            0f, 11),
     INSN_PFP_NB(mova,        0f, 28),
@@ -110,6 +114,9 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movnt,       0f, 2b),
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
+    INSN_FP(mul,             0f, 59),
+    INSN_FP(sqrt,            0f, 51),
+    INSN_FP(sub,             0f, 5c),
 };
 
 static const struct test avx512f_128[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -300,12 +300,12 @@ static const struct twobyte_table {
     [0x3a] = { DstReg|SrcImmByte|ModRM },
     [0x40 ... 0x4f] = { DstReg|SrcMem|ModRM|Mov },
     [0x50] = { DstReg|SrcImplicit|ModRM|Mov },
-    [0x51] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_any_fp },
+    [0x51] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_any_fp, d8s_vl },
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
-    [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp },
+    [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp },
+    [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -5863,10 +5863,22 @@ x86_emulate(
         if ( (b & ~1) == 0x10 && ea.type == OP_MEM )
             d |= TwoOp;
         generate_exception_if(evex.br, EXC_UD);
-        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+        /* fall through */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x51):    /* vsqrtp{s,d} [xyz]mm/mem,[xyz]mm{k} */
+                                            /* vsqrts{s,d} xmm/m32,xmm,xmm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x58):    /* vadd{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x59):    /* vmul{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5c):    /* vsub{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5d):    /* vmin{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5e):    /* vdiv{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5f):    /* vmax{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
+                               (ea.type == OP_MEM && evex.br &&
+                                (evex.pfx & VEX_PREFIX_SCALAR_MASK))),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
-        avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
     simd_zmm:
         get_fpu(X86EMUL_FPU_zmm);
         opc = init_evex(stub);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 10/44] x86emul: support AVX512DQ logic FP insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (8 preceding siblings ...)
  2018-09-25 13:31   ` [PATCH v4 09/44] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
@ 2018-09-25 13:32   ` Jan Beulich
  2018-11-13 18:56     ` Andrew Cooper
  2018-09-25 13:32   ` [PATCH v4 11/44] x86emul: support AVX512F "normal" FP compare insns Jan Beulich
                     ` (33 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:32 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -133,6 +133,13 @@ static const struct test avx512bw_all[]
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
 };
 
+static const struct test avx512dq_all[] = {
+    INSN_PFP(and,              0f, 54),
+    INSN_PFP(andn,             0f, 55),
+    INSN_PFP(or,               0f, 56),
+    INSN_PFP(xor,              0f, 57),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 
@@ -456,4 +463,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, all);
     RUN(avx512f, 128);
     RUN(avx512bw, all);
+    RUN(avx512dq, all);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -302,7 +302,7 @@ static const struct twobyte_table {
     [0x50] = { DstReg|SrcImplicit|ModRM|Mov },
     [0x51] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_any_fp, d8s_vl },
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
-    [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
+    [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
@@ -6329,6 +6329,17 @@ x86_emulate(
         dst.bytes = 4;
         break;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x54): /* vandp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x55): /* vandnp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x56): /* vorp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x57): /* vxorp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
+                               (ea.type != OP_MEM && evex.br)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512dq);
+        avx512_vlen_check(false);
+        goto simd_zmm;
+
     CASE_SIMD_ALL_FP(, 0x0f, 0x5a):        /* cvt{p,s}{s,d}2{p,s}{s,d} xmm/mem,xmm */
     CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} xmm/mem,xmm */
                                            /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 11/44] x86emul: support AVX512F "normal" FP compare insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (9 preceding siblings ...)
  2018-09-25 13:32   ` [PATCH v4 10/44] x86emul: support AVX512DQ logic " Jan Beulich
@ 2018-09-25 13:32   ` Jan Beulich
  2018-11-13 19:04     ` Andrew Cooper
  2018-09-25 13:33   ` [PATCH v4 12/44] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
                     ` (32 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:32 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Also correct the AVX counterpart's comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -94,6 +94,7 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN_FP(cmp,             0f, c2),
     INSN_FP(div,             0f, 5e),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -352,7 +352,7 @@ static const struct twobyte_table {
     [0xbf] = { DstReg|SrcMem16|ModRM|Mov },
     [0xc0] = { ByteOp|DstMem|SrcReg|ModRM },
     [0xc1] = { DstMem|SrcReg|ModRM },
-    [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp },
+    [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp, d8s_vl },
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
     [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
@@ -7438,7 +7438,7 @@ x86_emulate(
         goto add;
 
     CASE_SIMD_ALL_FP(, 0x0f, 0xc2):        /* cmp{p,s}{s,d} $imm8,xmm/mem,xmm */
-    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0xc2):    /* vcmp{p,s}{s,d} $imm8,{x,y}mm/mem,{x,y}mm */
+    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0xc2):    /* vcmp{p,s}{s,d} $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0xc6):     /* shufp{s,d} $imm8,xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,{x,y}mm/mem,{x,y}mm */
         d = (d & ~SrcMask) | SrcMem;
@@ -7452,6 +7452,30 @@ x86_emulate(
         }
         goto simd_0f_imm8_avx;
 
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0xc2): /* vcmp{p,s}{s,d} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
+                               (ea.type == OP_MEM && evex.br &&
+                                (evex.pfx & VEX_PREFIX_SCALAR_MASK)) ||
+                               !evex.r || !evex.R || evex.z),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
+        d = (d & ~SrcMask) | SrcMem;
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            evex.b = 1;
+            opc[1] &= 0x38;
+        }
+        opc[2] = imm1;
+        insn_bytes = EVEX_PFX_BYTES + 3;
+        break;
+
     case X86EMUL_OPC(0x0f, 0xc3): /* movnti */
         /* Ignore the non-temporal hint for now. */
         vcpu_must_have(sse2);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 12/44] x86emul: support AVX512F misc legacy-equivalent FP insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (10 preceding siblings ...)
  2018-09-25 13:32   ` [PATCH v4 11/44] x86emul: support AVX512F "normal" FP compare insns Jan Beulich
@ 2018-09-25 13:33   ` Jan Beulich
  2018-11-13 19:17     ` Andrew Cooper
  2018-09-25 13:33   ` [PATCH v4 13/44] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
                     ` (31 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:33 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Also correct an AVX counterpart's comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -116,8 +116,11 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
+    INSN_PFP(unpckh,         0f, 15),
+    INSN_PFP(unpckl,         0f, 14),
 };
 
 static const struct test avx512f_128[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -282,7 +282,7 @@ static const struct twobyte_table {
     [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
-    [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
+    [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
@@ -356,7 +356,7 @@ static const struct twobyte_table {
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
     [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
-    [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp },
+    [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp, d8s_vl },
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -5932,6 +5932,17 @@ x86_emulate(
         host_and_vcpu_must_have(sse3);
         goto simd_0f_xmm;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x14): /* vunpcklp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+                              EXC_UD);
+        fault_suppression = false;
+    avx512f_no_sae:
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        avx512_vlen_check(false);
+        goto simd_zmm;
+
     case X86EMUL_OPC(0x0f, 0x20): /* mov cr,reg */
     case X86EMUL_OPC(0x0f, 0x21): /* mov dr,reg */
     case X86EMUL_OPC(0x0f, 0x22): /* mov reg,cr */
@@ -6611,11 +6622,9 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_F3(0x0f, 0x7f): /* vmovdqu{32,64} [xyz]mm,[xyz]mm/mem{k} */
     vmovdqa:
         generate_exception_if(evex.br, EXC_UD);
-        host_and_vcpu_must_have(avx512f);
-        avx512_vlen_check(false);
         d |= TwoOp;
         op_bytes = 16 << evex.lr;
-        goto simd_zmm;
+        goto avx512f_no_sae;
 
     case X86EMUL_OPC_EVEX_F2(0x0f, 0x6f): /* vmovdqu{8,16} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F2(0x0f, 0x7f): /* vmovdqu{8,16} [xyz]mm,[xyz]mm/mem{k} */
@@ -7440,7 +7449,7 @@ x86_emulate(
     CASE_SIMD_ALL_FP(, 0x0f, 0xc2):        /* cmp{p,s}{s,d} $imm8,xmm/mem,xmm */
     CASE_SIMD_ALL_FP(_VEX, 0x0f, 0xc2):    /* vcmp{p,s}{s,d} $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0xc6):     /* shufp{s,d} $imm8,xmm/mem,xmm */
-    CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,{x,y}mm/mem,{x,y}mm */
+    CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         d = (d & ~SrcMask) | SrcMem;
         if ( vex.opcx == vex_none )
         {
@@ -7461,7 +7470,9 @@ x86_emulate(
         host_and_vcpu_must_have(avx512f);
         if ( ea.type == OP_MEM || !evex.br )
             avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
-        d = (d & ~SrcMask) | SrcMem;
+    simd_imm8_zmm:
+        if ( (d & SrcMask) == SrcImmByte )
+            d = (d & ~SrcMask) | SrcMem;
         get_fpu(X86EMUL_FPU_zmm);
         opc = init_evex(stub);
         opc[0] = b;
@@ -7505,6 +7516,15 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 3;
         goto simd_0f_to_gpr;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC(0x0f, 0xc7): /* Grp9 */
     {
         union {




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 13/44] x86emul: support AVX512F fused-multiply-add insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (11 preceding siblings ...)
  2018-09-25 13:33   ` [PATCH v4 12/44] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
@ 2018-09-25 13:33   ` Jan Beulich
  2018-11-13 19:28     ` Andrew Cooper
  2018-09-25 13:34   ` [PATCH v4 14/44] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
                     ` (30 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:33 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -96,6 +96,36 @@ static const struct test avx512f_all[] =
     INSN_FP(add,             0f, 58),
     INSN_FP(cmp,             0f, c2),
     INSN_FP(div,             0f, 5e),
+    INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
+    INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
+    INSN(fmadd213,     66, 0f38, a8,    vl,     sd, vl),
+    INSN(fmadd213,     66, 0f38, a9,    el,     sd, el),
+    INSN(fmadd231,     66, 0f38, b8,    vl,     sd, vl),
+    INSN(fmadd231,     66, 0f38, b9,    el,     sd, el),
+    INSN(fmaddsub132,  66, 0f38, 96,    vl,     sd, vl),
+    INSN(fmaddsub213,  66, 0f38, a6,    vl,     sd, vl),
+    INSN(fmaddsub231,  66, 0f38, b6,    vl,     sd, vl),
+    INSN(fmsub132,     66, 0f38, 9a,    vl,     sd, vl),
+    INSN(fmsub132,     66, 0f38, 9b,    el,     sd, el),
+    INSN(fmsub213,     66, 0f38, aa,    vl,     sd, vl),
+    INSN(fmsub213,     66, 0f38, ab,    el,     sd, el),
+    INSN(fmsub231,     66, 0f38, ba,    vl,     sd, vl),
+    INSN(fmsub231,     66, 0f38, bb,    el,     sd, el),
+    INSN(fmsubadd132,  66, 0f38, 97,    vl,     sd, vl),
+    INSN(fmsubadd213,  66, 0f38, a7,    vl,     sd, vl),
+    INSN(fmsubadd231,  66, 0f38, b7,    vl,     sd, vl),
+    INSN(fnmadd132,    66, 0f38, 9c,    vl,     sd, vl),
+    INSN(fnmadd132,    66, 0f38, 9d,    el,     sd, el),
+    INSN(fnmadd213,    66, 0f38, ac,    vl,     sd, vl),
+    INSN(fnmadd213,    66, 0f38, ad,    el,     sd, el),
+    INSN(fnmadd231,    66, 0f38, bc,    vl,     sd, vl),
+    INSN(fnmadd231,    66, 0f38, bd,    el,     sd, el),
+    INSN(fnmsub132,    66, 0f38, 9e,    vl,     sd, vl),
+    INSN(fnmsub132,    66, 0f38, 9f,    el,     sd, el),
+    INSN(fnmsub213,    66, 0f38, ae,    vl,     sd, vl),
+    INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
+    INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
+    INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -452,30 +452,30 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
-    [0x96 ... 0x98] = { .simd_size = simd_packed_fp },
-    [0x99] = { .simd_size = simd_scalar_vexw },
-    [0x9a] = { .simd_size = simd_packed_fp },
-    [0x9b] = { .simd_size = simd_scalar_vexw },
-    [0x9c] = { .simd_size = simd_packed_fp },
-    [0x9d] = { .simd_size = simd_scalar_vexw },
-    [0x9e] = { .simd_size = simd_packed_fp },
-    [0x9f] = { .simd_size = simd_scalar_vexw },
-    [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp },
-    [0xa9] = { .simd_size = simd_scalar_vexw },
-    [0xaa] = { .simd_size = simd_packed_fp },
-    [0xab] = { .simd_size = simd_scalar_vexw },
-    [0xac] = { .simd_size = simd_packed_fp },
-    [0xad] = { .simd_size = simd_scalar_vexw },
-    [0xae] = { .simd_size = simd_packed_fp },
-    [0xaf] = { .simd_size = simd_scalar_vexw },
-    [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp },
-    [0xb9] = { .simd_size = simd_scalar_vexw },
-    [0xba] = { .simd_size = simd_packed_fp },
-    [0xbb] = { .simd_size = simd_scalar_vexw },
-    [0xbc] = { .simd_size = simd_packed_fp },
-    [0xbd] = { .simd_size = simd_scalar_vexw },
-    [0xbe] = { .simd_size = simd_packed_fp },
-    [0xbf] = { .simd_size = simd_scalar_vexw },
+    [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x9a] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x9b] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x9c] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x9d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x9e] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x9f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xa9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xaa] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xab] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xac] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xad] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xae] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xaf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xb9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xba] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xbb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xbc] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xc8 ... 0xcd] = { .simd_size = simd_other },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
@@ -8287,6 +8287,49 @@ x86_emulate(
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9a): /* vfmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9c): /* vfnmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9e): /* vfnmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa6): /* vfmaddsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa7): /* vfmsubadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa8): /* vfmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaa): /* vfmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xac): /* vfnmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xae): /* vfnmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb6): /* vfmaddsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb7): /* vfmsubadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb8): /* vfmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xba): /* vfmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM )
+        {
+            generate_exception_if(evex.br, EXC_UD);
+            avx512_vlen_check(true);
+        }
+        goto simd_zmm;
+
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xc9):     /* sha1msg1 xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xca):     /* sha1msg2 xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 14/44] x86emul: support AVX512F legacy-equivalent logic insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (12 preceding siblings ...)
  2018-09-25 13:33   ` [PATCH v4 13/44] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
@ 2018-09-25 13:34   ` Jan Beulich
  2018-11-13 19:30     ` Andrew Cooper
  2018-09-25 13:35   ` [PATCH v4 15/44] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
                     ` (29 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:34 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Plus vpternlog{d,q} as being extensively used by the compiler, in order
to facilitate test enabling in the harness as soon as possible. Also the
twobyte_table[] entries for a few more insns get their .d8s field set
right away, in order to not split and later re-combine the groups.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -146,6 +146,11 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(pand,         66,   0f, db,    vl,     dq, vl),
+    INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
+    INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -364,13 +364,13 @@ static const struct twobyte_table {
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
-    [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
-    [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
@@ -493,6 +493,7 @@ static const struct ext0f3a_table {
     uint8_t to_mem:1;
     uint8_t two_op:1;
     uint8_t four_op:1;
+    disp8scale_t d8s:4;
 } ext0f3a_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x01] = { .simd_size = simd_packed_fp, .two_op = 1 },
@@ -510,6 +511,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
+    [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
@@ -3016,20 +3018,33 @@ x86_decode(
                 disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
             break;
 
+        case ext_0f3a:
+            /*
+             * Cannot update d here yet, as the immediate operand still
+             * needs fetching.
+             */
+            state->simd_size = ext0f3a_table[b].simd_size;
+            if ( evex_encoded() )
+                disp8scale = decode_disp8scale(ext0f3a_table[b].d8s, state);
+            break;
+
         case ext_8f09:
             if ( ext8f09_table[b].two_op )
                 d |= TwoOp;
             state->simd_size = ext8f09_table[b].simd_size;
             break;
 
-        case ext_0f3a:
         case ext_8f08:
+        case ext_8f0a:
             /*
              * Cannot update d here yet, as the immediate operand still
              * needs fetching.
              */
-        default:
             break;
+
+        default:
+            ASSERT_UNREACHABLE();
+            return X86EMUL_UNIMPLEMENTED;
         }
 
         if ( modrm_mod == 3 )
@@ -3208,7 +3223,6 @@ x86_decode(
         else if ( ext0f3a_table[b].four_op && !mode_64bit() && vex.opcx )
             imm1 &= 0x7f;
         state->desc = d;
-        state->simd_size = ext0f3a_table[b].simd_size;
         rc = x86_decode_0f3a(state, ctxt, ops);
         break;
 
@@ -5937,6 +5951,11 @@ x86_emulate(
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
         fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdb): /* vpand{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -7520,6 +7539,8 @@ x86_emulate(
         fault_suppression = false;
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
         avx512_vlen_check(false);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 15/44] x86emul: support AVX512{F, DQ} FP broadcast insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (13 preceding siblings ...)
  2018-09-25 13:34   ` [PATCH v4 14/44] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
@ 2018-09-25 13:35   ` Jan Beulich
  2018-11-13 19:37     ` Andrew Cooper
  2018-09-25 13:35   ` [PATCH v4 16/44] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
                     ` (28 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -94,6 +94,7 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
@@ -165,6 +166,15 @@ static const struct test avx512f_128[] =
     INSN(movq,      66,   0f, d6, el,    q, el),
 };
 
+static const struct test avx512f_no128[] = {
+    INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
+    INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
+};
+
+static const struct test avx512f_512[] = {
+    INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+};
+
 static const struct test avx512bw_all[] = {
     INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
@@ -179,8 +189,19 @@ static const struct test avx512dq_all[]
     INSN_PFP(xor,              0f, 57),
 };
 
+static const struct test avx512dq_no128[] = {
+    INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
+    INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+};
+
+static const struct test avx512dq_512[] = {
+    INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
+static const unsigned char vl_no128[] = { VL_512, VL_256 };
+static const unsigned char vl_512[] = { VL_512 };
 
 /*
  * This table, indicating the presence of an immediate (byte) for an opcode
@@ -501,6 +522,10 @@ void evex_disp8_test(void *instr, struct
 
     RUN(avx512f, all);
     RUN(avx512f, 128);
+    RUN(avx512f, no128);
+    RUN(avx512f, 512);
     RUN(avx512bw, all);
     RUN(avx512dq, all);
+    RUN(avx512dq, no128);
+    RUN(avx512dq, 512);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -234,10 +234,16 @@ enum simd_opsize {
 
     /*
      * 128 bits of integer or floating point data, with no further
-     * formatting information.
+     * formatting information, or with it encoded by EVEX.W.
      */
     simd_128,
 
+    /*
+     * 256 bits of integer or floating point data, with formatting
+     * encoded by EVEX.W.
+     */
+    simd_256,
+
     /* Operand size encoded in non-standard way. */
     simd_other
 };
@@ -432,8 +438,10 @@ static const struct ext0f38_table {
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x18 ... 0x19] = { .simd_size = simd_scalar_opc, .two_op = 1 },
-    [0x1a] = { .simd_size = simd_128, .two_op = 1 },
+    [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
+    [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
+    [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
+    [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
     [0x28 ... 0x29] = { .simd_size = simd_packed_int },
@@ -3330,6 +3338,10 @@ x86_decode(
         op_bytes = 16;
         break;
 
+    case simd_256:
+        op_bytes = 32;
+        break;
+
     default:
         op_bytes = 0;
         break;
@@ -7979,6 +7991,42 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+    avx512_broadcast:
+        /*
+         * For the respective code below the main switch() to work we need to
+         * fold op_mask here: A source element gets read whenever any of its
+         * respective destination elements' mask bits is set.
+         */
+        if ( fault_suppression )
+        {
+            n = 1 << ((b & 3) - evex.w);
+            ASSERT(op_bytes == n * elem_bytes);
+            for ( i = n; i < (16 << evex.lr) / elem_bytes; i += n )
+                op_mask |= (op_mask >> i) & ((1 << n) - 1);
+        }
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
+                                            /* vbroadcastf64x4 m256,zmm{k} */
+        generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
+                                            /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
+        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcastf64x2 m128,{y,z}mm{k} */
+        generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.br,
+                              EXC_UD);
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
     case X86EMUL_OPC_66(0x0f38, 0x20): /* pmovsxbw xmm/m64,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x21): /* pmovsxbd xmm/m32,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x22): /* pmovsxbq xmm/m16,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 16/44] x86emul: support AVX512F v{, u}comis{d, s} insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (14 preceding siblings ...)
  2018-09-25 13:35   ` [PATCH v4 15/44] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
@ 2018-09-25 13:35   ` Jan Beulich
  2018-11-13 19:39     ` Andrew Cooper
  2018-09-25 13:36   ` [PATCH v4 17/44] x86emul/test: introduce eq() Jan Beulich
                     ` (27 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Add missing avx512_vlen_check().
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -96,6 +96,8 @@ static const struct test avx512f_all[] =
     INSN_FP(add,             0f, 58),
     INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
+    INSN(comisd,       66,   0f, 2f,    el,      q, el),
+    INSN(comiss,         ,   0f, 2f,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -155,6 +157,8 @@ static const struct test avx512f_all[] =
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
+    INSN(ucomisd,      66,   0f, 2e,    el,      q, el),
+    INSN(ucomiss,        ,   0f, 2e,    el,      d, el),
     INSN_PFP(unpckh,         0f, 15),
     INSN_PFP(unpckl,         0f, 14),
 };
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -299,7 +299,7 @@ static const struct twobyte_table {
     [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp },
+    [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp, simd_none, d8s_dq },
     [0x30 ... 0x35] = { ImplicitOps },
     [0x37] = { ImplicitOps },
     [0x38] = { DstReg|SrcMem|ModRM },
@@ -6111,24 +6111,34 @@ x86_emulate(
         }
 
         opc = init_prefixes(stub);
+        op_bytes = 4 << vex.pfx;
+    vcomi:
         opc[0] = b;
         opc[1] = modrm;
         if ( ea.type == OP_MEM )
         {
-            rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp, vex.pfx ? 8 : 4,
-                           ctxt);
+            rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp, op_bytes, ctxt);
             if ( rc != X86EMUL_OKAY )
                 goto done;
 
             /* Convert memory operand to (%rAX). */
             rex_prefix &= ~REX_B;
             vex.b = 1;
+            evex.b = 1;
             opc[1] &= 0x38;
         }
-        insn_bytes = PFX_BYTES + 2;
+        if ( evex_encoded() )
+        {
+            insn_bytes = EVEX_PFX_BYTES + 2;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 2;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
         opc[2] = 0xc3;
 
-        copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),
                     _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),
                     [eflags] "+g" (_regs.eflags),
@@ -6139,6 +6149,20 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2e): /* vucomis{s,d} xmm/mem,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2f): /* vcomis{s,d} xmm/mem,xmm */
+        generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
+                               (ea.type != OP_REG && evex.br) ||
+                               evex.w != evex.pfx),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        op_bytes = 4 << evex.w;
+        goto vcomi;
+
     case X86EMUL_OPC(0x0f, 0x30): /* wrmsr */
         generate_exception_if(!mode_ring0(), EXC_GP, 0);
         fail_if(ops->write_msr == NULL);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 17/44] x86emul/test: introduce eq()
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (15 preceding siblings ...)
  2018-09-25 13:35   ` [PATCH v4 16/44] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
@ 2018-09-25 13:36   ` Jan Beulich
  2018-10-26 11:31     ` Andrew Cooper
  2018-09-25 13:37   ` [PATCH v4 18/44] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
                     ` (26 subsequent siblings)
  43 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

In preparation for sensible to-boolean conversion on AVX512, wrap
another abstraction function around the present to_bool(<x> == <y>), to
get rid of the open-coded == (which will get in the way of using
built-in functions instead). For the future AVX512 use scalar operands
can't be used then anymore: Use (vec_t){} when the operand is zero,
and broadcast (if available) otherwise (assume pre-AVX512 when broadcast
is not available, in which case a plain scalar is still fine).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -46,6 +46,10 @@ static inline bool _to_bool(byte_vec_t b
 # define to_bool(cmp) _to_bool((byte_vec_t)(cmp))
 #endif
 
+#ifndef eq
+# define eq(x, y) to_bool((x) == (y))
+#endif
+
 #if VEC_SIZE == FLOAT_SIZE
 # define to_int(x) ((vec_t){ (int)(x)[0] })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
@@ -605,18 +609,18 @@ int simd_test(void)
     touch(src);
     x = src;
     touch(x);
-    if ( !to_bool(x == src) ) return __LINE__;
+    if ( !eq(x, src) ) return __LINE__;
 
     touch(src);
     y = x + src;
     touch(src);
     touch(y);
-    if ( !to_bool(y == 2 * src) ) return __LINE__;
+    if ( !eq(y, 2 * src) ) return __LINE__;
 
     touch(src);
     z = y -= src;
     touch(z);
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
 #if defined(UINT_SIZE)
 
@@ -628,7 +632,7 @@ int simd_test(void)
     z ^= inv;
     touch(inv);
     touch(x);
-    if ( !to_bool((x & ~y) == z) ) return __LINE__;
+    if ( !eq(x & ~y, z) ) return __LINE__;
 
 #elif ELEM_SIZE > 1 || VEC_SIZE <= 8
 
@@ -639,7 +643,7 @@ int simd_test(void)
     z = src + inv;
     touch(inv);
     z *= (src - inv);
-    if ( !to_bool(x - y == z) ) return __LINE__;
+    if ( !eq(x - y, z) ) return __LINE__;
 
 #endif
 
@@ -648,10 +652,10 @@ int simd_test(void)
     x = src * alt;
     touch(alt);
     y = src / alt;
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
     touch(alt);
     touch(src);
-    if ( !to_bool(x * -alt == -src) ) return __LINE__;
+    if ( !eq(x * -alt, -src) ) return __LINE__;
 
 # if defined(recip) && defined(to_int)
 
@@ -659,16 +663,16 @@ int simd_test(void)
     x = recip(src);
     touch(src);
     touch(x);
-    if ( !to_bool(to_int(recip(x)) == src) ) return __LINE__;
+    if ( !eq(to_int(recip(x)), src) ) return __LINE__;
 
 #  ifdef rsqrt
     x = src * src;
     touch(x);
     y = rsqrt(x);
     touch(y);
-    if ( !to_bool(to_int(recip(y)) == src) ) return __LINE__;
+    if ( !eq(to_int(recip(y)), src) ) return __LINE__;
     touch(src);
-    if ( !to_bool(to_int(y) == to_int(recip(src))) ) return __LINE__;
+    if ( !eq(to_int(y), to_int(recip(src))) ) return __LINE__;
 #  endif
 
 # endif
@@ -676,7 +680,7 @@ int simd_test(void)
 # ifdef sqrt
     x = src * src;
     touch(x);
-    if ( !to_bool(sqrt(x) == src) ) return __LINE__;
+    if ( !eq(sqrt(x), src) ) return __LINE__;
 # endif
 
 # ifdef trunc
@@ -684,20 +688,20 @@ int simd_test(void)
     y = (vec_t){ 1 };
     touch(x);
     z = trunc(x);
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
 # endif
 
 # ifdef frac
     touch(src);
     x = frac(src);
     touch(src);
-    if ( !to_bool(x == 0) ) return __LINE__;
+    if ( !eq(x, (vec_t){}) ) return __LINE__;
 
     x = 1 / (src + 1);
     touch(x);
     y = frac(x);
     touch(x);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 # endif
 
 # if defined(trunc) && defined(frac)
@@ -707,7 +711,7 @@ int simd_test(void)
     touch(x);
     z = frac(x);
     touch(x);
-    if ( !to_bool(x == y + z) ) return __LINE__;
+    if ( !eq(x, y + z) ) return __LINE__;
 # endif
 
 #else
@@ -720,16 +724,16 @@ int simd_test(void)
     y[ELEM_COUNT - 1] = y[0] = j = ELEM_COUNT;
     for ( i = 1; i < ELEM_COUNT / 2; ++i )
         y[ELEM_COUNT - i - 1] = y[i] = y[i - 1] + (j -= 2);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 
 #  ifdef mul_hi
     touch(alt);
     x = mul_hi(src, alt);
     touch(alt);
 #   ifdef INT_SIZE
-    if ( !to_bool(x == (alt < 0)) ) return __LINE__;
+    if ( !eq(x, alt < 0) ) return __LINE__;
 #   else
-    if ( !to_bool(x == (src & alt) + alt) ) return __LINE__;
+    if ( !eq(x, (src & alt) + alt) ) return __LINE__;
 #   endif
 #  endif
 
@@ -745,7 +749,7 @@ int simd_test(void)
         z[i] = res;
         z[i + 1] = res >> (ELEM_SIZE << 3);
     }
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
 #  endif
 
     z = src;
@@ -757,12 +761,12 @@ int simd_test(void)
     touch(z);
     y = z << 2;
     touch(z);
-    if ( !to_bool(x == y + y) ) return __LINE__;
+    if ( !eq(x, y + y) ) return __LINE__;
 
     touch(x);
     z = x >> 2;
     touch(x);
-    if ( !to_bool(y == z + z) ) return __LINE__;
+    if ( !eq(y, z + z) ) return __LINE__;
 
     z = src;
 #  ifdef INT_SIZE
@@ -781,11 +785,11 @@ int simd_test(void)
     touch(j);
     y = z << j;
     touch(j);
-    if ( !to_bool(x == y + y) ) return __LINE__;
+    if ( !eq(x, y + y) ) return __LINE__;
 
     z = x >> j;
     touch(j);
-    if ( !to_bool(y == z + z) ) return __LINE__;
+    if ( !eq(y, z + z) ) return __LINE__;
 
 # endif
 
@@ -809,12 +813,12 @@ int simd_test(void)
     --sh;
     touch(sh);
     y = z << sh;
-    if ( !to_bool(x == y + y) ) return __LINE__;
+    if ( !eq(x, y + y) ) return __LINE__;
 
 #  if (defined(__AVX2__) && ELEM_SIZE >= 4) || defined(__XOP__)
     touch(sh);
     x = y >> sh;
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 #  endif
 
 # endif
@@ -828,7 +832,7 @@ int simd_test(void)
     touch(inv);
     y = max(src, inv);
     touch(inv);
-    if ( !to_bool(x + y == src + inv) ) return __LINE__;
+    if ( !eq(x + y, src + inv) ) return __LINE__;
 # else
     x = src * alt;
     y = inv * alt;
@@ -837,33 +841,33 @@ int simd_test(void)
     touch(y);
     y = min(x, y);
     touch(y);
-    if ( !to_bool((y + z) * alt == src + inv) ) return __LINE__;
+    if ( !eq((y + z) * alt, src + inv) ) return __LINE__;
 # endif
 #endif
 
 #ifdef abs
     x = src * alt;
     touch(x);
-    if ( !to_bool(abs(x) == src) ) return __LINE__;
+    if ( !eq(abs(x), src) ) return __LINE__;
 #endif
 
 #ifdef copysignz
     touch(alt);
-    if ( !to_bool(copysignz((vec_t){} + 1, alt) == alt) ) return __LINE__;
+    if ( !eq(copysignz((vec_t){} + 1, alt), alt) ) return __LINE__;
 #endif
 
 #ifdef swap
     touch(src);
-    if ( !to_bool(swap(src) == inv) ) return __LINE__;
+    if ( !eq(swap(src), inv) ) return __LINE__;
 #endif
 
 #ifdef swap2
     touch(src);
-    if ( !to_bool(swap2(src) == inv) ) return __LINE__;
+    if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
 #if defined(broadcast)
-    if ( !to_bool(broadcast(ELEM_COUNT + 1) == src + inv) ) return __LINE__;
+    if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
 #if defined(interleave_lo) && defined(interleave_hi)
@@ -877,7 +881,11 @@ int simd_test(void)
 # else
     z = (x - y) * alt;
 # endif
-    if ( !to_bool(z == ELEM_COUNT / 2) ) return __LINE__;
+# ifdef broadcast
+    if ( !eq(z, broadcast(ELEM_COUNT / 2)) ) return __LINE__;
+# else
+    if ( !eq(z, ELEM_COUNT / 2) ) return __LINE__;
+# endif
 #endif
 
 #if defined(INT_SIZE) && defined(widen1) && defined(interleave_lo)
@@ -887,7 +895,7 @@ int simd_test(void)
     touch(x);
     z = widen1(x);
     touch(x);
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 
 # ifdef widen2
     y = interleave_lo(alt < 0, alt < 0);
@@ -895,7 +903,7 @@ int simd_test(void)
     touch(x);
     z = widen2(x);
     touch(x);
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 
 #  ifdef widen3
     y = interleave_lo(alt < 0, alt < 0);
@@ -904,7 +912,7 @@ int simd_test(void)
     touch(x);
     z = widen3(x);
     touch(x);
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 #  endif
 # endif
 
@@ -919,21 +927,21 @@ int simd_test(void)
     touch(src);
     x = widen1(src);
     touch(src);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 # endif
 
 # ifdef widen2
     touch(src);
     x = widen2(src);
     touch(src);
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 # endif
 
 # ifdef widen3
     touch(src);
     x = widen3(src);
     touch(src);
-    if ( !to_bool(x == interleave_lo(z, (vec_t){})) ) return __LINE__;
+    if ( !eq(x, interleave_lo(z, (vec_t){})) ) return __LINE__;
 # endif
 
 #endif
@@ -942,14 +950,14 @@ int simd_test(void)
     touch(src);
     x = dup_lo(src);
     touch(src);
-    if ( !to_bool(x - src == (alt - 1) / 2) ) return __LINE__;
+    if ( !eq(x - src, (alt - 1) / 2) ) return __LINE__;
 #endif
 
 #ifdef dup_hi
     touch(src);
     x = dup_hi(src);
     touch(src);
-    if ( !to_bool(x - src == (alt + 1) / 2) ) return __LINE__;
+    if ( !eq(x - src, (alt + 1) / 2) ) return __LINE__;
 #endif
 
     for ( i = 0; i < ELEM_COUNT; ++i )
@@ -961,7 +969,7 @@ int simd_test(void)
 # else
     select(&z, src, inv, alt > 0);
 # endif
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 #endif
 
 #ifdef select2
@@ -970,14 +978,14 @@ int simd_test(void)
 # else
     select2(&z, src, inv, alt > 0);
 # endif
-    if ( !to_bool(z == y) ) return __LINE__;
+    if ( !eq(z, y) ) return __LINE__;
 #endif
 
 #ifdef mix
     touch(src);
     touch(inv);
     x = mix(src, inv);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 
 # ifdef addsub
     touch(src);
@@ -986,22 +994,22 @@ int simd_test(void)
     touch(src);
     touch(inv);
     y = mix(src - inv, src + inv);
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 # endif
 #endif
 
 #ifdef rotr
     x = rotr(src, 1);
     y = (src & (ELEM_COUNT - 1)) + 1;
-    if ( !to_bool(x == y) ) return __LINE__;
+    if ( !eq(x, y) ) return __LINE__;
 #endif
 
 #ifdef dot_product
     touch(src);
     touch(inv);
     x = dot_product(src, inv);
-    if ( !to_bool(x == (vec_t){ (ELEM_COUNT * (ELEM_COUNT + 1) *
-                                 (ELEM_COUNT + 2)) / 6 }) ) return __LINE__;
+    if ( !eq(x, (vec_t){ (ELEM_COUNT * (ELEM_COUNT + 1) *
+                          (ELEM_COUNT + 2)) / 6 }) ) return __LINE__;
 #endif
 
 #ifdef hadd
@@ -1022,7 +1030,7 @@ int simd_test(void)
     x = hsub(src, inv);
     for ( i = ELEM_COUNT; i >>= 1; )
         x = hadd(x, (vec_t){});
-    if ( !to_bool(x == 0) ) return __LINE__;
+    if ( !eq(x, (vec_t){}) ) return __LINE__;
 # endif
 #endif
 
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -20,6 +20,10 @@ ENTRY(fma_test);
 # endif
 #endif
 
+#ifndef eq
+# define eq(x, y) to_bool((x) == (y))
+#endif
+
 #if VEC_SIZE == 16
 # if FLOAT_SIZE == 4
 #  define addsub(x, y) __builtin_ia32_addsubps(x, y)
@@ -62,38 +66,38 @@ int fma_test(void)
     y = (src - one) * inv;
     touch(src);
     z = inv * src + inv;
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
     touch(src);
     z = -inv * src - inv;
-    if ( !to_bool(-x == z) ) return __LINE__;
+    if ( !eq(-x, z) ) return __LINE__;
 
     touch(src);
     z = inv * src - inv;
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
 
     touch(src);
     z = -inv * src + inv;
-    if ( !to_bool(-y == z) ) return __LINE__;
+    if ( !eq(-y, z) ) return __LINE__;
     touch(src);
 
     x = src + inv;
     y = src - inv;
     touch(inv);
     z = src * one + inv;
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
     touch(inv);
     z = -src * one - inv;
-    if ( !to_bool(-x == z) ) return __LINE__;
+    if ( !eq(-x, z) ) return __LINE__;
 
     touch(inv);
     z = src * one - inv;
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
 
     touch(inv);
     z = -src * one + inv;
-    if ( !to_bool(-y == z) ) return __LINE__;
+    if ( !eq(-y, z) ) return __LINE__;
     touch(inv);
 
 #if defined(addsub) && defined(fmaddsub)
@@ -101,21 +105,21 @@ int fma_test(void)
     y = addsub(src * inv, -one);
     touch(one);
     z = fmaddsub(src, inv, one);
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
     touch(one);
     z = fmaddsub(src, inv, -one);
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
     touch(one);
 
     x = addsub(src * inv, one);
     touch(inv);
     z = fmaddsub(src, inv, one);
-    if ( !to_bool(x == z) ) return __LINE__;
+    if ( !eq(x, z) ) return __LINE__;
 
     touch(inv);
     z = fmaddsub(src, inv, -one);
-    if ( !to_bool(y == z) ) return __LINE__;
+    if ( !eq(y, z) ) return __LINE__;
     touch(inv);
 #endif
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 18/44] x86emul: support AVX512{F, BW} packed integer compare insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (16 preceding siblings ...)
  2018-09-25 13:36   ` [PATCH v4 17/44] x86emul/test: introduce eq() Jan Beulich
@ 2018-09-25 13:37   ` Jan Beulich
  2018-09-25 13:37   ` [PATCH v4 19/44] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
                     ` (25 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:37 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Include VPTEST{,N}M{B,D,Q,W} as once again possibly used by the compiler
for comparison against all-zero vectors.

Also table entries for a few more insns get their .d8s field set right
away, again in order to not split and later re-combine the groups.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -151,8 +151,16 @@ static const struct test avx512f_all[] =
     INSN_FP(mul,             0f, 59),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
+    INSN(pcmpeqd,      66,   0f, 76,    vl,      d, vl),
+    INSN(pcmpeqq,      66, 0f38, 29,    vl,      q, vl),
+    INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
+    INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
+    INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
+    INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
+    INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
@@ -184,6 +192,14 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(pcmp,        66, 0f3a, 3f,    vl,   bw, vl),
+    INSN(pcmpeqb,     66,   0f, 74,    vl,    b, vl),
+    INSN(pcmpeqw,     66,   0f, 75,    vl,    w, vl),
+    INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
+    INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
+    INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(ptestm,      66, 0f38, 26,    vl,   bw, vl),
+    INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
 };
 
 static const struct test avx512dq_all[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -313,14 +313,14 @@ static const struct twobyte_table {
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
-    [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
@@ -444,13 +444,13 @@ static const struct ext0f38_table {
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
-    [0x28 ... 0x29] = { .simd_size = simd_packed_int },
+    [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
-    [0x36 ... 0x3f] = { .simd_size = simd_packed_int },
+    [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int },
@@ -516,6 +516,7 @@ static const struct ext0f3a_table {
     [0x18] = { .simd_size = simd_128 },
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
+    [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
@@ -523,6 +524,7 @@ static const struct ext0f3a_table {
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
     [0x44] = { .simd_size = simd_packed_int },
@@ -6569,6 +6571,32 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
+        op_bytes = 16 << evex.lr;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x64): /* vpcmpeqb [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x65): /* vpcmpeqw [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x66): /* vpcmpeqd [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x74): /* vpcmpgtb [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x75): /* vpcmpgtw [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x76): /* vpcmpgtd [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x26): /* vptestm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x27): /* vptestm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x29): /* vpcmpeqq [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x37): /* vpcmpgtq [xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( b & (ext == ext_0f38 ? 1 : 2) )
+        {
+            generate_exception_if(b != 0x27 && evex.w != (b & 1), EXC_UD);
+            goto avx512f_no_sae;
+        }
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << (ext == ext_0f ? b & 1 : evex.w);
+        avx512_vlen_check(false);
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x6e):    /* mov{d,q} r/m,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
     CASE_SIMD_PACKED_INT(0x0f, 0x7e):    /* mov{d,q} {,x}mm,r/m */
@@ -7577,6 +7605,7 @@ x86_emulate(
                               EXC_UD);
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    avx512f_imm_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
         avx512_vlen_check(false);
@@ -8750,6 +8779,19 @@ x86_emulate(
         break;
     }
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1e): /* vpcmpu{d,q} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1f): /* vpcmp{d,q} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3e): /* vpcmpu{b,w} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3f): /* vpcmp{b,w} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( !(b & 0x20) )
+            goto avx512f_imm_no_sae;
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x20): /* pinsrb $imm8,r32/m8,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x22): /* pinsr{d,q} $imm8,r/m,xmm */
         host_and_vcpu_must_have(sse4_1);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 19/44] x86emul: support AVX512{F, BW} packed integer arithmetic insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (17 preceding siblings ...)
  2018-09-25 13:37   ` [PATCH v4 18/44] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
@ 2018-09-25 13:37   ` Jan Beulich
  2018-09-25 13:38   ` [PATCH v4 20/44] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
                     ` (24 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:37 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note: vpadd* / vpsub* et al are put at seemingly the wrong slot of the
big switch(). This is in anticipation of adding e.g. vpunpck* to those
groups (see the legacy/VEX encoded case labels nearby to support this).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Move a case block further down.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -149,6 +149,8 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(paddd,        66,   0f, fe,    vl,      d, vl),
+    INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
     INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
@@ -157,7 +159,16 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
+    INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
+    INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
+    INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmuldq,       66, 0f38, 28,    vl,      q, vl),
+    INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
+    INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSN(psubd,        66,   0f, fa,    vl,      d, vl),
+    INSN(psubq,        66,   0f, fb,    vl,      q, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
     INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
     INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
@@ -192,12 +203,39 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(paddb,       66,   0f, fc,    vl,    b, vl),
+    INSN(paddsb,      66,   0f, ec,    vl,    b, vl),
+    INSN(paddsw,      66,   0f, ed,    vl,    w, vl),
+    INSN(paddusb,     66,   0f, dc,    vl,    b, vl),
+    INSN(paddusw,     66,   0f, dd,    vl,    w, vl),
+    INSN(paddw,       66,   0f, fd,    vl,    w, vl),
+    INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
+    INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
     INSN(pcmp,        66, 0f3a, 3f,    vl,   bw, vl),
     INSN(pcmpeqb,     66,   0f, 74,    vl,    b, vl),
     INSN(pcmpeqw,     66,   0f, 75,    vl,    w, vl),
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
+    INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
+    INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
+    INSN(pmaxub,      66,   0f, de,    vl,    b, vl),
+    INSN(pmaxuw,      66, 0f38, 3e,    vl,    w, vl),
+    INSN(pminsb,      66, 0f38, 38,    vl,    b, vl),
+    INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
+    INSN(pminub,      66,   0f, da,    vl,    b, vl),
+    INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
+    INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
+    INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
+    INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSN(psubb,       66,   0f, f8,    vl,    b, vl),
+    INSN(psubsb,      66,   0f, e8,    vl,    b, vl),
+    INSN(psubsw,      66,   0f, e9,    vl,    w, vl),
+    INSN(psubusb,     66,   0f, d8,    vl,    b, vl),
+    INSN(psubusw,     66,   0f, d9,    vl,    w, vl),
+    INSN(psubw,       66,   0f, f9,    vl,    w, vl),
     INSN(ptestm,      66, 0f38, 26,    vl,   bw, vl),
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
 };
@@ -206,6 +244,7 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN_PFP(or,               0f, 56),
+    INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
 };
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -367,21 +367,21 @@ static const struct twobyte_table {
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
-    [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xff] = { ModRM }
 };
 
@@ -451,7 +451,7 @@ static const struct ext0f38_table {
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x40] = { .simd_size = simd_packed_int },
+    [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int },
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
@@ -5970,6 +5970,10 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x39): /* vpmins{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3b): /* vpminu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3d): /* vpmaxs{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3f): /* vpmaxu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -6571,6 +6575,31 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd8): /* vpsubusb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd9): /* vpsubusw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdc): /* vpaddusb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdd): /* vpaddusw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe0): /* vpavgb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe3): /* vpavgw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe5): /* vpmulhw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe8): /* vpsubsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe9): /* vpsubsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xec): /* vpaddsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xed): /* vpaddsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf8): /* vpsubb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << (b & 1);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
         op_bytes = 16 << evex.lr;
@@ -6597,6 +6626,12 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd4): /* vpaddq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf4): /* vpmuludq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x28): /* vpmuldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        goto avx512f_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x6e):    /* mov{d,q} r/m,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
     CASE_SIMD_PACKED_INT(0x0f, 0x7e):    /* mov{d,q} {,x}mm,r/m */
@@ -7820,6 +7855,12 @@ x86_emulate(
         op_bytes = vex.pfx ? 16 : 8;
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC(0x0f, 0xd4):        /* paddq mm/m64,mm */
     case X86EMUL_OPC(0x0f, 0xf4):        /* pmuludq mm/m64,mm */
     case X86EMUL_OPC(0x0f, 0xfb):        /* psubq mm/m64,mm */
@@ -7848,6 +7889,16 @@ x86_emulate(
         vcpu_must_have(mmxext);
         goto simd_0f_mmx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xda): /* vpminub [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xde): /* vpmaxub [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe4): /* vpmulhuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xea): /* vpminsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xee): /* vpmaxsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = b & 0x10 ? 1 : 2;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f, 0xe6):       /* cvttpd2dq xmm/mem,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe6):   /* vcvttpd2dq {x,y}mm/mem,xmm */
     case X86EMUL_OPC_F3(0x0f, 0xe6):       /* cvtdq2pd xmm/mem,xmm */
@@ -8221,6 +8272,20 @@ x86_emulate(
         host_and_vcpu_must_have(sse4_2);
         goto simd_0f38_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x38): /* vpminsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3a): /* vpminuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3c): /* vpmaxsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3e): /* vpmaxuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = b & 2 ?: 1;
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x40): /* vpmull{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0xdb):     /* aesimc xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xdb): /* vaesimc xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xdc):     /* aesenc xmm/m128,xmm,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 20/44] x86emul: use simd_128 also for legacy vector shift insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (18 preceding siblings ...)
  2018-09-25 13:37   ` [PATCH v4 19/44] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
@ 2018-09-25 13:38   ` Jan Beulich
  2018-09-25 13:39   ` [PATCH v4 21/44] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
                     ` (23 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

This eliminates a separate case block here, and allows to get away with
fewer new ones when adding AVX512 vector shifts.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -366,19 +366,19 @@ static const struct twobyte_table {
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128 },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128 },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
     [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -3337,7 +3337,8 @@ x86_decode(
         break;
 
     case simd_128:
-        op_bytes = 16;
+        /* The special case here are MMX shift insns. */
+        op_bytes = vex.opcx || vex.pfx ? 16 : 8;
         break;
 
     case simd_256:
@@ -6466,6 +6467,12 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f, 0x75): /* vpcmpeqw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0x76):    /* pcmpeqd {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x76): /* vpcmpeqd {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xd1):    /* psrlw {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xd2):    /* psrld {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xd3):    /* psrlq {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xd4):     /* paddq xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xd4): /* vpaddq {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0xd5):    /* pmullw {,x}mm/mem,{,x}mm */
@@ -6488,6 +6495,10 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f, 0xdf): /* vpandn {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xe0):     /* pavgb xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe0): /* vpavgb {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xe1):    /* psraw {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xe2):    /* psrad {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xe2): /* vpsrad xmm/m128,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xe3):     /* pavgw xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe3): /* vpavgw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xe4):     /* pmulhuw xmm/m128,xmm */
@@ -6510,6 +6521,12 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f, 0xee): /* vpmaxsw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0xef):    /* pxor {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xef): /* vpxor {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xf1):    /* psllw {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xf2):    /* pslld {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xf2): /* vpslld xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xf3):    /* psllq {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xf4):     /* pmuludq xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xf4): /* vpmuludq {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xf6):     /* psadbw xmm/m128,xmm */
@@ -7836,25 +7853,6 @@ x86_emulate(
         }
         break;
 
-    CASE_SIMD_PACKED_INT(0x0f, 0xd1):    /* psrlw {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xd2):    /* psrld {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xd3):    /* psrlq {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xe1):    /* psraw {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xe2):    /* psrad {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xe2): /* vpsrad xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xf1):    /* psllw {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xf2):    /* pslld {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xf2): /* vpslld xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xf3):    /* psllq {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,{x,y}mm,{x,y}mm */
-        op_bytes = vex.pfx ? 16 : 8;
-        goto simd_0f_int;
-
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 21/44] x86emul: support AVX512{F, BW} shift/rotate insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (19 preceding siblings ...)
  2018-09-25 13:38   ` [PATCH v4 20/44] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
@ 2018-09-25 13:39   ` Jan Beulich
  2018-09-25 13:40   ` [PATCH v4 22/44] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
                     ` (22 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:39 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note that simd_packed_fp for the opcode space 0f38 major opcodes 14 and
15 is not really correct, but sufficient for the purposes here. Further
adjustments may later be needed for the down conversion unsigned
saturating VPMOV* insns, first and foremost for the different Disp8
scaling those ones use.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -167,6 +167,24 @@ static const struct test avx512f_all[] =
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSNX(prol,        66,   0f, 72, 1, vl,     dq, vl),
+    INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
+    INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
+    INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
+    INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
+    INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
+    INSNX(psllq,       66,   0f, 73, 6, vl,      q, vl),
+    INSN(psllv,        66, 0f38, 47,    vl,     dq, vl),
+    INSNX(psra,        66,   0f, 72, 4, vl,     dq, vl),
+    INSN(psrad,        66,   0f, e2,    el_4,    d, vl),
+    INSN(psraq,        66,   0f, e2,    el_2,    q, vl),
+    INSN(psrav,        66, 0f38, 46,    vl,     dq, vl),
+    INSN(psrld,        66,   0f, d2,    el_4,    d, vl),
+    INSNX(psrld,       66,   0f, 72, 2, vl,      d, vl),
+    INSN(psrlq,        66,   0f, d3,    el_2,    q, vl),
+    INSNX(psrlq,       66,   0f, 73, 2, vl,      q, vl),
+    INSN(psrlv,        66, 0f38, 45,    vl,     dq, vl),
     INSN(psubd,        66,   0f, fa,    vl,      d, vl),
     INSN(psubq,        66,   0f, fb,    vl,      q, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
@@ -230,6 +248,17 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSNX(pslldq,     66,   0f, 73, 7, vl,    b, vl),
+    INSN(psllvw,      66, 0f38, 12,    vl,    w, vl),
+    INSN(psllw,       66,   0f, f1,    el_8,  w, vl),
+    INSNX(psllw,      66,   0f, 71, 6, vl,    w, vl),
+    INSN(psravw,      66, 0f38, 11,    vl,    w, vl),
+    INSN(psraw,       66,   0f, e1,    el_8,  w, vl),
+    INSNX(psraw,      66,   0f, 71, 4, vl,    w, vl),
+    INSNX(psrldq,     66,   0f, 73, 3, vl,    b, vl),
+    INSN(psrlvw,      66, 0f38, 10,    vl,    w, vl),
+    INSN(psrlw,       66,   0f, d1,    el_8,  w, vl),
+    INSNX(psrlw,      66,   0f, 71, 2, vl,    w, vl),
     INSN(psubb,       66,   0f, f8,    vl,    b, vl),
     INSN(psubsb,      66,   0f, e8,    vl,    b, vl),
     INSN(psubsw,      66,   0f, e9,    vl,    w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -319,7 +319,7 @@ static const struct twobyte_table {
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
-    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
+    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
@@ -366,19 +366,19 @@ static const struct twobyte_table {
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
     [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -434,9 +434,9 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
-    [0x10] = { .simd_size = simd_packed_int },
+    [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
-    [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
+    [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
     [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
@@ -453,7 +453,7 @@ static const struct ext0f38_table {
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x45 ... 0x47] = { .simd_size = simd_packed_int },
+    [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
@@ -5971,10 +5971,15 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x14): /* vprorv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x15): /* vprolv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x39): /* vpmins{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3b): /* vpminu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3d): /* vpmaxs{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3f): /* vpmaxu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -6592,6 +6597,9 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
@@ -6891,6 +6899,37 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x71): /* Grp12 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsraw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512bw_shift_imm:
+            fault_suppression = false;
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512bw_imm;
+        }
+        goto unrecognized_insn;
+
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x72): /* Grp13 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpslld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(evex.w, EXC_UD);
+            /* fall through */
+        case 0: /* vpror{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 1: /* vprol{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsra{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512f_shift_imm:
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512f_imm_no_sae;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x73):        /* Grp14 */
         switch ( modrm_reg & 7 )
         {
@@ -6916,6 +6955,19 @@ x86_emulate(
         }
         goto unrecognized_insn;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x73): /* Grp14 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(!evex.w, EXC_UD);
+            goto avx512f_shift_imm;
+        case 3: /* vpsrldq $imm8,{x,y}mm,{x,y}mm */
+        case 7: /* vpslldq $imm8,{x,y}mm,{x,y}mm */
+            goto avx512bw_shift_imm;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x77):        /* emms */
     case X86EMUL_OPC_VEX(0x0f, 0x77):    /* vzero{all,upper} */
         if ( vex.opcx != vex_none )
@@ -7853,6 +7905,16 @@ x86_emulate(
         }
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe2): /* vpsra{d,q} xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.br, EXC_UD);
+        fault_suppression = false;
+        if ( b == 0xe2 )
+            goto avx512f_no_sae;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8093,6 +8155,14 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x10): /* vpsrlvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x11): /* vpsravw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x12): /* vpsllvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
         generate_exception_if(evex.w || evex.br, EXC_UD);
     avx512_broadcast:
@@ -8849,6 +8919,7 @@ x86_emulate(
         generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
         if ( !(b & 0x20) )
             goto avx512f_imm_no_sae;
+    avx512bw_imm:
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.br, EXC_UD);
         elem_bytes = 1 << evex.w;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 22/44] x86emul: support AVX512{F, BW, DQ} extract insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (20 preceding siblings ...)
  2018-09-25 13:39   ` [PATCH v4 21/44] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
@ 2018-09-25 13:40   ` Jan Beulich
  2018-09-25 13:40   ` [PATCH v4 23/44] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
                     ` (21 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:40 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -201,6 +201,7 @@ static const struct test avx512f_all[] =
 };
 
 static const struct test avx512f_128[] = {
+    INSN(extractps, 66, 0f3a, 17, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -210,10 +211,14 @@ static const struct test avx512f_128[] =
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
+    INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
+    INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
+    INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -269,6 +274,12 @@ static const struct test avx512bw_all[]
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
 };
 
+static const struct test avx512bw_128[] = {
+    INSN(pextrb, 66, 0f3a, 14, el, b, el),
+//       pextrw, 66,   0f, c5,     w
+    INSN(pextrw, 66, 0f3a, 15, el, w, el),
+};
+
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
@@ -277,13 +288,21 @@ static const struct test avx512dq_all[]
     INSN_PFP(xor,              0f, 57),
 };
 
+static const struct test avx512dq_128[] = {
+    INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+};
+
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
+    INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
+    INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
@@ -613,7 +632,9 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, no128);
     RUN(avx512f, 512);
     RUN(avx512bw, all);
+    RUN(avx512bw, 128);
     RUN(avx512dq, all);
+    RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -512,9 +512,13 @@ static const struct ext0f3a_table {
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
-    [0x14 ... 0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1 },
+    [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
+    [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
+    [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
+    [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
     [0x18] = { .simd_size = simd_128 },
-    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none },
@@ -523,7 +527,8 @@ static const struct ext0f3a_table {
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
-    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
@@ -2660,6 +2665,8 @@ x86_decode_0f3a(
      ... X86EMUL_OPC_66(0, 0x17):     /* pextr*, extractps */
     case X86EMUL_OPC_VEX_66(0, 0x14)
      ... X86EMUL_OPC_VEX_66(0, 0x17): /* vpextr*, vextractps */
+    case X86EMUL_OPC_EVEX_66(0, 0x14)
+     ... X86EMUL_OPC_EVEX_66(0, 0x17): /* vpextr*, vextractps */
     case X86EMUL_OPC_VEX_F2(0, 0xf0): /* rorx */
         break;
 
@@ -8838,9 +8845,9 @@ x86_emulate(
         opc[0] = b;
         /* Convert memory/GPR operand to (%rAX). */
         rex_prefix &= ~REX_B;
-        vex.b = 1;
+        evex.b = vex.b = 1;
         if ( !mode_64bit() )
-            vex.w = 0;
+            evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
         opc[3] = 0xc3;
@@ -8850,7 +8857,10 @@ x86_emulate(
             --opc;
         }
 
-        copy_REX_VEX(opc, rex_prefix, vex);
+        if ( evex_encoded() )
+            copy_EVEX(opc, evex);
+        else
+            copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=m" (dst.val) : "a" (&dst.val));
         put_stub(stub);
 
@@ -8870,6 +8880,52 @@ x86_emulate(
         opc = init_prefixes(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
+        generate_exception_if(ea.type != OP_REG, EXC_UD);
+        /* Convert to alternative encoding: We want to use a memory operand. */
+        evex.opcx = ext_0f3a;
+        b = 0x15;
+        modrm <<= 3;
+        evex.r = evex.b;
+        evex.R = evex.x;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x14): /* vpextrb $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x15): /* vpextrw $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x16): /* vpextr{d,q} $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x17): /* vextractps $imm8,xmm,r/m */
+        generate_exception_if((evex.lr || evex.reg != 0xf || !evex.RX ||
+                               evex.opmsk || evex.br),
+                              EXC_UD);
+        if ( !(b & 2) )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( !(b & 1) )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto pextr;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(evex.lr != 2 || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
     {
         uint32_t mxcsr;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 23/44] x86emul: support AVX512{F, BW, DQ} insert insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (21 preceding siblings ...)
  2018-09-25 13:40   ` [PATCH v4 22/44] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
@ 2018-09-25 13:40   ` Jan Beulich
  2018-09-25 13:41   ` [PATCH v4 24/44] x86emul: basic AVX512F testing Jan Beulich
                     ` (20 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:40 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Also correct the comment of the AVX form of VINSERTPS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -202,6 +202,7 @@ static const struct test avx512f_all[] =
 
 static const struct test avx512f_128[] = {
     INSN(extractps, 66, 0f3a, 17, el,    d, el),
+    INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -213,12 +214,16 @@ static const struct test avx512f_no128[]
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
+    INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
+    INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
+    INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
+    INSN(inserti64x4,    66, 0f3a, 3a, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -278,6 +283,8 @@ static const struct test avx512bw_128[]
     INSN(pextrb, 66, 0f3a, 14, el, b, el),
 //       pextrw, 66,   0f, c5,     w
     INSN(pextrw, 66, 0f3a, 15, el, w, el),
+    INSN(pinsrb, 66, 0f3a, 20, el, b, el),
+    INSN(pinsrw, 66,   0f, c4, el, w, el),
 };
 
 static const struct test avx512dq_all[] = {
@@ -290,6 +297,7 @@ static const struct test avx512dq_all[]
 
 static const struct test avx512dq_128[] = {
     INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+    INSN(pinsr, 66, 0f3a, 22, el, dq64, el),
 };
 
 static const struct test avx512dq_no128[] = {
@@ -297,12 +305,16 @@ static const struct test avx512dq_no128[
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
+    INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
+    INSN(inserti64x2,    66, 0f3a, 38, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
+    INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
+    INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -360,7 +360,7 @@ static const struct twobyte_table {
     [0xc1] = { DstMem|SrcReg|ModRM },
     [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp, d8s_vl },
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
-    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
+    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int, 1 },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
     [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp, d8s_vl },
     [0xc7] = { ImplicitOps|ModRM },
@@ -516,17 +516,19 @@ static const struct ext0f3a_table {
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
     [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
-    [0x18] = { .simd_size = simd_128 },
+    [0x18] = { .simd_size = simd_128, .d8s = 4 },
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x20] = { .simd_size = simd_none },
-    [0x21] = { .simd_size = simd_other },
-    [0x22] = { .simd_size = simd_none },
+    [0x20] = { .simd_size = simd_none, .d8s = 0 },
+    [0x21] = { .simd_size = simd_other, .d8s = 2 },
+    [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
-    [0x38] = { .simd_size = simd_128 },
+    [0x38] = { .simd_size = simd_128, .d8s = 4 },
+    [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -2570,6 +2572,7 @@ x86_decode_twobyte(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
     case X86EMUL_OPC_VEX_66(0, 0xc4): /* vpinsrw */
+    case X86EMUL_OPC_EVEX_66(0, 0xc4): /* vpinsrw */
         state->desc = DstReg | SrcMem16;
         break;
 
@@ -2672,6 +2675,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x20):     /* pinsrb */
     case X86EMUL_OPC_VEX_66(0, 0x20): /* vpinsrb */
+    case X86EMUL_OPC_EVEX_66(0, 0x20): /* vpinsrb */
         state->desc = DstImplicit | SrcMem;
         if ( modrm_mod != 3 )
             state->desc |= ByteOp;
@@ -2679,6 +2683,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x22):     /* pinsr{d,q} */
     case X86EMUL_OPC_VEX_66(0, 0x22): /* vpinsr{d,q} */
+    case X86EMUL_OPC_EVEX_66(0, 0x22): /* vpinsr{d,q} */
         state->desc = DstImplicit | SrcMem;
         break;
 
@@ -7695,6 +7700,23 @@ x86_emulate(
         ea.type = OP_MEM;
         goto simd_0f_int_imm8;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc4):   /* vpinsrw $imm8,r32/m16,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x20): /* vpinsrb $imm8,r32/m8,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x22): /* vpinsr{d,q} $imm8,r/m,xmm,xmm */
+        generate_exception_if(evex.lr || evex.opmsk || evex.br, EXC_UD);
+        if ( b & 2 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        if ( !mode_64bit() )
+            evex.w = 0;
+        memcpy(mmvalp, &src.val, op_bytes);
+        ea.type = OP_MEM;
+        op_bytes = src.bytes;
+        d = SrcMem16; /* Fake for the common SIMD code below. */
+        state->simd_size = simd_other;
+        goto avx512f_imm_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0xc5):      /* pextrw $imm8,{,x}mm,reg */
     case X86EMUL_OPC_VEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
         generate_exception_if(vex.l, EXC_UD);
@@ -8906,8 +8928,12 @@ x86_emulate(
         opc = init_evex(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x18): /* vinsertf32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinsertf64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x38): /* vinserti32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinserti64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
@@ -8916,8 +8942,12 @@ x86_emulate(
         fault_suppression = false;
         goto avx512f_imm_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1a): /* vinsertf32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinsertf64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3a): /* vinserti32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinserti64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
         if ( !evex.w )
@@ -9010,13 +9040,19 @@ x86_emulate(
         op_bytes = 4;
         goto simd_0f3a_common;
 
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m128,xmm,xmm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
         op_bytes = 4;
         /* fall through */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x41): /* vdppd $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
+        op_bytes = 4;
+        generate_exception_if(evex.lr || evex.w || evex.opmsk || evex.br,
+                              EXC_UD);
+        goto avx512f_imm_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 24/44] x86emul: basic AVX512F testing
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (22 preceding siblings ...)
  2018-09-25 13:40   ` [PATCH v4 23/44] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
@ 2018-09-25 13:41   ` Jan Beulich
  2018-09-25 13:41   ` [PATCH v4 25/44] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
                     ` (19 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:41 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Make eq() also work for 4- and 8-byte integer element sizes.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -13,7 +13,7 @@ all: $(TARGET)
 run: $(TARGET)
 	./$(TARGET)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -60,6 +60,9 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
+avx512f-vecs := 64
+avx512f-ints := 4 8
+avx512f-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
@@ -145,7 +148,7 @@ $(addsuffix .c,$(SG)):
 
 $(addsuffix .h,$(SIMD) $(FMA) $(SG)): simd.h
 
-xop.h: simd-fma.c
+xop.h avx512f.h: simd-fma.c
 
 endif # 32-bit override
 
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -2,7 +2,41 @@
 
 ENTRY(simd_test);
 
-#if VEC_SIZE == 8 && defined(__SSE__)
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if VEC_SIZE == 4
+#  define eq(x, y) ({ \
+    float x_ = (x)[0]; \
+    float __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpss $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif VEC_SIZE == 8
+#  define eq(x, y) ({ \
+    double x_ = (x)[0]; \
+    double __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpsd $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif FLOAT_SIZE == 4
+/*
+ * gcc's (up to at least 8.2) __builtin_ia32_cmpps256_mask() has an anomaly in
+ * that its return type is QI rather than UQI, and hence the value would get
+ * sign-extended before comapring to ALL_TRUE. The same oddity does not matter
+ * for __builtin_ia32_cmppd256_mask(), as there only 4 bits are significant.
+ * Hence the extra " & ALL_TRUE".
+ */
+#  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
+# elif FLOAT_SIZE == 8
+#  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif INT_SIZE == 4 || UINT_SIZE == 4
+#  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+# endif
+#elif VEC_SIZE == 8 && defined(__SSE__)
 # define to_bool(cmp) (__builtin_ia32_pmovmskb(cmp) == 0xff)
 #elif VEC_SIZE == 16
 # if defined(__AVX__) && defined(FLOAT_SIZE)
@@ -93,6 +127,50 @@ static inline bool _to_bool(byte_vec_t b
     touch(x); \
     __builtin_ia32_pfrcpit2(__builtin_ia32_pfrsqit1(__builtin_ia32_pfmul(t_, t_), x), t_); \
 })
+#elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if VEC_SIZE > FLOAT_SIZE
+#  if FLOAT_SIZE == 4
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastss %1, %0" \
+          : "=v" (t_) : "m" (*(float[1]){ x }) ); \
+    t_; \
+})
+#   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#   define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#   if VEC_SIZE == 16
+#    define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
+#    define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
+#    define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#   endif
+#  elif FLOAT_SIZE == 8
+#   if VEC_SIZE >= 32
+#    define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastsd %1, %0" : "=v" (t_) \
+          : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#   else
+#    define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#   endif
+#   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#   define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#   if VEC_SIZE == 16
+#    define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
+#    define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
+#    define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#   endif
+#  endif
+# endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
 # if VEC_SIZE == 32 && defined(__AVX__)
 #  if defined(__AVX2__)
@@ -191,7 +269,30 @@ static inline bool _to_bool(byte_vec_t b
 #  define sqrt(x) scalar_1op(x, "sqrtsd %[in], %[out]")
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE2__)
+#if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
+     defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 4 || UINT_SIZE == 4
+#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+# endif
+# if INT_SIZE == 4
+#  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
+#  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 4
+#  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# endif
+#elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
 #  define interleave_lo(x, y) ((vec_t)__builtin_ia32_punpcklbw128((vqi_t)(x), (vqi_t)(y)))
@@ -587,6 +688,10 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if defined(__AVX512F__) && defined(FLOAT_SIZE)
+# include "simd-fma.c"
+#endif
+
 int simd_test(void)
 {
     unsigned int i, j;
@@ -1034,7 +1139,8 @@ int simd_test(void)
 # endif
 #endif
 
-#if defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)
+#if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
+    (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
 #endif
 
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,9 +70,111 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE == 16
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
+# define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
+#elif VEC_SIZE == 32
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 256 ## s(a)
+#elif VEC_SIZE == 64
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 512 ## s(a)
+# define BR(n, s, a...)  __builtin_ia32_ ## n ## 512 ## s(a, 4)
+#endif
+#ifndef B_
+# define B_ B
+#endif
+#ifndef BR
+# define BR B
+# define BR_ B_
+#endif
+#ifndef BR_
+# define BR_ BR
+#endif
+
+#ifdef __AVX512F__
+
+/*
+ * The original plan was to effect use of EVEX encodings for scalar as well as
+ * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
+ * only of course) XMM16-XMM31 only. All sorts of compiler errors result when
+ * doing this with gcc 8.2. Therefore resort to injecting {evex} prefixes,
+ * which has the benefit of also working for 32-bit. Granted, there is a lot of
+ * escaping to get right here.
+ */
+asm ( ".macro override insn    \n\t"
+      ".macro $\\insn o:vararg \n\t"
+      ".purgem \\insn          \n\t"
+      "{evex} \\insn \\(\\)o   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\(\\))o     \n\t"
+      ".endm                   \n\t"
+      ".endm                   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\)o         \n\t"
+      ".endm                   \n\t"
+      ".endm" );
+
+#define OVR(n) asm ( "override v" #n )
+#define OVR_SFP(n) OVR(n ## sd); OVR(n ## ss)
+
+#ifdef __AVX512VL__
+# ifdef __AVX512BW__
+#  define OVR_BW(n) OVR(p ## n ## b); OVR(p ## n ## w)
+# else
+#  define OVR_BW(n)
+# endif
+# define OVR_DQ(n) OVR(p ## n ## d); OVR(p ## n ## q)
+# define OVR_VFP(n) OVR(n ## pd); OVR(n ## ps)
+#else
+# define OVR_BW(n)
+# define OVR_DQ(n)
+# define OVR_VFP(n)
+#endif
+
+#define OVR_FMA(n, w) OVR_ ## w(n ## 132); OVR_ ## w(n ## 213); \
+                      OVR_ ## w(n ## 231)
+#define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
+#define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
+
+OVR_SFP(broadcast);
+OVR_SFP(comi);
+OVR_FP(add);
+OVR_FP(div);
+OVR(extractps);
+OVR_FMA(fmadd, FP);
+OVR_FMA(fmsub, FP);
+OVR_FMA(fnmadd, FP);
+OVR_FMA(fnmsub, FP);
+OVR(insertps);
+OVR_FP(max);
+OVR_FP(min);
+OVR(movd);
+OVR(movq);
+OVR_SFP(mov);
+OVR_FP(mul);
+OVR_FP(sqrt);
+OVR_FP(sub);
+OVR_SFP(ucomi);
+
+#undef OVR_VFP
+#undef OVR_SFP
+#undef OVR_INT
+#undef OVR_FP
+#undef OVR_FMA
+#undef OVR_DQ
+#undef OVR_BW
+#undef OVR
+
+#endif
+
 /*
  * Suppress value propagation by the compiler, preventing unwanted
  * optimization. This at once makes the compiler use memory operands
  * more often, which for our purposes is the more interesting case.
  */
 #define touch(var) asm volatile ( "" : "+m" (var) )
+
+static inline vec_t undef(void)
+{
+    vec_t v = v;
+    return v;
+}
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -1,10 +1,9 @@
+#if !defined(__XOP__) && !defined(__AVX512F__)
 #include "simd.h"
-
-#ifndef __XOP__
 ENTRY(fma_test);
 #endif
 
-#if VEC_SIZE < 16
+#if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
 #elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
@@ -24,7 +23,13 @@ ENTRY(fma_test);
 # define eq(x, y) to_bool((x) == (y))
 #endif
 
-#if VEC_SIZE == 16
+#if defined(__AVX512F__) && VEC_SIZE > FLOAT_SIZE
+# if FLOAT_SIZE == 4
+#  define fmaddsub(x, y, z) BR(vfmaddsubps, _mask, x, y, z, ~0)
+# elif FLOAT_SIZE == 8
+#  define fmaddsub(x, y, z) BR(vfmaddsubpd, _mask, x, y, z, ~0)
+# endif
+#elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
 #  define addsub(x, y) __builtin_ia32_addsubps(x, y)
 #  if defined(__FMA4__) || defined(__FMA__)
@@ -50,6 +55,10 @@ ENTRY(fma_test);
 # endif
 #endif
 
+#if defined(fmaddsub) && !defined(addsub)
+# define addsub(x, y) fmaddsub(x, broadcast(1), y)
+#endif
+
 int fma_test(void)
 {
     unsigned int i;
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -21,6 +21,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f-opmask.h"
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
+#include "avx512f.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -248,6 +249,14 @@ static const struct {
     SIMD(OPMASK/b,    avx512dq_opmask,         1),
     SIMD(OPMASK/d,    avx512bw_opmask,         4),
     SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(AVX512F f32 scalar,  avx512f,        f4),
+    SIMD(AVX512F f32x16,      avx512f,      64f4),
+    SIMD(AVX512F f64 scalar,  avx512f,        f8),
+    SIMD(AVX512F f64x8,       avx512f,      64f8),
+    SIMD(AVX512F s32x16,      avx512f,      64i4),
+    SIMD(AVX512F u32x16,      avx512f,      64u4),
+    SIMD(AVX512F s64x8,       avx512f,      64i8),
+    SIMD(AVX512F u64x8,       avx512f,      64u8),
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 25/44] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (23 preceding siblings ...)
  2018-09-25 13:41   ` [PATCH v4 24/44] x86emul: basic AVX512F testing Jan Beulich
@ 2018-09-25 13:41   ` Jan Beulich
  2018-09-25 13:42   ` [PATCH v4 26/44] x86emul: basic AVX512VL testing Jan Beulich
                     ` (18 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:41 UTC (permalink / raw)
  To: xen-devel, Jan Beulich; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note that the pbroadcastw table entry in evex-disp8.c is slightly
different from what one would expect, due to it requiring EVEX.W to be
zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -153,6 +153,9 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+//       pbroadcast,   66, 0f38, 7c,          dq64
+    INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
+    INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
     INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
     INSN(pcmpeqd,      66,   0f, 76,    vl,      d, vl),
     INSN(pcmpeqq,      66, 0f38, 29,    vl,      q, vl),
@@ -211,6 +214,7 @@ static const struct test avx512f_128[] =
 
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
+    INSN(broadcasti32x4, 66, 0f38, 5a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
@@ -220,6 +224,7 @@ static const struct test avx512f_no128[]
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(broadcasti64x4, 66, 0f38, 5b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
     INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
@@ -239,6 +244,10 @@ static const struct test avx512bw_all[]
     INSN(paddw,       66,   0f, fd,    vl,    w, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
+    INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
+//       pbroadcastb, 66, 0f38, 7a,           b
+    INSN(pbroadcastw, 66, 0f38, 79,    el_2,  b, vl),
+//       pbroadcastw, 66, 0f38, 7b,           b
     INSN(pcmp,        66, 0f3a, 3f,    vl,   bw, vl),
     INSN(pcmpeqb,     66,   0f, 74,    vl,    b, vl),
     INSN(pcmpeqw,     66,   0f, 75,    vl,    w, vl),
@@ -290,6 +299,7 @@ static const struct test avx512bw_128[]
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
+    INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
@@ -303,6 +313,7 @@ static const struct test avx512dq_128[]
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(broadcasti64x2, 66, 0f38, 5a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
     INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
@@ -311,6 +322,7 @@ static const struct test avx512dq_no128[
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(broadcasti32x8, 66, 0f38, 5b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
     INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -272,9 +272,33 @@ static inline bool _to_bool(byte_vec_t b
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if INT_SIZE == 4 || UINT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastd %1, %0" \
+          : "=v" (t_) : "m" (*(int[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(long long[1]){ x }) ); \
+    t_; \
+})
+#  ifdef __x86_64__
+#   define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastq %1, %0" : "=v" (t_) : "r" ((x) + 0ULL) ); \
+    t_; \
+})
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
@@ -971,10 +995,14 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
-#if defined(broadcast)
+#ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#ifdef broadcast2
+    if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -454,9 +454,13 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
-    [0x5a] = { .simd_size = simd_128, .two_op = 1 },
-    [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
+    [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
+    [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
+    [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
+    [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x78] = { .simd_size = simd_other, .two_op = 1 },
+    [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
+    [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -2620,6 +2624,11 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
+    case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
+    case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
+        break;
+
     case 0xf0: /* movbe / crc32 */
         state->desc |= repne_prefix() ? ByteOp : Mov;
         if ( rep_prefix() )
@@ -8193,6 +8202,8 @@ x86_emulate(
         goto avx512f_no_sae;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
+        op_bytes = elem_bytes;
         generate_exception_if(evex.w || evex.br, EXC_UD);
     avx512_broadcast:
         /*
@@ -8211,17 +8222,27 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
                                             /* vbroadcastf64x4 m256,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
+                                            /* vbroadcasti64x4 m256,zmm{k} */
         generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
                                             /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
-        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        generate_exception_if(!evex.lr, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
+                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
+        if ( b == 0x59 )
+            op_bytes = 8;
+        generate_exception_if(evex.br, EXC_UD);
         if ( !evex.w )
             host_and_vcpu_must_have(avx512dq);
         goto avx512_broadcast;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
                                             /* vbroadcastf64x2 m128,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
         generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.br,
                               EXC_UD);
         if ( evex.w )
@@ -8415,6 +8436,45 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+        op_bytes = elem_bytes = 1 << (b & 1);
+        /* See the comment at the avx512_broadcast label. */
+        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
+        generate_exception_if((ea.type != OP_REG || evex.br ||
+                               evex.reg != 0xf || !evex.RX),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        opc[1] = modrm & 0xf8;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "+m" (src.val) : "a" (src.val));
+
+        put_stub(stub);
+        ASSERT(!state->simd_size);
+        break;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 26/44] x86emul: basic AVX512VL testing
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (24 preceding siblings ...)
  2018-09-25 13:41   ` [PATCH v4 25/44] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
@ 2018-09-25 13:42   ` Jan Beulich
  2018-09-25 13:43   ` [PATCH v4 27/44] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
                     ` (17 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:42 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Test the 128- and 256-bit variants of the insns which have been
implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -60,7 +60,7 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
-avx512f-vecs := 64
+avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
 
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -5,13 +5,13 @@ ENTRY(fma_test);
 
 #if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
-#elif VEC_SIZE == 16
+#elif VEC_SIZE == 16 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
 #  define to_bool(cmp) __builtin_ia32_vtestcpd(cmp, (vec_t){} == 0)
 # endif
-#elif VEC_SIZE == 32
+#elif VEC_SIZE == 32 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps256(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -533,7 +533,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define rotr(x, n) ((vec_t)__builtin_ia32_palignr128((vdi_t)(x), (vdi_t)(x), (n) * 64))
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE4_1__)
+#if VEC_SIZE == 16 && defined(__SSE4_1__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)__builtin_ia32_pmaxsb128((vqi_t)(x), (vqi_t)(y)))
 #  define min(x, y) ((vec_t)__builtin_ia32_pminsb128((vqi_t)(x), (vqi_t)(y)))
@@ -587,7 +587,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define mix(x, y) __builtin_ia32_blendpd(x, y, 0b10)
 # endif
 #endif
-#if VEC_SIZE == 32 && defined(__AVX__)
+#if VEC_SIZE == 32 && defined(__AVX__) && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define dot_product(x, y) ({ \
     vec_t t_ = __builtin_ia32_dpps256(x, y, 0b11110001); \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -92,6 +92,15 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+#if VEC_SIZE < 64
+# pragma GCC target ( "avx512vl" )
+#endif
+
+#define REN(insn, old, new)                      \
+    asm ( ".macro v" #insn #old " o:vararg \n\t" \
+          "v" #insn #new " \\o             \n\t" \
+          ".endm" )
+
 /*
  * The original plan was to effect use of EVEX encodings for scalar as well as
  * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
@@ -135,25 +144,88 @@ asm ( ".macro override insn    \n\t"
 #define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
 #define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
 
+OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
+OVR_INT(add);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
+OVR_FMA(fmaddsub, VFP);
 OVR_FMA(fmsub, FP);
+OVR_FMA(fmsubadd, VFP);
 OVR_FMA(fnmadd, FP);
 OVR_FMA(fnmsub, FP);
 OVR(insertps);
 OVR_FP(max);
+OVR_INT(maxs);
+OVR_INT(maxu);
 OVR_FP(min);
+OVR_INT(mins);
+OVR_INT(minu);
 OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
+OVR_VFP(mova);
+OVR_VFP(movnt);
+OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(shuf);
+OVR_INT(sll);
+OVR_DQ(sllv);
 OVR_FP(sqrt);
+OVR_INT(sra);
+OVR_DQ(srav);
+OVR_INT(srl);
+OVR_DQ(srlv);
 OVR_FP(sub);
+OVR_INT(sub);
 OVR_SFP(ucomi);
+OVR_VFP(unpckh);
+OVR_VFP(unpckl);
+
+#ifdef __AVX512VL__
+# if ELEM_SIZE == 8 && defined(__AVX512DQ__)
+REN(extract, f128, f64x2);
+REN(extract, i128, i64x2);
+REN(insert, f128, f64x2);
+REN(insert, i128, i64x2);
+# else
+REN(extract, f128, f32x4);
+REN(extract, i128, i32x4);
+REN(insert, f128, f32x4);
+REN(insert, i128, i32x4);
+# endif
+# if ELEM_SIZE == 8
+REN(movdqa, , 64);
+REN(movdqu, , 64);
+REN(pand, , q);
+REN(pandn, , q);
+REN(por, , q);
+REN(pxor, , q);
+# else
+#  if ELEM_SIZE == 1 && defined(__AVX512BW__)
+REN(movdq, a, u8);
+REN(movdqu, , 8);
+#  elif ELEM_SIZE == 2 && defined(__AVX512BW__)
+REN(movdq, a, u16);
+REN(movdqu, , 16);
+#  else
+REN(movdqa, , 32);
+REN(movdqu, , 32);
+#  endif
+REN(pand, , d);
+REN(pandn, , d);
+REN(por, , d);
+REN(pxor, , d);
+# endif
+OVR(movntdq);
+OVR(movntdqa);
+OVR(pmulld);
+OVR(pmuldq);
+OVR(pmuludq);
+#endif
 
 #undef OVR_VFP
 #undef OVR_SFP
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -88,6 +88,11 @@ static bool simd_check_avx512f(void)
 }
 #define simd_check_avx512f_opmask simd_check_avx512f
 
+static bool simd_check_avx512f_vl(void)
+{
+    return cpu_has_avx512f && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512dq(void)
 {
     return cpu_has_avx512dq;
@@ -142,11 +147,21 @@ static const struct {
       .check_cpu = simd_check_ ## feat,                             \
       .set_regs = simd_set_regs,                                    \
       .check_regs = simd_check_regs }
+#define AVX512VL_(bits, desc, feat, form)                          \
+    { .code = feat ## _x86_ ## bits ## _D ## _ ## form,            \
+      .size = sizeof(feat ## _x86_ ## bits ## _D ## _ ## form),    \
+      .bitness = bits, .name = "AVX512" #desc,                     \
+      .check_cpu = simd_check_ ## feat ## _vl,                     \
+      .set_regs = simd_set_regs,                                   \
+      .check_regs = simd_check_regs }
 #ifdef __x86_64__
 # define SIMD(desc, feat, form) SIMD_(64, desc, feat, form), \
                                 SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(64, desc, feat, form), \
+                                    AVX512VL_(32, desc, feat, form)
 #else
 # define SIMD(desc, feat, form) SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(32, desc, feat, form)
 #endif
     SIMD(3DNow! single,          _3dnow,     8f4),
     SIMD(SSE scalar single,      sse,         f4),
@@ -257,6 +272,20 @@ static const struct {
     SIMD(AVX512F u32x16,      avx512f,      64u4),
     SIMD(AVX512F s64x8,       avx512f,      64i8),
     SIMD(AVX512F u64x8,       avx512f,      64u8),
+    AVX512VL(VL f32x4,        avx512f,      16f4),
+    AVX512VL(VL f64x2,        avx512f,      16f8),
+    AVX512VL(VL f32x8,        avx512f,      32f4),
+    AVX512VL(VL f64x4,        avx512f,      32f8),
+    AVX512VL(VL s32x4,        avx512f,      16i4),
+    AVX512VL(VL u32x4,        avx512f,      16u4),
+    AVX512VL(VL s32x8,        avx512f,      32i4),
+    AVX512VL(VL u32x8,        avx512f,      32u4),
+    AVX512VL(VL s64x2,        avx512f,      16i8),
+    AVX512VL(VL u64x2,        avx512f,      16u8),
+    AVX512VL(VL s64x4,        avx512f,      32i8),
+    AVX512VL(VL u64x4,        avx512f,      32u8),
+#undef AVX512VL_
+#undef AVX512VL
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 27/44] x86emul: support AVX512{F, BW} zero- and sign-extending moves
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (25 preceding siblings ...)
  2018-09-25 13:42   ` [PATCH v4 26/44] x86emul: basic AVX512VL testing Jan Beulich
@ 2018-09-25 13:43   ` Jan Beulich
  2018-09-25 13:43   ` [PATCH v4 28/44] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
                     ` (16 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:43 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note that the testing in simd.c doesn't really follow the ISA extension
pattern - to fit the scheme, extensions from byte and word granular
vectors can (currently) sensibly only happen in the AVX512BW case (and
hence respective abstraction macros will be added there rather than
here).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -166,6 +166,16 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
+    INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
+    INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
+    INSN(pmovzxwq,     66, 0f38, 34,    vl_4,    w, vl),
+    INSN(pmovzxdq,     66, 0f38, 35,    vl_2, d_nb, vl),
     INSN(pmuldq,       66, 0f38, 28,    vl,      q, vl),
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
@@ -263,6 +273,8 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -443,13 +443,23 @@ static const struct ext0f38_table {
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
+    [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x23] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x24] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
-    [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
+    [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x32] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x33] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x34] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x35] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
@@ -8308,6 +8318,25 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x25): /* vpmovsxdq {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
+        elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -305,10 +305,12 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxdq, _mask, x, (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 4
 #  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -222,6 +222,16 @@ REN(pxor, , d);
 # endif
 OVR(movntdq);
 OVR(movntdqa);
+OVR(pmovsxbd);
+OVR(pmovsxbq);
+OVR(pmovsxdq);
+OVR(pmovsxwd);
+OVR(pmovsxwq);
+OVR(pmovzxbd);
+OVR(pmovzxbq);
+OVR(pmovzxdq);
+OVR(pmovzxwd);
+OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 28/44] x86emul: support AVX512{F, BW} down conversion moves
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (26 preceding siblings ...)
  2018-09-25 13:43   ` [PATCH v4 27/44] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
@ 2018-09-25 13:43   ` Jan Beulich
  2018-09-25 13:44   ` [PATCH v4 29/44] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
                     ` (15 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:43 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Note that the vpmov{,s,us}{d,q}w table entries in evex-disp8.c are
slightly different from what one would expect, due to them requiring
EVEX.W to be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Also #UD when evex.z is set with a memory operand.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -166,11 +166,26 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovdb,       f3, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovdw,       f3, 0f38, 33,    vl_2,    b, vl),
+    INSN(pmovqb,       f3, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovqd,       f3, 0f38, 35,    vl_2, d_nb, vl),
+    INSN(pmovqw,       f3, 0f38, 34,    vl_4,    b, vl),
+    INSN(pmovsdb,      f3, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsdw,      f3, 0f38, 23,    vl_2,    b, vl),
+    INSN(pmovsqb,      f3, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsqd,      f3, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovsqw,      f3, 0f38, 24,    vl_4,    b, vl),
     INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
     INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
     INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
     INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
     INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovusdb,     f3, 0f38, 11,    vl_4,    b, vl),
+    INSN(pmovusdw,     f3, 0f38, 13,    vl_2,    b, vl),
+    INSN(pmovusqb,     f3, 0f38, 12,    vl_8,    b, vl),
+    INSN(pmovusqd,     f3, 0f38, 15,    vl_2, d_nb, vl),
+    INSN(pmovusqw,     f3, 0f38, 14,    vl_4,    b, vl),
     INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
     INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
     INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
@@ -273,7 +288,10 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+    INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -271,6 +271,17 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextracti{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextracti32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -285,6 +296,7 @@ static inline bool _to_bool(byte_vec_t b
 })
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -714,6 +726,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if VEC_SIZE >= 16
+
+# if !defined(low_half) && defined(HALF_SIZE)
+static inline half_t low_half(vec_t x)
+{
+#  if HALF_SIZE < VEC_SIZE
+    half_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 2; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1081,6 +1114,21 @@ int simd_test(void)
 
 #endif
 
+#if defined(widen1) && defined(shrink1)
+    {
+        half_t aux1 = low_half(src), aux2;
+
+        touch(aux1);
+        x = widen1(aux1);
+        touch(x);
+        aux2 = shrink1(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 2; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
 #ifdef dup_lo
     touch(src);
     x = dup_lo(src);
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,23 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE >= 16
+
+# if ELEM_COUNT >= 2
+#  if VEC_SIZE > 32
+#   define HALF_SIZE (VEC_SIZE / 2)
+#  else
+#   define HALF_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(HALF_SIZE))) half_t;
+typedef char __attribute__((vector_size(HALF_SIZE))) vqi_half_t;
+typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
+typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
+typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+# endif
+
+#endif
+
 #if VEC_SIZE == 16
 # define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
 # define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3056,7 +3056,22 @@ x86_decode(
                 d |= vSIB;
             state->simd_size = ext0f38_table[b].simd_size;
             if ( evex_encoded() )
-                disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+            {
+                /*
+                 * VPMOVUS* are identical to VPMOVS* Disp8-scaling-wise, but
+                 * their attributes don't match those of the vex_66 encoded
+                 * insns with the same base opcodes. Rather than adding new
+                 * columns to the table, handle this here for now.
+                 */
+                if ( evex.pfx != vex_f3 || (b & 0xf8) != 0x10 )
+                    disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+                else
+                {
+                    disp8scale = decode_disp8scale(ext0f38_table[b + 0x10].d8s,
+                                                   state);
+                    state->simd_size = simd_other;
+                }
+            }
             break;
 
         case ext_0f3a:
@@ -8318,10 +8333,14 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10): /* vpmovuswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20): /* vpmovswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30): /* vpmovwb [xyz]mm,{x,y}mm/mem{k} */
         host_and_vcpu_must_have(avx512bw);
-        /* fall through */
+        if ( evex.pfx != vex_f3 )
+        {
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
@@ -8332,7 +8351,28 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
-        generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+            generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        }
+        else
+        {
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x11): /* vpmovusdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x12): /* vpmovusqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x13): /* vpmovusdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x14): /* vpmovusqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* vpmovusqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x21): /* vpmovsdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x22): /* vpmovsqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x23): /* vpmovsdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x24): /* vpmovsqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* vpmovsqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x31): /* vpmovdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x32): /* vpmovqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x33): /* vpmovdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x34): /* vpmovqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* vpmovqd [xyz]mm,{x,y}mm/mem{k} */
+            generate_exception_if(evex.w || (ea.type == OP_MEM && evex.z), EXC_UD);
+            d = DstMem | SrcReg | TwoOp;
+        }
         op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 29/44] x86emul: support AVX512{F, BW} integer unpack insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (27 preceding siblings ...)
  2018-09-25 13:43   ` [PATCH v4 28/44] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
@ 2018-09-25 13:44   ` Jan Beulich
  2018-09-25 13:44   ` [PATCH v4 30/44] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
                     ` (14 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

There's once again one extra twobyte_table[] entry which gets its Disp8
shift value set right away without getting support implemented just yet,
again to avoid needlessly splitting groups of entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -218,6 +218,10 @@ static const struct test avx512f_all[] =
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
     INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
     INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
+    INSN(punpckhdq,    66,   0f, 6a,    vl,      d, vl),
+    INSN(punpckhqdq,   66,   0f, 6d,    vl,      q, vl),
+    INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
+    INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
@@ -316,6 +320,10 @@ static const struct test avx512bw_all[]
     INSN(psubw,       66,   0f, f9,    vl,    w, vl),
     INSN(ptestm,      66, 0f38, 26,    vl,   bw, vl),
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
+    INSN(punpckhbw,   66,   0f, 68,    vl,    b, vl),
+    INSN(punpckhwd,   66,   0f, 69,    vl,    w, vl),
+    INSN(punpcklbw,   66,   0f, 60,    vl,    b, vl),
+    INSN(punpcklwd,   66,   0f, 61,    vl,    w, vl),
 };
 
 static const struct test avx512bw_128[] = {
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -294,6 +294,10 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
@@ -311,6 +315,10 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -252,6 +252,10 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(punpckhdq);
+OVR(punpckhqdq);
+OVR(punpckldq);
+OVR(punpcklqdq);
 #endif
 
 #undef OVR_VFP
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -312,10 +312,10 @@ static const struct twobyte_table {
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
+    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
@@ -6643,6 +6643,12 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        op_bytes = 16 << evex.lr;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6671,6 +6677,13 @@ x86_emulate(
         elem_bytes = 1 << (b & 1);
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x62): /* vpunpckldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6a): /* vpunpckhdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        fault_suppression = false;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
         op_bytes = 16 << evex.lr;
@@ -6697,6 +6710,10 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd4): /* vpaddq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf4): /* vpmuludq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x28): /* vpmuldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 30/44] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (28 preceding siblings ...)
  2018-09-25 13:44   ` [PATCH v4 29/44] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
@ 2018-09-25 13:44   ` Jan Beulich
  2018-09-25 13:46   ` [PATCH v4 31/44] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
                     ` (13 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Take the liberty and also correct the (public interface) name of the
AVX512_VBMI feature flag, on the assumption that no external consumer
has actually been using that flag so far. Furthermore make it have
AVX512BW instead of AVX512F as a prerequisite, for requiring full
64-bit mask registers (the upper 48 bits of which can't be accessed
other than through XSAVE/XRSTOR without AVX512BW support).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -162,6 +162,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
+    INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -283,6 +287,8 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
+    INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
@@ -367,6 +373,11 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512_vbmi_all[] = {
+    INSN(permi2b,       66, 0f38, 75, vl, b, vl),
+    INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -699,4 +710,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -144,6 +144,9 @@ static inline bool _to_bool(byte_vec_t b
 #    define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #    define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #    define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#   else
+#    define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
+#    define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #   endif
 #  elif FLOAT_SIZE == 8
 #   if VEC_SIZE >= 32
@@ -168,6 +171,9 @@ static inline bool _to_bool(byte_vec_t b
 #    define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #    define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #    define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#   else
+#    define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
+#    define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
 #   endif
 #  endif
 # endif
@@ -297,6 +303,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -318,6 +327,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
@@ -763,6 +775,7 @@ int simd_test(void)
 {
     unsigned int i, j;
     vec_t x, y, z, src, inv, alt, sh;
+    vint_t interleave_lo, interleave_hi;
 
     for ( i = 0, j = ELEM_SIZE << 3; i < ELEM_COUNT; ++i )
     {
@@ -776,6 +789,9 @@ int simd_test(void)
         if ( !(i & (i + 1)) )
             --j;
         sh[i] = j;
+
+        interleave_lo[i] = ((i & 1) * ELEM_COUNT) | (i >> 1);
+        interleave_hi[i] = interleave_lo[i] + (ELEM_COUNT / 2);
     }
 
     touch(src);
@@ -1069,7 +1085,7 @@ int simd_test(void)
     x = src * alt;
     y = interleave_lo(x, alt < 0);
     touch(x);
-    z = widen1(x);
+    z = widen1(low_half(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1101,7 +1117,7 @@ int simd_test(void)
 
 # ifdef widen1
     touch(src);
-    x = widen1(src);
+    x = widen1(low_half(src));
     touch(src);
     if ( !eq(x, y) ) return __LINE__;
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,16 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if ELEM_SIZE == 1
+typedef vqi_t vint_t;
+#elif ELEM_SIZE == 2
+typedef vhi_t vint_t;
+#elif ELEM_SIZE == 4
+typedef vsi_t vint_t;
+#elif ELEM_SIZE == 8
+typedef vdi_t vint_t;
+#endif
+
 #if VEC_SIZE >= 16
 
 # if ELEM_COUNT >= 2
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -279,6 +279,16 @@ static inline uint64_t xgetbv(uint32_t x
     (res.b & (1U << 31)) != 0; \
 })
 
+#define cpu_has_avx512_vbmi ({ \
+    struct cpuid_leaf res; \
+    emul_test_cpuid(1, 0, &res, NULL); \
+    if ( !(res.c & (1U << 27)) || ((xgetbv(0) & 0xe6) != 0xe6) ) \
+        res.c = 0; \
+    else \
+        emul_test_cpuid(7, 0, &res, NULL); \
+    (res.c & (1U << 1)) != 0; \
+})
+
 int emul_test_cpuid(
     uint32_t leaf,
     uint32_t subleaf,
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -468,9 +468,13 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
     [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
+    [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -1852,6 +1856,7 @@ static bool vcpu_has(
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
+#define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -6011,6 +6016,11 @@ x86_emulate(
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x76): /* vpermi2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x77): /* vpermi2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7e): /* vpermt2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7f): /* vpermt2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdb): /* vpand{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8522,6 +8532,16 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512_vbmi);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -107,6 +107,9 @@
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
 
+/* CPUID level 0x00000007:0.ecx */
+#define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
+
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
 
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -224,7 +224,7 @@ XEN_CPUFEATURE(AVX512VL,      5*32+31) /
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.ecx, word 6 */
 XEN_CPUFEATURE(PREFETCHWT1,   6*32+ 0) /*A  PREFETCHWT1 instruction */
-XEN_CPUFEATURE(AVX512VBMI,    6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -254,12 +254,17 @@ def crunch_numbers(state):
         AVX2: [AVX512F],
 
         # AVX512F is taken to mean hardware support for 512bit registers
-        # (which in practice depends on the EVEX prefix to encode), and the
-        # instructions themselves. All further AVX512 features are built on
-        # top of AVX512F
+        # (which in practice depends on the EVEX prefix to encode) as well
+        # as mask registers, and the instructions themselves. All further
+        # AVX512 features are built on top of AVX512F
         AVX512F: [AVX512DQ, AVX512IFMA, AVX512PF, AVX512ER, AVX512CD,
-                  AVX512BW, AVX512VL, AVX512VBMI, AVX512_4VNNIW,
-                  AVX512_4FMAPS, AVX512_VPOPCNTDQ],
+                  AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
+                  AVX512_VPOPCNTDQ],
+
+        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # dependents of AVX512BW (as to requiring wider than 16-bit mask
+        # registers), despite the SDM not formally making this connection.
+        AVX512BW: [AVX512_VBMI],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 31/44] x86emul: support AVX512{F, BW} integer shuffle insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (29 preceding siblings ...)
  2018-09-25 13:44   ` [PATCH v4 30/44] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
@ 2018-09-25 13:46   ` Jan Beulich
  2018-09-25 13:46   ` [PATCH v4 32/44] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
                     ` (12 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:46 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Also include shuff{32x4,64x2} as being very similar to shufi{32x4,64x2}.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Move OVR() addition into __AVX512VL__ conditional. Correct comments.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -203,6 +203,7 @@ static const struct test avx512f_all[] =
     INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
     INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
     INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pshufd,       66,   0f, 70,    vl,      d, vl),
     INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
     INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
     INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
@@ -253,6 +254,10 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
+    INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
+    INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
+    INSN(shufi64x2,      66, 0f3a, 43, vl,    q, vl),
 };
 
 static const struct test avx512f_512[] = {
@@ -307,6 +312,9 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSN(pshufb,      66, 0f38, 00,    vl,    b, vl),
+    INSN(pshufhw,     f3,   0f, 70,    vl,    w, vl),
+    INSN(pshuflw,     f2,   0f, 70,    vl,    w, vl),
     INSNX(pslldq,     66,   0f, 73, 7, vl,    b, vl),
     INSN(psllvw,      66, 0f38, 12,    vl,    w, vl),
     INSN(psllw,       66,   0f, f1,    el_8,  w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -147,6 +147,10 @@ static inline bool _to_bool(byte_vec_t b
 #   else
 #    define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #    define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
+#    define swap(x) ({ \
+    vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
+})
 #   endif
 #  elif FLOAT_SIZE == 8
 #   if VEC_SIZE >= 32
@@ -174,6 +178,10 @@ static inline bool _to_bool(byte_vec_t b
 #   else
 #    define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #    define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
+#    define swap(x) ({ \
+    vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
+})
 #   endif
 #  endif
 # endif
@@ -303,9 +311,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
+                               VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
+                             0b00011011, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -327,9 +340,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b01001110, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
+                                      VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -119,6 +119,12 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+/* Sadly there are a few exceptions to the general naming rules. */
+#define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
+#define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
+#define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
+#define __builtin_ia32_shuf_i64x2_512_mask __builtin_ia32_shuf_i64x2_mask
+
 #if VEC_SIZE < 64
 # pragma GCC target ( "avx512vl" )
 #endif
@@ -262,6 +268,7 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(pshufd);
 OVR(punpckhdq);
 OVR(punpckhqdq);
 OVR(punpckldq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -318,7 +318,7 @@ static const struct twobyte_table {
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
-    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
+    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other, d8s_vl },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
@@ -432,7 +432,8 @@ static const struct ext0f38_table {
     uint8_t vsib:1;
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
-    [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
@@ -543,6 +544,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
+    [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
@@ -552,6 +554,7 @@ static const struct ext0f3a_table {
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
+    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6664,6 +6667,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6919,6 +6923,20 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 3;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x70): /* vpshufd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x70): /* vpshufhw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x70): /* vpshuflw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx == vex_66 )
+            generate_exception_if(evex.w, EXC_UD);
+        else
+        {
+            host_and_vcpu_must_have(avx512bw);
+            generate_exception_if(evex.br, EXC_UD);
+        }
+        d = (d & ~SrcMask) | SrcMem | TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_imm_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x71):    /* Grp12 */
     case X86EMUL_OPC_VEX_66(0x0f, 0x71):
     CASE_SIMD_PACKED_INT(0x0f, 0x72):    /* Grp13 */
@@ -9104,7 +9122,13 @@ x86_emulate(
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
-        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        generate_exception_if(evex.br, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x23): /* vshuff32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshuff64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x43): /* vshufi32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshufi64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
         fault_suppression = false;
         goto avx512f_imm_no_sae;
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 32/44] x86emul: support AVX512{BW, DQ} mask move insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (30 preceding siblings ...)
  2018-09-25 13:46   ` [PATCH v4 31/44] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
@ 2018-09-25 13:46   ` Jan Beulich
  2018-09-25 13:47   ` [PATCH v4 33/44] x86emul: basic AVX512BW testing Jan Beulich
                     ` (11 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:46 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Entries to the tables in evex-disp8.c are added despite these insns not
allowing for memory operands, with the goal of the tables giving a
complete picture of the supported EVEX-encoded insns in the end.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -303,9 +303,12 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+//       pmovb2m,     f3, 0f38, 29,           b
+//       pmovm2,      f3, 0f38, 28,          bw
     INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+//       pmovw2m,     f3, 0f38, 29,           w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
@@ -353,6 +356,9 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
+//       pmovd2m,        f3, 0f38, 39,        d
+//       pmovm2,         f3, 0f38, 38,       dq
+//       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
 };
--- a/tools/tests/x86_emulator/opmask.S
+++ b/tools/tests/x86_emulator/opmask.S
@@ -12,17 +12,23 @@
 
 #if SIZE == 1
 # define _(x) x##b
+# define _v(x, t) _v_(x##q, t)
 #elif SIZE == 2
 # define _(x) x##w
+# define _v(x, t) _v_(x##d, t)
 # define WIDEN(x) x##bw
 #elif SIZE == 4
 # define _(x) x##d
+# define _v(x, t) _v_(x##w, t)
 # define WIDEN(x) x##wd
 #elif SIZE == 8
 # define _(x) x##q
+# define _v(x, t) _v_(x##b, t)
 # define WIDEN(x) x##dq
 #endif
 
+#define _v_(x, t) v##x##t
+
     .macro check res1:req, res2:req, line:req
     _(kmov)       %\res1, DATA(out)
 #if SIZE < 8 || !defined(__i386__)
@@ -131,6 +137,15 @@ _start:
 
 #endif
 
+#if SIZE > 2 ? defined(__AVX512BW__) : defined(__AVX512DQ__)
+
+    _(kmov)       DATA(in1), %k0
+    _v(pmovm2,)   %k0, %zmm7
+    _v(pmov,2m)   %zmm7, %k3
+    check         k0, k3, __LINE__
+
+#endif
+
     xor           %eax, %eax
     ret
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -8422,6 +8422,21 @@ x86_emulate(
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x29): /* vpmov{b,w}2m [xyz]mm,k */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x39): /* vpmov{d,q}2m [xyz]mm,k */
+        generate_exception_if(!evex.r || !evex.R, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x28): /* vpmovm2{b,w} k,[xyz]mm */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x38): /* vpmovm2{d,q} k,[xyz]mm */
+        if ( b & 0x10 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.opmsk || ea.type != OP_REG, EXC_UD);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 33/44] x86emul: basic AVX512BW testing
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (31 preceding siblings ...)
  2018-09-25 13:46   ` [PATCH v4 32/44] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
@ 2018-09-25 13:47   ` Jan Beulich
  2018-09-25 13:48   ` [PATCH v4 34/44] x86emul: basic AVX512DQ testing Jan Beulich
                     ` (10 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:47 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Add __AVX512VL__ conditional around majority of OVR() additions.
    Correct eq() for 1- and 2-byte cases.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -13,7 +13,7 @@ all: $(TARGET)
 run: $(TARGET)
 	./$(TARGET)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -63,6 +63,9 @@ xop-flts := $(avx-flts)
 avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
+avx512bw-vecs := $(avx512f-vecs)
+avx512bw-ints := 1 2
+avx512bw-flts :=
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -31,6 +31,10 @@ ENTRY(simd_test);
 #  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
 # elif FLOAT_SIZE == 8
 #  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif (INT_SIZE == 1 || UINT_SIZE == 1) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqb, _mask, (vqi_t)(x), (vqi_t)(y), -1) == ALL_TRUE)
+# elif (INT_SIZE == 2 || UINT_SIZE == 2) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqw, _mask, (vhi_t)(x), (vhi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 4 || UINT_SIZE == 4
 #  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 8 || UINT_SIZE == 8
@@ -368,6 +372,87 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # endif
+#elif (INT_SIZE == 1 || UINT_SIZE == 1 || INT_SIZE == 2 || UINT_SIZE == 2) && \
+      defined(__AVX512BW__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 1 || UINT_SIZE == 1
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastb %1, %0" \
+          : "=v" (t_) : "m" (*(char[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastb %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  elif defined(__AVX512VBMI__)
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
+#  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+# elif INT_SIZE == 2 || UINT_SIZE == 2
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastw %1, %0" \
+          : "=v" (t_) : "m" (*(short[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastw %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(pshufhw, _mask, \
+                                      B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
+                                      0b00011011, (vhi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+# endif
+# if INT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovsxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 2
+#  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
+#  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
+#  define widen1(x) ((vec_t)B(pmovsxwd, _mask, x, (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxwq, _mask, x, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 2
+#  define max(x, y) ((vec_t)B(pmaxuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define mul_hi(x, y) ((vec_t)B(pmulhuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxwd, _mask, (vhi_half_t)(x), (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxwq, _mask, (vhi_quarter_t)(x), (vdi_t)undef(), ~0))
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
@@ -559,7 +644,7 @@ static inline bool _to_bool(byte_vec_t b
 #  endif
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSSE3__)
+#if VEC_SIZE == 16 && defined(__SSSE3__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define abs(x) ((vec_t)__builtin_ia32_pabsb128((vqi_t)(x)))
 # elif INT_SIZE == 2
@@ -783,6 +868,40 @@ static inline half_t low_half(vec_t x)
 }
 # endif
 
+# if !defined(low_quarter) && defined(QUARTER_SIZE)
+static inline quarter_t low_quarter(vec_t x)
+{
+#  if QUARTER_SIZE < VEC_SIZE
+    quarter_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+# if !defined(low_eighth) && defined(EIGHTH_SIZE)
+static inline eighth_t low_eighth(vec_t x)
+{
+#  if EIGHTH_SIZE < VEC_SIZE
+    eighth_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
 #endif
 
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
@@ -1111,7 +1230,7 @@ int simd_test(void)
     y = interleave_lo(alt < 0, alt < 0);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen2(x);
+    z = widen2(low_quarter(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1120,7 +1239,7 @@ int simd_test(void)
     y = interleave_lo(y, y);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen3(x);
+    z = widen3(low_eighth(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 #  endif
@@ -1142,14 +1261,14 @@ int simd_test(void)
 
 # ifdef widen2
     touch(src);
-    x = widen2(src);
+    x = widen2(low_quarter(src));
     touch(src);
     if ( !eq(x, z) ) return __LINE__;
 # endif
 
 # ifdef widen3
     touch(src);
-    x = widen3(src);
+    x = widen3(low_eighth(src));
     touch(src);
     if ( !eq(x, interleave_lo(z, (vec_t){})) ) return __LINE__;
 # endif
@@ -1169,6 +1288,36 @@ int simd_test(void)
             if ( aux2[i] != src[i] )
                 return __LINE__;
     }
+#endif
+
+#if defined(widen2) && defined(shrink2)
+    {
+        quarter_t aux1 = low_quarter(src), aux2;
+
+        touch(aux1);
+        x = widen2(aux1);
+        touch(x);
+        aux2 = shrink2(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 4; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
+#if defined(widen3) && defined(shrink3)
+    {
+        eighth_t aux1 = low_eighth(src), aux2;
+
+        touch(aux1);
+        x = widen3(aux1);
+        touch(x);
+        aux2 = shrink3(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 8; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
 #endif
 
 #ifdef dup_lo
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -95,6 +95,32 @@ typedef int __attribute__((vector_size(H
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
 # endif
 
+# if ELEM_COUNT >= 4
+#  if VEC_SIZE > 64
+#   define QUARTER_SIZE (VEC_SIZE / 4)
+#  else
+#   define QUARTER_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(QUARTER_SIZE))) quarter_t;
+typedef char __attribute__((vector_size(QUARTER_SIZE))) vqi_quarter_t;
+typedef short __attribute__((vector_size(QUARTER_SIZE))) vhi_quarter_t;
+typedef int __attribute__((vector_size(QUARTER_SIZE))) vsi_quarter_t;
+typedef long long __attribute__((vector_size(QUARTER_SIZE))) vdi_quarter_t;
+# endif
+
+# if ELEM_COUNT >= 8
+#  if VEC_SIZE > 128
+#   define EIGHTH_SIZE (VEC_SIZE / 8)
+#  else
+#   define EIGHTH_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(EIGHTH_SIZE))) eighth_t;
+typedef char __attribute__((vector_size(EIGHTH_SIZE))) vqi_eighth_t;
+typedef short __attribute__((vector_size(EIGHTH_SIZE))) vhi_eighth_t;
+typedef int __attribute__((vector_size(EIGHTH_SIZE))) vsi_eighth_t;
+typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
+# endif
+
 #endif
 
 #if VEC_SIZE == 16
@@ -182,6 +208,9 @@ OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
 OVR_INT(add);
+OVR_BW(adds);
+OVR_BW(addus);
+OVR_BW(avg);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
@@ -214,6 +243,8 @@ OVR_INT(srl);
 OVR_DQ(srlv);
 OVR_FP(sub);
 OVR_INT(sub);
+OVR_BW(subs);
+OVR_BW(subus);
 OVR_SFP(ucomi);
 OVR_VFP(unpckh);
 OVR_VFP(unpckl);
@@ -275,6 +306,31 @@ OVR(punpckldq);
 OVR(punpcklqdq);
 #endif
 
+#ifdef __AVX512BW__
+OVR(pextrb);
+OVR(pextrw);
+OVR(pinsrb);
+OVR(pinsrw);
+# ifdef __AVX512VL__
+OVR(pmaddwd);
+OVR(pmovsxbw);
+OVR(pmovzxbw);
+OVR(pmulhuw);
+OVR(pmulhw);
+OVR(pmullw);
+OVR(psadbw);
+OVR(pshufb);
+OVR(pshufhw);
+OVR(pshuflw);
+OVR(punpckhbw);
+OVR(punpckhwd);
+OVR(punpcklbw);
+OVR(punpcklwd);
+OVR(slldq);
+OVR(srldq);
+# endif
+#endif
+
 #undef OVR_VFP
 #undef OVR_SFP
 #undef OVR_INT
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
+#include "avx512bw.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -105,6 +106,11 @@ static bool simd_check_avx512bw(void)
 }
 #define simd_check_avx512bw_opmask simd_check_avx512bw
 
+static bool simd_check_avx512bw_vl(void)
+{
+    return cpu_has_avx512bw && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -284,6 +290,18 @@ static const struct {
     AVX512VL(VL u64x2,        avx512f,      16u8),
     AVX512VL(VL s64x4,        avx512f,      32i8),
     AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512BW s8x64,     avx512bw,      64i1),
+    SIMD(AVX512BW u8x64,     avx512bw,      64u1),
+    SIMD(AVX512BW s16x32,    avx512bw,      64i2),
+    SIMD(AVX512BW u16x32,    avx512bw,      64u2),
+    AVX512VL(BW+VL s8x16,    avx512bw,      16i1),
+    AVX512VL(BW+VL u8x16,    avx512bw,      16u1),
+    AVX512VL(BW+VL s8x32,    avx512bw,      32i1),
+    AVX512VL(BW+VL u8x32,    avx512bw,      32u1),
+    AVX512VL(BW+VL s16x8,    avx512bw,      16i2),
+    AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
+    AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
+    AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 34/44] x86emul: basic AVX512DQ testing
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (32 preceding siblings ...)
  2018-09-25 13:47   ` [PATCH v4 33/44] x86emul: basic AVX512BW testing Jan Beulich
@ 2018-09-25 13:48   ` Jan Beulich
  2018-09-25 13:48   ` [PATCH v4 35/44] x86emul: support AVX512F move high/low insns Jan Beulich
                     ` (9 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:48 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Wrap OVR(pmullq) in __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -13,7 +13,7 @@ all: $(TARGET)
 run: $(TARGET)
 	./$(TARGET)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -66,9 +66,12 @@ avx512f-flts := 4 8
 avx512bw-vecs := $(avx512f-vecs)
 avx512bw-ints := 1 2
 avx512bw-flts :=
+avx512dq-vecs := $(avx512f-vecs)
+avx512dq-ints := $(avx512f-ints)
+avx512dq-flts := $(avx512f-flts)
 
 avx512f-opmask-vecs := 2
-avx512dq-opmask-vecs := 1
+avx512dq-opmask-vecs := 1 2
 avx512bw-opmask-vecs := 4 8
 
 # For AVX and later, have the compiler avoid XMM0 to widen coverage of
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -121,6 +121,34 @@ typedef int __attribute__((vector_size(E
 typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
 # endif
 
+# define DECL_PAIR(w) \
+typedef w ## _t pair_t; \
+typedef vsi_ ## w ## _t vsi_pair_t; \
+typedef vdi_ ## w ## _t vdi_pair_t
+# define DECL_QUARTET(w) \
+typedef w ## _t quartet_t; \
+typedef vsi_ ## w ## _t vsi_quartet_t; \
+typedef vdi_ ## w ## _t vdi_quartet_t
+# define DECL_OCTET(w) \
+typedef w ## _t octet_t; \
+typedef vsi_ ## w ## _t vsi_octet_t; \
+typedef vdi_ ## w ## _t vdi_octet_t
+
+# if ELEM_COUNT == 4
+DECL_PAIR(half);
+# elif ELEM_COUNT == 8
+DECL_PAIR(quarter);
+DECL_QUARTET(half);
+# elif ELEM_COUNT == 16
+DECL_PAIR(eighth);
+DECL_QUARTET(quarter);
+DECL_OCTET(half);
+# endif
+
+# undef DECL_OCTET
+# undef DECL_QUARTET
+# undef DECL_PAIR
+
 #endif
 
 #if VEC_SIZE == 16
@@ -146,6 +174,14 @@ typedef long long __attribute__((vector_
 #ifdef __AVX512F__
 
 /* Sadly there are a few exceptions to the general naming rules. */
+#define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
+#define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+#define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
+#define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
+#define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
+#define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
+#define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
+#define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
 #define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 #define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 #define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -331,6 +367,20 @@ OVR(srldq);
 # endif
 #endif
 
+#ifdef __AVX512DQ__
+OVR_VFP(and);
+OVR_VFP(andn);
+OVR_VFP(or);
+OVR(pextrd);
+OVR(pextrq);
+OVR(pinsrd);
+OVR(pinsrq);
+# ifdef __AVX512VL__
+OVR(pmullq);
+# endif
+OVR_VFP(xor);
+#endif
+
 #undef OVR_VFP
 #undef OVR_SFP
 #undef OVR_INT
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -134,6 +134,27 @@ static inline bool _to_bool(byte_vec_t b
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if VEC_SIZE > FLOAT_SIZE
+#  if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
+       (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
+       (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#   define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+#  endif
+#  if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
+       (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#   define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+#  endif
 #  if FLOAT_SIZE == 4
 #   define broadcast(x) ({ \
     vec_t t_; \
@@ -141,6 +162,17 @@ static inline bool _to_bool(byte_vec_t b
           : "=v" (t_) : "m" (*(float[1]){ x }) ); \
     t_; \
 })
+#   if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#    define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcastf32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#   endif
+#   if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#    define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
+#    define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
+#   endif
 #   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #   define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
@@ -149,6 +181,13 @@ static inline bool _to_bool(byte_vec_t b
 #    define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #    define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
 #   else
+#    define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
+#    define insert_pair(x, y, p) \
+    B(insertf32x4_, _mask, x, \
+      /* Cast needed below to work around gcc 7.x quirk. */ \
+      (p) & 1 ? (typeof(y))__builtin_ia32_shufps(y, y, 0b01000100) : (y), \
+      (p) >> 1, x, 3 << ((p) * 2))
+#    define insert_quartet(x, y, p) B(insertf32x4_, _mask, x, y, p, undef(), ~0)
 #    define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #    define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #    define swap(x) ({ \
@@ -172,6 +211,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #   endif
+#   if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#    define broadcast_pair(x) B(broadcastf64x2_, _mask, x, undef(), ~0)
+#    define insert_pair(x, y, p) B(insertf64x2_, _mask, x, y, p, undef(), ~0)
+#   endif
+#   if VEC_SIZE == 64
+#    define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
+#    define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
+#   endif
 #   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #   define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
@@ -300,6 +347,16 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 # endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextracti32x4 */ || \
+       (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -312,11 +369,30 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  ifdef __AVX512DQ__
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcasti32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) ((vec_t)B(broadcasti32x8_, _mask, (vsi_octet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_octet(x, y, p) ((vec_t)B(inserti32x8_, _mask, (vsi_t)(x), (vsi_octet_t)(y), p, (vsi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti32x4_, _mask, (vsi_quartet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_pair(x, y, p) \
+    (vec_t)(B(inserti32x4_, _mask, (vsi_t)(x), \
+              /* First cast needed below to work around gcc 7.x quirk. */ \
+              (p) & 1 ? (vsi_pair_t)__builtin_ia32_pshufd((vsi_pair_t)(y), 0b01000100) \
+                      : (vsi_pair_t)(y), \
+              (p) >> 1, (vsi_t)(x), 3 << ((p) * 2)))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti32x4_, _mask, (vsi_t)(x), (vsi_quartet_t)(y), p, (vsi_t)undef(), ~0))
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
@@ -341,6 +417,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ((vec_t)B(broadcasti64x2_, _mask, (vdi_pair_t)(x), (vdi_t)undef(), ~0))
+#   define insert_pair(x, y, p) ((vec_t)B(inserti64x2_, _mask, (vdi_t)(x), (vdi_pair_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti64x4_, , (vdi_quartet_t)(x), (vdi_t)undef(), ~0))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti64x4_, _mask, (vdi_t)(x), (vdi_quartet_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
@@ -892,7 +976,7 @@ static inline eighth_t low_eighth(vec_t
     eighth_t y;
     unsigned int i;
 
-    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+    for ( i = 0; i < ELEM_COUNT / 8; ++i )
         y[i] = x[i];
 
     return y;
@@ -904,6 +988,50 @@ static inline eighth_t low_eighth(vec_t
 
 #endif
 
+#ifdef broadcast_pair
+# if ELEM_COUNT == 4
+#  define broadcast_half broadcast_pair
+# elif ELEM_COUNT == 8
+#  define broadcast_quarter broadcast_pair
+# elif ELEM_COUNT == 16
+#  define broadcast_eighth broadcast_pair
+# endif
+#endif
+
+#ifdef insert_pair
+# if ELEM_COUNT == 4
+#  define insert_half insert_pair
+# elif ELEM_COUNT == 8
+#  define insert_quarter insert_pair
+# elif ELEM_COUNT == 16
+#  define insert_eighth insert_pair
+# endif
+#endif
+
+#ifdef broadcast_quartet
+# if ELEM_COUNT == 8
+#  define broadcast_half broadcast_quartet
+# elif ELEM_COUNT == 16
+#  define broadcast_quarter broadcast_quartet
+# endif
+#endif
+
+#ifdef insert_quartet
+# if ELEM_COUNT == 8
+#  define insert_half insert_quartet
+# elif ELEM_COUNT == 16
+#  define insert_quarter insert_quartet
+# endif
+#endif
+
+#if defined(broadcast_octet) && ELEM_COUNT == 16
+# define broadcast_half broadcast_octet
+#endif
+
+#if defined(insert_octet) && ELEM_COUNT == 16
+# define insert_half insert_octet
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1199,6 +1327,60 @@ int simd_test(void)
     if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#if defined(broadcast_half) && defined(insert_half)
+    {
+        half_t aux = low_half(src);
+
+        touch(aux);
+        x = broadcast_half(aux);
+        touch(aux);
+        y = insert_half(src, aux, 1);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_quarter) && defined(insert_quarter)
+    {
+        quarter_t aux = low_quarter(src);
+
+        touch(aux);
+        x = broadcast_quarter(aux);
+        touch(aux);
+        y = insert_quarter(src, aux, 1);
+        touch(aux);
+        y = insert_quarter(y, aux, 2);
+        touch(aux);
+        y = insert_quarter(y, aux, 3);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_eighth) && defined(insert_eighth) && \
+    /* At least gcc 7.3 "optimizes" away all insert_eighth() calls below. */ \
+    __GNUC__ >= 8
+    {
+        eighth_t aux = low_eighth(src);
+
+        touch(aux);
+        x = broadcast_eighth(aux);
+        touch(aux);
+        y = insert_eighth(src, aux, 1);
+        touch(aux);
+        y = insert_eighth(y, aux, 2);
+        touch(aux);
+        y = insert_eighth(y, aux, 3);
+        touch(aux);
+        y = insert_eighth(y, aux, 4);
+        touch(aux);
+        y = insert_eighth(y, aux, 5);
+        touch(aux);
+        y = insert_eighth(y, aux, 6);
+        touch(aux);
+        y = insert_eighth(y, aux, 7);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -23,6 +23,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
 #include "avx512bw.h"
+#include "avx512dq.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -100,6 +101,11 @@ static bool simd_check_avx512dq(void)
 }
 #define simd_check_avx512dq_opmask simd_check_avx512dq
 
+static bool simd_check_avx512dq_vl(void)
+{
+    return cpu_has_avx512dq && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -267,9 +273,10 @@ static const struct {
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
     SIMD(OPMASK/w,     avx512f_opmask,         2),
-    SIMD(OPMASK/b,    avx512dq_opmask,         1),
-    SIMD(OPMASK/d,    avx512bw_opmask,         4),
-    SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(OPMASK+DQ/b, avx512dq_opmask,         1),
+    SIMD(OPMASK+DQ/w, avx512dq_opmask,         2),
+    SIMD(OPMASK+BW/d, avx512bw_opmask,         4),
+    SIMD(OPMASK+BW/q, avx512bw_opmask,         8),
     SIMD(AVX512F f32 scalar,  avx512f,        f4),
     SIMD(AVX512F f32x16,      avx512f,      64f4),
     SIMD(AVX512F f64 scalar,  avx512f,        f8),
@@ -302,6 +309,24 @@ static const struct {
     AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
     AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
     AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
+    SIMD(AVX512DQ f32x16,    avx512dq,      64f4),
+    SIMD(AVX512DQ f64x8,     avx512dq,      64f8),
+    SIMD(AVX512DQ s32x16,    avx512dq,      64i4),
+    SIMD(AVX512DQ u32x16,    avx512dq,      64u4),
+    SIMD(AVX512DQ s64x8,     avx512dq,      64i8),
+    SIMD(AVX512DQ u64x8,     avx512dq,      64u8),
+    AVX512VL(DQ+VL f32x4,    avx512dq,      16f4),
+    AVX512VL(DQ+VL f64x2,    avx512dq,      16f8),
+    AVX512VL(DQ+VL f32x8,    avx512dq,      32f4),
+    AVX512VL(DQ+VL f64x4,    avx512dq,      32f8),
+    AVX512VL(DQ+VL s32x4,    avx512dq,      16i4),
+    AVX512VL(DQ+VL u32x4,    avx512dq,      16u4),
+    AVX512VL(DQ+VL s32x8,    avx512dq,      32i4),
+    AVX512VL(DQ+VL u32x8,    avx512dq,      32u4),
+    AVX512VL(DQ+VL s64x2,    avx512dq,      16i8),
+    AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
+    AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
+    AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 35/44] x86emul: support AVX512F move high/low insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (33 preceding siblings ...)
  2018-09-25 13:48   ` [PATCH v4 34/44] x86emul: basic AVX512DQ testing Jan Beulich
@ 2018-09-25 13:48   ` Jan Beulich
  2018-09-25 13:49   ` [PATCH v4 36/44] x86emul: support AVX512F move duplicate insns Jan Beulich
                     ` (8 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:48 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -242,6 +242,16 @@ static const struct test avx512f_128[] =
     INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
+//       movhlps,     ,   0f, 12,        d
+    INSN(movhpd,    66,   0f, 16, el,    q, vl),
+    INSN(movhpd,    66,   0f, 17, el,    q, vl),
+    INSN(movhps,      ,   0f, 16, el_2,  d, vl),
+    INSN(movhps,      ,   0f, 17, el_2,  d, vl),
+//       movlhps,     ,   0f, 16,        d
+    INSN(movlpd,    66,   0f, 12, el,    q, vl),
+    INSN(movlpd,    66,   0f, 13, el,    q, vl),
+    INSN(movlps,      ,   0f, 12, el_2,  d, vl),
+    INSN(movlps,      ,   0f, 13, el_2,  d, vl),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
     INSN(movq,      66,   0f, d6, el,    q, el),
 };
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -266,6 +266,12 @@ OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
 OVR_VFP(mova);
+OVR(movhlps);
+OVR(movhpd);
+OVR(movhps);
+OVR(movlhps);
+OVR(movlpd);
+OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -286,11 +286,11 @@ static const struct twobyte_table {
     [0x0f] = { ModRM|SrcImmByte },
     [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
-    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
     [0x20 ... 0x21] = { DstMem|SrcImplicit|ModRM },
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
@@ -6000,6 +6000,26 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_fp;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x12):   /* vmovlpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x13): /* vmovlp{s,d} xmm,m64 */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x16):   /* vmovhpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x17): /* vmovhp{s,d} xmm,m64 */
+        generate_exception_if(ea.type != OP_MEM, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x12):      /* vmovlps m64,xmm,xmm */
+                                            /* vmovhlps xmm,xmm,xmm */
+    case X86EMUL_OPC_EVEX(0x0f, 0x16):      /* vmovhps m64,xmm,xmm */
+                                            /* vmovlhps xmm,xmm,xmm */
+        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( (d & DstMask) != DstMem )
+            d &= ~TwoOp;
+        op_bytes = 8;
+        fault_suppression = false;
+        goto simd_zmm;
+
     case X86EMUL_OPC_F3(0x0f, 0x12):       /* movsldup xmm/m128,xmm */
     case X86EMUL_OPC_VEX_F3(0x0f, 0x12):   /* vmovsldup {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F2(0x0f, 0x12):       /* movddup xmm/m64,xmm */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 36/44] x86emul: support AVX512F move duplicate insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (34 preceding siblings ...)
  2018-09-25 13:48   ` [PATCH v4 35/44] x86emul: support AVX512F move high/low insns Jan Beulich
@ 2018-09-25 13:49   ` Jan Beulich
  2018-09-25 13:49   ` [PATCH v4 37/44] x86emul: support AVX512{F, BW, VBMI} permute insns Jan Beulich
                     ` (7 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:49 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Judging from insn prefixes, these are scalar insns, but their (memory)
operands are vector ones (with the exception of 128-bit VMOVDDUP). For
this some adjustments to disp8scale calculation code are needed.

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -146,6 +146,8 @@ static const struct test avx512f_all[] =
     INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
     INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
     INSN_PFP_NB(movnt,       0f, 2b),
+    INSN(movshdup,     f3,   0f, 16,    vl,   d_nb, vl),
+    INSN(movsldup,     f3,   0f, 12,    vl,   d_nb, vl),
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
@@ -242,6 +244,7 @@ static const struct test avx512f_128[] =
     INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
+    INSN(movddup,   f2,   0f, 12, el,    q, el),
 //       movhlps,     ,   0f, 12,        d
     INSN(movhpd,    66,   0f, 16, el,    q, vl),
     INSN(movhpd,    66,   0f, 17, el,    q, vl),
@@ -264,6 +267,7 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(movddup,        f2,   0f, 12, vl, q_nb, vl),
     INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
     INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
     INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -326,8 +326,11 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 # endif
+OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
+OVR(movshdup);
+OVR(movsldup);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3036,6 +3036,15 @@ x86_decode(
 
             switch ( b )
             {
+            case 0x12: /* vmovsldup / vmovddup */
+                if ( evex.pfx == vex_f2 )
+                    disp8scale = evex.lr ? 4 + evex.lr : 3;
+                /* fall through */
+            case 0x16: /* vmovshdup */
+                if ( evex.pfx == vex_f3 )
+                    disp8scale = 4 + evex.lr;
+                break;
+
             case 0x20: /* mov cr,reg */
             case 0x21: /* mov dr,reg */
             case 0x22: /* mov reg,cr */
@@ -6035,6 +6044,20 @@ x86_emulate(
         host_and_vcpu_must_have(sse3);
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x12):   /* vmovsldup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x12):   /* vmovddup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x16):   /* vmovshdup [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if((evex.br ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = !(evex.pfx & VEX_PREFIX_DOUBLE_MASK) || evex.lr
+                   ? 16 << evex.lr : 8;
+        fault_suppression = false;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x14): /* vunpcklp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 37/44] x86emul: support AVX512{F, BW, VBMI} permute insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (35 preceding siblings ...)
  2018-09-25 13:49   ` [PATCH v4 36/44] x86emul: support AVX512F move duplicate insns Jan Beulich
@ 2018-09-25 13:49   ` Jan Beulich
  2018-09-25 13:50   ` [PATCH v4 38/44] x86emul: support AVX512BW pack insns Jan Beulich
                     ` (6 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:49 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -166,6 +166,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permilpd,     66, 0f38, 0d,    vl,      q, vl),
+    INSN(permilpd,     66, 0f3a, 05,    vl,      q, vl),
+    INSN(permilps,     66, 0f38, 0c,    vl,      d, vl),
+    INSN(permilps,     66, 0f3a, 04,    vl,      d, vl),
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
@@ -268,6 +272,10 @@ static const struct test avx512f_no128[]
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
     INSN(movddup,        f2,   0f, 12, vl, q_nb, vl),
+    INSN(perm,           66, 0f38, 36, vl,   dq, vl),
+    INSN(perm,           66, 0f38, 16, vl,   sd, vl),
+    INSN(permpd,         66, 0f3a, 01, vl,    q, vl),
+    INSN(permq,          66, 0f3a, 00, vl,    q, vl),
     INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
     INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
     INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
@@ -306,6 +314,7 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permw,       66, 0f38, 8d,    vl,    w, vl),
     INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
     INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
@@ -402,6 +411,7 @@ static const struct test avx512dq_512[]
 };
 
 static const struct test avx512_vbmi_all[] = {
+    INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
 };
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -180,6 +180,7 @@ static inline bool _to_bool(byte_vec_t b
 #    define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #    define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #    define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#    define swap2(x) B_(vpermilps, _mask, x, 0b00011011, undef(), ~0)
 #   else
 #    define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
 #    define insert_pair(x, y, p) \
@@ -194,6 +195,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
 })
+#    define swap2(x) B(vpermilps, _mask, \
+                       B(shuf_f32x4_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b00011011, undef(), ~0)
 #   endif
 #  elif FLOAT_SIZE == 8
 #   if VEC_SIZE >= 32
@@ -226,6 +231,7 @@ static inline bool _to_bool(byte_vec_t b
 #    define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #    define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #    define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#    define swap2(x) B_(vpermilpd, _mask, x, 0b01, undef(), ~0)
 #   else
 #    define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #    define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
@@ -233,6 +239,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
 })
+#    define swap2(x) B(vpermilpd, _mask, \
+                       B(shuf_f64x2_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b01010101, undef(), ~0)
 #   endif
 #  endif
 # endif
@@ -399,6 +409,7 @@ static inline bool _to_bool(byte_vec_t b
                              B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
                                VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
                              0b00011011, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B_(permvarsi, _mask, (vsi_t)(x), (vsi_t)(inv - 1), (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -436,8 +447,17 @@ static inline bool _to_bool(byte_vec_t b
                              (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
                                       VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
                              0b01001110, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B(permvardi, _mask, (vdi_t)(x), (vdi_t)(inv - 1), (vdi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+#  if VEC_SIZE == 32
+#   define swap3(x) ((vec_t)B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0))
+#  elif VEC_SIZE == 64
+#   define swap3(x) ({ \
+    vdi_t t_ = B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0); \
+    B(shuf_i64x2_, _mask, t_, t_, 0b01001110, (vdi_t)undef(), ~0); \
+})
+#  endif
 # endif
 # if INT_SIZE == 4
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
@@ -483,6 +503,9 @@ static inline bool _to_bool(byte_vec_t b
 #  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
 #  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+#  ifdef __AVX512VBMI__
+#   define swap2(x) ((vec_t)B(permvarqi, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  endif
 # elif INT_SIZE == 2 || UINT_SIZE == 2
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -511,6 +534,7 @@ static inline bool _to_bool(byte_vec_t b
                               (0b01010101010101010101010101010101 & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+#  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
 # endif
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
@@ -1319,6 +1343,12 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
+#ifdef swap3
+    touch(src);
+    if ( !eq(swap3(src), inv) ) return __LINE__;
+    touch(src);
+#endif
+
 #ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -275,6 +275,8 @@ OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(perm);
+OVR_VFP(permil);
 OVR_VFP(shuf);
 OVR_INT(sll);
 OVR_DQ(sllv);
@@ -331,6 +333,8 @@ OVR(movntdq);
 OVR(movntdqa);
 OVR(movshdup);
 OVR(movsldup);
+OVR(permd);
+OVR(permq);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -434,7 +434,8 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
-    [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
+    [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -477,6 +478,7 @@ static const struct ext0f38_table {
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
+    [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -522,10 +524,10 @@ static const struct ext0f3a_table {
     uint8_t four_op:1;
     disp8scale_t d8s:4;
 } ext0f3a_table[256] = {
-    [0x00] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x00] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
+    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x02] = { .simd_size = simd_packed_int },
-    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
@@ -8062,6 +8064,9 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.br, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0c): /* vpermilps [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0d): /* vpermilpd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         if ( b == 0xe2 )
             goto avx512f_no_sae;
@@ -8406,6 +8411,12 @@ x86_emulate(
         generate_exception_if(!vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x16): /* vpermp{s,d} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x36): /* vperm{d,q} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x20): /* vpmovsxbw xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,{x,y}mm */
@@ -8610,6 +8621,7 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         if ( !evex.w )
             host_and_vcpu_must_have(avx512_vbmi);
         else
@@ -9036,6 +9048,12 @@ x86_emulate(
         generate_exception_if(!vex.l || !vex.w, EXC_UD);
         goto simd_0f_imm8_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x00): /* vpermq $imm8,{y,z}mm/mem,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x01): /* vpermpd $imm8,{y,z}mm/mem,{y,z}mm{k} */
+        generate_exception_if(!evex.lr || !evex.w, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x38): /* vinserti128 $imm8,xmm/m128,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x39): /* vextracti128 $imm8,ymm,xmm/m128 */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x46): /* vperm2i128 $imm8,ymm/m256,ymm,ymm */
@@ -9055,6 +9073,12 @@ x86_emulate(
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x04): /* vpermilps $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x05): /* vpermilpd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm_no_sae;
+
     case X86EMUL_OPC_66(0x0f3a, 0x08): /* roundps $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x09): /* roundpd $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x0a): /* roundss $imm8,xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 38/44] x86emul: support AVX512BW pack insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (36 preceding siblings ...)
  2018-09-25 13:49   ` [PATCH v4 37/44] x86emul: support AVX512{F, BW, VBMI} permute insns Jan Beulich
@ 2018-09-25 13:50   ` Jan Beulich
  2018-09-25 13:51   ` [PATCH v4 39/44] x86emul: support AVX512F floating-point conversion insns Jan Beulich
                     ` (5 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:50 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

No further test harness additions - what is there is good enough for
these rather "regular" insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -296,6 +296,10 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(packssdw,    66,   0f, 6b,    vl, d_nb, vl),
+    INSN(packsswb,    66,   0f, 63,    vl,    w, vl),
+    INSN(packusdw,    66, 0f38, 2b,    vl, d_nb, vl),
+    INSN(packuswb,    66,   0f, 67,    vl,    w, vl),
     INSN(paddb,       66,   0f, fc,    vl,    b, vl),
     INSN(paddsb,      66,   0f, ec,    vl,    b, vl),
     INSN(paddsw,      66,   0f, ed,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -361,6 +361,10 @@ OVR(pextrw);
 OVR(pinsrb);
 OVR(pinsrw);
 # ifdef __AVX512VL__
+OVR(packssdw);
+OVR(packsswb);
+OVR(packusdw);
+OVR(packuswb);
 OVR(pmaddwd);
 OVR(pmovsxbw);
 OVR(pmovzxbw);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -453,7 +453,7 @@ static const struct ext0f38_table {
     [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
-    [0x2b] = { .simd_size = simd_packed_int },
+    [0x2b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
@@ -6707,6 +6707,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         op_bytes = 16 << evex.lr;
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x63): /* vpacksswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x67): /* vpackuswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6769,6 +6771,12 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6b): /* vpackssdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2b): /* vpackusdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 39/44] x86emul: support AVX512F floating-point conversion insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (37 preceding siblings ...)
  2018-09-25 13:50   ` [PATCH v4 38/44] x86emul: support AVX512BW pack insns Jan Beulich
@ 2018-09-25 13:51   ` Jan Beulich
  2018-09-25 13:52   ` [PATCH v4 40/44] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
                     ` (4 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:51 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

VCVTPS2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size change for twobyte_table[0x5a] is benign to pre-existing
code, but allows decode_disp8scale() to work as is here.

Also correct the comment on an AVX counterpart.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -98,6 +98,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
+    INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
+    INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -176,6 +176,8 @@ static inline bool _to_bool(byte_vec_t b
 #   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #   define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#   define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
+#   define widen1(x) ((vec_t)BR(cvtps2pd, _mask, x, (vdf_t)undef(), ~0))
 #   if VEC_SIZE == 16
 #    define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #    define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -68,6 +68,7 @@ typedef short __attribute__((vector_size
 typedef int __attribute__((vector_size(VEC_SIZE))) vsi_t;
 #if VEC_SIZE >= 8
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
+typedef double __attribute__((vector_size(VEC_SIZE))) vdf_t;
 #endif
 
 #if ELEM_SIZE == 1
@@ -93,6 +94,7 @@ typedef char __attribute__((vector_size(
 typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
 typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+typedef float __attribute__((vector_size(HALF_SIZE))) vsf_half_t;
 # endif
 
 # if ELEM_COUNT >= 4
@@ -328,6 +330,13 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 # endif
+OVR(cvtpd2psx);
+OVR(cvtpd2psy);
+OVR(cvtph2ps);
+OVR(cvtps2pd);
+OVR(cvtps2ph);
+OVR(cvtsd2ss);
+OVR(cvtss2sd);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3842,6 +3842,49 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vcvtph2ps 32(%ecx),%zmm7{%k4}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vcvtph2ps);
+        decl_insn(evex_vcvtps2ph);
+
+        asm volatile ( "vpternlogd $0x81, %%zmm7, %%zmm7, %%zmm7\n\t"
+                       "kmovw %1,%%k4\n"
+                       put_insn(evex_vcvtph2ps, "vcvtph2ps 32(%0), %%zmm7%{%%k4%}")
+                       :: "c" (NULL), "r" (0x3333) );
+
+        set_insn(evex_vcvtph2ps);
+        memset(res, 0xff, 128);
+        res[8] = 0x40003c00; /* (1.0, 2.0) */
+        res[10] = 0x44004200; /* (3.0, 4.0) */
+        res[12] = 0x3400b800; /* (-.5, .25) */
+        res[14] = 0xbc000000; /* (0.0, -1.) */
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        asm volatile ( "vmovups %%zmm7, %0" : "=m" (res[16]) );
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtph2ps) )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vcvtps2ph $0,%zmm3,64(%edx){%k4}...");
+        asm volatile ( "vmovups %0, %%zmm3\n"
+                       put_insn(evex_vcvtps2ph, "vcvtps2ph $0, %%zmm3, 128(%1)%{%%k4%}")
+                       :: "m" (res[16]), "d" (NULL) );
+
+        set_insn(evex_vcvtps2ph);
+        regs.edx = (unsigned long)res;
+        memset(res + 32, 0xcc, 32);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtps2ph) )
+            goto fail;
+        res[15] = res[13] = res[11] = res[9] = 0xcccccccc;
+        if ( memcmp(res + 8, res + 32, 32) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -310,7 +310,8 @@ static const struct twobyte_table {
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -437,7 +438,7 @@ static const struct ext0f38_table {
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x13] = { .simd_size = simd_other, .two_op = 1 },
+    [0x13] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
@@ -541,7 +542,7 @@ static const struct ext0f3a_table {
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
-    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
+    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
@@ -3059,6 +3060,11 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x5a: /* vcvtps2pd needs special casing */
+                if ( disp8scale && !evex.pfx && !evex.br )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -5966,6 +5972,7 @@ x86_emulate(
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5d):    /* vmin{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5e):    /* vdiv{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5f):    /* vmax{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    avx512f_all_fp:
         generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
                                (ea.type == OP_MEM && evex.br &&
                                 (evex.pfx & VEX_PREFIX_SCALAR_MASK))),
@@ -6523,7 +6530,7 @@ x86_emulate(
         goto simd_zmm;
 
     CASE_SIMD_ALL_FP(, 0x0f, 0x5a):        /* cvt{p,s}{s,d}2{p,s}{s,d} xmm/mem,xmm */
-    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} xmm/mem,xmm */
+    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} {x,y}mm/mem,{x,y}mm */
                                            /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm */
         op_bytes = 4 << (((vex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + vex.l) +
                          !!(vex.pfx & VEX_PREFIX_DOUBLE_MASK));
@@ -6532,6 +6539,12 @@ x86_emulate(
             goto simd_0f_sse2;
         goto simd_0f_avx;
 
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5a):   /* vcvtp{s,d}2p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+                                           /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm{k} */
+        op_bytes = 4 << (((evex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + evex.lr) +
+                         evex.w);
+        goto avx512f_all_fp;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x5b):     /* cvt{ps,dq}2{dq,ps} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x5b): /* vcvt{ps,dq}2{dq,ps} {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F3(0x0f, 0x5b):       /* cvttps2dq xmm/mem,xmm */
@@ -8414,6 +8427,15 @@ x86_emulate(
         op_bytes = 8 << vex.l;
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x13): /* vcvtph2ps {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w || (ea.type == OP_MEM && evex.br), EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.br )
+            avx512_vlen_check(false);
+        op_bytes = 8 << evex.lr;
+        elem_bytes = 2;
+        goto simd_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x16): /* vpermps ymm/m256,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x36): /* vpermd ymm/m256,ymm,ymm */
         generate_exception_if(!vex.l || vex.w, EXC_UD);
@@ -9237,27 +9259,79 @@ x86_emulate(
         goto avx512f_imm_no_sae;
 
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,[xyz]mm,{x,y}mm/mem{k} */
     {
         uint32_t mxcsr;
 
-        generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
-        host_and_vcpu_must_have(f16c);
         fail_if(!ops->write);
+        if ( evex_encoded() )
+        {
+            generate_exception_if((evex.w || evex.reg != 0xf || !evex.RX ||
+                                   (ea.type == OP_MEM && (evex.z || evex.br))),
+                                  EXC_UD);
+            host_and_vcpu_must_have(avx512f);
+            avx512_vlen_check(false);
+            opc = init_evex(stub);
+        }
+        else
+        {
+            generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
+            host_and_vcpu_must_have(f16c);
+            opc = init_prefixes(stub);
+        }
+
+        op_bytes = 8 << evex.lr;
 
-        opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
         if ( ea.type == OP_MEM )
         {
             /* Convert memory operand to (%rAX). */
             vex.b = 1;
+            evex.b = 1;
             opc[1] &= 0x38;
         }
         opc[2] = imm1;
-        insn_bytes = PFX_BYTES + 3;
+        if ( evex_encoded() )
+        {
+            unsigned int full = 0;
+
+            insn_bytes = EVEX_PFX_BYTES + 3;
+            copy_EVEX(opc, evex);
+
+            if ( ea.type == OP_MEM && evex.opmsk )
+            {
+                full = 0xffff >> (16 - op_bytes / 2);
+                op_mask &= full;
+                if ( !op_mask )
+                    goto complete_insn;
+
+                first_byte = __builtin_ctz(op_mask);
+                op_mask >>= first_byte;
+                full >>= first_byte;
+                first_byte <<= 1;
+                op_bytes = (32 - __builtin_clz(op_mask)) << 1;
+
+                /*
+                 * We may need to read (parts of) the memory operand for the
+                 * purpose of merging in order to avoid splitting the write
+                 * below into multiple ones.
+                 */
+                if ( op_mask != full &&
+                     (rc = ops->read(ea.mem.seg,
+                                     truncate_ea(ea.mem.off + first_byte),
+                                     (void *)mmvalp + first_byte, op_bytes,
+                                     ctxt)) != X86EMUL_OKAY )
+                    goto done;
+            }
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 3;
+            copy_VEX(opc, vex);
+        }
         opc[3] = 0xc3;
 
-        copy_VEX(opc, vex);
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
                     "=m" (*mmvalp), [mxcsr] "=m" (mxcsr) : "a" (mmvalp));
@@ -9266,7 +9340,8 @@ x86_emulate(
 
         if ( ea.type == OP_MEM )
         {
-            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
+            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
+                            (void *)mmvalp + first_byte, op_bytes, ctxt);
             if ( rc != X86EMUL_OKAY )
             {
                 asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 40/44] x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (38 preceding siblings ...)
  2018-09-25 13:51   ` [PATCH v4 39/44] x86emul: support AVX512F floating-point conversion insns Jan Beulich
@ 2018-09-25 13:52   ` Jan Beulich
  2018-09-25 13:53   ` [PATCH v4 41/44] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
                     ` (3 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:52 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

... including the two AVX512DQ forms which shared encodings, just with
EVEX.W set there.

VCVTDQ2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size changes for the twobyte_table[] entries are benign to
pre-existing code, but allow decode_disp8scale() to work as is here.

The at this point wrong placement of the 0xe6 case block is once again
in anticipation of further additions of case labels.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -98,8 +98,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
+    INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
+    INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
@@ -388,6 +392,8 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
+    INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -92,6 +92,13 @@ static inline bool _to_bool(byte_vec_t b
 # define to_int(x) ((vec_t){ (int)(x)[0] })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
+#elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if FLOAT_SIZE == 4
+#  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+# elif FLOAT_SIZE == 8
+#  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
 #  define to_int(x) __builtin_ia32_cvtdq2ps(__builtin_ia32_cvtps2dq(x))
@@ -1136,15 +1143,21 @@ int simd_test(void)
     touch(src);
     if ( !eq(x * -alt, -src) ) return __LINE__;
 
-# if defined(recip) && defined(to_int)
+# ifdef to_int
 
     touch(src);
+    x = to_int(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+
+#  ifdef recip
+    touch(src);
     x = recip(src);
     touch(src);
     touch(x);
     if ( !eq(to_int(recip(x)), src) ) return __LINE__;
 
-#  ifdef rsqrt
+#   ifdef rsqrt
     x = src * src;
     touch(x);
     y = rsqrt(x);
@@ -1152,6 +1165,7 @@ int simd_test(void)
     if ( !eq(to_int(recip(y)), src) ) return __LINE__;
     touch(src);
     if ( !eq(to_int(y), to_int(recip(src))) ) return __LINE__;
+#   endif
 #  endif
 
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -244,6 +244,7 @@ asm ( ".macro override insn    \n\t"
 OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
+OVR_VFP(cvtdq2);
 OVR_FP(add);
 OVR_INT(add);
 OVR_BW(adds);
@@ -330,13 +331,19 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 # endif
+OVR(cvtpd2dqx);
+OVR(cvtpd2dqy);
 OVR(cvtpd2psx);
 OVR(cvtpd2psy);
 OVR(cvtph2ps);
+OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
 OVR(cvtss2sd);
+OVR(cvttpd2dqx);
+OVR(cvttpd2dqy);
+OVR(cvttps2dq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -311,7 +311,7 @@ static const struct twobyte_table {
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -375,7 +375,7 @@ static const struct twobyte_table {
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
@@ -3069,6 +3069,11 @@ x86_decode(
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
                 break;
+
+            case 0xe6: /* vcvtdq2pd needs special casing */
+                if ( disp8scale && evex.pfx == vex_f3 && !evex.w && !evex.br )
+                    --disp8scale;
+                break;
             }
             break;
 
@@ -6553,6 +6558,22 @@ x86_emulate(
         op_bytes = 16 << vex.l;
         goto simd_0f_cvt;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x5b): /* vcvtps2dq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x5b): /* vcvttps2dq [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x60):    /* punpcklbw {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x60): /* vpunpcklbw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0x61):    /* punpcklwd {,x}mm/mem,{,x}mm */
@@ -7211,6 +7232,27 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe6):   /* vcvttpd2dq [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx != vex_f3 )
+            host_and_vcpu_must_have(avx512f);
+        else if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+        {
+            host_and_vcpu_must_have(avx512f);
+            generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        }
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 8 << (evex.w + evex.lr);
+        goto simd_zmm;
+
     case X86EMUL_OPC_F2(0x0f, 0xf0):     /* lddqu m128,xmm */
     case X86EMUL_OPC_VEX_F2(0x0f, 0xf0): /* vlddqu mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 41/44] x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (39 preceding siblings ...)
  2018-09-25 13:52   ` [PATCH v4 40/44] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
@ 2018-09-25 13:53   ` Jan Beulich
  2018-09-25 13:53   ` [PATCH v4 42/44] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
                     ` (2 subsequent siblings)
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

VCVT{,T}S{S,D}2SI use EVEX.W for their destination (register) rather
than their (possibly memory) source operand size and hence need a
"manual" override of disp8scale.

Slightly adjust the scalar to_int() in the test harness, to increase the
chances of the operand ending up in memory.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -106,8 +106,16 @@ static const struct test avx512f_all[] =
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
+    INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
+    INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -715,8 +723,9 @@ static void test_group(const struct test
                 break;
 
             case ESZ_dq:
-                test_pair(&tests[i], vl[j], ESZ_d, "d", ESZ_q, "q",
-                          instr, ctxt);
+                test_pair(&tests[i], vl[j], ESZ_d,
+                          strncmp(tests[i].mnemonic, "cvt", 3) ? "d" : "l",
+                          ESZ_q, "q", instr, ctxt);
                 break;
 
 #ifdef __i386__
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -89,7 +89,7 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 
 #if VEC_SIZE == FLOAT_SIZE
-# define to_int(x) ((vec_t){ (int)(x)[0] })
+# define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -340,10 +340,28 @@ OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
+OVR(cvtsd2si);
+OVR(cvtsd2sil);
+OVR(cvtsd2siq);
+OVR(cvtsi2sd);
+OVR(cvtsi2sdl);
+OVR(cvtsi2sdq);
+OVR(cvtsi2ss);
+OVR(cvtsi2ssl);
+OVR(cvtsi2ssq);
 OVR(cvtss2sd);
+OVR(cvtss2si);
+OVR(cvtss2sil);
+OVR(cvtss2siq);
 OVR(cvttpd2dqx);
 OVR(cvttpd2dqy);
 OVR(cvttps2dq);
+OVR(cvttsd2si);
+OVR(cvttsd2sil);
+OVR(cvttsd2siq);
+OVR(cvttss2si);
+OVR(cvttss2sil);
+OVR(cvttss2siq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -296,7 +296,7 @@ static const struct twobyte_table {
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
     [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp, simd_none, d8s_dq },
@@ -3060,6 +3060,12 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x2c: /* vcvtts{s,d}2si need special casing */
+            case 0x2d: /* vcvts{s,d}2si need special casing */
+                if ( evex_encoded() )
+                    disp8scale = 2 + (evex.pfx & VEX_PREFIX_DOUBLE_MASK);
+                break;
+
             case 0x5a: /* vcvtps2pd needs special casing */
                 if ( disp8scale && !evex.pfx && !evex.br )
                     --disp8scale;
@@ -6168,6 +6174,26 @@ x86_emulate(
         state->simd_size = simd_none;
         goto simd_0f_rm;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+        generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.br),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        if ( ea.type == OP_MEM )
+        {
+            rc = read_ulong(ea.mem.seg, ea.mem.off, &src.val,
+                            rex_prefix & REX_W ? 8 : 4, ctxt, ops);
+            if ( rc != X86EMUL_OKAY )
+                goto done;
+        }
+        else
+            src.val = rex_prefix & REX_W ? *ea.reg : (uint32_t)*ea.reg;
+
+        state->simd_size = simd_none;
+        goto avx512f_rm;
+
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2c):     /* cvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_VEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2d):     /* cvts{s,d}2si xmm/mem,reg */
@@ -6189,14 +6215,17 @@ x86_emulate(
         }
 
         opc = init_prefixes(stub);
+    cvts_2si:
         opc[0] = b;
         /* Convert GPR destination to %rAX and memory operand to (%rCX). */
         rex_prefix &= ~REX_R;
         vex.r = 1;
+        evex.r = 1;
         if ( ea.type == OP_MEM )
         {
             rex_prefix &= ~REX_B;
             vex.b = 1;
+            evex.b = 1;
             opc[1] = 0x01;
 
             rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp,
@@ -6207,11 +6236,22 @@ x86_emulate(
         else
             opc[1] = modrm & 0xc7;
         if ( !mode_64bit() )
+        {
             vex.w = 0;
-        insn_bytes = PFX_BYTES + 2;
+            evex.w = 0;
+        }
+        if ( evex_encoded() )
+        {
+            insn_bytes = EVEX_PFX_BYTES + 2;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 2;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
         opc[2] = 0xc3;
 
-        copy_REX_VEX(opc, rex_prefix, vex);
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
 
@@ -6219,6 +6259,17 @@ x86_emulate(
         state->simd_size = simd_none;
         break;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+        generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
+                               (ea.type != OP_REG && evex.br)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto cvts_2si;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2e):     /* ucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x2e): /* vucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2f):     /* comis{s,d} xmm/mem,xmm */
@@ -6870,6 +6921,7 @@ x86_emulate(
         host_and_vcpu_must_have(avx512f);
         get_fpu(X86EMUL_FPU_zmm);
 
+    avx512f_rm:
         opc = init_evex(stub);
         opc[0] = b;
         /* Convert memory/GPR operand to (%rAX). */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 42/44] x86emul: support AVX512DQ packed quad-int/FP conversion insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (40 preceding siblings ...)
  2018-09-25 13:53   ` [PATCH v4 41/44] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
@ 2018-09-25 13:53   ` Jan Beulich
  2018-09-25 13:54   ` [PATCH v4 43/44] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
  2018-09-25 13:55   ` [PATCH v4 44/44] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

VCVT{,T}PS2QQ, sharing their main opcodes with others, once again need
"manual" overrides of disp8scale.

While not directly related here, also add a scalar variant of to_wint()
to the test harness.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -400,8 +400,12 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
+    INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -90,14 +90,35 @@ static inline bool _to_bool(byte_vec_t b
 
 #if VEC_SIZE == FLOAT_SIZE
 # define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
+# ifdef __x86_64__
+#  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) ({ \
+    vsf_half_t t_ = low_half(x); \
+    vdi_t lo_, hi_; \
+    touch(t_); \
+    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    t_ = high_half(x); \
+    touch(t_); \
+    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    touch(lo_); touch(hi_); \
+    insert_half(insert_half(undef(), \
+                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+})
+#  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
@@ -121,6 +142,14 @@ static inline bool _to_bool(byte_vec_t b
 })
 #endif
 
+#if VEC_SIZE == 16 && FLOAT_SIZE == 4 && defined(__SSE__)
+# define low_half(x) (x)
+# define high_half(x) B_(movhlps, , undef(), x)
+# define insert_pair(x, y, p) \
+    ((p) ? B_(movlhps, , x, y) \
+         : ({ vec_t t_ = (x); t_[0] = (y)[0]; t_[1] = (y)[1]; t_; }))
+#endif
+
 #if VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW_A__)
 # define max __builtin_ia32_pfmax
 # define min __builtin_ia32_pfmin
@@ -144,13 +173,16 @@ static inline bool _to_bool(byte_vec_t b
 #  if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
        (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
        (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
-#   define low_half(x) ({ \
+#   define _half(x, lh) ({ \
     half_t t_; \
-    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+    asm ( "vextractf%c[w]x%c[n] %[sel], %[s], %[d]" \
           : [d] "=m" (t_) \
-          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+          : [s] "v" (x), [sel] "i" (lh), \
+            [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
     t_; \
 })
+#   define low_half(x)  _half(x, 0)
+#   define high_half(x) _half(x, 1)
 #  endif
 #  if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
        (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
@@ -1170,6 +1202,13 @@ int simd_test(void)
 
 # endif
 
+# ifdef to_wint
+    touch(src);
+    x = to_wint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
 # ifdef sqrt
     x = src * src;
     touch(x);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -325,6 +325,8 @@ static const struct twobyte_table {
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3071,6 +3073,12 @@ x86_decode(
                     --disp8scale;
                 break;
 
+            case 0x7a: /* vcvttps2qq needs special casing */
+            case 0x7b: /* vcvtps2qq needs special casing */
+                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.br )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -7293,7 +7301,13 @@ x86_emulate(
         if ( evex.pfx != vex_f3 )
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
+        {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2qq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512dq);
+        }
         else
         {
             host_and_vcpu_must_have(avx512f);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 43/44] x86emul: support AVX512{F, DQ} uint-to-FP conversion insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (41 preceding siblings ...)
  2018-09-25 13:53   ` [PATCH v4 42/44] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
@ 2018-09-25 13:54   ` Jan Beulich
  2018-09-25 13:55   ` [PATCH v4 44/44] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:54 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Some "manual" overrides of disp8scale are needed here again. In
particular code ends up simpler when using d8s_dq64 in the
twobyte_table[] entry.

Test harness additions will be done once the reverse conversions are
also available.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -116,6 +116,10 @@ static const struct test avx512f_all[] =
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
+    INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
+    INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
+    INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -406,6 +410,8 @@ static const struct test avx512dq_all[]
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
+    INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -326,7 +326,7 @@ static const struct twobyte_table {
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3073,12 +3073,16 @@ x86_decode(
                     --disp8scale;
                 break;
 
-            case 0x7a: /* vcvttps2qq needs special casing */
-            case 0x7b: /* vcvtps2qq needs special casing */
-                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.br )
+            case 0x7a: /* vcvttps2qq and vcvtudq2pd need special casing */
+                if ( disp8scale && evex.pfx != vex_f2 && !evex.w && !evex.br )
                     --disp8scale;
                 break;
 
+            case 0x7b: /* vcvtp{s,d}2qq need special casing */
+                if ( disp8scale && evex.pfx == vex_66 )
+                    disp8scale = (evex.br ? 2 : 3 + evex.lr) + evex.w;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -6183,6 +6187,7 @@ x86_emulate(
         goto simd_0f_rm;
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x7b): /* vcvtusi2s{s,d} r/m,xmm,xmm */
         generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.br),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
@@ -6623,6 +6628,8 @@ x86_emulate(
         /* fall through */
     case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
                                           /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x7a): /* vcvtudq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtuqq2ps [xyz]mm/mem,{x,y}mm{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
@@ -7296,6 +7303,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
         generate_exception_if(!evex.w, EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7a):   /* vcvtudq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtuqq2pd [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
         if ( evex.pfx != vex_f3 )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v4 44/44] x86emul: support AVX512{F, DQ} FP-to-uint conversion insns
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                     ` (42 preceding siblings ...)
  2018-09-25 13:54   ` [PATCH v4 43/44] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
@ 2018-09-25 13:55   ` Jan Beulich
  43 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-25 13:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

Along the lines of prior patches, VCVT{,T}PS2UQQ as well as
VCVT{,T}S{S,D}2USI need "manual" overrides of disp8scale.

The twobyte_table[] entries get altered, with their prior values
now put in place in x86_decode_twobyte().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -101,21 +101,29 @@ static const struct test avx512f_all[] =
     INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
     INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
     INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
+    INSN(cvtpd2udq,      ,   0f, 79,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtps2udq,      ,   0f, 79,    vl,      d, vl),
     INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
+    INSN(cvtsd2usi,    f2,   0f, 79,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
     INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
     INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvtss2usi,    f3,   0f, 79,    el,      d, el),
     INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttpd2udq,     ,   0f, 78,    vl,      q, vl),
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttps2udq,     ,   0f, 78,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttsd2usi,   f2,   0f, 78,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvttss2usi,   f3,   0f, 78,    el,      d, el),
     INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
     INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
@@ -405,11 +413,15 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtpd2uqq,      66,   0f, 79,   vl,  q, vl),
     INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
+    INSN(cvtps2uqq,      66,   0f, 79, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttpd2uqq,     66,   0f, 78,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvttps2uqq,     66,   0f, 78, vl_2,  d, vl),
     INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
     INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -93,31 +93,65 @@ static inline bool _to_bool(byte_vec_t b
 # ifdef __x86_64__
 #  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
 # endif
+# ifdef __AVX512F__
+/*
+ * Sadly even gcc 9.x, at the time of writing, does not carry out at least
+ * uint -> FP conversions using VCVTUSI2S{S,D}, so we need to use builtins
+ * or inline assembly here. The full-vector parameter types of the builtins
+ * aren't very helpful for our purposes, so use inline assembly.
+ */
+#  if FLOAT_SIZE == 4
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    float __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtss2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2ss%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  elif FLOAT_SIZE == 8
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    double __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtsd2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2sd%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  endif
+#  define to_uint(x) to_u_int(int, x)
+#  ifdef __x86_64__
+#   define to_uwint(x) to_u_int(long, x)
+#  endif
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  define to_uint(x) BR(cvtudq2ps, _mask, BR(cvtps2udq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
-#   define to_wint(x) ({ \
+#   define to_w_int(x, s) ({ \
     vsf_half_t t_ = low_half(x); \
     vdi_t lo_, hi_; \
     touch(t_); \
-    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    lo_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     t_ = high_half(x); \
     touch(t_); \
-    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    hi_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     touch(lo_); touch(hi_); \
     insert_half(insert_half(undef(), \
-                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
-                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+                            BR(cvt ## s ## qq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvt ## s ## qq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
 })
+#   define to_wint(x) to_w_int(x, )
+#   define to_uwint(x) to_w_int(x, u)
 #  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  define to_uint(x) B(cvtudq2pd, _mask, BR(cvtpd2udq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
 #   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#   define to_uwint(x) BR(cvtuqq2pd, _mask, BR(cvtpd2uqq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
 #  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
@@ -1208,6 +1242,20 @@ int simd_test(void)
     touch(src);
     if ( !eq(x, src) ) return __LINE__;
 # endif
+
+# ifdef to_uint
+    touch(src);
+    x = to_uint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
+# ifdef to_uwint
+    touch(src);
+    x = to_uwint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
 
 # ifdef sqrt
     x = src * src;
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -323,8 +323,7 @@ static const struct twobyte_table {
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
-    [0x78] = { ImplicitOps|ModRM },
-    [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x78 ... 0x79] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -2507,6 +2506,8 @@ x86_decode_twobyte(
         break;
 
     case 0x78:
+        state->desc = ImplicitOps;
+        state->simd_size = simd_none;
         switch ( vex.pfx )
         {
         case vex_66: /* extrq $imm8, $imm8, xmm */
@@ -2519,7 +2520,7 @@ x86_decode_twobyte(
     case 0x10 ... 0x18:
     case 0x28 ... 0x2f:
     case 0x50 ... 0x77:
-    case 0x79 ... 0x7d:
+    case 0x7a ... 0x7d:
     case 0x7f:
     case 0xc2 ... 0xc3:
     case 0xc5 ... 0xc6:
@@ -2541,6 +2542,12 @@ x86_decode_twobyte(
         op_bytes = mode_64bit() ? 8 : 4;
         break;
 
+    case 0x79:
+        state->desc = DstReg | SrcMem;
+        state->simd_size = simd_packed_int;
+        ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
+        break;
+
     case 0x7e:
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         if ( vex.pfx == vex_f3 ) /* movq xmm/m64,xmm */
@@ -3062,6 +3069,18 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x78:
+            case 0x79:
+                if ( !evex.pfx )
+                    break;
+                /* vcvt{,t}ps2uqq need special casing */
+                if ( evex.pfx == vex_66 )
+                {
+                    if ( !evex.w && !evex.br )
+                        --disp8scale;
+                    break;
+                }
+                /* vcvt{,t}s{s,d}2usi need special casing: fall through */
             case 0x2c: /* vcvtts{s,d}2si need special casing */
             case 0x2d: /* vcvts{s,d}2si need special casing */
                 if ( evex_encoded() )
@@ -6274,6 +6293,8 @@ x86_emulate(
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x78): /* vcvtts{s,d}2usi xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x79): /* vcvts{s,d}2usi xmm/mem,reg */
         generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
                                (ea.type != OP_REG && evex.br)),
                               EXC_UD);
@@ -6633,7 +6654,11 @@ x86_emulate(
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
+        {
+    case X86EMUL_OPC_EVEX(0x0f, 0x78):    /* vcvttp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX(0x0f, 0x79):    /* vcvtp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512f);
+        }
         if ( ea.type == OP_MEM || !evex.br )
             avx512_vlen_check(false);
         d |= TwoOp;
@@ -7311,6 +7336,10 @@ x86_emulate(
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
         {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x78):   /* vcvttps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2uqq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x79):   /* vcvtps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2uqq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 01/44] x86emul: support AVX512 opmask insns
  2018-09-25 13:25   ` [PATCH v4 01/44] x86emul: support AVX512 opmask insns Jan Beulich
@ 2018-09-26  6:06     ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-09-26  6:06 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu

>>> On 25.09.18 at 15:25, <JBeulich@suse.com> wrote:
> These are all VEX encoded, so the EVEX decoding logic continues to
> remain unused at this point.
> 
> The new testcase is deliberately coded in assembly, as a C one would
> have become almost unreadable due to the overwhelming amount of
> __builtin_...() that would need to be used. After all the compiler has
> no underlying type (yet) that could be operated on without builtins,
> other than the vector types used for "normal" SIMD insns.

I've added

"Note that outside of 64-bit mode and despite the SDM not currently
 saying so, VEX.W is ignored for the KMOV{D,Q} encodings to/from GPRs,
 just like e.g. for the similar VMOV{D,Q}."

here, to clarify why the implementation does not seem to match the
SDM in this regard. I've asked Intel to correct the SDM.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 01/34] x86emul: support AVX512 opmask insns
  2018-09-18 11:53   ` [PATCH v3 01/34] x86emul: support AVX512 opmask insns Jan Beulich
@ 2018-10-25 18:32     ` Andrew Cooper
  2018-10-26  9:03       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-10-25 18:32 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 18/09/18 12:53, Jan Beulich wrote:
> @@ -1187,6 +1188,11 @@ static int _get_fpu(
>              return X86EMUL_UNHANDLEABLE;
>          break;
>  
> +    case X86EMUL_FPU_opmask:
> +        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
> +            return X86EMUL_UNHANDLEABLE;
> +        break;

I see this follows the pattern from X86EMUL_FPU_ymm, but by the SSE
check?  It is not relevant at this point - if xcr0.opmask is set, the
opmask instructions should be usable.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes
  2018-09-18 11:53   ` [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes Jan Beulich
  2018-09-18 16:05     ` Paul Durrant
@ 2018-10-25 18:36     ` Andrew Cooper
  2018-10-26  9:04       ` Jan Beulich
  1 sibling, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-10-25 18:36 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Paul Durrant, Wei Liu

On 18/09/18 12:53, Jan Beulich wrote:
> This is needed before enabling any AVX512 insns in the emulator. Change
> the way alignment is enforced at the same time.
>
> Add a check that the buffer won't actually overflow, and while at it
> also convert the check for accesses to not cross page boundaries.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: New.
>
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -866,7 +866,18 @@ static int hvmemul_phys_mmio_access(
>      int rc = X86EMUL_OKAY;
>  
>      /* Accesses must fall within a page. */
> -    BUG_ON((gpa & ~PAGE_MASK) + size > PAGE_SIZE);
> +    if ( (gpa & ~PAGE_MASK) + size > PAGE_SIZE )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return X86EMUL_UNHANDLEABLE;
> +    }
> +
> +    /* Accesses must not overflow the cache's buffer. */
> +    if ( size > sizeof(cache->buffer) )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return X86EMUL_UNHANDLEABLE;
> +    }
>  
>      /*
>       * hvmemul_do_io() cannot handle non-power-of-2 accesses or
> --- a/xen/include/asm-x86/hvm/vcpu.h
> +++ b/xen/include/asm-x86/hvm/vcpu.h
> @@ -42,15 +42,14 @@ struct hvm_vcpu_asid {
>  };
>  
>  /*
> - * We may read or write up to m256 as a number of device-model
> + * We may read or write up to m512 as a number of device-model
>   * transactions.
>   */
>  struct hvm_mmio_cache {
>      unsigned long gla;
>      unsigned int size;
>      uint8_t dir;
> -    uint8_t pad[3]; /* make buffer[] long-aligned */
> -    uint8_t buffer[32];
> +    uint8_t buffer[64] __aligned(sizeof(long));

Don't we want it 16-byte aligned, rather than 8?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 01/34] x86emul: support AVX512 opmask insns
  2018-10-25 18:32     ` Andrew Cooper
@ 2018-10-26  9:03       ` Jan Beulich
  2018-10-26 11:29         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-10-26  9:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 25.10.18 at 20:32, <andrew.cooper3@citrix.com> wrote:
> On 18/09/18 12:53, Jan Beulich wrote:
>> @@ -1187,6 +1188,11 @@ static int _get_fpu(
>>              return X86EMUL_UNHANDLEABLE;
>>          break;
>>  
>> +    case X86EMUL_FPU_opmask:
>> +        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
>> +            return X86EMUL_UNHANDLEABLE;
>> +        break;
> 
> I see this follows the pattern from X86EMUL_FPU_ymm, but by the SSE
> check?  It is not relevant at this point - if xcr0.opmask is set, the
> opmask instructions should be usable.

I would agree with you from a functional POV, but please see the
last row of the table named "OS XSAVE Enabling Requirements of
Instruction Categories" in SDM Vol 2.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes
  2018-10-25 18:36     ` Andrew Cooper
@ 2018-10-26  9:04       ` Jan Beulich
  2018-10-26 11:29         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-10-26  9:04 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Paul Durrant, Wei Liu

>>> On 25.10.18 at 20:36, <andrew.cooper3@citrix.com> wrote:
> On 18/09/18 12:53, Jan Beulich wrote:
>> This is needed before enabling any AVX512 insns in the emulator. Change
>> the way alignment is enforced at the same time.
>>
>> Add a check that the buffer won't actually overflow, and while at it
>> also convert the check for accesses to not cross page boundaries.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v3: New.
>>
>> --- a/xen/arch/x86/hvm/emulate.c
>> +++ b/xen/arch/x86/hvm/emulate.c
>> @@ -866,7 +866,18 @@ static int hvmemul_phys_mmio_access(
>>      int rc = X86EMUL_OKAY;
>>  
>>      /* Accesses must fall within a page. */
>> -    BUG_ON((gpa & ~PAGE_MASK) + size > PAGE_SIZE);
>> +    if ( (gpa & ~PAGE_MASK) + size > PAGE_SIZE )
>> +    {
>> +        ASSERT_UNREACHABLE();
>> +        return X86EMUL_UNHANDLEABLE;
>> +    }
>> +
>> +    /* Accesses must not overflow the cache's buffer. */
>> +    if ( size > sizeof(cache->buffer) )
>> +    {
>> +        ASSERT_UNREACHABLE();
>> +        return X86EMUL_UNHANDLEABLE;
>> +    }
>>  
>>      /*
>>       * hvmemul_do_io() cannot handle non-power-of-2 accesses or
>> --- a/xen/include/asm-x86/hvm/vcpu.h
>> +++ b/xen/include/asm-x86/hvm/vcpu.h
>> @@ -42,15 +42,14 @@ struct hvm_vcpu_asid {
>>  };
>>  
>>  /*
>> - * We may read or write up to m256 as a number of device-model
>> + * We may read or write up to m512 as a number of device-model
>>   * transactions.
>>   */
>>  struct hvm_mmio_cache {
>>      unsigned long gla;
>>      unsigned int size;
>>      uint8_t dir;
>> -    uint8_t pad[3]; /* make buffer[] long-aligned */
>> -    uint8_t buffer[32];
>> +    uint8_t buffer[64] __aligned(sizeof(long));
> 
> Don't we want it 16-byte aligned, rather than 8?

Why? We don't access the buffer via SIMD insns.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 01/34] x86emul: support AVX512 opmask insns
  2018-10-26  9:03       ` Jan Beulich
@ 2018-10-26 11:29         ` Andrew Cooper
  2018-10-26 11:59           ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-10-26 11:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu

On 26/10/18 10:03, Jan Beulich wrote:
>>>> On 25.10.18 at 20:32, <andrew.cooper3@citrix.com> wrote:
>> On 18/09/18 12:53, Jan Beulich wrote:
>>> @@ -1187,6 +1188,11 @@ static int _get_fpu(
>>>              return X86EMUL_UNHANDLEABLE;
>>>          break;
>>>  
>>> +    case X86EMUL_FPU_opmask:
>>> +        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
>>> +            return X86EMUL_UNHANDLEABLE;
>>> +        break;
>> I see this follows the pattern from X86EMUL_FPU_ymm, but by the SSE
>> check?  It is not relevant at this point - if xcr0.opmask is set, the
>> opmask instructions should be usable.
> I would agree with you from a functional POV, but please see the
> last row of the table named "OS XSAVE Enabling Requirements of
> Instruction Categories" in SDM Vol 2.

Hmm.  That table is in contradiction to Vol 3 Figure 2-8 and associated
description, which enumerates the #GP conditions of XSETBV.

In particular, it suggests that OPMASK must be set in union with the ZMM
bits, and cannot be set without YMM.

IMO, we can require/expect get_xcr0() to pass an architecturally valid
result.  I don't think we should be checking the transitive closure of
feature dependences here.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes
  2018-10-26  9:04       ` Jan Beulich
@ 2018-10-26 11:29         ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-10-26 11:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Paul Durrant, Wei Liu

On 26/10/18 10:04, Jan Beulich wrote:
>>>> On 25.10.18 at 20:36, <andrew.cooper3@citrix.com> wrote:
>> On 18/09/18 12:53, Jan Beulich wrote:
>>> This is needed before enabling any AVX512 insns in the emulator. Change
>>> the way alignment is enforced at the same time.
>>>
>>> Add a check that the buffer won't actually overflow, and while at it
>>> also convert the check for accesses to not cross page boundaries.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> v3: New.
>>>
>>> --- a/xen/arch/x86/hvm/emulate.c
>>> +++ b/xen/arch/x86/hvm/emulate.c
>>> @@ -866,7 +866,18 @@ static int hvmemul_phys_mmio_access(
>>>      int rc = X86EMUL_OKAY;
>>>  
>>>      /* Accesses must fall within a page. */
>>> -    BUG_ON((gpa & ~PAGE_MASK) + size > PAGE_SIZE);
>>> +    if ( (gpa & ~PAGE_MASK) + size > PAGE_SIZE )
>>> +    {
>>> +        ASSERT_UNREACHABLE();
>>> +        return X86EMUL_UNHANDLEABLE;
>>> +    }
>>> +
>>> +    /* Accesses must not overflow the cache's buffer. */
>>> +    if ( size > sizeof(cache->buffer) )
>>> +    {
>>> +        ASSERT_UNREACHABLE();
>>> +        return X86EMUL_UNHANDLEABLE;
>>> +    }
>>>  
>>>      /*
>>>       * hvmemul_do_io() cannot handle non-power-of-2 accesses or
>>> --- a/xen/include/asm-x86/hvm/vcpu.h
>>> +++ b/xen/include/asm-x86/hvm/vcpu.h
>>> @@ -42,15 +42,14 @@ struct hvm_vcpu_asid {
>>>  };
>>>  
>>>  /*
>>> - * We may read or write up to m256 as a number of device-model
>>> + * We may read or write up to m512 as a number of device-model
>>>   * transactions.
>>>   */
>>>  struct hvm_mmio_cache {
>>>      unsigned long gla;
>>>      unsigned int size;
>>>      uint8_t dir;
>>> -    uint8_t pad[3]; /* make buffer[] long-aligned */
>>> -    uint8_t buffer[32];
>>> +    uint8_t buffer[64] __aligned(sizeof(long));
>> Don't we want it 16-byte aligned, rather than 8?
> Why? We don't access the buffer via SIMD insns.

Fair point.  Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 17/44] x86emul/test: introduce eq()
  2018-09-25 13:36   ` [PATCH v4 17/44] x86emul/test: introduce eq() Jan Beulich
@ 2018-10-26 11:31     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-10-26 11:31 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/18 14:36, Jan Beulich wrote:
> In preparation for sensible to-boolean conversion on AVX512, wrap
> another abstraction function around the present to_bool(<x> == <y>), to
> get rid of the open-coded == (which will get in the way of using
> built-in functions instead). For the future AVX512 use scalar operands
> can't be used then anymore: Use (vec_t){} when the operand is zero,
> and broadcast (if available) otherwise (assume pre-AVX512 when broadcast
> is not available, in which case a plain scalar is still fine).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 01/34] x86emul: support AVX512 opmask insns
  2018-10-26 11:29         ` Andrew Cooper
@ 2018-10-26 11:59           ` Jan Beulich
  2018-10-26 12:19             ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-10-26 11:59 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 26.10.18 at 13:29, <andrew.cooper3@citrix.com> wrote:
> On 26/10/18 10:03, Jan Beulich wrote:
>>>>> On 25.10.18 at 20:32, <andrew.cooper3@citrix.com> wrote:
>>> On 18/09/18 12:53, Jan Beulich wrote:
>>>> @@ -1187,6 +1188,11 @@ static int _get_fpu(
>>>>              return X86EMUL_UNHANDLEABLE;
>>>>          break;
>>>>  
>>>> +    case X86EMUL_FPU_opmask:
>>>> +        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
>>>> +            return X86EMUL_UNHANDLEABLE;
>>>> +        break;
>>> I see this follows the pattern from X86EMUL_FPU_ymm, but by the SSE
>>> check?  It is not relevant at this point - if xcr0.opmask is set, the
>>> opmask instructions should be usable.
>> I would agree with you from a functional POV, but please see the
>> last row of the table named "OS XSAVE Enabling Requirements of
>> Instruction Categories" in SDM Vol 2.
> 
> Hmm.  That table is in contradiction to Vol 3 Figure 2-8 and associated
> description, which enumerates the #GP conditions of XSETBV.
> 
> In particular, it suggests that OPMASK must be set in union with the ZMM
> bits, and cannot be set without YMM.

Hmm, true. However ...

> IMO, we can require/expect get_xcr0() to pass an architecturally valid
> result.  I don't think we should be checking the transitive closure of
> feature dependences here.

... I'd prefer if the code remained consistent, and hence the
analogy with X86EMUL_FPU_ymm should remain, the more that
this way it's also more consistent with Vol 2's #UD condition table.

Furthermore I consider some of the restrictions in table 2-8 quite
a bit too restrictive, and hence I wouldn't be surprised if they
got relaxed at some point. It has puzzled me from the very
beginning that e.g. Hi16_ZMM needs to be set by 32-bit OSes
despite there being no means to access the upper 16 ZMM
registers. By keeping the code as is we'd have no issue with
any such relaxation.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 01/34] x86emul: support AVX512 opmask insns
  2018-10-26 11:59           ` Jan Beulich
@ 2018-10-26 12:19             ` Andrew Cooper
  2018-10-26 12:34               ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-10-26 12:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu

On 26/10/18 12:59, Jan Beulich wrote:
>>>> On 26.10.18 at 13:29, <andrew.cooper3@citrix.com> wrote:
>> On 26/10/18 10:03, Jan Beulich wrote:
>>>>>> On 25.10.18 at 20:32, <andrew.cooper3@citrix.com> wrote:
>>>> On 18/09/18 12:53, Jan Beulich wrote:
>>>>> @@ -1187,6 +1188,11 @@ static int _get_fpu(
>>>>>              return X86EMUL_UNHANDLEABLE;
>>>>>          break;
>>>>>  
>>>>> +    case X86EMUL_FPU_opmask:
>>>>> +        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
>>>>> +            return X86EMUL_UNHANDLEABLE;
>>>>> +        break;
>>>> I see this follows the pattern from X86EMUL_FPU_ymm, but by the SSE
>>>> check?  It is not relevant at this point - if xcr0.opmask is set, the
>>>> opmask instructions should be usable.
>>> I would agree with you from a functional POV, but please see the
>>> last row of the table named "OS XSAVE Enabling Requirements of
>>> Instruction Categories" in SDM Vol 2.
>> Hmm.  That table is in contradiction to Vol 3 Figure 2-8 and associated
>> description, which enumerates the #GP conditions of XSETBV.
>>
>> In particular, it suggests that OPMASK must be set in union with the ZMM
>> bits, and cannot be set without YMM.
> Hmm, true. However ...
>
>> IMO, we can require/expect get_xcr0() to pass an architecturally valid
>> result.  I don't think we should be checking the transitive closure of
>> feature dependences here.
> ... I'd prefer if the code remained consistent, and hence the
> analogy with X86EMUL_FPU_ymm should remain, the more that
> this way it's also more consistent with Vol 2's #UD condition table.
>
> Furthermore I consider some of the restrictions in table 2-8 quite
> a bit too restrictive, and hence I wouldn't be surprised if they
> got relaxed at some point. It has puzzled me from the very
> beginning that e.g. Hi16_ZMM needs to be set by 32-bit OSes
> despite there being no means to access the upper 16 ZMM
> registers.

It almost certainly simplifies the circuitry.  The high parts will be 0
(due to the zero extending property of VEX/EVEX) and unallocated in the
physical register file.

>  By keeping the code as is we'd have no issue with
> any such relaxation.

Right.  For now, lets leave this consistent with the surrounding code. 
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

I think we need need Intel to fix their docs, and based on that, I think
we should do some uniform changes to handling to remove unnecessary
dependency links.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v3 01/34] x86emul: support AVX512 opmask insns
  2018-10-26 12:19             ` Andrew Cooper
@ 2018-10-26 12:34               ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-10-26 12:34 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 26.10.18 at 14:19, <andrew.cooper3@citrix.com> wrote:
> On 26/10/18 12:59, Jan Beulich wrote:
>>>>> On 26.10.18 at 13:29, <andrew.cooper3@citrix.com> wrote:
>>> On 26/10/18 10:03, Jan Beulich wrote:
>>>>>>> On 25.10.18 at 20:32, <andrew.cooper3@citrix.com> wrote:
>>>>> On 18/09/18 12:53, Jan Beulich wrote:
>>>>>> @@ -1187,6 +1188,11 @@ static int _get_fpu(
>>>>>>              return X86EMUL_UNHANDLEABLE;
>>>>>>          break;
>>>>>>  
>>>>>> +    case X86EMUL_FPU_opmask:
>>>>>> +        if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_OPMASK) )
>>>>>> +            return X86EMUL_UNHANDLEABLE;
>>>>>> +        break;
>>>>> I see this follows the pattern from X86EMUL_FPU_ymm, but by the SSE
>>>>> check?  It is not relevant at this point - if xcr0.opmask is set, the
>>>>> opmask instructions should be usable.
>>>> I would agree with you from a functional POV, but please see the
>>>> last row of the table named "OS XSAVE Enabling Requirements of
>>>> Instruction Categories" in SDM Vol 2.
>>> Hmm.  That table is in contradiction to Vol 3 Figure 2-8 and associated
>>> description, which enumerates the #GP conditions of XSETBV.
>>>
>>> In particular, it suggests that OPMASK must be set in union with the ZMM
>>> bits, and cannot be set without YMM.
>> Hmm, true. However ...
>>
>>> IMO, we can require/expect get_xcr0() to pass an architecturally valid
>>> result.  I don't think we should be checking the transitive closure of
>>> feature dependences here.
>> ... I'd prefer if the code remained consistent, and hence the
>> analogy with X86EMUL_FPU_ymm should remain, the more that
>> this way it's also more consistent with Vol 2's #UD condition table.
>>
>> Furthermore I consider some of the restrictions in table 2-8 quite
>> a bit too restrictive, and hence I wouldn't be surprised if they
>> got relaxed at some point. It has puzzled me from the very
>> beginning that e.g. Hi16_ZMM needs to be set by 32-bit OSes
>> despite there being no means to access the upper 16 ZMM
>> registers.
> 
> It almost certainly simplifies the circuitry.  The high parts will be 0
> (due to the zero extending property of VEX/EVEX) and unallocated in the
> physical register file.

High parts? We're talking of entirely distinct registers (zmm16 -
zmm31) here. But I agree it's likely a simplification for them, just
perhaps elsewhere.

>>  By keeping the code as is we'd have no issue with
>> any such relaxation.
> 
> Right.  For now, lets leave this consistent with the surrounding code. 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks.

> I think we need need Intel to fix their docs, and based on that, I think
> we should do some uniform changes to handling to remove unnecessary
> dependency links.

I'll fire off a request.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 03/44] x86emul: correct EVEX decoding
  2018-09-25 13:27   ` [PATCH v4 03/44] x86emul: correct EVEX decoding Jan Beulich
@ 2018-10-26 14:33     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-10-26 14:33 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:27, Jan Beulich wrote:
> Fix an inverted pair of checks, drop an incorrect instance of #UD
> raising for non-64-bit mode, and add further generic checks.
>
> Note: Other than SDM Vol 2 rev 067 states, EVEX.V' is _not_ ignored

"Despite what SDM ..." would be a more normal way of phrasing this, I think.


>       outside of 64-bit mode when the field does not encode a register.
>       Just like EVEX.VVVV is required to be 0b1111 in that case, EVEX.V'
>       is required to be 1 there.
>
> Also rename the bcst field to br, as #UD generation for individual insns
> will need to consider both of its possible meanings.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

I still don't like how hard it is to compare the code to the manual, but
given no plausible solution, I'm not sure what to do.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 04/44] x86emul: generalize vector length handling for AVX512/EVEX
  2018-09-25 13:28   ` [PATCH v4 04/44] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
@ 2018-10-26 16:10     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-10-26 16:10 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:28, Jan Beulich wrote:
> To allow for some code sharing where possible, copy VEX.L into EVEX.LR
> even for VEX (or XOP) encoded insns. Make operand size determination
> use this right away, at the same time adding consistency checks for the
> EVEX scalar insn cases (the non-scalar ones aren't uniform enough for
> the checking to be done in a central place like this).
>
> Note that the broadcast case is not handled here, but will be taken care
> of elsewhere (in just a single place rather than at least two).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling
  2018-09-25 13:29   ` [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
@ 2018-11-12 17:42     ` Andrew Cooper
  2018-11-13 11:12       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-11-12 17:42 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/18 14:29, Jan Beulich wrote:
> Besides the already existing tests (which are going to be extended once
> respective ISA extension support is complete), let's also ensure for
> every individual insn that their Disp8 scaling (and memory access width)
> are correct.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I can see what you're attempting to do, but you now have two
implementations of the EVEX disp8 logic written by yourself.  AFAICT,
this doesn't actually check that the behaviour of the instruction in
hardware matches your model of the instruction - it checks that two of
your models are the same.

The only way I can think of testing the emulator model against hardware
is to start with two memory area poisoned with a non-repeating pattern,
and a src/dst register poisoned with a different non-repeating pattern. 
Then, execute a real instruction stub, emulate the other and memcmp()
the two memory regions.

That way, a systematic error in the two models won't cancel out to "all ok".

Also, some observations about the code as presented.

> --- /dev/null
> +++ b/tools/tests/x86_emulator/evex-disp8.c
> @@ -0,0 +1,452 @@
> +#include <stdarg.h>
> +#include <stdio.h>
> +
> +#include "x86-emulate.h"

This now needs rearranging to avoid:

x86-emulate.h:30:3: error: #error "Must not include <stdio.h> before
x86-emulate.h"
 # error "Must not include <stdio.h> before x86-emulate.h"

> +
> +struct test {
> +    const char *mnemonic;
> +    unsigned int opc:8;
> +    unsigned int spc:2;
> +    unsigned int pfx:2;
> +    unsigned int vsz:3;
> +    unsigned int esz:4;
> +    unsigned int scale:1;
> +    unsigned int ext:3;
> +};
> +
> +enum spc {
> +    SPC_invalid,
> +    SPC_0f,
> +    SPC_0f38,
> +    SPC_0f3a,
> +};
> +
> +enum pfx {
> +    PFX_,
> +    PFX_66,
> +    PFX_f3,
> +    PFX_f2
> +};
> +
> +enum vl {
> +    VL_128,
> +    VL_256,
> +    VL_512,
> +};
> +
> +enum scale {
> +    SC_vl,
> +    SC_el,
> +};
> +
> +enum vsz {
> +    VSZ_vl,
> +    VSZ_vl_2, /* VL / 2 */
> +    VSZ_vl_4, /* VL / 4 */
> +    VSZ_vl_8, /* VL / 8 */
> +    /* "no broadcast" implied from here on. */
> +    VSZ_el,
> +    VSZ_el_2, /* EL * 2 */
> +    VSZ_el_4, /* EL * 4 */
> +    VSZ_el_8, /* EL * 8 */
> +};
> +

These acronyms get increasingly difficult to follow.  What is el in this
context?

> +enum esz {
> +    ESZ_d,
> +    ESZ_q,
> +    ESZ_dq,
> +    ESZ_sd,
> +    ESZ_d_nb,
> +    ESZ_q_nb,
> +    /* "no broadcast" implied from here on. */
> +#ifdef __i386__
> +    ESZ_d_WIG,
> +#endif
> +    ESZ_b,
> +    ESZ_w,
> +    ESZ_bw,
> +};
> +
> +#ifndef __i386__
> +# define ESZ_dq64 ESZ_dq
> +#else
> +# define ESZ_dq64 ESZ_d_WIG
> +#endif
> +
> +#define INSNX(m, p, sp, o, e, vs, es, sc) { \
> +    .mnemonic = #m, .opc = 0x##o, .spc = SPC_##sp, .pfx = PFX_##p, \
> +    .vsz = VSZ_##vs, .esz = ESZ_##es, .scale = SC_##sc, .ext = 0##e \
> +}
> +#define INSN(m, p, sp, o, vs, es, sc) INSNX(m, p, sp, o, 0, vs, es, sc)
> +#define INSN_PFP(m, sp, o) \
> +    INSN(m##pd, 66, sp, o, vl, q, vl), \
> +    INSN(m##ps,   , sp, o, vl, d, vl)
> +#define INSN_PFP_NB(m, sp, o) \
> +    INSN(m##pd, 66, sp, o, vl, q_nb, vl), \
> +    INSN(m##ps,   , sp, o, vl, d_nb, vl)
> +#define INSN_SFP(m, sp, o) \
> +    INSN(m##sd, f2, sp, o, el, q, el), \
> +    INSN(m##ss, f3, sp, o, el, d, el)
> +
> +#define INSN_FP(m, sp, o) \
> +    INSN_PFP(m, sp, o), \
> +    INSN_SFP(m, sp, o)
> +
> +static const struct test avx512f_all[] = {
> +    INSN_SFP(mov,            0f, 10),
> +    INSN_SFP(mov,            0f, 11),
> +    INSN_PFP_NB(mova,        0f, 28),
> +    INSN_PFP_NB(mova,        0f, 29),
> +    INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
> +    INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
> +    INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
> +    INSN(movdqa64,     66,   0f, 7f,    vl,   q_nb, vl),
> +    INSN(movdqu32,     f3,   0f, 6f,    vl,   d_nb, vl),
> +    INSN(movdqu32,     f3,   0f, 7f,    vl,   d_nb, vl),
> +    INSN(movdqu64,     f3,   0f, 6f,    vl,   q_nb, vl),
> +    INSN(movdqu64,     f3,   0f, 7f,    vl,   q_nb, vl),
> +    INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
> +    INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
> +    INSN_PFP_NB(movnt,       0f, 2b),
> +    INSN_PFP_NB(movu,        0f, 10),
> +    INSN_PFP_NB(movu,        0f, 11),
> +};
> +
> +static const struct test avx512f_128[] = {
> +    INSN(mov,       66,   0f, 6e, el, dq64, el),
> +    INSN(mov,       66,   0f, 7e, el, dq64, el),
> +    INSN(movq,      f3,   0f, 7e, el,    q, el),
> +    INSN(movq,      66,   0f, d6, el,    q, el),
> +};
> +
> +static const struct test avx512bw_all[] = {
> +    INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
> +    INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
> +    INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
> +    INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
> +};
> +
> +static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
> +static const unsigned char vl_128[] = { VL_128 };

What are these for, and why is vl_all[]'s VL_128 out of order?

> +
> +/*
> + * This table, indicating the presence of an immediate (byte) for an opcode
> + * space 0f major opcode, is indexed by high major opcode byte nibble, with
> + * each table element then bit-indexed by low major opcode byte nibble.
> + */
> +static const uint16_t imm0f[16] = {
> +    [0x7] = (1 << 0x0) /* vpshuf* */ |
> +            (1 << 0x1) /* vps{ll,ra,rl}w */ |
> +            (1 << 0x2) /* vps{l,r}ld, vp{rol,ror,sra}{d,q} */ |
> +            (1 << 0x3) /* vps{l,r}l{,d}q */,
> +    [0xc] = (1 << 0x2) /* vcmp{p,s}{d,s} */ |
> +            (1 << 0x4) /* vpinsrw */ |
> +            (1 << 0x5) /* vpextrw */ |
> +            (1 << 0x6) /* vshufp{d,s} */,
> +};
> +
> +static struct x86_emulate_ops emulops;
> +
> +static unsigned int accessed[3 * 64];

What are the expected properties?  Why 3 * ?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 07/44] x86emul: also allow running the 32-bit harness on a 64-bit distro
  2018-09-25 13:29   ` [PATCH v4 07/44] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
@ 2018-11-12 17:50     ` Andrew Cooper
  2018-11-13 11:42       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-11-12 17:50 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/18 14:29, Jan Beulich wrote:
> In order to be able to verify the 32-bit variant builds and runs,
> introduce a respective target (and the necessary other adjustments).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I tried this, but got:

make: Entering directory '/local/xen.git/tools/tests/x86_emulator/32'
gcc -Wall -Werror -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -Wdeclaration-after-statement -m32 -I/local/xen.git/tools/tests/x86_emulator/32/../../../../tools/include -I.  -D__XEN_TOOLS__ -c -g -o x86-emulate.o ../x86-emulate.c
gcc -Wall -Werror -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -Wdeclaration-after-statement -m32 -I/local/xen.git/tools/tests/x86_emulator/32/../../../../tools/include -I.  -o test_x86_emulator x86-emulate.o ../test_x86_emulator.o ../evex-disp8.o ../wrappers.o
/usr/bin/ld: i386:x86-64 architecture of input file `../test_x86_emulator.o' is incompatible with i386 output
/usr/bin/ld: i386:x86-64 architecture of input file `../evex-disp8.o' is incompatible with i386 output
/usr/bin/ld: i386:x86-64 architecture of input file `../wrappers.o' is incompatible with i386 output
collect2: error: ld returned 1 exit status
../Makefile:153: recipe for target 'test_x86_emulator' failed


> ---
> v4: Moved ahead in series.
> v3: New.
>
> --- a/.gitignore
> +++ b/.gitignore
> @@ -240,6 +240,7 @@ tools/security/xensec_tool
>  tools/tests/depriv/depriv-fd-checker
>  tools/tests/x86_emulator/*.bin
>  tools/tests/x86_emulator/*.tmp
> +tools/tests/x86_emulator/32/x86_emulate
>  tools/tests/x86_emulator/3dnow*.[ch]
>  tools/tests/x86_emulator/asm
>  tools/tests/x86_emulator/avx*.[ch]
> --- /dev/null
> +++ b/tools/tests/x86_emulator/32/Makefile
> @@ -0,0 +1,4 @@
> +override XEN_COMPILE_ARCH := x86_32
> +XEN_ROOT = $(CURDIR)/../../../..
> +VPATH += ..
> +include ../Makefile
> --- a/tools/tests/x86_emulator/Makefile
> +++ b/tools/tests/x86_emulator/Makefile
> @@ -1,5 +1,7 @@
>  
> +ifeq ($(XEN_ROOT),)
>  XEN_ROOT=$(CURDIR)/../../..
> +endif

How about ?= ?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling
  2018-11-12 17:42     ` Andrew Cooper
@ 2018-11-13 11:12       ` Jan Beulich
  2018-11-13 15:45         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-13 11:12 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 12.11.18 at 18:42, <andrew.cooper3@citrix.com> wrote:
> On 25/09/18 14:29, Jan Beulich wrote:
>> Besides the already existing tests (which are going to be extended once
>> respective ISA extension support is complete), let's also ensure for
>> every individual insn that their Disp8 scaling (and memory access width)
>> are correct.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> I can see what you're attempting to do, but you now have two
> implementations of the EVEX disp8 logic written by yourself.  AFAICT,
> this doesn't actually check that the behaviour of the instruction in
> hardware matches your model of the instruction - it checks that two of
> your models are the same.

Correct, but I've specifically tried to make the two models sufficiently
different.

> The only way I can think of testing the emulator model against hardware
> is to start with two memory area poisoned with a non-repeating pattern,
> and a src/dst register poisoned with a different non-repeating pattern. 
> Then, execute a real instruction stub, emulate the other and memcmp()
> the two memory regions.

That's what some of the tests added right in patch 5 do. Did you
intentionally skip that patch while reviewing?

> That way, a systematic error in the two models won't cancel out to "all ok".

Hence the two different models. I certainly realize the risk you
name.

>> --- /dev/null
>> +++ b/tools/tests/x86_emulator/evex-disp8.c
>> @@ -0,0 +1,452 @@
>> +#include <stdarg.h>
>> +#include <stdio.h>
>> +
>> +#include "x86-emulate.h"
> 
> This now needs rearranging to avoid:
> 
> x86-emulate.h:30:3: error: #error "Must not include <stdio.h> before
> x86-emulate.h"
>  # error "Must not include <stdio.h> before x86-emulate.h"

Yes, I've already re-based over that other change.

>> +enum vl {
>> +    VL_128,
>> +    VL_256,
>> +    VL_512,
>> +};
>> +
>> +enum scale {
>> +    SC_vl,
>> +    SC_el,
>> +};
>> +
>> +enum vsz {
>> +    VSZ_vl,
>> +    VSZ_vl_2, /* VL / 2 */
>> +    VSZ_vl_4, /* VL / 4 */
>> +    VSZ_vl_8, /* VL / 8 */
>> +    /* "no broadcast" implied from here on. */
>> +    VSZ_el,
>> +    VSZ_el_2, /* EL * 2 */
>> +    VSZ_el_4, /* EL * 4 */
>> +    VSZ_el_8, /* EL * 8 */
>> +};
>> +
> 
> These acronyms get increasingly difficult to follow.  What is el in this
> context?

VL -> vector length
EL -> element length

>> +static const struct test avx512f_all[] = {
>> +    INSN_SFP(mov,            0f, 10),
>> +    INSN_SFP(mov,            0f, 11),
>> +    INSN_PFP_NB(mova,        0f, 28),
>> +    INSN_PFP_NB(mova,        0f, 29),
>> +    INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
>> +    INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
>> +    INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
>> +    INSN(movdqa64,     66,   0f, 7f,    vl,   q_nb, vl),
>> +    INSN(movdqu32,     f3,   0f, 6f,    vl,   d_nb, vl),
>> +    INSN(movdqu32,     f3,   0f, 7f,    vl,   d_nb, vl),
>> +    INSN(movdqu64,     f3,   0f, 6f,    vl,   q_nb, vl),
>> +    INSN(movdqu64,     f3,   0f, 7f,    vl,   q_nb, vl),
>> +    INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
>> +    INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
>> +    INSN_PFP_NB(movnt,       0f, 2b),
>> +    INSN_PFP_NB(movu,        0f, 10),
>> +    INSN_PFP_NB(movu,        0f, 11),
>> +};
>> +
>> +static const struct test avx512f_128[] = {
>> +    INSN(mov,       66,   0f, 6e, el, dq64, el),
>> +    INSN(mov,       66,   0f, 7e, el, dq64, el),
>> +    INSN(movq,      f3,   0f, 7e, el,    q, el),
>> +    INSN(movq,      66,   0f, d6, el,    q, el),
>> +};
>> +
>> +static const struct test avx512bw_all[] = {
>> +    INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
>> +    INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
>> +    INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
>> +    INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
>> +};
>> +
>> +static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
>> +static const unsigned char vl_128[] = { VL_128 };
> 
> What are these for, and why is vl_all[]'s VL_128 out of order?

The RUN() macro invocations (further down) reference one them
each, to indicate what vector lengths to test. The first array
entry does always get used, while subsequent entries (if any)
require AVX512VL to be available. See the conditional at the top
of the inner loop in test_group().

>> +
>> +/*
>> + * This table, indicating the presence of an immediate (byte) for an opcode
>> + * space 0f major opcode, is indexed by high major opcode byte nibble, with
>> + * each table element then bit-indexed by low major opcode byte nibble.
>> + */
>> +static const uint16_t imm0f[16] = {
>> +    [0x7] = (1 << 0x0) /* vpshuf* */ |
>> +            (1 << 0x1) /* vps{ll,ra,rl}w */ |
>> +            (1 << 0x2) /* vps{l,r}ld, vp{rol,ror,sra}{d,q} */ |
>> +            (1 << 0x3) /* vps{l,r}l{,d}q */,
>> +    [0xc] = (1 << 0x2) /* vcmp{p,s}{d,s} */ |
>> +            (1 << 0x4) /* vpinsrw */ |
>> +            (1 << 0x5) /* vpextrw */ |
>> +            (1 << 0x6) /* vshufp{d,s} */,
>> +};
>> +
>> +static struct x86_emulate_ops emulops;
>> +
>> +static unsigned int accessed[3 * 64];
> 
> What are the expected properties?  Why 3 * ?

See record_access(): The instructions under test all get a Disp8 value
of 1 encoded. In order to be able to sensibly see how exactly things
go wrong (during debugging), it simply helps to cover the entire range
from zero to 3 times the (maximum) vector length. All accesses farther
out of bounds than by vector length will not be recorded here, and
hence fail "silently".

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 07/44] x86emul: also allow running the 32-bit harness on a 64-bit distro
  2018-11-12 17:50     ` Andrew Cooper
@ 2018-11-13 11:42       ` Jan Beulich
  2018-11-13 15:58         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-13 11:42 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 12.11.18 at 18:50, <andrew.cooper3@citrix.com> wrote:
> On 25/09/18 14:29, Jan Beulich wrote:
>> In order to be able to verify the 32-bit variant builds and runs,
>> introduce a respective target (and the necessary other adjustments).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> I tried this, but got:
> 
> make: Entering directory '/local/xen.git/tools/tests/x86_emulator/32'
> gcc -Wall -Werror -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -Wdeclaration-after-statement -m32 -I/local/xen.git/tools/tests/x86_emulator/32/../../../../tools/include -I.  -D__XEN_TOOLS__ -c -g -o x86-emulate.o ../x86-emulate.c
> gcc -Wall -Werror -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -Wdeclaration-after-statement -m32 -I/local/xen.git/tools/tests/x86_emulator/32/../../../../tools/include -I.  -o test_x86_emulator x86-emulate.o ../test_x86_emulator.o ../evex-disp8.o ../wrappers.o
> /usr/bin/ld: i386:x86-64 architecture of input file `../test_x86_emulator.o' is incompatible with i386 output
> /usr/bin/ld: i386:x86-64 architecture of input file `../evex-disp8.o' is incompatible with i386 output
> /usr/bin/ld: i386:x86-64 architecture of input file `../wrappers.o' is incompatible with i386 output
> collect2: error: ld returned 1 exit status
> ../Makefile:153: recipe for target 'test_x86_emulator' failed

Hmm, nothing I can reproduce. I don't understand why it tries to use
../*.o instead of building 32-bit variants in the current directory.
Perhaps a make version difference in how VPATH gets handled? But
I know I've tried with different versions, which will make it rather
hard for me to diagnose the issue. Could you see whether replacing

VPATH += ..

by

vpath %.c ..

helps in your case?

>> --- a/tools/tests/x86_emulator/Makefile
>> +++ b/tools/tests/x86_emulator/Makefile
>> @@ -1,5 +1,7 @@
>>  
>> +ifeq ($(XEN_ROOT),)
>>  XEN_ROOT=$(CURDIR)/../../..
>> +endif
> 
> How about ?= ?

I can switch to that (provided I run into no issues).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling
  2018-11-13 11:12       ` Jan Beulich
@ 2018-11-13 15:45         ` Andrew Cooper
  2018-11-14 14:17           ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 15:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu

On 13/11/2018 11:12, Jan Beulich wrote:
>>>> On 12.11.18 at 18:42, <andrew.cooper3@citrix.com> wrote:
>> On 25/09/18 14:29, Jan Beulich wrote:
>>> Besides the already existing tests (which are going to be extended once
>>> respective ISA extension support is complete), let's also ensure for
>>> every individual insn that their Disp8 scaling (and memory access width)
>>> are correct.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> I can see what you're attempting to do, but you now have two
>> implementations of the EVEX disp8 logic written by yourself.  AFAICT,
>> this doesn't actually check that the behaviour of the instruction in
>> hardware matches your model of the instruction - it checks that two of
>> your models are the same.
> Correct, but I've specifically tried to make the two models sufficiently
> different.
>
>> The only way I can think of testing the emulator model against hardware
>> is to start with two memory area poisoned with a non-repeating pattern,
>> and a src/dst register poisoned with a different non-repeating pattern. 
>> Then, execute a real instruction stub, emulate the other and memcmp()
>> the two memory regions.
> That's what some of the tests added right in patch 5 do. Did you
> intentionally skip that patch while reviewing?

I intentionally wanted to understand this patch first.

>
>> That way, a systematic error in the two models won't cancel out to "all ok".
> Hence the two different models. I certainly realize the risk you
> name.
>
>>> --- /dev/null
>>> +++ b/tools/tests/x86_emulator/evex-disp8.c
>>> @@ -0,0 +1,452 @@
>>> +#include <stdarg.h>
>>> +#include <stdio.h>
>>> +
>>> +#include "x86-emulate.h"
>> This now needs rearranging to avoid:
>>
>> x86-emulate.h:30:3: error: #error "Must not include <stdio.h> before
>> x86-emulate.h"
>>  # error "Must not include <stdio.h> before x86-emulate.h"
> Yes, I've already re-based over that other change.
>
>>> +enum vl {
>>> +    VL_128,
>>> +    VL_256,
>>> +    VL_512,
>>> +};
>>> +
>>> +enum scale {
>>> +    SC_vl,
>>> +    SC_el,
>>> +};
>>> +
>>> +enum vsz {
>>> +    VSZ_vl,
>>> +    VSZ_vl_2, /* VL / 2 */
>>> +    VSZ_vl_4, /* VL / 4 */
>>> +    VSZ_vl_8, /* VL / 8 */
>>> +    /* "no broadcast" implied from here on. */
>>> +    VSZ_el,
>>> +    VSZ_el_2, /* EL * 2 */
>>> +    VSZ_el_4, /* EL * 4 */
>>> +    VSZ_el_8, /* EL * 8 */
>>> +};
>>> +
>> These acronyms get increasingly difficult to follow.  What is el in this
>> context?
> VL -> vector length
> EL -> element length

Can you at least leave trailing comments after the identifiers for the
benefit of people other than you reading the code?

>
>>> +static const struct test avx512f_all[] = {
>>> +    INSN_SFP(mov,            0f, 10),
>>> +    INSN_SFP(mov,            0f, 11),
>>> +    INSN_PFP_NB(mova,        0f, 28),
>>> +    INSN_PFP_NB(mova,        0f, 29),
>>> +    INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
>>> +    INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
>>> +    INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
>>> +    INSN(movdqa64,     66,   0f, 7f,    vl,   q_nb, vl),
>>> +    INSN(movdqu32,     f3,   0f, 6f,    vl,   d_nb, vl),
>>> +    INSN(movdqu32,     f3,   0f, 7f,    vl,   d_nb, vl),
>>> +    INSN(movdqu64,     f3,   0f, 6f,    vl,   q_nb, vl),
>>> +    INSN(movdqu64,     f3,   0f, 7f,    vl,   q_nb, vl),
>>> +    INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
>>> +    INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
>>> +    INSN_PFP_NB(movnt,       0f, 2b),
>>> +    INSN_PFP_NB(movu,        0f, 10),
>>> +    INSN_PFP_NB(movu,        0f, 11),
>>> +};
>>> +
>>> +static const struct test avx512f_128[] = {
>>> +    INSN(mov,       66,   0f, 6e, el, dq64, el),
>>> +    INSN(mov,       66,   0f, 7e, el, dq64, el),
>>> +    INSN(movq,      f3,   0f, 7e, el,    q, el),
>>> +    INSN(movq,      66,   0f, d6, el,    q, el),
>>> +};
>>> +
>>> +static const struct test avx512bw_all[] = {
>>> +    INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
>>> +    INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
>>> +    INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
>>> +    INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
>>> +};
>>> +
>>> +static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
>>> +static const unsigned char vl_128[] = { VL_128 };
>> What are these for, and why is vl_all[]'s VL_128 out of order?
> The RUN() macro invocations (further down) reference one them
> each, to indicate what vector lengths to test. The first array
> entry does always get used, while subsequent entries (if any)
> require AVX512VL to be available. See the conditional at the top
> of the inner loop in test_group().

After re-reading the apparently relevant bits of Vol 1, 2 and 3, I'm
still actually none the wiser as to which AVX512 feature bits mean what.

Is there a chapter with an overview that I've overlooked, or if not, can
we see about putting one together?

>
>>> +
>>> +/*
>>> + * This table, indicating the presence of an immediate (byte) for an opcode
>>> + * space 0f major opcode, is indexed by high major opcode byte nibble, with
>>> + * each table element then bit-indexed by low major opcode byte nibble.
>>> + */
>>> +static const uint16_t imm0f[16] = {
>>> +    [0x7] = (1 << 0x0) /* vpshuf* */ |
>>> +            (1 << 0x1) /* vps{ll,ra,rl}w */ |
>>> +            (1 << 0x2) /* vps{l,r}ld, vp{rol,ror,sra}{d,q} */ |
>>> +            (1 << 0x3) /* vps{l,r}l{,d}q */,
>>> +    [0xc] = (1 << 0x2) /* vcmp{p,s}{d,s} */ |
>>> +            (1 << 0x4) /* vpinsrw */ |
>>> +            (1 << 0x5) /* vpextrw */ |
>>> +            (1 << 0x6) /* vshufp{d,s} */,
>>> +};
>>> +
>>> +static struct x86_emulate_ops emulops;
>>> +
>>> +static unsigned int accessed[3 * 64];
>> What are the expected properties?  Why 3 * ?
> See record_access(): The instructions under test all get a Disp8 value
> of 1 encoded. In order to be able to sensibly see how exactly things
> go wrong (during debugging), it simply helps to cover the entire range
> from zero to 3 times the (maximum) vector length. All accesses farther
> out of bounds than by vector length will not be recorded here, and
> hence fail "silently".

Please can you put a short description in a comment somewhere around
about here.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 07/44] x86emul: also allow running the 32-bit harness on a 64-bit distro
  2018-11-13 11:42       ` Jan Beulich
@ 2018-11-13 15:58         ` Andrew Cooper
  2018-11-14  8:42           ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 15:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu

On 13/11/2018 11:42, Jan Beulich wrote:
>>>> On 12.11.18 at 18:50, <andrew.cooper3@citrix.com> wrote:
>> On 25/09/18 14:29, Jan Beulich wrote:
>>> In order to be able to verify the 32-bit variant builds and runs,
>>> introduce a respective target (and the necessary other adjustments).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> I tried this, but got:
>>
>> make: Entering directory '/local/xen.git/tools/tests/x86_emulator/32'
>> gcc -Wall -Werror -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -Wdeclaration-after-statement -m32 -I/local/xen.git/tools/tests/x86_emulator/32/../../../../tools/include -I.  -D__XEN_TOOLS__ -c -g -o x86-emulate.o ../x86-emulate.c
>> gcc -Wall -Werror -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -Wdeclaration-after-statement -m32 -I/local/xen.git/tools/tests/x86_emulator/32/../../../../tools/include -I.  -o test_x86_emulator x86-emulate.o ../test_x86_emulator.o ../evex-disp8.o ../wrappers.o
>> /usr/bin/ld: i386:x86-64 architecture of input file `../test_x86_emulator.o' is incompatible with i386 output
>> /usr/bin/ld: i386:x86-64 architecture of input file `../evex-disp8.o' is incompatible with i386 output
>> /usr/bin/ld: i386:x86-64 architecture of input file `../wrappers.o' is incompatible with i386 output
>> collect2: error: ld returned 1 exit status
>> ../Makefile:153: recipe for target 'test_x86_emulator' failed
> Hmm, nothing I can reproduce. I don't understand why it tries to use
> ../*.o instead of building 32-bit variants in the current directory.
> Perhaps a make version difference in how VPATH gets handled? But
> I know I've tried with different versions, which will make it rather
> hard for me to diagnose the issue. Could you see whether replacing
>
> VPATH += ..
>
> by
>
> vpath %.c ..
>
> helps in your case?

Sadly not.  That turns into:

make: Entering directory '/local/xen.git/tools/tests/x86_emulator/32'
make: *** No rule to make target 'x86-emulate.h', needed by
'x86-emulate.o'.  Stop.
make: Leaving directory '/local/xen.git/tools/tests/x86_emulator/32'

I've noticed another issue.  The clean target now rebuilds all the
binary fragements before deleting them again.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 05/44] x86emul: support basic AVX512 moves
  2018-09-25 13:28   ` [PATCH v4 05/44] x86emul: support basic AVX512 moves Jan Beulich
@ 2018-11-13 17:12     ` Andrew Cooper
  2018-11-14 14:35       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 17:12 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:28, Jan Beulich wrote:
> +#define avx512_vlen_check(lig) do { \
> +    switch ( evex.lr ) \
> +    { \
> +    default: \
> +        generate_exception(EXC_UD); \
> +    case 2: \
> +        break; \
> +    case 0: case 1: \
> +        if (!(lig)) \

if ( !(lig) )

> +            host_and_vcpu_must_have(avx512vl); \
> +        break; \
> +    } \
> +} while ( false )
> +
>  static bool is_aligned(enum x86_segment seg, unsigned long offs,
>                         unsigned int size, struct x86_emulate_ctxt *ctxt,
>                         const struct x86_emulate_ops *ops)
> @@ -3272,6 +3387,7 @@ x86_emulate(
>      b = ctxt->opcode;
>      d = state.desc;
>  #define state (&state)
> +    elem_bytes = 4 << evex.w;

evex.w isn't filled by this point, is it?  We only fill evex.lr in the
!evex_encoded() case AFAICT.

>  
>      generate_exception_if(state->not_64bit && mode_64bit(), EXC_UD);
>  
> @@ -6348,6 +6521,41 @@ x86_emulate(
>          ASSERT(!state->simd_size);
>          break;
>  
> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x7e): /* vmov{d,q} xmm,r/m */
> +        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
> +                               evex.reg != 0xf || !evex.RX),

Are the inner brackets necessary?

> @@ -8819,6 +9070,44 @@ x86_emulate(
>                                    !is_aligned(ea.mem.seg, ea.mem.off, op_bytes,
>                                                ctxt, ops),
>                                    EXC_GP, 0);
> +
> +            if ( evex.br )
> +            {
> +                ASSERT((d & DstMask) != DstMem);
> +                op_bytes = elem_bytes;
> +            }
> +            if ( evex.opmsk )
> +            {
> +                ASSERT(!(op_bytes % elem_bytes));
> +                full = ~0ULL >> (64 - op_bytes / elem_bytes);

I think we want a path which checks elem_bytes != 0 which is
release-build safe.  This feels like an XSA waiting to happen.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 08/44] x86emul: use AVX512 logic for emulating V{, P}MASKMOV*
  2018-09-25 13:30   ` [PATCH v4 08/44] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
@ 2018-11-13 18:12     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 18:12 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:30, Jan Beulich wrote:
> The more generic AVX512 implementation allows quite a bit of insn-
> specific code to be dropped/shared.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 09/44] x86emul: support AVX512F legacy-equivalent arithmetic FP insns
  2018-09-25 13:31   ` [PATCH v4 09/44] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
@ 2018-11-13 18:21     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 18:21 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:31, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 10/44] x86emul: support AVX512DQ logic FP insns
  2018-09-25 13:32   ` [PATCH v4 10/44] x86emul: support AVX512DQ logic " Jan Beulich
@ 2018-11-13 18:56     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 18:56 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:32, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 11/44] x86emul: support AVX512F "normal" FP compare insns
  2018-09-25 13:32   ` [PATCH v4 11/44] x86emul: support AVX512F "normal" FP compare insns Jan Beulich
@ 2018-11-13 19:04     ` Andrew Cooper
  2018-11-14 14:41       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 19:04 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

What do you consider "normal" here?

On 25/09/2018 14:32, Jan Beulich wrote:
> Also correct the AVX counterpart's comment.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

The content itself Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>,
but preferably with a better description.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 12/44] x86emul: support AVX512F misc legacy-equivalent FP insns
  2018-09-25 13:33   ` [PATCH v4 12/44] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
@ 2018-11-13 19:17     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 19:17 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:33, Jan Beulich wrote:
> Also correct an AVX counterpart's comment.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 13/44] x86emul: support AVX512F fused-multiply-add insns
  2018-09-25 13:33   ` [PATCH v4 13/44] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
@ 2018-11-13 19:28     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 19:28 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:33, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 14/44] x86emul: support AVX512F legacy-equivalent logic insns
  2018-09-25 13:34   ` [PATCH v4 14/44] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
@ 2018-11-13 19:30     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 19:30 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:34, Jan Beulich wrote:
> Plus vpternlog{d,q} as being extensively used by the compiler, in order
> to facilitate test enabling in the harness as soon as possible. Also the
> twobyte_table[] entries for a few more insns get their .d8s field set
> right away, in order to not split and later re-combine the groups.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 15/44] x86emul: support AVX512{F, DQ} FP broadcast insns
  2018-09-25 13:35   ` [PATCH v4 15/44] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
@ 2018-11-13 19:37     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 19:37 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:35, Jan Beulich wrote:
> @@ -7979,6 +7991,42 @@ x86_emulate(
>          dst.type = OP_NONE;
>          break;
>  
> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
> +        generate_exception_if(evex.w || evex.br, EXC_UD);
> +    avx512_broadcast:
> +        /*
> +         * For the respective code below the main switch() to work we need to
> +         * fold op_mask here: A source element gets read whenever any of its
> +         * respective destination elements' mask bits is set.
> +         */
> +        if ( fault_suppression )
> +        {
> +            n = 1 << ((b & 3) - evex.w);
> +            ASSERT(op_bytes == n * elem_bytes);
> +            for ( i = n; i < (16 << evex.lr) / elem_bytes; i += n )

As before, we should have a release-build safe check of elem_bytes.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 16/44] x86emul: support AVX512F v{, u}comis{d, s} insns
  2018-09-25 13:35   ` [PATCH v4 16/44] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
@ 2018-11-13 19:39     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-13 19:39 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu

On 25/09/2018 14:35, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 07/44] x86emul: also allow running the 32-bit harness on a 64-bit distro
  2018-11-13 15:58         ` Andrew Cooper
@ 2018-11-14  8:42           ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-14  8:42 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 13.11.18 at 16:58, <andrew.cooper3@citrix.com> wrote:
> On 13/11/2018 11:42, Jan Beulich wrote:
>>>>> On 12.11.18 at 18:50, <andrew.cooper3@citrix.com> wrote:
>>> On 25/09/18 14:29, Jan Beulich wrote:
>>>> In order to be able to verify the 32-bit variant builds and runs,
>>>> introduce a respective target (and the necessary other adjustments).
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> I tried this, but got:
>>>
>>> make: Entering directory '/local/xen.git/tools/tests/x86_emulator/32'
>>> gcc -Wall -Werror -Wstrict-prototypes -O2 -fomit-frame-pointer 
> -fno-strict-aliasing -Wdeclaration-after-statement -m32 
> -I/local/xen.git/tools/tests/x86_emulator/32/../../../../tools/include -I.  
> -D__XEN_TOOLS__ -c -g -o x86-emulate.o ../x86-emulate.c
>>> gcc -Wall -Werror -Wstrict-prototypes -O2 -fomit-frame-pointer 
> -fno-strict-aliasing -Wdeclaration-after-statement -m32 
> -I/local/xen.git/tools/tests/x86_emulator/32/../../../../tools/include -I.  
> -o test_x86_emulator x86-emulate.o ../test_x86_emulator.o ../evex-disp8.o 
> ../wrappers.o
>>> /usr/bin/ld: i386:x86-64 architecture of input file `../test_x86_emulator.o' 
> is incompatible with i386 output
>>> /usr/bin/ld: i386:x86-64 architecture of input file `../evex-disp8.o' is 
> incompatible with i386 output
>>> /usr/bin/ld: i386:x86-64 architecture of input file `../wrappers.o' is 
> incompatible with i386 output
>>> collect2: error: ld returned 1 exit status
>>> ../Makefile:153: recipe for target 'test_x86_emulator' failed
>> Hmm, nothing I can reproduce. I don't understand why it tries to use
>> ../*.o instead of building 32-bit variants in the current directory.
>> Perhaps a make version difference in how VPATH gets handled? But
>> I know I've tried with different versions, which will make it rather
>> hard for me to diagnose the issue. Could you see whether replacing
>>
>> VPATH += ..
>>
>> by
>>
>> vpath %.c ..
>>
>> helps in your case?
> 
> Sadly not.  That turns into:
> 
> make: Entering directory '/local/xen.git/tools/tests/x86_emulator/32'
> make: *** No rule to make target 'x86-emulate.h', needed by
> 'x86-emulate.o'.  Stop.
> make: Leaving directory '/local/xen.git/tools/tests/x86_emulator/32'

Well, I had realized late last night that an equivalent line for %.h
would then also be needed (which was in the first place why I used
the single VPATH line instead, seeing that it worked fine for me).
Since the above shows that it'll work in general, I'll switch to this
(making further additions if need be).

> I've noticed another issue.  The clean target now rebuilds all the
> binary fragements before deleting them again.

Oh, yes. Easily taken care of by splitting the dependency line from
the rule one.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling
  2018-11-13 15:45         ` Andrew Cooper
@ 2018-11-14 14:17           ` Jan Beulich
  2018-11-14 14:42             ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-14 14:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 13.11.18 at 16:45, <andrew.cooper3@citrix.com> wrote:
> On 13/11/2018 11:12, Jan Beulich wrote:
>>>>> On 12.11.18 at 18:42, <andrew.cooper3@citrix.com> wrote:
>>> On 25/09/18 14:29, Jan Beulich wrote:
>>>> +enum vl {
>>>> +    VL_128,
>>>> +    VL_256,
>>>> +    VL_512,
>>>> +};
>>>> +
>>>> +enum scale {
>>>> +    SC_vl,
>>>> +    SC_el,
>>>> +};
>>>> +
>>>> +enum vsz {
>>>> +    VSZ_vl,
>>>> +    VSZ_vl_2, /* VL / 2 */
>>>> +    VSZ_vl_4, /* VL / 4 */
>>>> +    VSZ_vl_8, /* VL / 8 */
>>>> +    /* "no broadcast" implied from here on. */
>>>> +    VSZ_el,
>>>> +    VSZ_el_2, /* EL * 2 */
>>>> +    VSZ_el_4, /* EL * 4 */
>>>> +    VSZ_el_8, /* EL * 8 */
>>>> +};
>>>> +
>>> These acronyms get increasingly difficult to follow.  What is el in this
>>> context?
>> VL -> vector length
>> EL -> element length
> 
> Can you at least leave trailing comments after the identifiers for the
> benefit of people other than you reading the code?

Will do.

>>>> +static const struct test avx512f_all[] = {
>>>> +    INSN_SFP(mov,            0f, 10),
>>>> +    INSN_SFP(mov,            0f, 11),
>>>> +    INSN_PFP_NB(mova,        0f, 28),
>>>> +    INSN_PFP_NB(mova,        0f, 29),
>>>> +    INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
>>>> +    INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
>>>> +    INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
>>>> +    INSN(movdqa64,     66,   0f, 7f,    vl,   q_nb, vl),
>>>> +    INSN(movdqu32,     f3,   0f, 6f,    vl,   d_nb, vl),
>>>> +    INSN(movdqu32,     f3,   0f, 7f,    vl,   d_nb, vl),
>>>> +    INSN(movdqu64,     f3,   0f, 6f,    vl,   q_nb, vl),
>>>> +    INSN(movdqu64,     f3,   0f, 7f,    vl,   q_nb, vl),
>>>> +    INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
>>>> +    INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
>>>> +    INSN_PFP_NB(movnt,       0f, 2b),
>>>> +    INSN_PFP_NB(movu,        0f, 10),
>>>> +    INSN_PFP_NB(movu,        0f, 11),
>>>> +};
>>>> +
>>>> +static const struct test avx512f_128[] = {
>>>> +    INSN(mov,       66,   0f, 6e, el, dq64, el),
>>>> +    INSN(mov,       66,   0f, 7e, el, dq64, el),
>>>> +    INSN(movq,      f3,   0f, 7e, el,    q, el),
>>>> +    INSN(movq,      66,   0f, d6, el,    q, el),
>>>> +};
>>>> +
>>>> +static const struct test avx512bw_all[] = {
>>>> +    INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
>>>> +    INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
>>>> +    INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
>>>> +    INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
>>>> +};
>>>> +
>>>> +static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
>>>> +static const unsigned char vl_128[] = { VL_128 };
>>> What are these for, and why is vl_all[]'s VL_128 out of order?
>> The RUN() macro invocations (further down) reference one them
>> each, to indicate what vector lengths to test. The first array
>> entry does always get used, while subsequent entries (if any)
>> require AVX512VL to be available. See the conditional at the top
>> of the inner loop in test_group().
> 
> After re-reading the apparently relevant bits of Vol 1, 2 and 3, I'm
> still actually none the wiser as to which AVX512 feature bits mean what.

What feature bits are you talking about? The context above doesn't
refer to any, at least not directly.

> Is there a chapter with an overview that I've overlooked, or if not, can
> we see about putting one together?

I'd have to understand (as per above) what exactly you're after in
order to answer this and/or put something together. What I had
done in the year that I had been waiting for a machine to be able
to test all of this was to put together two classifying tables. One is
what I'm slowly migrating into the tables right here. The other is
taking a slightly different view, grouping them by memory access
width, Disp8 scaling behavior, and "tuple type":

MNEMONIC		Disp8 Scale		Mem Width		Category
======================================================================================
vpcompressb		0			VL			Tuple1 Scalar
vpexpandb
--------------------------------------------------------------------------------------
vpbroadcastb		0			Disp8 Scale		Tuple1 Scalar
vpextrb
vpinsrb
======================================================================================
vpcompressw		1			VL			Tuple1 Scalar
vpexpandw
--------------------------------------------------------------------------------------
vpbroadcastw		1			Disp8 Scale		Tuple1 Scalar
vpextrw
vpinsrw
======================================================================================
vcompressps		2			VL			Tuple1 Scalar
vexpandps
vpcompressd
vpexpandd

etc.

But I didn't compile anything matching up with feature flags, other
than the first having tables grouped by features. I did a lot of
binutils work though during that time as well, to get things into a
more sane state there. An original thought of mine was whether
we could re-use their opcode table, which is why I first wanted
to get it into better shape.

(A 3rd table then turned out helpful for implementing the myriad of
conversion insns, but that's just a re-ordered derivation of the first
one here, going by encoding instead.)

>>>> +
>>>> +/*
>>>> + * This table, indicating the presence of an immediate (byte) for an opcode
>>>> + * space 0f major opcode, is indexed by high major opcode byte nibble, with
>>>> + * each table element then bit-indexed by low major opcode byte nibble.
>>>> + */
>>>> +static const uint16_t imm0f[16] = {
>>>> +    [0x7] = (1 << 0x0) /* vpshuf* */ |
>>>> +            (1 << 0x1) /* vps{ll,ra,rl}w */ |
>>>> +            (1 << 0x2) /* vps{l,r}ld, vp{rol,ror,sra}{d,q} */ |
>>>> +            (1 << 0x3) /* vps{l,r}l{,d}q */,
>>>> +    [0xc] = (1 << 0x2) /* vcmp{p,s}{d,s} */ |
>>>> +            (1 << 0x4) /* vpinsrw */ |
>>>> +            (1 << 0x5) /* vpextrw */ |
>>>> +            (1 << 0x6) /* vshufp{d,s} */,
>>>> +};
>>>> +
>>>> +static struct x86_emulate_ops emulops;
>>>> +
>>>> +static unsigned int accessed[3 * 64];
>>> What are the expected properties?  Why 3 * ?
>> See record_access(): The instructions under test all get a Disp8 value
>> of 1 encoded. In order to be able to sensibly see how exactly things
>> go wrong (during debugging), it simply helps to cover the entire range
>> from zero to 3 times the (maximum) vector length. All accesses farther
>> out of bounds than by vector length will not be recorded here, and
>> hence fail "silently".
> 
> Please can you put a short description in a comment somewhere around
> about here.

Again, will do.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 05/44] x86emul: support basic AVX512 moves
  2018-11-13 17:12     ` Andrew Cooper
@ 2018-11-14 14:35       ` Jan Beulich
  2018-11-14 16:26         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-14 14:35 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 13.11.18 at 18:12, <andrew.cooper3@citrix.com> wrote:
> On 25/09/2018 14:28, Jan Beulich wrote:
>> +#define avx512_vlen_check(lig) do { \
>> +    switch ( evex.lr ) \
>> +    { \
>> +    default: \
>> +        generate_exception(EXC_UD); \
>> +    case 2: \
>> +        break; \
>> +    case 0: case 1: \
>> +        if (!(lig)) \
> 
> if ( !(lig) )

Oops. By now I've looked at this many dozen times, and I've
never noticed the style issue.

>> @@ -3272,6 +3387,7 @@ x86_emulate(
>>      b = ctxt->opcode;
>>      d = state.desc;
>>  #define state (&state)
>> +    elem_bytes = 4 << evex.w;
> 
> evex.w isn't filled by this point, is it?  We only fill evex.lr in the
> !evex_encoded() case AFAICT.

Well, that's another bit of (pre-existing) trickery: When we decode
these special prefixes (VEX, XOP, and EVEX) we first read the bytes
into vex.raw[]. The code dealing with the EVEX case then copies
the two vex.raw[] bytes into evex.raw[]. Thus the common fields
(between the prefix encodings) are filled uniformly, and uses
through vex are fine also for the EVEX case. I think this is better
than littering around many ?: expressions.

As an aside - if the above didn't work, none of the testing would
have succeeded.

>> @@ -6348,6 +6521,41 @@ x86_emulate(
>>          ASSERT(!state->simd_size);
>>          break;
>>  
>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x7e): /* vmov{d,q} xmm,r/m */
>> +        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
>> +                               evex.reg != 0xf || !evex.RX),
> 
> Are the inner brackets necessary?

I'd be happy to drop them - I've put them there mostly for you,
who you want whatever tool to properly deal with indentation on
such wrapped lines. Since I don't know the exact rules that tool
follows, I may have gone too far, but then again I think the
resulting different indentation between the two lines above and
the next line (holding the other macro argument) isn't unhelpful.

>> @@ -8819,6 +9070,44 @@ x86_emulate(
>>                                    !is_aligned(ea.mem.seg, ea.mem.off, op_bytes,
>>                                                ctxt, ops),
>>                                    EXC_GP, 0);
>> +
>> +            if ( evex.br )
>> +            {
>> +                ASSERT((d & DstMask) != DstMem);
>> +                op_bytes = elem_bytes;
>> +            }
>> +            if ( evex.opmsk )
>> +            {
>> +                ASSERT(!(op_bytes % elem_bytes));
>> +                full = ~0ULL >> (64 - op_bytes / elem_bytes);
> 
> I think we want a path which checks elem_bytes != 0 which is
> release-build safe.  This feels like an XSA waiting to happen.

Nothing _ever_ sets (or should set) elem_bytes to zero, and it gets
initialized to a non-zero value right in this patch. When writing this
code I indeed did think about adding a check against zero, but I
couldn't figure what half way sensible action (other than BUG()ing)
I could take in that case. Yet BUG() is in no way better than hitting
#DE on the division.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 11/44] x86emul: support AVX512F "normal" FP compare insns
  2018-11-13 19:04     ` Andrew Cooper
@ 2018-11-14 14:41       ` Jan Beulich
  2018-11-14 14:45         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-14 14:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 13.11.18 at 20:04, <andrew.cooper3@citrix.com> wrote:
> What do you consider "normal" here?

Would "ordinary" be better (I've used quotes exactly because I
couldn't think of anything better)? V{,U}COMIS{S,D} are the ones
not dealt with here.

> On 25/09/2018 14:32, Jan Beulich wrote:
>> Also correct the AVX counterpart's comment.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> The content itself Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>,
> but preferably with a better description.

Thanks - let me know whether "normal" or "ordinary" or okay with
you, or what better suggestions you have.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling
  2018-11-14 14:17           ` Jan Beulich
@ 2018-11-14 14:42             ` Andrew Cooper
  2018-11-14 14:58               ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-11-14 14:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu

On 14/11/2018 14:17, Jan Beulich wrote:
>>>>> +static const struct test avx512f_all[] = {
>>>>> +    INSN_SFP(mov,            0f, 10),
>>>>> +    INSN_SFP(mov,            0f, 11),
>>>>> +    INSN_PFP_NB(mova,        0f, 28),
>>>>> +    INSN_PFP_NB(mova,        0f, 29),
>>>>> +    INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
>>>>> +    INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
>>>>> +    INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
>>>>> +    INSN(movdqa64,     66,   0f, 7f,    vl,   q_nb, vl),
>>>>> +    INSN(movdqu32,     f3,   0f, 6f,    vl,   d_nb, vl),
>>>>> +    INSN(movdqu32,     f3,   0f, 7f,    vl,   d_nb, vl),
>>>>> +    INSN(movdqu64,     f3,   0f, 6f,    vl,   q_nb, vl),
>>>>> +    INSN(movdqu64,     f3,   0f, 7f,    vl,   q_nb, vl),
>>>>> +    INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
>>>>> +    INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
>>>>> +    INSN_PFP_NB(movnt,       0f, 2b),
>>>>> +    INSN_PFP_NB(movu,        0f, 10),
>>>>> +    INSN_PFP_NB(movu,        0f, 11),
>>>>> +};
>>>>> +
>>>>> +static const struct test avx512f_128[] = {
>>>>> +    INSN(mov,       66,   0f, 6e, el, dq64, el),
>>>>> +    INSN(mov,       66,   0f, 7e, el, dq64, el),
>>>>> +    INSN(movq,      f3,   0f, 7e, el,    q, el),
>>>>> +    INSN(movq,      66,   0f, d6, el,    q, el),
>>>>> +};
>>>>> +
>>>>> +static const struct test avx512bw_all[] = {
>>>>> +    INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
>>>>> +    INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
>>>>> +    INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
>>>>> +    INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
>>>>> +};
>>>>> +
>>>>> +static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
>>>>> +static const unsigned char vl_128[] = { VL_128 };
>>>> What are these for, and why is vl_all[]'s VL_128 out of order?
>>> The RUN() macro invocations (further down) reference one them
>>> each, to indicate what vector lengths to test. The first array
>>> entry does always get used, while subsequent entries (if any)
>>> require AVX512VL to be available. See the conditional at the top
>>> of the inner loop in test_group().
>> After re-reading the apparently relevant bits of Vol 1, 2 and 3, I'm
>> still actually none the wiser as to which AVX512 feature bits mean what.
> What feature bits are you talking about? The context above doesn't
> refer to any, at least not directly.

I was referring to the AVX512 cpuid flags.

For example, it took me until writing that comment to realise that the
VL feature bit behaved in the opposite way to how I expected it to
behave.  (I.e. it allows you to encode EVEX instructions which don't
refer to %zmm register).

Having said all of this, having searched about online, I think the
Wikipedia AVX-512 page is probably the closes to what I was looking for,
so perhaps us writing our own breakdown isn't the best idea.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 11/44] x86emul: support AVX512F "normal" FP compare insns
  2018-11-14 14:41       ` Jan Beulich
@ 2018-11-14 14:45         ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-14 14:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu

On 14/11/2018 14:41, Jan Beulich wrote:
>>>> On 13.11.18 at 20:04, <andrew.cooper3@citrix.com> wrote:
>> What do you consider "normal" here?
> Would "ordinary" be better (I've used quotes exactly because I
> couldn't think of anything better)? V{,U}COMIS{S,D} are the ones
> not dealt with here.
>
>> On 25/09/2018 14:32, Jan Beulich wrote:
>>> Also correct the AVX counterpart's comment.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> The content itself Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>,
>> but preferably with a better description.
> Thanks - let me know whether "normal" or "ordinary" or okay with
> you, or what better suggestions you have.

How about just "Support most AVX512F FP compare insns", and note that
V{,U}COMIS{S,D} is done later in the commit message?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling
  2018-11-14 14:42             ` Andrew Cooper
@ 2018-11-14 14:58               ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-14 14:58 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 14.11.18 at 15:42, <andrew.cooper3@citrix.com> wrote:
> On 14/11/2018 14:17, Jan Beulich wrote:
>>> After re-reading the apparently relevant bits of Vol 1, 2 and 3, I'm
>>> still actually none the wiser as to which AVX512 feature bits mean what.
>> What feature bits are you talking about? The context above doesn't
>> refer to any, at least not directly.
> 
> I was referring to the AVX512 cpuid flags.
> 
> For example, it took me until writing that comment to realise that the
> VL feature bit behaved in the opposite way to how I expected it to
> behave.  (I.e. it allows you to encode EVEX instructions which don't
> refer to %zmm register).

Oh. AVX512, as its name says, implies 512-bit vectors. Shorter
forms are what they consider the Vector Length Extension (where
"extension" does _not_ refer to vector length, but to the wider
capabilities.) Some of the gcc issues I've run into appear to
relate to problems they have when AVX512F (i.e. without
AVX512VL) code wants to interact with code dealing with shorter
vectors. And that's despite some AVX512F insns themselves
having shorter operands (when element width widens or shrinks).

> Having said all of this, having searched about online, I think the
> Wikipedia AVX-512 page is probably the closes to what I was looking for,
> so perhaps us writing our own breakdown isn't the best idea.

Indeed. Plus the SDM insn pages look pretty correct in this particular
regard.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 05/44] x86emul: support basic AVX512 moves
  2018-11-14 14:35       ` Jan Beulich
@ 2018-11-14 16:26         ` Andrew Cooper
  2018-11-15  9:54           ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-11-14 16:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu

On 14/11/2018 14:35, Jan Beulich wrote:
>>>> On 13.11.18 at 18:12, <andrew.cooper3@citrix.com> wrote:
>> On 25/09/2018 14:28, Jan Beulich wrote:
>>> +#define avx512_vlen_check(lig) do { \
>>> +    switch ( evex.lr ) \
>>> +    { \
>>> +    default: \
>>> +        generate_exception(EXC_UD); \
>>> +    case 2: \
>>> +        break; \
>>> +    case 0: case 1: \
>>> +        if (!(lig)) \
>> if ( !(lig) )
> Oops. By now I've looked at this many dozen times, and I've
> never noticed the style issue.

It is amazing how you get used to the code when you stare at it for
weeks on end.  In this case however, I expect it might be easier to spot
if the \ were aligned on the RHS.  (I still find aligned \ far easier to
read.)

>
>>> @@ -3272,6 +3387,7 @@ x86_emulate(
>>>      b = ctxt->opcode;
>>>      d = state.desc;
>>>  #define state (&state)
>>> +    elem_bytes = 4 << evex.w;
>> evex.w isn't filled by this point, is it?  We only fill evex.lr in the
>> !evex_encoded() case AFAICT.
> Well, that's another bit of (pre-existing) trickery: When we decode
> these special prefixes (VEX, XOP, and EVEX) we first read the bytes
> into vex.raw[]. The code dealing with the EVEX case then copies
> the two vex.raw[] bytes into evex.raw[].

Oh - I was looking for that, but failed to spot it.  Where is that?

>  Thus the common fields
> (between the prefix encodings) are filled uniformly, and uses
> through vex are fine also for the EVEX case. I think this is better
> than littering around many ?: expressions.
>
> As an aside - if the above didn't work, none of the testing would
> have succeeded.

I hoped as much, but in the absence of finding any suitable code, I
though I'd ask.

>
>>> @@ -6348,6 +6521,41 @@ x86_emulate(
>>>          ASSERT(!state->simd_size);
>>>          break;
>>>  
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x7e): /* vmov{d,q} xmm,r/m */
>>> +        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
>>> +                               evex.reg != 0xf || !evex.RX),
>> Are the inner brackets necessary?
> I'd be happy to drop them - I've put them there mostly for you,
> who you want whatever tool to properly deal with indentation on
> such wrapped lines. Since I don't know the exact rules that tool
> follows, I may have gone too far, but then again I think the
> resulting different indentation between the two lines above and
> the next line (holding the other macro argument) isn't unhelpful.

BSD style already specifies that function parameters are aligned
vertically after the (, so this case is fine without.

The problematic case is bare block continuations (especially on return
statements) where the BSD style is 4 spaces in from the outer block.

>
>>> @@ -8819,6 +9070,44 @@ x86_emulate(
>>>                                    !is_aligned(ea.mem.seg, ea.mem.off, op_bytes,
>>>                                                ctxt, ops),
>>>                                    EXC_GP, 0);
>>> +
>>> +            if ( evex.br )
>>> +            {
>>> +                ASSERT((d & DstMask) != DstMem);
>>> +                op_bytes = elem_bytes;
>>> +            }
>>> +            if ( evex.opmsk )
>>> +            {
>>> +                ASSERT(!(op_bytes % elem_bytes));
>>> +                full = ~0ULL >> (64 - op_bytes / elem_bytes);
>> I think we want a path which checks elem_bytes != 0 which is
>> release-build safe.  This feels like an XSA waiting to happen.
> Nothing _ever_ sets (or should set) elem_bytes to zero, and it gets
> initialized to a non-zero value right in this patch. When writing this
> code I indeed did think about adding a check against zero, but I
> couldn't figure what half way sensible action (other than BUG()ing)
> I could take in that case. Yet BUG() is in no way better than hitting
> #DE on the division.

An { ASSERT_UNREACHABLE(); return X86_UNHANDLEABLE; } block would be
better than BUG(), because at least it won't crash a release
hypervisor.  (At least being unsigned division, we don't have the -1
case to worry about).

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v4 05/44] x86emul: support basic AVX512 moves
  2018-11-14 16:26         ` Andrew Cooper
@ 2018-11-15  9:54           ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-15  9:54 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu

>>> On 14.11.18 at 17:26, <andrew.cooper3@citrix.com> wrote:
> On 14/11/2018 14:35, Jan Beulich wrote:
>>>>> On 13.11.18 at 18:12, <andrew.cooper3@citrix.com> wrote:
>>> On 25/09/2018 14:28, Jan Beulich wrote:
>>>> @@ -3272,6 +3387,7 @@ x86_emulate(
>>>>      b = ctxt->opcode;
>>>>      d = state.desc;
>>>>  #define state (&state)
>>>> +    elem_bytes = 4 << evex.w;
>>> evex.w isn't filled by this point, is it?  We only fill evex.lr in the
>>> !evex_encoded() case AFAICT.
>> Well, that's another bit of (pre-existing) trickery: When we decode
>> these special prefixes (VEX, XOP, and EVEX) we first read the bytes
>> into vex.raw[]. The code dealing with the EVEX case then copies
>> the two vex.raw[] bytes into evex.raw[].
> 
> Oh - I was looking for that, but failed to spot it.  Where is that?

                    switch ( b )
                    {
                    case 0x62:
                        opcode = X86EMUL_OPC_EVEX_;
                        evex.raw[0] = vex.raw[0];
                        evex.raw[1] = vex.raw[1];
                        evex.raw[2] = insn_fetch_type(uint8_t);

(in the middle of ModRM processing in x86_decode())

>>>> @@ -6348,6 +6521,41 @@ x86_emulate(
>>>>          ASSERT(!state->simd_size);
>>>>          break;
>>>>  
>>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
>>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x7e): /* vmov{d,q} xmm,r/m */
>>>> +        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
>>>> +                               evex.reg != 0xf || !evex.RX),
>>> Are the inner brackets necessary?
>> I'd be happy to drop them - I've put them there mostly for you,
>> who you want whatever tool to properly deal with indentation on
>> such wrapped lines. Since I don't know the exact rules that tool
>> follows, I may have gone too far, but then again I think the
>> resulting different indentation between the two lines above and
>> the next line (holding the other macro argument) isn't unhelpful.
> 
> BSD style already specifies that function parameters are aligned
> vertically after the (, so this case is fine without.
> 
> The problematic case is bare block continuations (especially on return
> statements) where the BSD style is 4 spaces in from the outer block.

So considering the second aspect I've mentioned, would you
nevertheless prefer the parens here (and perhaps elsewhere)
to be dropped?

>>>> @@ -8819,6 +9070,44 @@ x86_emulate(
>>>>                                    !is_aligned(ea.mem.seg, ea.mem.off, op_bytes,
>>>>                                                ctxt, ops),
>>>>                                    EXC_GP, 0);
>>>> +
>>>> +            if ( evex.br )
>>>> +            {
>>>> +                ASSERT((d & DstMask) != DstMem);
>>>> +                op_bytes = elem_bytes;
>>>> +            }
>>>> +            if ( evex.opmsk )
>>>> +            {
>>>> +                ASSERT(!(op_bytes % elem_bytes));
>>>> +                full = ~0ULL >> (64 - op_bytes / elem_bytes);
>>> I think we want a path which checks elem_bytes != 0 which is
>>> release-build safe.  This feels like an XSA waiting to happen.
>> Nothing _ever_ sets (or should set) elem_bytes to zero, and it gets
>> initialized to a non-zero value right in this patch. When writing this
>> code I indeed did think about adding a check against zero, but I
>> couldn't figure what half way sensible action (other than BUG()ing)
>> I could take in that case. Yet BUG() is in no way better than hitting
>> #DE on the division.
> 
> An { ASSERT_UNREACHABLE(); return X86_UNHANDLEABLE; } block would be
> better than BUG(), because at least it won't crash a release
> hypervisor.  (At least being unsigned division, we don't have the -1
> case to worry about).

Will do, perhaps introducing (in a prereq patch) an IMPOSSIBLE()
construct abstracting this away, as there's already one such
instance in the code.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 00/47] x86emul: fair parts of AVX512 support
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (8 preceding siblings ...)
  2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
@ 2018-11-19 10:00 ` Jan Beulich
  2018-11-19 10:13   ` [PATCH v5 01/47] x86emul: introduce IMPOSSIBLE() Jan Beulich
                     ` (46 more replies)
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                   ` (2 subsequent siblings)
  12 siblings, 47 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:00 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

01: introduce IMPOSSIBLE()
02: support basic AVX512 moves
03: test for correct EVEX Disp8 scaling
04: also allow running the 32-bit harness on a 64-bit distro
05: use AVX512 logic for emulating V{,P}MASKMOV*
06: support AVX512F legacy-equivalent arithmetic FP insns
07: support AVX512DQ logic FP insns
08: support basic AVX512F FP compare insns
09: support AVX512F misc legacy-equivalent FP insns
10: support AVX512F fused-multiply-add insns
11: support AVX512F legacy-equivalent logic insns
12: support AVX512{F,DQ} FP broadcast insns
13: support AVX512F v{,u}comis{d,s} insns
14: support AVX512{F,BW} packed integer compare insns
15: support AVX512{F,BW} packed integer arithmetic insns
16: use simd_128 also for legacy vector shift insns
17: support AVX512{F,BW} shift/rotate insns
18: support AVX512{F,BW,DQ} extract insns
19: support AVX512{F,BW,DQ} insert insns
20: basic AVX512F testing
21: support AVX512{F,BW,DQ} integer broadcast insns
22: basic AVX512VL testing
23: support AVX512{F,BW} zero- and sign-extending moves
24: support AVX512{F,BW} down conversion moves
25: support AVX512{F,BW} integer unpack insns
26: support AVX512{F,BW,_VBMI} full permute insns
27: support AVX512{F,BW} integer shuffle insns
28: support AVX512{BW,DQ} mask move insns
29: basic AVX512BW testing
30: basic AVX512DQ testing
31: support AVX512F move high/low insns
32: support AVX512F move duplicate insns
33: support AVX512{F,BW,VBMI} permute insns
34: support AVX512BW pack insns
35: support AVX512F floating-point conversion insns
36: support AVX512F legacy-equivalent packed int/FP conversion insns
37: support AVX512F legacy-equivalent scalar int/FP conversion insns
38: support AVX512DQ packed quad-int/FP conversion insns
39: support AVX512{F,DQ} uint-to-FP conversion insns
40: support AVX512{F,DQ} FP-to-uint conversion insns
41: support remaining AVX512F legacy-equivalent insns
42: support remaining AVX512BW legacy-equivalent insns
43: support AVX512{F,ER} reciprocal insns
44: support AVX512F floating point manipulation insns
45: support AVX512DQ floating point manipulation insns
46: support AVX512{F,_VBMI2} compress/expand insns
47: support remaining misc AVX512{F,BW} insns

This adds support for all AVX512BW, AVX512DQ, and AVX512ER insns
as well as everything in AVX512F except for the scatter/gather ones.
For a few other extensions parts get implemented when having close
relatives among the above sets. Where applicable AVX512VL variants
get supported along with their base flavors.

While I'm unaware of conflicts, if in doubt this series applies on
top of the still un-acked v2 of "x86emul: VME/PVI mode fixes".

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 01/47] x86emul: introduce IMPOSSIBLE()
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
@ 2018-11-19 10:13   ` Jan Beulich
  2018-11-19 18:11     ` Andrew Cooper
  2018-11-19 10:13   ` [PATCH v5 02/47] x86emul: support basic AVX512 moves Jan Beulich
                     ` (45 subsequent siblings)
  46 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:13 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This abstracts away the debug/release coverage using both
ASSERT_UNREACHABLE() and a return value of X86EMUL_UNHANDLEABLE.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1017,6 +1017,15 @@ do {
     if ( rc ) goto done;                                \
 } while (0)
 
+#define IMPOSSIBLE(p)                                   \
+do {                                                    \
+    if ( unlikely(p) )                                  \
+    {                                                   \
+        ASSERT_UNREACHABLE();                           \
+        goto unhandleable;                              \
+    }                                                   \
+} while (0)
+
 static inline int mkec(uint8_t e, int32_t ec, ...)
 {
     return (e < 32 && ((1u << e) & EXC_HAS_EC)) ? ec : X86_EVENT_NO_EC;
@@ -8828,12 +8837,7 @@ x86_emulate(
                 dst.type = OP_NONE;
                 break;
             default:
-                if ( (d & DstMask) != DstMem )
-                {
-                    ASSERT_UNREACHABLE();
-                    rc = X86EMUL_UNHANDLEABLE;
-                    goto done;
-                }
+                IMPOSSIBLE((d & DstMask) != DstMem);
                 break;
             }
             if ( (d & DstMask) == DstMem )
@@ -8960,9 +8964,11 @@ x86_emulate(
             stub.func);
     generate_exception_if(stub_exn.info.fields.trapnr == EXC_UD, EXC_UD);
     domain_crash(current->domain);
+#endif
+
+ unhandleable:
     rc = X86EMUL_UNHANDLEABLE;
     goto done;
-#endif
 }
 
 #undef op_bytes





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 02/47] x86emul: support basic AVX512 moves
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
  2018-11-19 10:13   ` [PATCH v5 01/47] x86emul: introduce IMPOSSIBLE() Jan Beulich
@ 2018-11-19 10:13   ` Jan Beulich
  2018-11-19 18:35     ` Andrew Cooper
  2018-11-19 10:14   ` [PATCH v5 03/47] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
                     ` (44 subsequent siblings)
  46 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:13 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note: SDM Vol 2 rev 067 is not really consistent about EVEX.L'L for LIG
      insns - the only place where this is made explicit is a table in
      the section titled "Vector Length Orthogonality": While they
      tolerate 0, 1, and 2, a value of 3 uniformly leads to #UD.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Use IMPOSSIBLE() to guard against division by zero. Correct style.
    Re-base.
v4: Introduce d8s_dq64 to deal with 32-bit mode VMOVD with EVEX.W set.
    Adjust a comment.
v3: Restrict k-reg reading to insns with memory operand. Shrink scope of
    "disp8scale".
v2: Move "full" into more narrow scope.

--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -1985,6 +1985,53 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovq %xmm1,32(%edx)...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_to_mem);
+
+        asm volatile ( "pcmpgtb %%xmm1, %%xmm1\n"
+                       put_insn(evex_vmovq_to_mem, "%{evex%} vmovq %%xmm1, 32(%0)")
+                       :: "d" (NULL) );
+
+        memset(res, 0xdb, 64);
+        set_insn(evex_vmovq_to_mem);
+        regs.ecx = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_to_mem) ||
+             *((uint64_t *)res + 4) ||
+             memcmp(res, res + 10, 24) ||
+             memcmp(res, res + 6, 8) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing {evex} vmovq 32(%edx),%xmm0...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_from_mem);
+
+        asm volatile ( "pcmpeqb %%xmm0, %%xmm0\n"
+                       put_insn(evex_vmovq_from_mem, "%{evex%} vmovq 32(%0), %%xmm0")
+                       :: "d" (NULL) );
+
+        set_insn(evex_vmovq_from_mem);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_from_mem) )
+            goto fail;
+        asm ( "vmovq %1, %%xmm1\n\t"
+              "vpcmpeqq %%zmm0, %%zmm1, %%k0\n"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[8]) );
+        if ( rc != 0xff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movdqu %xmm2,(%ecx)...");
     if ( stack_exec && cpu_has_sse2 )
     {
@@ -2085,6 +2132,118 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vmovdqu32 %zmm2,(%ecx){%k1}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovdqu32_to_mem);
+
+        memset(res, 0x55, 128);
+
+        asm volatile ( "vpcmpeqd %%ymm2, %%ymm2, %%ymm2\n\t"
+                       "kmovw %1,%%k1\n"
+                       put_insn(vmovdqu32_to_mem,
+                                "vmovdqu32 %%zmm2, (%0)%{%%k1%}")
+                       :: "c" (NULL), "rm" (res[0]) );
+        set_insn(vmovdqu32_to_mem);
+
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || memcmp(res + 16, res + 24, 32) ||
+             !check_eip(vmovdqu32_to_mem) )
+            goto fail;
+
+        res[16] = ~0; res[18] = ~0; res[20] = ~0; res[22] = ~0;
+        res[24] =  0; res[26] =  0; res[28] =  0; res[30] =  0;
+        if ( memcmp(res, res + 16, 64) )
+            goto fail;
+
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovdqu32 64(%edx),%zmm2{%k2}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovdqu32_from_mem);
+
+        asm volatile ( "knotw %%k1, %%k2\n"
+                       put_insn(vmovdqu32_from_mem,
+                                "vmovdqu32 64(%0), %%zmm2%{%%k2%}")
+                       :: "d" (NULL) );
+
+        set_insn(vmovdqu32_from_mem);
+        regs.ecx = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vmovdqu32_from_mem) )
+            goto fail;
+        asm ( "vpcmpeqd %1, %%zmm2, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[0]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovdqu16 %zmm3,(%ecx){%k1}...");
+    if ( stack_exec && cpu_has_avx512bw )
+    {
+        decl_insn(vmovdqu16_to_mem);
+
+        memset(res, 0x55, 128);
+
+        asm volatile ( "vpcmpeqw %%ymm3, %%ymm3, %%ymm3\n\t"
+                       "kmovd %1,%%k1\n"
+                       put_insn(vmovdqu16_to_mem,
+                                "vmovdqu16 %%zmm3, (%0)%{%%k1%}")
+                       :: "c" (NULL), "rm" (res[0]) );
+        set_insn(vmovdqu16_to_mem);
+
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || memcmp(res + 16, res + 24, 32) ||
+             !check_eip(vmovdqu16_to_mem) )
+            goto fail;
+
+        for ( i = 16; i < 24; ++i )
+            res[i] |= 0x0000ffff;
+        for ( ; i < 32; ++i )
+            res[i] &= 0xffff0000;
+        if ( memcmp(res, res + 16, 64) )
+            goto fail;
+
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovdqu16 64(%edx),%zmm3{%k2}...");
+    if ( stack_exec && cpu_has_avx512bw )
+    {
+        decl_insn(vmovdqu16_from_mem);
+
+        asm volatile ( "knotd %%k1, %%k2\n"
+                       put_insn(vmovdqu16_from_mem,
+                                "vmovdqu16 64(%0), %%zmm3%{%%k2%}")
+                       :: "d" (NULL) );
+
+        set_insn(vmovdqu16_from_mem);
+        regs.ecx = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vmovdqu16_from_mem) )
+            goto fail;
+        asm ( "vpcmpeqw %1, %%zmm3, %%k0\n\t"
+              "kmovd %%k0, %0" : "=r" (rc) : "m" (res[0]) );
+        if ( rc != 0xffffffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movsd %xmm5,(%ecx)...");
     memset(res, 0x77, 64);
     memset(res + 10, 0x66, 8);
@@ -2186,6 +2345,71 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vmovsd %xmm5,16(%ecx){%k3}...");
+    memset(res, 0x88, 128);
+    memset(res + 20, 0x77, 8);
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovsd_masked_to_mem);
+
+        asm volatile ( "vbroadcastsd %0, %%ymm5\n\t"
+                       "kxorw %%k3, %%k3, %%k3\n"
+                       put_insn(vmovsd_masked_to_mem,
+                                "vmovsd %%xmm5, 16(%1)%{%%k3%}")
+                       :: "m" (res[20]), "c" (NULL) );
+
+        set_insn(vmovsd_masked_to_mem);
+        regs.ecx = 0;
+        regs.edx = 0;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(vmovsd_masked_to_mem) )
+            goto fail;
+
+        asm volatile ( "kmovw %0, %%k3\n" :: "m" (res[20]) );
+
+        set_insn(vmovsd_masked_to_mem);
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(vmovsd_masked_to_mem) ||
+             memcmp(res, res + 16, 64) )
+            goto fail;
+
+        printf("okay\n");
+    }
+    else
+    {
+        printf("skipped\n");
+        memset(res + 4, 0x77, 8);
+    }
+
+    printf("%-40s", "Testing vmovaps (%edx),%zmm7{%k3}{z}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vmovaps_masked_from_mem);
+
+        asm volatile ( "vpcmpeqd %%xmm7, %%xmm7, %%xmm7\n\t"
+                       "vbroadcastss %%xmm7, %%zmm7\n"
+                       put_insn(vmovaps_masked_from_mem,
+                                "vmovaps (%0), %%zmm7%{%%k3%}%{z%}")
+                       :: "d" (NULL) );
+
+        set_insn(vmovaps_masked_from_mem);
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vmovaps_masked_from_mem) )
+            goto fail;
+        asm ( "vcmpeqps %1, %%zmm7, %%k0\n\t"
+              "vxorps %%xmm0, %%xmm0, %%xmm0\n\t"
+              "vcmpeqps %%zmm0, %%zmm7, %%k1\n\t"
+              "kxorw %%k1, %%k0, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[16]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movd %mm3,32(%ecx)...");
     if ( stack_exec && cpu_has_mmx )
     {
@@ -2341,6 +2565,55 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovd %xmm3,32(%ecx)...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_to_mem);
+
+        asm volatile ( "pcmpeqb %%xmm3, %%xmm3\n"
+                       put_insn(evex_vmovd_to_mem,
+                                "%{evex%} vmovd %%xmm3, 32(%0)")
+                       :: "c" (NULL) );
+
+        memset(res, 0xbd, 64);
+        set_insn(evex_vmovd_to_mem);
+        regs.ecx = (unsigned long)res;
+        regs.edx = 0;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovd_to_mem) ||
+             res[8] + 1 ||
+             memcmp(res, res + 9, 28) ||
+             memcmp(res, res + 6, 8) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing {evex} vmovd 32(%ecx),%xmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_from_mem);
+
+        asm volatile ( "pcmpeqb %%xmm4, %%xmm4\n"
+                       put_insn(evex_vmovd_from_mem,
+                                "%{evex%} vmovd 32(%0), %%xmm4")
+                       :: "c" (NULL) );
+
+        set_insn(evex_vmovd_from_mem);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovd_from_mem) )
+            goto fail;
+        asm ( "vmovd %1, %%xmm0\n\t"
+              "vpcmpeqd %%zmm4, %%zmm0, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[8]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movd %mm3,%ebx...");
     if ( stack_exec && cpu_has_mmx )
     {
@@ -2507,6 +2780,57 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovd %xmm2,%ebx...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_to_reg);
+
+        /* See comment next to movd above. */
+        asm volatile ( "pcmpeqb %%xmm2, %%xmm2\n"
+                       put_insn(evex_vmovd_to_reg,
+                                "%{evex%} vmovd %%xmm2, %%ebx")
+                       :: );
+
+        set_insn(evex_vmovd_to_reg);
+#ifdef __x86_64__
+        regs.rbx = 0xbdbdbdbdbdbdbdbdUL;
+#else
+        regs.ebx = 0xbdbdbdbdUL;
+#endif
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(evex_vmovd_to_reg) ||
+             regs.ebx != 0xffffffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing {evex} vmovd %ebx,%xmm1...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovd_from_reg);
+
+        /* See comment next to movd above. */
+        asm volatile ( "pcmpgtb %%xmm1, %%xmm1\n"
+                       put_insn(evex_vmovd_from_reg,
+                                "%{evex%} vmovd %%ebx, %%xmm1")
+                       :: );
+
+        set_insn(evex_vmovd_from_reg);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( (rc != X86EMUL_OKAY) || !check_eip(evex_vmovd_from_reg) )
+            goto fail;
+        asm ( "vmovd %1, %%xmm0\n\t"
+              "vpcmpeqd %%zmm1, %%zmm0, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "m" (res[8]) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #ifdef __x86_64__
     printf("%-40s", "Testing movq %mm3,32(%ecx)...");
     if ( stack_exec && cpu_has_mmx )
@@ -2584,6 +2908,36 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing {evex} vmovq %xmm11,32(%ecx)...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_to_mem2);
+
+        asm volatile ( "pcmpeqb %%xmm11, %%xmm11\n"
+#if 0 /* This may not work, as the assembler might pick opcode D6. */
+                       put_insn(evex_vmovq_to_mem2,
+                                "{evex} vmovq %%xmm11, 32(%0)")
+#else
+                       put_insn(evex_vmovq_to_mem2,
+                                ".byte 0x62, 0xf1, 0xfd, 0x08, 0x7e, 0x49, 0x04")
+#endif
+                       :: "c" (NULL) );
+
+        memset(res, 0xbd, 64);
+        set_insn(evex_vmovq_to_mem2);
+        regs.ecx = (unsigned long)res;
+        regs.edx = 0;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_to_mem2) ||
+             *((long *)res + 4) + 1 ||
+             memcmp(res, res + 10, 24) ||
+             memcmp(res, res + 6, 8) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing movq %mm3,%rbx...");
     if ( stack_exec && cpu_has_mmx )
     {
@@ -2643,6 +2997,28 @@ int main(int argc, char **argv)
     }
     else
         printf("skipped\n");
+
+    printf("%-40s", "Testing vmovq %xmm22,%rbx...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovq_to_reg);
+
+        /* See comment next to movd above. */
+        asm volatile ( "pcmpeqq %%xmm2, %%xmm2\n\t"
+                       "vmovq %%xmm2, %%xmm22\n"
+                       put_insn(evex_vmovq_to_reg, "vmovq %%xmm22, %%rbx")
+                       :: );
+
+        set_insn(evex_vmovq_to_reg);
+        regs.rbx = 0xbdbdbdbdbdbdbdbdUL;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovq_to_reg) ||
+             regs.rbx + 1 )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
 #endif
 
     printf("%-40s", "Testing maskmovq %mm4,%mm4...");
@@ -2812,6 +3188,32 @@ int main(int argc, char **argv)
             goto fail;
         printf("okay\n");
     }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing vmovntdqa 64(%ecx),%zmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vmovntdqa);
+
+        asm volatile ( "vpxor %%xmm4, %%xmm4, %%xmm4\n"
+                       put_insn(evex_vmovntdqa, "vmovntdqa 64(%0), %%zmm4")
+                       :: "c" (NULL) );
+
+        set_insn(evex_vmovntdqa);
+        memset(res, 0x55, 192);
+        memset(res + 16, 0xff, 64);
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vmovntdqa) )
+            goto fail;
+        asm ( "vpbroadcastd %1, %%zmm2\n\t"
+              "vpcmpeqd %%zmm4, %%zmm2, %%k0\n\t"
+              "kmovw %%k0, %0" : "=r" (rc) : "0" (~0) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
     else
         printf("skipped\n");
 
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -222,6 +222,7 @@ int emul_test_get_fpu(
         if ( cpu_has_avx )
             break;
     case X86EMUL_FPU_opmask:
+    case X86EMUL_FPU_zmm:
         if ( cpu_has_avx512f )
             break;
     default:
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -132,6 +132,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
+#define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -243,9 +243,27 @@ enum simd_opsize {
 };
 typedef uint8_t simd_opsize_t;
 
+enum disp8scale {
+    /* Values 0 ... 4 are explicit sizes. */
+    d8s_bw = 5,
+    d8s_dq,
+    /* EVEX.W ignored outside of 64-bit mode */
+    d8s_dq64,
+    /*
+     * All further values must strictly be last and in the order
+     * given so that arithmetic on the values works.
+     */
+    d8s_vl,
+    d8s_vl_by_2,
+    d8s_vl_by_4,
+    d8s_vl_by_8,
+};
+typedef uint8_t disp8scale_t;
+
 static const struct twobyte_table {
     opcode_desc_t desc;
-    simd_opsize_t size;
+    simd_opsize_t size:4;
+    disp8scale_t d8s:4;
 } twobyte_table[256] = {
     [0x00] = { ModRM },
     [0x01] = { ImplicitOps|ModRM },
@@ -260,8 +278,8 @@ static const struct twobyte_table {
     [0x0d] = { ImplicitOps|ModRM },
     [0x0e] = { ImplicitOps },
     [0x0f] = { ModRM|SrcImmByte },
-    [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp },
-    [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp },
+    [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
+    [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
     [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
@@ -270,10 +288,10 @@ static const struct twobyte_table {
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
     [0x20 ... 0x21] = { DstMem|SrcImplicit|ModRM },
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
-    [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp },
-    [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp },
+    [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
+    [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp },
+    [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp },
     [0x30 ... 0x35] = { ImplicitOps },
@@ -292,8 +310,8 @@ static const struct twobyte_table {
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
-    [0x6e] = { DstImplicit|SrcMem|ModRM|Mov },
-    [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int },
+    [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
+    [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
@@ -301,8 +319,8 @@ static const struct twobyte_table {
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x7e] = { DstMem|SrcImplicit|ModRM|Mov },
-    [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int },
+    [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
+    [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x80 ... 0x8f] = { DstImplicit|SrcImm },
     [0x90 ... 0x9f] = { ByteOp|DstMem|SrcNone|ModRM|Mov },
     [0xa0 ... 0xa1] = { ImplicitOps|Mov },
@@ -344,14 +362,14 @@ static const struct twobyte_table {
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
-    [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int },
+    [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -406,6 +424,7 @@ static const struct ext0f38_table {
     uint8_t to_mem:1;
     uint8_t two_op:1;
     uint8_t vsib:1;
+    disp8scale_t d8s:4;
 } ext0f38_table[256] = {
     [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
@@ -418,7 +437,7 @@ static const struct ext0f38_table {
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
     [0x28 ... 0x29] = { .simd_size = simd_packed_int },
-    [0x2a] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_other },
     [0x2e ... 0x2f] = { .simd_size = simd_other, .to_mem = 1 },
@@ -656,6 +675,22 @@ union evex {
     };
 };
 
+#define EVEX_PFX_BYTES 4
+#define init_evex(stub) ({ \
+    uint8_t *buf_ = get_stub(stub); \
+    buf_[0] = 0x62; \
+    buf_ + EVEX_PFX_BYTES; \
+})
+
+#define copy_EVEX(ptr, evex) ({ \
+    if ( !mode_64bit() ) \
+        (evex).reg |= 8; \
+    (ptr)[1 - EVEX_PFX_BYTES] = (evex).raw[0]; \
+    (ptr)[2 - EVEX_PFX_BYTES] = (evex).raw[1]; \
+    (ptr)[3 - EVEX_PFX_BYTES] = (evex).raw[2]; \
+    container_of((ptr) + 1 - EVEX_PFX_BYTES, typeof(evex), raw[0]); \
+})
+
 #define rep_prefix()   (vex.pfx >= vex_f3)
 #define repe_prefix()  (vex.pfx == vex_f3)
 #define repne_prefix() (vex.pfx == vex_f2)
@@ -768,6 +803,7 @@ typedef union {
     uint64_t mmx;
     uint64_t __attribute__ ((aligned(16))) xmm[2];
     uint64_t __attribute__ ((aligned(32))) ymm[4];
+    uint64_t __attribute__ ((aligned(64))) zmm[8];
 } mmval_t;
 
 /*
@@ -1201,6 +1237,11 @@ static int _get_fpu(
 
     switch ( type )
     {
+    case X86EMUL_FPU_zmm:
+        if ( !(xcr0 & X86_XCR0_ZMM) || !(xcr0 & X86_XCR0_HI_ZMM) ||
+             !(xcr0 & X86_XCR0_OPMASK) )
+            return X86EMUL_UNHANDLEABLE;
+        /* fall through */
     case X86EMUL_FPU_ymm:
         if ( !(xcr0 & X86_XCR0_SSE) || !(xcr0 & X86_XCR0_YMM) )
             return X86EMUL_UNHANDLEABLE;
@@ -1787,6 +1828,7 @@ static bool vcpu_has(
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
+#define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -2160,6 +2202,65 @@ static unsigned long *decode_vex_gpr(
     return decode_gpr(regs, ~vex_reg & (mode_64bit() ? 0xf : 7));
 }
 
+static unsigned int decode_disp8scale(enum disp8scale scale,
+                                      const struct x86_emulate_state *state)
+{
+    switch ( scale )
+    {
+    case d8s_bw:
+        return state->evex.w;
+
+    default:
+        if ( scale < d8s_vl )
+            return scale;
+        if ( state->evex.br )
+        {
+    case d8s_dq:
+            return 2 + state->evex.w;
+        }
+        break;
+
+    case d8s_dq64:
+        return 2 + (state->op_bytes == 8);
+    }
+
+    switch ( state->simd_size )
+    {
+    case simd_any_fp:
+    case simd_single_fp:
+        if ( !(state->evex.pfx & VEX_PREFIX_SCALAR_MASK) )
+            break;
+        /* fall through */
+    case simd_scalar_opc:
+    case simd_scalar_vexw:
+        return 2 + state->evex.w;
+
+    case simd_128:
+        /* These should have an explicit size specified. */
+        ASSERT_UNREACHABLE();
+        return 4;
+
+    default:
+        break;
+    }
+
+    return 4 + state->evex.lr - (scale - d8s_vl);
+}
+
+#define avx512_vlen_check(lig) do { \
+    switch ( evex.lr ) \
+    { \
+    default: \
+        generate_exception(EXC_UD); \
+    case 2: \
+        break; \
+    case 0: case 1: \
+        if ( !(lig) ) \
+            host_and_vcpu_must_have(avx512vl); \
+        break; \
+    } \
+} while ( false )
+
 static bool is_aligned(enum x86_segment seg, unsigned long offs,
                        unsigned int size, struct x86_emulate_ctxt *ctxt,
                        const struct x86_emulate_ops *ops)
@@ -2406,6 +2507,7 @@ x86_decode_twobyte(
         if ( vex.pfx == vex_f3 ) /* movq xmm/m64,xmm */
         {
     case X86EMUL_OPC_VEX_F3(0, 0x7e): /* vmovq xmm/m64,xmm */
+    case X86EMUL_OPC_EVEX_F3(0, 0x7e): /* vmovq xmm/m64,xmm */
             state->desc = DstImplicit | SrcMem | TwoOp;
             state->simd_size = simd_other;
             /* Avoid the state->desc clobbering of TwoOp below. */
@@ -2476,7 +2578,7 @@ x86_decode_twobyte(
     }
 
     /*
-     * Scalar forms of most VEX-encoded TwoOp instructions have
+     * Scalar forms of most VEX-/EVEX-encoded TwoOp instructions have
      * three operands.  Those which do really have two operands
      * should have exited earlier.
      */
@@ -2841,6 +2943,8 @@ x86_decode(
 
     if ( d & ModRM )
     {
+        unsigned int disp8scale = 0;
+
         d &= ~ModRM;
 #undef ModRM /* Only its aliases are valid to use from here on. */
         modrm_reg = ((rex_prefix & 4) << 1) | ((modrm & 0x38) >> 3);
@@ -2883,6 +2987,9 @@ x86_decode(
             break;
 
         case ext_0f:
+            if ( evex_encoded() )
+                disp8scale = decode_disp8scale(twobyte_table[b].d8s, state);
+
             switch ( b )
             {
             case 0x20: /* mov cr,reg */
@@ -2896,6 +3003,11 @@ x86_decode(
                  */
                 modrm_mod = 3;
                 break;
+
+            case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
+                if ( disp8scale == 2 && evex.pfx == vex_f3 )
+                    disp8scale = 3;
+                break;
             }
             break;
 
@@ -2907,6 +3019,8 @@ x86_decode(
             if ( ext0f38_table[b].vsib )
                 d |= vSIB;
             state->simd_size = ext0f38_table[b].simd_size;
+            if ( evex_encoded() )
+                disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
             break;
 
         case ext_8f09:
@@ -2975,7 +3089,7 @@ x86_decode(
                     ea.mem.off = insn_fetch_type(int16_t);
                 break;
             case 1:
-                ea.mem.off += insn_fetch_type(int8_t);
+                ea.mem.off += insn_fetch_type(int8_t) << disp8scale;
                 break;
             case 2:
                 ea.mem.off += insn_fetch_type(int16_t);
@@ -3034,7 +3148,7 @@ x86_decode(
                 pc_rel = mode_64bit();
                 break;
             case 1:
-                ea.mem.off += insn_fetch_type(int8_t);
+                ea.mem.off += insn_fetch_type(int8_t) << disp8scale;
                 break;
             case 2:
                 ea.mem.off += insn_fetch_type(int32_t);
@@ -3235,10 +3349,11 @@ x86_emulate(
     struct x86_emulate_state state;
     int rc, cr4_rc;
     uint8_t b, d, *opc = NULL;
-    unsigned int first_byte = 0, insn_bytes = 0;
+    unsigned int first_byte = 0, elem_bytes, insn_bytes = 0;
+    uint64_t op_mask = ~0ULL;
     bool singlestep = (_regs.eflags & X86_EFLAGS_TF) &&
 	    !is_branch_step(ctxt, ops);
-    bool sfence = false;
+    bool sfence = false, fault_suppression = false;
     struct operand src = { .reg = PTR_POISON };
     struct operand dst = { .reg = PTR_POISON };
     unsigned long cr4 = 0;
@@ -3286,6 +3401,7 @@ x86_emulate(
     b = ctxt->opcode;
     d = state.desc;
 #define state (&state)
+    elem_bytes = 4 << evex.w;
 
     generate_exception_if(state->not_64bit && mode_64bit(), EXC_UD);
 
@@ -3360,6 +3476,28 @@ x86_emulate(
         break;
     }
 
+    /* With a memory operand, fetch the mask register in use (if any). */
+    if ( ea.type == OP_MEM && evex.opmsk )
+    {
+        uint8_t *stb = get_stub(stub);
+
+        /* KMOV{W,Q} %k<n>, (%rax) */
+        stb[0] = 0xc4;
+        stb[1] = 0xe1;
+        stb[2] = cpu_has_avx512bw ? 0xf8 : 0x78;
+        stb[3] = 0x91;
+        stb[4] = evex.opmsk << 3;
+        insn_bytes = 5;
+        stb[5] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+
+        insn_bytes = 0;
+        put_stub(stub);
+
+        fault_suppression = true;
+    }
+
     /* Decode (but don't fetch) the destination operand: register or memory. */
     switch ( d & DstMask )
     {
@@ -5716,6 +5854,41 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 2;
         break;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2b): /* vmovntp{s,d} [xyz]mm,mem */
+        generate_exception_if(ea.type != OP_MEM || evex.opmsk, EXC_UD);
+        sfence = true;
+        fault_suppression = false;
+        /* fall through */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x10): /* vmovup{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x10): /* vmovs{s,d} mem,xmm{k} */
+                                            /* vmovs{s,d} xmm,xmm,xmm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x11): /* vmovup{s,d} [xyz]mm,[xyz]mm/mem{k} */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x11): /* vmovs{s,d} xmm,mem{k} */
+                                            /* vmovs{s,d} xmm,xmm,xmm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x28): /* vmovap{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x29): /* vmovap{s,d} [xyz]mm,[xyz]mm/mem{k} */
+        /* vmovs{s,d} to/from memory have only two operands. */
+        if ( (b & ~1) == 0x10 && ea.type == OP_MEM )
+            d |= TwoOp;
+        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
+    simd_zmm:
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            evex.b = 1;
+            opc[1] &= 0x38;
+        }
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        break;
+
     case X86EMUL_OPC_66(0x0f, 0x12):       /* movlpd m64,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x12):   /* vmovlpd m64,xmm,xmm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0x13):     /* movlp{s,d} xmm,m64 */
@@ -6355,6 +6528,41 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7e): /* vmov{d,q} xmm,r/m */
+        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
+                               evex.reg != 0xf || !evex.RX),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert memory/GPR operand to (%rAX). */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        opc[1] = modrm & 0x38;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
+        dst.val = src.val;
+
+        put_stub(stub);
+        ASSERT(!state->simd_size);
+        break;
+
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7e): /* vmovq xmm/m64,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
+        generate_exception_if(evex.lr || !evex.w || evex.opmsk || evex.br,
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        d |= TwoOp;
+        op_bytes = 8;
+        goto simd_zmm;
+
     case X86EMUL_OPC_66(0x0f, 0xe7):     /* movntdq xmm,m128 */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe7): /* vmovntdq {x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
@@ -6375,6 +6583,30 @@ x86_emulate(
             goto simd_0f_avx;
         goto simd_0f_sse2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe7): /* vmovntdq [xyz]mm,mem */
+        generate_exception_if(ea.type != OP_MEM || evex.opmsk || evex.w,
+                              EXC_UD);
+        sfence = true;
+        fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6f): /* vmovdqa{32,64} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x6f): /* vmovdqu{32,64} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7f): /* vmovdqa{32,64} [xyz]mm,[xyz]mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7f): /* vmovdqu{32,64} [xyz]mm,[xyz]mm/mem{k} */
+    vmovdqa:
+        generate_exception_if(evex.br, EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x6f): /* vmovdqu{8,16} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x7f): /* vmovdqu{8,16} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512bw);
+        elem_bytes = 1 << evex.w;
+        goto vmovdqa;
+
     case X86EMUL_OPC_VEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
         generate_exception_if(vex.l, EXC_UD);
         d |= TwoOp;
@@ -7739,6 +7971,15 @@ x86_emulate(
         }
         goto movdqa;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2a): /* vmovntdqa mem,[xyz]mm */
+        generate_exception_if(ea.type != OP_MEM || evex.opmsk || evex.w,
+                              EXC_UD);
+        /* Ignore the non-temporal hint for now, using vmovdqa32 instead. */
+        asm volatile ( "mfence" ::: "memory" );
+        b = 0x6f;
+        evex.opcx = vex_0f;
+        goto vmovdqa;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2c): /* vmaskmovps mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2d): /* vmaskmovpd mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2e): /* vmaskmovps {x,y}mm,{x,y}mm,mem */
@@ -8792,17 +9033,27 @@ x86_emulate(
     else if ( state->simd_size )
     {
         generate_exception_if(!op_bytes, EXC_UD);
-        generate_exception_if(vex.opcx && (d & TwoOp) && vex.reg != 0xf,
+        generate_exception_if((vex.opcx && (d & TwoOp) &&
+                               (vex.reg != 0xf || (evex_encoded() && !evex.RX))),
                               EXC_UD);
 
         if ( !opc )
             BUG();
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
-        copy_REX_VEX(opc, rex_prefix, vex);
+        if ( evex_encoded() )
+        {
+            opc[insn_bytes - EVEX_PFX_BYTES] = 0xc3;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            opc[insn_bytes - PFX_BYTES] = 0xc3;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
 
         if ( ea.type == OP_MEM )
         {
             uint32_t mxcsr = 0;
+            uint64_t full = 0;
 
             if ( op_bytes < 16 ||
                  (vex.opcx
@@ -8824,6 +9075,45 @@ x86_emulate(
                                   !is_aligned(ea.mem.seg, ea.mem.off, op_bytes,
                                               ctxt, ops),
                                   EXC_GP, 0);
+
+            IMPOSSIBLE(elem_bytes <= 0);
+            if ( evex.br )
+            {
+                ASSERT((d & DstMask) != DstMem);
+                op_bytes = elem_bytes;
+            }
+            if ( evex.opmsk )
+            {
+                ASSERT(!(op_bytes % elem_bytes));
+                full = ~0ULL >> (64 - op_bytes / elem_bytes);
+                op_mask &= full;
+            }
+            if ( fault_suppression )
+            {
+                if ( !op_mask )
+                    goto simd_no_mem;
+                if ( !evex.br )
+                {
+                    first_byte = __builtin_ctzll(op_mask);
+                    op_mask >>= first_byte;
+                    full >>= first_byte;
+                    first_byte *= elem_bytes;
+                    op_bytes = (64 - __builtin_clzll(op_mask)) * elem_bytes;
+                }
+            }
+            /*
+             * Independent of fault suppression we may need to read (parts of)
+             * the memory operand for the purpose of merging without splitting
+             * the write below into multiple ones. Note that the EVEX.Z check
+             * here isn't strictly needed, due to there not currently being
+             * any instructions allowing zeroing-merging on memory writes (and
+             * we raise #UD during DstMem processing far above in this case),
+             * yet conceptually the read is then unnecessary.
+             */
+            if ( evex.opmsk && !evex.z && (d & DstMask) == DstMem &&
+                 op_mask != full )
+                d = (d & ~SrcMask) | SrcMem;
+
             switch ( d & SrcMask )
             {
             case SrcMem:
@@ -8865,7 +9155,10 @@ x86_emulate(
             }
         }
         else
+        {
+        simd_no_mem:
             dst.type = OP_NONE;
+        }
 
         /* {,v}maskmov{q,dqu}, as an exception, uses rDI. */
         if ( likely((ctxt->opcode & ~(X86EMUL_OPC_PFX_MASK |
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -171,6 +171,7 @@ enum x86_emulate_fpu_type {
     X86EMUL_FPU_xmm, /* SSE instruction set (%xmm0-%xmm7/15) */
     X86EMUL_FPU_ymm, /* AVX/XOP instruction set (%ymm0-%ymm7/15) */
     X86EMUL_FPU_opmask, /* AVX512 opmask instruction set (%k0-%k7) */
+    X86EMUL_FPU_zmm, /* AVX512 instruction set (%zmm0-%zmm7/31) */
     /* This sentinel will never be passed to ->get_fpu(). */
     X86EMUL_FPU_none
 };
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -105,6 +105,7 @@
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
+#define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 03/47] x86emul: test for correct EVEX Disp8 scaling
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
  2018-11-19 10:13   ` [PATCH v5 01/47] x86emul: introduce IMPOSSIBLE() Jan Beulich
  2018-11-19 10:13   ` [PATCH v5 02/47] x86emul: support basic AVX512 moves Jan Beulich
@ 2018-11-19 10:14   ` Jan Beulich
  2018-11-19 18:35     ` Andrew Cooper
  2018-11-19 10:14   ` [PATCH v5 04/47] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
                     ` (43 subsequent siblings)
  46 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:14 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Besides the already existing tests (which are going to be extended once
respective ISA extension support is complete), let's also ensure for
every individual insn that their Disp8 scaling (and memory access width)
are correct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Fix upper bound of first access check loop. Add comments. Re-base.
v4: Introduce ESZ_d_WIG.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -144,7 +144,7 @@ $(addsuffix .h,$(SIMD) $(FMA) $(SG)): si
 
 xop.h: simd-fma.c
 
-$(TARGET): x86-emulate.o cpuid.o test_x86_emulator.o wrappers.o
+$(TARGET): x86-emulate.o cpuid.o test_x86_emulator.o evex-disp8.o wrappers.o
 	$(HOSTCC) $(HOSTCFLAGS) -o $@ $^
 
 .PHONY: clean
@@ -171,7 +171,7 @@ x86.h := $(addprefix $(XEN_ROOT)/tools/i
                      x86-vendors.h x86-defns.h msr-index.h)
 x86_emulate.h := x86-emulate.h x86_emulate/x86_emulate.h $(x86.h)
 
-x86-emulate.o test_x86_emulator.o wrappers.o: %.o: %.c $(x86_emulate.h)
+x86-emulate.o test_x86_emulator.o evex-disp8.o wrappers.o: %.o: %.c $(x86_emulate.h)
 	$(HOSTCC) $(HOSTCFLAGS) -c -g -o $@ $<
 
 x86-emulate.o: x86_emulate/x86_emulate.c
--- /dev/null
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -0,0 +1,471 @@
+#include "x86-emulate.h"
+
+#include <stdarg.h>
+#include <stdio.h>
+
+struct test {
+    const char *mnemonic;
+    unsigned int opc:8;
+    unsigned int spc:2;
+    unsigned int pfx:2;
+    unsigned int vsz:3;
+    unsigned int esz:4;
+    unsigned int scale:1;
+    unsigned int ext:3;
+};
+
+enum spc {
+    SPC_invalid,
+    SPC_0f,
+    SPC_0f38,
+    SPC_0f3a,
+};
+
+enum pfx {
+    PFX_,
+    PFX_66,
+    PFX_f3,
+    PFX_f2
+};
+
+enum vl {
+    VL_128,
+    VL_256,
+    VL_512,
+};
+
+enum scale { /* scale by memory operand ... */
+    SC_vl,   /* ... vector length */
+    SC_el,   /* ... element length */
+};
+
+/*
+ * Vector size is determined either from EVEX.L'L (VL) or vector
+ * element size (EL), often controlled by EVEX.W (see enum esz).
+ */
+enum vsz {
+    VSZ_vl,
+    VSZ_vl_2, /* VL / 2 */
+    VSZ_vl_4, /* VL / 4 */
+    VSZ_vl_8, /* VL / 8 */
+    /* "no broadcast" implied from here on. */
+    VSZ_el,
+    VSZ_el_2, /* EL * 2 */
+    VSZ_el_4, /* EL * 4 */
+    VSZ_el_8, /* EL * 8 */
+};
+
+/*
+ * Vector element size is either an opcode attribute or often determined
+ * by EVEX.W (in which case enumerators below name two sizes). Instructions
+ * accessing GPRs often use EVEX.W to select between 32- and 64-bit GPR
+ * width, but this distinction goes away outside of 64-bit mode (and EVEX.W
+ * is ignored there).
+ */
+enum esz {
+    ESZ_d,
+    ESZ_q,
+    ESZ_dq,
+    ESZ_sd,
+    ESZ_d_nb,
+    ESZ_q_nb,
+    /* "no broadcast" implied from here on. */
+#ifdef __i386__
+    ESZ_d_WIG,
+#endif
+    ESZ_b,
+    ESZ_w,
+    ESZ_bw,
+};
+
+#ifndef __i386__
+# define ESZ_dq64 ESZ_dq
+#else
+# define ESZ_dq64 ESZ_d_WIG
+#endif
+
+#define INSNX(m, p, sp, o, e, vs, es, sc) { \
+    .mnemonic = #m, .opc = 0x##o, .spc = SPC_##sp, .pfx = PFX_##p, \
+    .vsz = VSZ_##vs, .esz = ESZ_##es, .scale = SC_##sc, .ext = 0##e \
+}
+#define INSN(m, p, sp, o, vs, es, sc) INSNX(m, p, sp, o, 0, vs, es, sc)
+#define INSN_PFP(m, sp, o) \
+    INSN(m##pd, 66, sp, o, vl, q, vl), \
+    INSN(m##ps,   , sp, o, vl, d, vl)
+#define INSN_PFP_NB(m, sp, o) \
+    INSN(m##pd, 66, sp, o, vl, q_nb, vl), \
+    INSN(m##ps,   , sp, o, vl, d_nb, vl)
+#define INSN_SFP(m, sp, o) \
+    INSN(m##sd, f2, sp, o, el, q, el), \
+    INSN(m##ss, f3, sp, o, el, d, el)
+
+#define INSN_FP(m, sp, o) \
+    INSN_PFP(m, sp, o), \
+    INSN_SFP(m, sp, o)
+
+static const struct test avx512f_all[] = {
+    INSN_SFP(mov,            0f, 10),
+    INSN_SFP(mov,            0f, 11),
+    INSN_PFP_NB(mova,        0f, 28),
+    INSN_PFP_NB(mova,        0f, 29),
+    INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
+    INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
+    INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
+    INSN(movdqa64,     66,   0f, 7f,    vl,   q_nb, vl),
+    INSN(movdqu32,     f3,   0f, 6f,    vl,   d_nb, vl),
+    INSN(movdqu32,     f3,   0f, 7f,    vl,   d_nb, vl),
+    INSN(movdqu64,     f3,   0f, 6f,    vl,   q_nb, vl),
+    INSN(movdqu64,     f3,   0f, 7f,    vl,   q_nb, vl),
+    INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
+    INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
+    INSN_PFP_NB(movnt,       0f, 2b),
+    INSN_PFP_NB(movu,        0f, 10),
+    INSN_PFP_NB(movu,        0f, 11),
+};
+
+static const struct test avx512f_128[] = {
+    INSN(mov,       66,   0f, 6e, el, dq64, el),
+    INSN(mov,       66,   0f, 7e, el, dq64, el),
+    INSN(movq,      f3,   0f, 7e, el,    q, el),
+    INSN(movq,      66,   0f, d6, el,    q, el),
+};
+
+static const struct test avx512bw_all[] = {
+    INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
+    INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
+    INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
+    INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+};
+
+static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
+static const unsigned char vl_128[] = { VL_128 };
+
+/*
+ * This table, indicating the presence of an immediate (byte) for an opcode
+ * space 0f major opcode, is indexed by high major opcode byte nibble, with
+ * each table element then bit-indexed by low major opcode byte nibble.
+ */
+static const uint16_t imm0f[16] = {
+    [0x7] = (1 << 0x0) /* vpshuf* */ |
+            (1 << 0x1) /* vps{ll,ra,rl}w */ |
+            (1 << 0x2) /* vps{l,r}ld, vp{rol,ror,sra}{d,q} */ |
+            (1 << 0x3) /* vps{l,r}l{,d}q */,
+    [0xc] = (1 << 0x2) /* vcmp{p,s}{d,s} */ |
+            (1 << 0x4) /* vpinsrw */ |
+            (1 << 0x5) /* vpextrw */ |
+            (1 << 0x6) /* vshufp{d,s} */,
+};
+
+static struct x86_emulate_ops emulops;
+
+/*
+ * Access tracking (by granular) is used on the first 64 bytes of address
+ * space. Instructions get encode with a raw Disp8 value of 1, which then
+ * gets scaled accordingly. Hence accesses below the address <scaling factor>
+ * as well as at or above 2 * <scaling factor> are indications of bugs. To
+ * aid diagnosis / debugging, track all accesses below 3 * <scaling factor>.
+ * With AVX512 the maximum scaling factor is 64.
+ */
+static unsigned int accessed[3 * 64];
+
+static bool record_access(enum x86_segment seg, unsigned long offset,
+                          unsigned int bytes)
+{
+    while ( bytes-- )
+    {
+        if ( offset >= ARRAY_SIZE(accessed) )
+            return false;
+        ++accessed[offset++];
+    }
+
+    return true;
+}
+
+static int read(enum x86_segment seg, unsigned long offset, void *p_data,
+                unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+    if ( !record_access(seg, offset, bytes) )
+        return X86EMUL_UNHANDLEABLE;
+    memset(p_data, 0, bytes);
+    return X86EMUL_OKAY;
+}
+
+static int write(enum x86_segment seg, unsigned long offset, void *p_data,
+                 unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+    if ( !record_access(seg, offset, bytes) )
+        return X86EMUL_UNHANDLEABLE;
+    return X86EMUL_OKAY;
+}
+
+static void test_one(const struct test *test, enum vl vl,
+                     unsigned char *instr, struct x86_emulate_ctxt *ctxt)
+{
+    unsigned int vsz, esz, i;
+    int rc;
+    bool sg = strstr(test->mnemonic, "gather") ||
+              strstr(test->mnemonic, "scatter");
+    bool imm = test->spc == SPC_0f3a ||
+               (test->spc == SPC_0f &&
+                (imm0f[test->opc >> 4] & (1 << (test->opc & 0xf))));
+    union evex {
+        uint8_t raw[3];
+        struct {
+            uint8_t opcx:2;
+            uint8_t mbz:2;
+            uint8_t R:1;
+            uint8_t b:1;
+            uint8_t x:1;
+            uint8_t r:1;
+            uint8_t pfx:2;
+            uint8_t mbs:1;
+            uint8_t reg:4;
+            uint8_t w:1;
+            uint8_t opmsk:3;
+            uint8_t RX:1;
+            uint8_t bcst:1;
+            uint8_t lr:2;
+            uint8_t z:1;
+        };
+    } evex = {
+        .opcx = test->spc, .pfx = test->pfx, .lr = vl,
+        .R = 1, .b = 1, .x = 1, .r = 1, .mbs = 1,
+        .reg = 0xf, .RX = 1, .opmsk = sg,
+    };
+
+    switch ( test->esz )
+    {
+    case ESZ_b:
+        esz = 1;
+        break;
+
+    case ESZ_w:
+        esz = 2;
+        evex.w = 1;
+        break;
+
+#ifdef __i386__
+    case ESZ_d_WIG:
+        evex.w = 1;
+        /* fall through */
+#endif
+    case ESZ_d: case ESZ_d_nb:
+        esz = 4;
+        break;
+
+    case ESZ_q: case ESZ_q_nb:
+        esz = 8;
+        evex.w = 1;
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    switch ( test->vsz )
+    {
+    case VSZ_vl:
+        vsz = 16 << vl;
+        break;
+
+    case VSZ_vl_2:
+        vsz = 8 << vl;
+        break;
+
+    case VSZ_vl_4:
+        vsz = 4 << vl;
+        break;
+
+    case VSZ_vl_8:
+        vsz = 2 << vl;
+        break;
+
+    case VSZ_el:
+        vsz = esz;
+        break;
+
+    case VSZ_el_2:
+        vsz = esz * 2;
+        break;
+
+    case VSZ_el_4:
+        vsz = esz * 4;
+        break;
+
+    case VSZ_el_8:
+        vsz = esz * 8;
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    /*
+     * Note: SIB addressing is used here, such that S/G insns can be handled
+     * without extra conditionals.
+     */
+    instr[0] = 0x62;
+    instr[1] = evex.raw[0];
+    instr[2] = evex.raw[1];
+    instr[3] = evex.raw[2];
+    instr[4] = test->opc;
+    instr[5] = 0x44 | (test->ext << 3); /* ModR/M */
+    instr[6] = 0x12; /* SIB: base rDX, index none / xMM4 */
+    instr[7] = 1; /* Disp8 */
+    instr[8] = 0; /* immediate, if any */
+
+    asm volatile ( "kxnorw %k1, %k1, %k1" );
+    asm volatile ( "vxorps %xmm4, %xmm4, %xmm4" );
+
+    ctxt->regs->eip = (unsigned long)&instr[0];
+    ctxt->regs->edx = 0;
+    memset(accessed, 0, sizeof(accessed));
+
+    rc = x86_emulate(ctxt, &emulops);
+    if ( rc != X86EMUL_OKAY ||
+         (ctxt->regs->eip != (unsigned long)&instr[8 + imm]) )
+        goto fail;
+
+    for ( i = 0; i < (test->scale == SC_vl ? vsz : esz); ++i )
+         if ( accessed[i] )
+             goto fail;
+    for ( ; i < (test->scale == SC_vl ? vsz : esz) + (sg ? esz : vsz); ++i )
+         if ( accessed[i] != (sg ? vsz / esz : 1) )
+             goto fail;
+    for ( ; i < ARRAY_SIZE(accessed); ++i )
+         if ( accessed[i] )
+             goto fail;
+
+    /* Also check the broadcast case, if available. */
+    if ( test->vsz >= VSZ_el || test->scale != SC_vl )
+        return;
+
+    switch ( test->esz )
+    {
+    case ESZ_d_nb: case ESZ_q_nb:
+    case ESZ_b: case ESZ_w: case ESZ_bw:
+        return;
+
+    case ESZ_d: case ESZ_q:
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    evex.bcst = 1;
+    instr[3] = evex.raw[2];
+
+    ctxt->regs->eip = (unsigned long)&instr[0];
+    memset(accessed, 0, sizeof(accessed));
+
+    rc = x86_emulate(ctxt, &emulops);
+    if ( rc != X86EMUL_OKAY ||
+         (ctxt->regs->eip != (unsigned long)&instr[8 + imm]) )
+        goto fail;
+
+    for ( i = 0; i < esz; ++i )
+         if ( accessed[i] )
+             goto fail;
+    for ( ; i < esz * 2; ++i )
+         if ( accessed[i] != 1 )
+             goto fail;
+    for ( ; i < ARRAY_SIZE(accessed); ++i )
+         if ( accessed[i] )
+             goto fail;
+
+    return;
+
+ fail:
+    printf("failed (v%s%s %u-bit)\n", test->mnemonic,
+           evex.bcst ? "/bcst" : "", 128 << vl);
+    exit(1);
+}
+
+static void test_pair(const struct test *tmpl, enum vl vl,
+                      enum esz esz1, const char *suffix1,
+                      enum esz esz2, const char *suffix2,
+                      unsigned char *instr, struct x86_emulate_ctxt *ctxt)
+{
+    struct test test = *tmpl;
+    char mnemonic[24];
+
+    test.esz = esz1;
+    snprintf(mnemonic, ARRAY_SIZE(mnemonic), "%s%s", tmpl->mnemonic, suffix1);
+    test.mnemonic = mnemonic;
+    test_one(&test, vl, instr, ctxt);
+
+    test.esz = esz2;
+    snprintf(mnemonic, ARRAY_SIZE(mnemonic), "%s%s", tmpl->mnemonic, suffix2);
+    test.mnemonic = mnemonic;
+    test_one(&test, vl, instr, ctxt);
+}
+
+static void test_group(const struct test tests[], unsigned int nr_test,
+                       const unsigned char vl[], unsigned int nr_vl,
+                       void *instr, struct x86_emulate_ctxt *ctxt)
+{
+    unsigned int i, j;
+
+    for ( i = 0; i < nr_test; ++i )
+    {
+        for ( j = 0; j < nr_vl; ++j )
+        {
+            if ( vl[0] == VL_512 && vl[j] != VL_512 && !cpu_has_avx512vl )
+                continue;
+
+            switch ( tests[i].esz )
+            {
+            default:
+                test_one(&tests[i], vl[j], instr, ctxt);
+                break;
+
+            case ESZ_bw:
+                test_pair(&tests[i], vl[j], ESZ_b, "b", ESZ_w, "w",
+                          instr, ctxt);
+                break;
+
+            case ESZ_dq:
+                test_pair(&tests[i], vl[j], ESZ_d, "d", ESZ_q, "q",
+                          instr, ctxt);
+                break;
+
+#ifdef __i386__
+            case ESZ_d_WIG:
+                test_pair(&tests[i], vl[j], ESZ_d, "/W0",
+                          ESZ_d_WIG, "/W1", instr, ctxt);
+                break;
+#endif
+
+            case ESZ_sd:
+                test_pair(&tests[i], vl[j],
+                          ESZ_d, tests[i].vsz < VSZ_el ? "ps" : "ss",
+                          ESZ_q, tests[i].vsz < VSZ_el ? "pd" : "sd",
+                          instr, ctxt);
+                break;
+            }
+        }
+    }
+}
+
+void evex_disp8_test(void *instr, struct x86_emulate_ctxt *ctxt,
+                     const struct x86_emulate_ops *ops)
+{
+    emulops = *ops;
+    emulops.read = read;
+    emulops.write = write;
+
+#define RUN(feat, vl) do { \
+    if ( cpu_has_##feat ) \
+    { \
+        printf("%-40s", "Testing " #feat "/" #vl " disp8 handling..."); \
+        test_group(feat ## _ ## vl, ARRAY_SIZE(feat ## _ ## vl), \
+                   vl_ ## vl, ARRAY_SIZE(vl_ ## vl), instr, ctxt); \
+        printf("okay\n"); \
+    } \
+} while ( false )
+
+    RUN(avx512f, all);
+    RUN(avx512f, 128);
+    RUN(avx512bw, all);
+}
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3795,6 +3795,9 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    if ( stack_exec )
+        evex_disp8_test(instr, &ctxt, &emulops);
+
     for ( j = 0; j < ARRAY_SIZE(blobs); j++ )
     {
         if ( blobs[j].check_cpu && !blobs[j].check_cpu() )
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -96,6 +96,9 @@ WRAP(puts);
 
 #include "x86_emulate/x86_emulate.h"
 
+void evex_disp8_test(void *instr, struct x86_emulate_ctxt *ctxt,
+                     const struct x86_emulate_ops *ops);
+
 static inline uint64_t xgetbv(uint32_t xcr)
 {
     uint32_t lo, hi;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 04/47] x86emul: also allow running the 32-bit harness on a 64-bit distro
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (2 preceding siblings ...)
  2018-11-19 10:14   ` [PATCH v5 03/47] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
@ 2018-11-19 10:14   ` Jan Beulich
  2018-11-19 18:40     ` Andrew Cooper
  2018-11-19 10:15   ` [PATCH v5 05/47] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
                     ` (42 subsequent siblings)
  46 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:14 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

In order to be able to verify the 32-bit variant builds and runs,
introduce a respective target (and the necessary other adjustments).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Use vpath directive instead of VPATH variable. Use ?= instead of
    ifeq(). Don't have clean32 depend on generated headers. Re-base.
v4: Moved ahead in series.
v3: New.

--- a/.gitignore
+++ b/.gitignore
@@ -240,6 +240,7 @@ tools/security/xensec_tool
 tools/tests/depriv/depriv-fd-checker
 tools/tests/x86_emulator/*.bin
 tools/tests/x86_emulator/*.tmp
+tools/tests/x86_emulator/32/x86_emulate
 tools/tests/x86_emulator/3dnow*.[ch]
 tools/tests/x86_emulator/asm
 tools/tests/x86_emulator/avx*.[ch]
--- /dev/null
+++ b/tools/tests/x86_emulator/32/Makefile
@@ -0,0 +1,5 @@
+override XEN_COMPILE_ARCH := x86_32
+XEN_ROOT = $(CURDIR)/../../../..
+vpath %.c ..
+vpath %.h ..
+include ../Makefile
--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -1,5 +1,5 @@
 
-XEN_ROOT=$(CURDIR)/../../..
+XEN_ROOT ?= $(CURDIR)/../../..
 include $(XEN_ROOT)/tools/Rules.mk
 
 TARGET := test_x86_emulator
@@ -23,6 +23,12 @@ TESTCASES := blowfish $(SIMD) $(FMA) $(S
 
 OPMASK := avx512f avx512dq avx512bw
 
+ifeq ($(origin XEN_COMPILE_ARCH),override)
+
+HOSTCFLAGS += -m32
+
+else
+
 blowfish-cflags := ""
 blowfish-cflags-x86_32 := "-mno-accumulate-outgoing-args -Dstatic="
 
@@ -144,6 +150,8 @@ $(addsuffix .h,$(SIMD) $(FMA) $(SG)): si
 
 xop.h: simd-fma.c
 
+endif # 32-bit override
+
 $(TARGET): x86-emulate.o cpuid.o test_x86_emulator.o evex-disp8.o wrappers.o
 	$(HOSTCC) $(HOSTCFLAGS) -o $@ $^
 
@@ -158,6 +166,16 @@ distclean: clean
 .PHONY: install uninstall
 install uninstall:
 
+.PHONY: run32 clean32
+ifeq ($(XEN_TARGET_ARCH),x86_64)
+run32: $(addsuffix .h,$(TESTCASES)) $(addsuffix -opmask.h,$(OPMASK))
+run32 clean32: %32:
+	$(MAKE) -C 32 $*
+clean: clean32
+else
+run32 clean32: %32: %
+endif
+
 x86_emulate:
 	[ -L $@ ] || ln -sf $(XEN_ROOT)/xen/arch/x86/$@
 





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 05/47] x86emul: use AVX512 logic for emulating V{, P}MASKMOV*
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (3 preceding siblings ...)
  2018-11-19 10:14   ` [PATCH v5 04/47] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
@ 2018-11-19 10:15   ` Jan Beulich
  2018-11-19 10:16   ` [PATCH v5 06/47] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
                     ` (41 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:15 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

The more generic AVX512 implementation allows quite a bit of insn-
specific code to be dropped/shared.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -439,8 +439,8 @@ static const struct ext0f38_table {
     [0x28 ... 0x29] = { .simd_size = simd_packed_int },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
-    [0x2c ... 0x2d] = { .simd_size = simd_other },
-    [0x2e ... 0x2f] = { .simd_size = simd_other, .to_mem = 1 },
+    [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
+    [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int },
     [0x40] = { .simd_size = simd_packed_int },
@@ -449,8 +449,8 @@ static const struct ext0f38_table {
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
-    [0x8c] = { .simd_size = simd_other },
-    [0x8e] = { .simd_size = simd_other, .to_mem = 1 },
+    [0x8c] = { .simd_size = simd_packed_int },
+    [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp },
     [0x99] = { .simd_size = simd_scalar_vexw },
@@ -7989,6 +7989,8 @@ x86_emulate(
 
         generate_exception_if(ea.type != OP_MEM || vex.w, EXC_UD);
         host_and_vcpu_must_have(avx);
+        elem_bytes = 4 << (b & 1);
+    vmaskmov:
         get_fpu(X86EMUL_FPU_ymm);
 
         /*
@@ -8003,7 +8005,7 @@ x86_emulate(
         opc = init_prefixes(stub);
         pvex = copy_VEX(opc, vex);
         pvex->opcx = vex_0f;
-        if ( !(b & 1) )
+        if ( elem_bytes == 4 )
             pvex->pfx = vex_none;
         opc[0] = 0x50; /* vmovmskp{s,d} */
         /* Use %rax as GPR destination and VEX.vvvv as source. */
@@ -8016,21 +8018,9 @@ x86_emulate(
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
         put_stub(stub);
 
-        if ( !ea.val )
-            goto complete_insn;
-
-        op_bytes = 4 << (b & 1);
-        first_byte = __builtin_ctz(ea.val);
-        ea.val >>= first_byte;
-        first_byte *= op_bytes;
-        op_bytes *= 32 - __builtin_clz(ea.val);
-
-        /*
-         * Even for the memory write variant a memory read is needed, unless
-         * all set mask bits are contiguous.
-         */
-        if ( ea.val & (ea.val + 1) )
-            d = (d & ~SrcMask) | SrcMem;
+        evex.opmsk = 1; /* fake */
+        op_mask = ea.val;
+        fault_suppression = true;
 
         opc = init_prefixes(stub);
         opc[0] = b;
@@ -8081,63 +8071,10 @@ x86_emulate(
 
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
-    {
-        typeof(vex) *pvex;
-        unsigned int mask = vex.w ? 0x80808080U : 0x88888888U;
-
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
         host_and_vcpu_must_have(avx2);
-        get_fpu(X86EMUL_FPU_ymm);
-
-        /*
-         * While we can't reasonably provide fully correct behavior here
-         * (in particular, for writes, avoiding the memory read in anticipation
-         * of all elements in the range eventually being written), we can (and
-         * should) still limit the memory access to the smallest possible range
-         * (suppressing it altogether if all mask bits are clear), to provide
-         * correct faulting behavior. Read the mask bits via vmovmskp{s,d}
-         * for that purpose.
-         */
-        opc = init_prefixes(stub);
-        pvex = copy_VEX(opc, vex);
-        pvex->opcx = vex_0f;
-        opc[0] = 0xd7; /* vpmovmskb */
-        /* Use %rax as GPR destination and VEX.vvvv as source. */
-        pvex->r = 1;
-        pvex->b = !mode_64bit() || (vex.reg >> 3);
-        opc[1] = 0xc0 | (~vex.reg & 7);
-        pvex->reg = 0xf;
-        opc[2] = 0xc3;
-
-        invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
-        put_stub(stub);
-
-        /* Convert byte granular result to dword/qword granularity. */
-        ea.val &= mask;
-        if ( !ea.val )
-            goto complete_insn;
-
-        first_byte = __builtin_ctz(ea.val) & ~((4 << vex.w) - 1);
-        ea.val >>= first_byte;
-        op_bytes = 32 - __builtin_clz(ea.val);
-
-        /*
-         * Even for the memory write variant a memory read is needed, unless
-         * all set mask bits are contiguous.
-         */
-        if ( ea.val & (ea.val + ~mask + 1) )
-            d = (d & ~SrcMask) | SrcMem;
-
-        opc = init_prefixes(stub);
-        opc[0] = b;
-        /* Convert memory operand to (%rAX). */
-        rex_prefix &= ~REX_B;
-        vex.b = 1;
-        opc[1] = modrm & 0x38;
-        insn_bytes = PFX_BYTES + 2;
-
-        break;
-    }
+        elem_bytes = 4 << vex.w;
+        goto vmaskmov;
 
     case X86EMUL_OPC_VEX_66(0x0f38, 0x90): /* vpgatherd{d,q} {x,y}mm,mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x91): /* vpgatherq{d,q} {x,y}mm,mem,{x,y}mm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 06/47] x86emul: support AVX512F legacy-equivalent arithmetic FP insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (4 preceding siblings ...)
  2018-11-19 10:15   ` [PATCH v5 05/47] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
@ 2018-11-19 10:16   ` Jan Beulich
  2018-11-19 10:16   ` [PATCH v5 07/47] x86emul: support AVX512DQ logic " Jan Beulich
                     ` (40 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:16 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -104,6 +104,10 @@ enum esz {
     INSN_SFP(m, sp, o)
 
 static const struct test avx512f_all[] = {
+    INSN_FP(add,             0f, 58),
+    INSN_FP(div,             0f, 5e),
+    INSN_FP(max,             0f, 5f),
+    INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
     INSN_SFP(mov,            0f, 11),
     INSN_PFP_NB(mova,        0f, 28),
@@ -121,6 +125,9 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movnt,       0f, 2b),
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
+    INSN_FP(mul,             0f, 59),
+    INSN_FP(sqrt,            0f, 51),
+    INSN_FP(sub,             0f, 5c),
 };
 
 static const struct test avx512f_128[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -300,12 +300,12 @@ static const struct twobyte_table {
     [0x3a] = { DstReg|SrcImmByte|ModRM },
     [0x40 ... 0x4f] = { DstReg|SrcMem|ModRM|Mov },
     [0x50] = { DstReg|SrcImplicit|ModRM|Mov },
-    [0x51] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_any_fp },
+    [0x51] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_any_fp, d8s_vl },
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
-    [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp },
+    [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp },
+    [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -5871,10 +5871,22 @@ x86_emulate(
         if ( (b & ~1) == 0x10 && ea.type == OP_MEM )
             d |= TwoOp;
         generate_exception_if(evex.br, EXC_UD);
-        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+        /* fall through */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x51):    /* vsqrtp{s,d} [xyz]mm/mem,[xyz]mm{k} */
+                                            /* vsqrts{s,d} xmm/m32,xmm,xmm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x58):    /* vadd{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x59):    /* vmul{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5c):    /* vsub{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5d):    /* vmin{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5e):    /* vdiv{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5f):    /* vmax{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
+                               (ea.type == OP_MEM && evex.br &&
+                                (evex.pfx & VEX_PREFIX_SCALAR_MASK))),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
-        avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
     simd_zmm:
         get_fpu(X86EMUL_FPU_zmm);
         opc = init_evex(stub);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 07/47] x86emul: support AVX512DQ logic FP insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (5 preceding siblings ...)
  2018-11-19 10:16   ` [PATCH v5 06/47] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
@ 2018-11-19 10:16   ` Jan Beulich
  2018-11-19 10:17   ` [PATCH v5 08/47] x86emul: support basic AVX512F FP compare insns Jan Beulich
                     ` (39 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:16 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -144,6 +144,13 @@ static const struct test avx512bw_all[]
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
 };
 
+static const struct test avx512dq_all[] = {
+    INSN_PFP(and,              0f, 54),
+    INSN_PFP(andn,             0f, 55),
+    INSN_PFP(or,               0f, 56),
+    INSN_PFP(xor,              0f, 57),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 
@@ -475,4 +482,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, all);
     RUN(avx512f, 128);
     RUN(avx512bw, all);
+    RUN(avx512dq, all);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -302,7 +302,7 @@ static const struct twobyte_table {
     [0x50] = { DstReg|SrcImplicit|ModRM|Mov },
     [0x51] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_any_fp, d8s_vl },
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
-    [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
+    [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
@@ -6336,6 +6336,17 @@ x86_emulate(
         dst.bytes = 4;
         break;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x54): /* vandp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x55): /* vandnp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x56): /* vorp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x57): /* vxorp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
+                               (ea.type != OP_MEM && evex.br)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512dq);
+        avx512_vlen_check(false);
+        goto simd_zmm;
+
     CASE_SIMD_ALL_FP(, 0x0f, 0x5a):        /* cvt{p,s}{s,d}2{p,s}{s,d} xmm/mem,xmm */
     CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} xmm/mem,xmm */
                                            /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 08/47] x86emul: support basic AVX512F FP compare insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (6 preceding siblings ...)
  2018-11-19 10:16   ` [PATCH v5 07/47] x86emul: support AVX512DQ logic " Jan Beulich
@ 2018-11-19 10:17   ` Jan Beulich
  2018-11-19 10:17   ` [PATCH v5 09/47] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
                     ` (38 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:17 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

V{,U}COMIS{S,D} to follow later.

Also correct the AVX counterpart's comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v5: Adjust title.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -105,6 +105,7 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN_FP(cmp,             0f, c2),
     INSN_FP(div,             0f, 5e),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -352,7 +352,7 @@ static const struct twobyte_table {
     [0xbf] = { DstReg|SrcMem16|ModRM|Mov },
     [0xc0] = { ByteOp|DstMem|SrcReg|ModRM },
     [0xc1] = { DstMem|SrcReg|ModRM },
-    [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp },
+    [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp, d8s_vl },
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
     [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
@@ -7443,7 +7443,7 @@ x86_emulate(
         goto add;
 
     CASE_SIMD_ALL_FP(, 0x0f, 0xc2):        /* cmp{p,s}{s,d} $imm8,xmm/mem,xmm */
-    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0xc2):    /* vcmp{p,s}{s,d} $imm8,{x,y}mm/mem,{x,y}mm */
+    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0xc2):    /* vcmp{p,s}{s,d} $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0xc6):     /* shufp{s,d} $imm8,xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,{x,y}mm/mem,{x,y}mm */
         d = (d & ~SrcMask) | SrcMem;
@@ -7457,6 +7457,30 @@ x86_emulate(
         }
         goto simd_0f_imm8_avx;
 
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0xc2): /* vcmp{p,s}{s,d} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
+                               (ea.type == OP_MEM && evex.br &&
+                                (evex.pfx & VEX_PREFIX_SCALAR_MASK)) ||
+                               !evex.r || !evex.R || evex.z),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
+        d = (d & ~SrcMask) | SrcMem;
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        opc[0] = b;
+        opc[1] = modrm;
+        if ( ea.type == OP_MEM )
+        {
+            /* convert memory operand to (%rAX) */
+            evex.b = 1;
+            opc[1] &= 0x38;
+        }
+        opc[2] = imm1;
+        insn_bytes = EVEX_PFX_BYTES + 3;
+        break;
+
     case X86EMUL_OPC(0x0f, 0xc3): /* movnti */
         /* Ignore the non-temporal hint for now. */
         vcpu_must_have(sse2);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 09/47] x86emul: support AVX512F misc legacy-equivalent FP insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (7 preceding siblings ...)
  2018-11-19 10:17   ` [PATCH v5 08/47] x86emul: support basic AVX512F FP compare insns Jan Beulich
@ 2018-11-19 10:17   ` Jan Beulich
  2018-11-19 10:18   ` [PATCH v5 10/47] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
                     ` (37 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:17 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also correct an AVX counterpart's comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -127,8 +127,11 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
+    INSN_PFP(unpckh,         0f, 15),
+    INSN_PFP(unpckl,         0f, 14),
 };
 
 static const struct test avx512f_128[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -282,7 +282,7 @@ static const struct twobyte_table {
     [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
-    [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp },
+    [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
@@ -356,7 +356,7 @@ static const struct twobyte_table {
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
     [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
-    [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp },
+    [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp, d8s_vl },
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -5940,6 +5940,17 @@ x86_emulate(
         host_and_vcpu_must_have(sse3);
         goto simd_0f_xmm;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x14): /* vunpcklp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+                              EXC_UD);
+        fault_suppression = false;
+    avx512f_no_sae:
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        avx512_vlen_check(false);
+        goto simd_zmm;
+
     case X86EMUL_OPC(0x0f, 0x20): /* mov cr,reg */
     case X86EMUL_OPC(0x0f, 0x21): /* mov dr,reg */
     case X86EMUL_OPC(0x0f, 0x22): /* mov reg,cr */
@@ -6618,11 +6629,9 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_F3(0x0f, 0x7f): /* vmovdqu{32,64} [xyz]mm,[xyz]mm/mem{k} */
     vmovdqa:
         generate_exception_if(evex.br, EXC_UD);
-        host_and_vcpu_must_have(avx512f);
-        avx512_vlen_check(false);
         d |= TwoOp;
         op_bytes = 16 << evex.lr;
-        goto simd_zmm;
+        goto avx512f_no_sae;
 
     case X86EMUL_OPC_EVEX_F2(0x0f, 0x6f): /* vmovdqu{8,16} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F2(0x0f, 0x7f): /* vmovdqu{8,16} [xyz]mm,[xyz]mm/mem{k} */
@@ -7445,7 +7454,7 @@ x86_emulate(
     CASE_SIMD_ALL_FP(, 0x0f, 0xc2):        /* cmp{p,s}{s,d} $imm8,xmm/mem,xmm */
     CASE_SIMD_ALL_FP(_VEX, 0x0f, 0xc2):    /* vcmp{p,s}{s,d} $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0xc6):     /* shufp{s,d} $imm8,xmm/mem,xmm */
-    CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,{x,y}mm/mem,{x,y}mm */
+    CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         d = (d & ~SrcMask) | SrcMem;
         if ( vex.opcx == vex_none )
         {
@@ -7466,7 +7475,9 @@ x86_emulate(
         host_and_vcpu_must_have(avx512f);
         if ( ea.type == OP_MEM || !evex.br )
             avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
-        d = (d & ~SrcMask) | SrcMem;
+    simd_imm8_zmm:
+        if ( (d & SrcMask) == SrcImmByte )
+            d = (d & ~SrcMask) | SrcMem;
         get_fpu(X86EMUL_FPU_zmm);
         opc = init_evex(stub);
         opc[0] = b;
@@ -7510,6 +7521,15 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 3;
         goto simd_0f_to_gpr;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC(0x0f, 0xc7): /* Grp9 */
     {
         union {




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 10/47] x86emul: support AVX512F fused-multiply-add insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (8 preceding siblings ...)
  2018-11-19 10:17   ` [PATCH v5 09/47] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
@ 2018-11-19 10:18   ` Jan Beulich
  2018-11-19 10:18   ` [PATCH v5 11/47] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
                     ` (36 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -107,6 +107,36 @@ static const struct test avx512f_all[] =
     INSN_FP(add,             0f, 58),
     INSN_FP(cmp,             0f, c2),
     INSN_FP(div,             0f, 5e),
+    INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
+    INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
+    INSN(fmadd213,     66, 0f38, a8,    vl,     sd, vl),
+    INSN(fmadd213,     66, 0f38, a9,    el,     sd, el),
+    INSN(fmadd231,     66, 0f38, b8,    vl,     sd, vl),
+    INSN(fmadd231,     66, 0f38, b9,    el,     sd, el),
+    INSN(fmaddsub132,  66, 0f38, 96,    vl,     sd, vl),
+    INSN(fmaddsub213,  66, 0f38, a6,    vl,     sd, vl),
+    INSN(fmaddsub231,  66, 0f38, b6,    vl,     sd, vl),
+    INSN(fmsub132,     66, 0f38, 9a,    vl,     sd, vl),
+    INSN(fmsub132,     66, 0f38, 9b,    el,     sd, el),
+    INSN(fmsub213,     66, 0f38, aa,    vl,     sd, vl),
+    INSN(fmsub213,     66, 0f38, ab,    el,     sd, el),
+    INSN(fmsub231,     66, 0f38, ba,    vl,     sd, vl),
+    INSN(fmsub231,     66, 0f38, bb,    el,     sd, el),
+    INSN(fmsubadd132,  66, 0f38, 97,    vl,     sd, vl),
+    INSN(fmsubadd213,  66, 0f38, a7,    vl,     sd, vl),
+    INSN(fmsubadd231,  66, 0f38, b7,    vl,     sd, vl),
+    INSN(fnmadd132,    66, 0f38, 9c,    vl,     sd, vl),
+    INSN(fnmadd132,    66, 0f38, 9d,    el,     sd, el),
+    INSN(fnmadd213,    66, 0f38, ac,    vl,     sd, vl),
+    INSN(fnmadd213,    66, 0f38, ad,    el,     sd, el),
+    INSN(fnmadd231,    66, 0f38, bc,    vl,     sd, vl),
+    INSN(fnmadd231,    66, 0f38, bd,    el,     sd, el),
+    INSN(fnmsub132,    66, 0f38, 9e,    vl,     sd, vl),
+    INSN(fnmsub132,    66, 0f38, 9f,    el,     sd, el),
+    INSN(fnmsub213,    66, 0f38, ae,    vl,     sd, vl),
+    INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
+    INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
+    INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -452,30 +452,30 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
-    [0x96 ... 0x98] = { .simd_size = simd_packed_fp },
-    [0x99] = { .simd_size = simd_scalar_vexw },
-    [0x9a] = { .simd_size = simd_packed_fp },
-    [0x9b] = { .simd_size = simd_scalar_vexw },
-    [0x9c] = { .simd_size = simd_packed_fp },
-    [0x9d] = { .simd_size = simd_scalar_vexw },
-    [0x9e] = { .simd_size = simd_packed_fp },
-    [0x9f] = { .simd_size = simd_scalar_vexw },
-    [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp },
-    [0xa9] = { .simd_size = simd_scalar_vexw },
-    [0xaa] = { .simd_size = simd_packed_fp },
-    [0xab] = { .simd_size = simd_scalar_vexw },
-    [0xac] = { .simd_size = simd_packed_fp },
-    [0xad] = { .simd_size = simd_scalar_vexw },
-    [0xae] = { .simd_size = simd_packed_fp },
-    [0xaf] = { .simd_size = simd_scalar_vexw },
-    [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp },
-    [0xb9] = { .simd_size = simd_scalar_vexw },
-    [0xba] = { .simd_size = simd_packed_fp },
-    [0xbb] = { .simd_size = simd_scalar_vexw },
-    [0xbc] = { .simd_size = simd_packed_fp },
-    [0xbd] = { .simd_size = simd_scalar_vexw },
-    [0xbe] = { .simd_size = simd_packed_fp },
-    [0xbf] = { .simd_size = simd_scalar_vexw },
+    [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x9a] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x9b] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x9c] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x9d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x9e] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x9f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xa9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xaa] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xab] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xac] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xad] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xae] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xaf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xb9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xba] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xbb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xbc] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xc8 ... 0xcd] = { .simd_size = simd_other },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
@@ -8292,6 +8292,49 @@ x86_emulate(
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9a): /* vfmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9c): /* vfnmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9e): /* vfnmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa6): /* vfmaddsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa7): /* vfmsubadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa8): /* vfmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaa): /* vfmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xac): /* vfnmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xae): /* vfnmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb6): /* vfmaddsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb7): /* vfmsubadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb8): /* vfmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xba): /* vfmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM )
+        {
+            generate_exception_if(evex.br, EXC_UD);
+            avx512_vlen_check(true);
+        }
+        goto simd_zmm;
+
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xc9):     /* sha1msg1 xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xca):     /* sha1msg2 xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 11/47] x86emul: support AVX512F legacy-equivalent logic insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (9 preceding siblings ...)
  2018-11-19 10:18   ` [PATCH v5 10/47] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
@ 2018-11-19 10:18   ` Jan Beulich
  2018-11-19 10:19   ` [PATCH v5 12/47] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
                     ` (35 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Plus vpternlog{d,q} as being extensively used by the compiler, in order
to facilitate test enabling in the harness as soon as possible. Also the
twobyte_table[] entries for a few more insns get their .d8s field set
right away, in order to not split and later re-combine the groups.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -157,6 +157,11 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(pand,         66,   0f, db,    vl,     dq, vl),
+    INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
+    INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -364,13 +364,13 @@ static const struct twobyte_table {
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
-    [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
-    [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
@@ -493,6 +493,7 @@ static const struct ext0f3a_table {
     uint8_t to_mem:1;
     uint8_t two_op:1;
     uint8_t four_op:1;
+    disp8scale_t d8s:4;
 } ext0f3a_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x01] = { .simd_size = simd_packed_fp, .two_op = 1 },
@@ -510,6 +511,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
+    [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
@@ -3023,20 +3025,33 @@ x86_decode(
                 disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
             break;
 
+        case ext_0f3a:
+            /*
+             * Cannot update d here yet, as the immediate operand still
+             * needs fetching.
+             */
+            state->simd_size = ext0f3a_table[b].simd_size;
+            if ( evex_encoded() )
+                disp8scale = decode_disp8scale(ext0f3a_table[b].d8s, state);
+            break;
+
         case ext_8f09:
             if ( ext8f09_table[b].two_op )
                 d |= TwoOp;
             state->simd_size = ext8f09_table[b].simd_size;
             break;
 
-        case ext_0f3a:
         case ext_8f08:
+        case ext_8f0a:
             /*
              * Cannot update d here yet, as the immediate operand still
              * needs fetching.
              */
-        default:
             break;
+
+        default:
+            ASSERT_UNREACHABLE();
+            return X86EMUL_UNIMPLEMENTED;
         }
 
         if ( modrm_mod == 3 )
@@ -3215,7 +3230,6 @@ x86_decode(
         else if ( ext0f3a_table[b].four_op && !mode_64bit() && vex.opcx )
             imm1 &= 0x7f;
         state->desc = d;
-        state->simd_size = ext0f3a_table[b].simd_size;
         rc = x86_decode_0f3a(state, ctxt, ops);
         break;
 
@@ -5945,6 +5959,11 @@ x86_emulate(
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
         fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdb): /* vpand{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -7525,6 +7544,8 @@ x86_emulate(
         fault_suppression = false;
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
         avx512_vlen_check(false);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 12/47] x86emul: support AVX512{F, DQ} FP broadcast insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (10 preceding siblings ...)
  2018-11-19 10:18   ` [PATCH v5 11/47] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
@ 2018-11-19 10:19   ` Jan Beulich
  2018-11-19 18:44     ` Andrew Cooper
  2018-11-19 10:20   ` [PATCH v5 13/47] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
                     ` (34 subsequent siblings)
  46 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:19 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Use IMPOSSIBLE() to guard against division by zero.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -105,6 +105,7 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
@@ -176,6 +177,15 @@ static const struct test avx512f_128[] =
     INSN(movq,      66,   0f, d6, el,    q, el),
 };
 
+static const struct test avx512f_no128[] = {
+    INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
+    INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
+};
+
+static const struct test avx512f_512[] = {
+    INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+};
+
 static const struct test avx512bw_all[] = {
     INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
@@ -190,8 +200,19 @@ static const struct test avx512dq_all[]
     INSN_PFP(xor,              0f, 57),
 };
 
+static const struct test avx512dq_no128[] = {
+    INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
+    INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+};
+
+static const struct test avx512dq_512[] = {
+    INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
+static const unsigned char vl_no128[] = { VL_512, VL_256 };
+static const unsigned char vl_512[] = { VL_512 };
 
 /*
  * This table, indicating the presence of an immediate (byte) for an opcode
@@ -520,6 +541,10 @@ void evex_disp8_test(void *instr, struct
 
     RUN(avx512f, all);
     RUN(avx512f, 128);
+    RUN(avx512f, no128);
+    RUN(avx512f, 512);
     RUN(avx512bw, all);
     RUN(avx512dq, all);
+    RUN(avx512dq, no128);
+    RUN(avx512dq, 512);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -234,10 +234,16 @@ enum simd_opsize {
 
     /*
      * 128 bits of integer or floating point data, with no further
-     * formatting information.
+     * formatting information, or with it encoded by EVEX.W.
      */
     simd_128,
 
+    /*
+     * 256 bits of integer or floating point data, with formatting
+     * encoded by EVEX.W.
+     */
+    simd_256,
+
     /* Operand size encoded in non-standard way. */
     simd_other
 };
@@ -432,8 +438,10 @@ static const struct ext0f38_table {
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x18 ... 0x19] = { .simd_size = simd_scalar_opc, .two_op = 1 },
-    [0x1a] = { .simd_size = simd_128, .two_op = 1 },
+    [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
+    [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
+    [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
+    [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
     [0x28 ... 0x29] = { .simd_size = simd_packed_int },
@@ -3337,6 +3345,10 @@ x86_decode(
         op_bytes = 16;
         break;
 
+    case simd_256:
+        op_bytes = 32;
+        break;
+
     default:
         op_bytes = 0;
         break;
@@ -7984,6 +7996,43 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+    avx512_broadcast:
+        /*
+         * For the respective code below the main switch() to work we need to
+         * fold op_mask here: A source element gets read whenever any of its
+         * respective destination elements' mask bits is set.
+         */
+        if ( fault_suppression )
+        {
+            n = 1 << ((b & 3) - evex.w);
+            IMPOSSIBLE(elem_bytes <= 0);
+            ASSERT(op_bytes == n * elem_bytes);
+            for ( i = n; i < (16 << evex.lr) / elem_bytes; i += n )
+                op_mask |= (op_mask >> i) & ((1 << n) - 1);
+        }
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
+                                            /* vbroadcastf64x4 m256,zmm{k} */
+        generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
+                                            /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
+        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcastf64x2 m128,{y,z}mm{k} */
+        generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.br,
+                              EXC_UD);
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
     case X86EMUL_OPC_66(0x0f38, 0x20): /* pmovsxbw xmm/m64,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x21): /* pmovsxbd xmm/m32,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x22): /* pmovsxbq xmm/m16,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 13/47] x86emul: support AVX512F v{, u}comis{d, s} insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (11 preceding siblings ...)
  2018-11-19 10:19   ` [PATCH v5 12/47] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
@ 2018-11-19 10:20   ` Jan Beulich
  2018-11-19 10:21   ` [PATCH v5 14/47] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
                     ` (33 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:20 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v4: Add missing avx512_vlen_check().
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -107,6 +107,8 @@ static const struct test avx512f_all[] =
     INSN_FP(add,             0f, 58),
     INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
+    INSN(comisd,       66,   0f, 2f,    el,      q, el),
+    INSN(comiss,         ,   0f, 2f,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -166,6 +168,8 @@ static const struct test avx512f_all[] =
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
+    INSN(ucomisd,      66,   0f, 2e,    el,      q, el),
+    INSN(ucomiss,        ,   0f, 2e,    el,      d, el),
     INSN_PFP(unpckh,         0f, 15),
     INSN_PFP(unpckl,         0f, 14),
 };
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -299,7 +299,7 @@ static const struct twobyte_table {
     [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp },
+    [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp, simd_none, d8s_dq },
     [0x30 ... 0x35] = { ImplicitOps },
     [0x37] = { ImplicitOps },
     [0x38] = { DstReg|SrcMem|ModRM },
@@ -6119,24 +6119,34 @@ x86_emulate(
         }
 
         opc = init_prefixes(stub);
+        op_bytes = 4 << vex.pfx;
+    vcomi:
         opc[0] = b;
         opc[1] = modrm;
         if ( ea.type == OP_MEM )
         {
-            rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp, vex.pfx ? 8 : 4,
-                           ctxt);
+            rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp, op_bytes, ctxt);
             if ( rc != X86EMUL_OKAY )
                 goto done;
 
             /* Convert memory operand to (%rAX). */
             rex_prefix &= ~REX_B;
             vex.b = 1;
+            evex.b = 1;
             opc[1] &= 0x38;
         }
-        insn_bytes = PFX_BYTES + 2;
+        if ( evex_encoded() )
+        {
+            insn_bytes = EVEX_PFX_BYTES + 2;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 2;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
         opc[2] = 0xc3;
 
-        copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),
                     _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),
                     [eflags] "+g" (_regs.eflags),
@@ -6147,6 +6157,20 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2e): /* vucomis{s,d} xmm/mem,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2f): /* vcomis{s,d} xmm/mem,xmm */
+        generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
+                               (ea.type != OP_REG && evex.br) ||
+                               evex.w != evex.pfx),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        op_bytes = 4 << evex.w;
+        goto vcomi;
+
     case X86EMUL_OPC(0x0f, 0x30): /* wrmsr */
         generate_exception_if(!mode_ring0(), EXC_GP, 0);
         fail_if(ops->write_msr == NULL);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 14/47] x86emul: support AVX512{F, BW} packed integer compare insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (12 preceding siblings ...)
  2018-11-19 10:20   ` [PATCH v5 13/47] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
@ 2018-11-19 10:21   ` Jan Beulich
  2018-11-19 19:09     ` Andrew Cooper
  2018-11-19 10:21   ` [PATCH v5 15/47] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
                     ` (32 subsequent siblings)
  46 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:21 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Include VPTEST{,N}M{B,D,Q,W} as once again possibly used by the compiler
for comparison against all-zero vectors.

Also table entries for a few more insns get their .d8s field set right
away, again in order to not split and later re-combine the groups.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -162,8 +162,16 @@ static const struct test avx512f_all[] =
     INSN_FP(mul,             0f, 59),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
+    INSN(pcmpeqd,      66,   0f, 76,    vl,      d, vl),
+    INSN(pcmpeqq,      66, 0f38, 29,    vl,      q, vl),
+    INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
+    INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
+    INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
+    INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
+    INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
@@ -195,6 +203,14 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(pcmp,        66, 0f3a, 3f,    vl,   bw, vl),
+    INSN(pcmpeqb,     66,   0f, 74,    vl,    b, vl),
+    INSN(pcmpeqw,     66,   0f, 75,    vl,    w, vl),
+    INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
+    INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
+    INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(ptestm,      66, 0f38, 26,    vl,   bw, vl),
+    INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
 };
 
 static const struct test avx512dq_all[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -313,14 +313,14 @@ static const struct twobyte_table {
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
-    [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
@@ -444,13 +444,13 @@ static const struct ext0f38_table {
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
-    [0x28 ... 0x29] = { .simd_size = simd_packed_int },
+    [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
-    [0x36 ... 0x3f] = { .simd_size = simd_packed_int },
+    [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int },
@@ -516,6 +516,7 @@ static const struct ext0f3a_table {
     [0x18] = { .simd_size = simd_128 },
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
+    [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none },
     [0x21] = { .simd_size = simd_other },
     [0x22] = { .simd_size = simd_none },
@@ -523,6 +524,7 @@ static const struct ext0f3a_table {
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
     [0x44] = { .simd_size = simd_packed_int },
@@ -6576,6 +6578,32 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
+        op_bytes = 16 << evex.lr;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x64): /* vpcmpeqb [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x65): /* vpcmpeqw [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x66): /* vpcmpeqd [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x74): /* vpcmpgtb [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x75): /* vpcmpgtw [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f,   0x76): /* vpcmpgtd [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x26): /* vptestm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x27): /* vptestm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x29): /* vpcmpeqq [xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x37): /* vpcmpgtq [xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( b & (ext == ext_0f38 ? 1 : 2) )
+        {
+            generate_exception_if(b != 0x27 && evex.w != (b & 1), EXC_UD);
+            goto avx512f_no_sae;
+        }
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << (ext == ext_0f ? b & 1 : evex.w);
+        avx512_vlen_check(false);
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x6e):    /* mov{d,q} r/m,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
     CASE_SIMD_PACKED_INT(0x0f, 0x7e):    /* mov{d,q} {,x}mm,r/m */
@@ -7582,6 +7610,7 @@ x86_emulate(
                               EXC_UD);
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    avx512f_imm8_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
         avx512_vlen_check(false);
@@ -8756,6 +8785,19 @@ x86_emulate(
         break;
     }
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1e): /* vpcmpu{d,q} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1f): /* vpcmp{d,q} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3e): /* vpcmpu{b,w} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3f): /* vpcmp{b,w} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( !(b & 0x20) )
+            goto avx512f_imm8_no_sae;
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x20): /* pinsrb $imm8,r32/m8,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x22): /* pinsr{d,q} $imm8,r/m,xmm */
         host_and_vcpu_must_have(sse4_1);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 15/47] x86emul: support AVX512{F, BW} packed integer arithmetic insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (13 preceding siblings ...)
  2018-11-19 10:21   ` [PATCH v5 14/47] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
@ 2018-11-19 10:21   ` Jan Beulich
  2018-11-19 19:18     ` Andrew Cooper
  2018-11-19 10:22   ` [PATCH v5 16/47] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
                     ` (31 subsequent siblings)
  46 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:21 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note: vpadd* / vpsub* et al are put at seemingly the wrong slot of the
big switch(). This is in anticipation of adding e.g. vpunpck* to those
groups (see the legacy/VEX encoded case labels nearby to support this).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Move a case block further down.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -160,6 +160,8 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(paddd,        66,   0f, fe,    vl,      d, vl),
+    INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
     INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
@@ -168,7 +170,16 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
+    INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
+    INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
+    INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmuldq,       66, 0f38, 28,    vl,      q, vl),
+    INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
+    INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSN(psubd,        66,   0f, fa,    vl,      d, vl),
+    INSN(psubq,        66,   0f, fb,    vl,      q, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
     INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
     INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
@@ -203,12 +214,39 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(paddb,       66,   0f, fc,    vl,    b, vl),
+    INSN(paddsb,      66,   0f, ec,    vl,    b, vl),
+    INSN(paddsw,      66,   0f, ed,    vl,    w, vl),
+    INSN(paddusb,     66,   0f, dc,    vl,    b, vl),
+    INSN(paddusw,     66,   0f, dd,    vl,    w, vl),
+    INSN(paddw,       66,   0f, fd,    vl,    w, vl),
+    INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
+    INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
     INSN(pcmp,        66, 0f3a, 3f,    vl,   bw, vl),
     INSN(pcmpeqb,     66,   0f, 74,    vl,    b, vl),
     INSN(pcmpeqw,     66,   0f, 75,    vl,    w, vl),
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
+    INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
+    INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
+    INSN(pmaxub,      66,   0f, de,    vl,    b, vl),
+    INSN(pmaxuw,      66, 0f38, 3e,    vl,    w, vl),
+    INSN(pminsb,      66, 0f38, 38,    vl,    b, vl),
+    INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
+    INSN(pminub,      66,   0f, da,    vl,    b, vl),
+    INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
+    INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
+    INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
+    INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSN(psubb,       66,   0f, f8,    vl,    b, vl),
+    INSN(psubsb,      66,   0f, e8,    vl,    b, vl),
+    INSN(psubsw,      66,   0f, e9,    vl,    w, vl),
+    INSN(psubusb,     66,   0f, d8,    vl,    b, vl),
+    INSN(psubusw,     66,   0f, d9,    vl,    w, vl),
+    INSN(psubw,       66,   0f, f9,    vl,    w, vl),
     INSN(ptestm,      66, 0f38, 26,    vl,   bw, vl),
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
 };
@@ -217,6 +255,7 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN_PFP(or,               0f, 56),
+    INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
 };
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -367,21 +367,21 @@ static const struct twobyte_table {
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
-    [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xff] = { ModRM }
 };
 
@@ -451,7 +451,7 @@ static const struct ext0f38_table {
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x40] = { .simd_size = simd_packed_int },
+    [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int },
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
@@ -5978,6 +5978,10 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x39): /* vpmins{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3b): /* vpminu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3d): /* vpmaxs{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3f): /* vpmaxu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -6578,6 +6582,31 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd8): /* vpsubusb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd9): /* vpsubusw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdc): /* vpaddusb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xdd): /* vpaddusw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe0): /* vpavgb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe3): /* vpavgw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe5): /* vpmulhw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe8): /* vpsubsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe9): /* vpsubsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xec): /* vpaddsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xed): /* vpaddsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf8): /* vpsubb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << (b & 1);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
         op_bytes = 16 << evex.lr;
@@ -6604,6 +6633,12 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd4): /* vpaddq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf4): /* vpmuludq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x28): /* vpmuldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        goto avx512f_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x6e):    /* mov{d,q} r/m,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
     CASE_SIMD_PACKED_INT(0x0f, 0x7e):    /* mov{d,q} {,x}mm,r/m */
@@ -7825,6 +7860,12 @@ x86_emulate(
         op_bytes = vex.pfx ? 16 : 8;
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC(0x0f, 0xd4):        /* paddq mm/m64,mm */
     case X86EMUL_OPC(0x0f, 0xf4):        /* pmuludq mm/m64,mm */
     case X86EMUL_OPC(0x0f, 0xfb):        /* psubq mm/m64,mm */
@@ -7853,6 +7894,16 @@ x86_emulate(
         vcpu_must_have(mmxext);
         goto simd_0f_mmx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xda): /* vpminub [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xde): /* vpmaxub [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe4): /* vpmulhuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xea): /* vpminsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xee): /* vpmaxsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = b & 0x10 ? 1 : 2;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f, 0xe6):       /* cvttpd2dq xmm/mem,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe6):   /* vcvttpd2dq {x,y}mm/mem,xmm */
     case X86EMUL_OPC_F3(0x0f, 0xe6):       /* cvtdq2pd xmm/mem,xmm */
@@ -8227,6 +8278,20 @@ x86_emulate(
         host_and_vcpu_must_have(sse4_2);
         goto simd_0f38_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x38): /* vpminsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3a): /* vpminuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3c): /* vpmaxsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x3e): /* vpmaxuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = b & 2 ?: 1;
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x40): /* vpmull{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0xdb):     /* aesimc xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xdb): /* vaesimc xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xdc):     /* aesenc xmm/m128,xmm,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 16/47] x86emul: use simd_128 also for legacy vector shift insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (14 preceding siblings ...)
  2018-11-19 10:21   ` [PATCH v5 15/47] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
@ 2018-11-19 10:22   ` Jan Beulich
  2018-11-19 19:21     ` Andrew Cooper
  2018-11-19 10:22   ` [PATCH v5 17/47] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
                     ` (30 subsequent siblings)
  46 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:22 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This eliminates a separate case block here, and allows to get away with
fewer new ones when adding AVX512 vector shifts.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -366,19 +366,19 @@ static const struct twobyte_table {
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128 },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128 },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
     [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -3344,7 +3344,8 @@ x86_decode(
         break;
 
     case simd_128:
-        op_bytes = 16;
+        /* The special case here are MMX shift insns. */
+        op_bytes = vex.opcx || vex.pfx ? 16 : 8;
         break;
 
     case simd_256:
@@ -6473,6 +6474,12 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f, 0x75): /* vpcmpeqw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0x76):    /* pcmpeqd {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x76): /* vpcmpeqd {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xd1):    /* psrlw {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xd2):    /* psrld {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xd3):    /* psrlq {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xd4):     /* paddq xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xd4): /* vpaddq {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0xd5):    /* pmullw {,x}mm/mem,{,x}mm */
@@ -6495,6 +6502,10 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f, 0xdf): /* vpandn {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xe0):     /* pavgb xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe0): /* vpavgb {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xe1):    /* psraw {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xe2):    /* psrad {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xe2): /* vpsrad xmm/m128,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xe3):     /* pavgw xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe3): /* vpavgw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xe4):     /* pmulhuw xmm/m128,xmm */
@@ -6517,6 +6528,12 @@ x86_emulate(
     case X86EMUL_OPC_VEX_66(0x0f, 0xee): /* vpmaxsw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0xef):    /* pxor {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xef): /* vpxor {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xf1):    /* psllw {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xf2):    /* pslld {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xf2): /* vpslld xmm/m128,{x,y}mm,{x,y}mm */
+    CASE_SIMD_PACKED_INT(0x0f, 0xf3):    /* psllq {,x}mm/mem,{,x}mm */
+    case X86EMUL_OPC_VEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xf4):     /* pmuludq xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f, 0xf4): /* vpmuludq {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_66(0x0f, 0xf6):     /* psadbw xmm/m128,xmm */
@@ -7841,25 +7858,6 @@ x86_emulate(
         }
         break;
 
-    CASE_SIMD_PACKED_INT(0x0f, 0xd1):    /* psrlw {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xd2):    /* psrld {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xd3):    /* psrlq {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xe1):    /* psraw {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xe2):    /* psrad {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xe2): /* vpsrad xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xf1):    /* psllw {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xf2):    /* pslld {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xf2): /* vpslld xmm/m128,{x,y}mm,{x,y}mm */
-    CASE_SIMD_PACKED_INT(0x0f, 0xf3):    /* psllq {,x}mm/mem,{,x}mm */
-    case X86EMUL_OPC_VEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,{x,y}mm,{x,y}mm */
-        op_bytes = vex.pfx ? 16 : 8;
-        goto simd_0f_int;
-
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 17/47] x86emul: support AVX512{F, BW} shift/rotate insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (15 preceding siblings ...)
  2018-11-19 10:22   ` [PATCH v5 16/47] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
@ 2018-11-19 10:22   ` Jan Beulich
  2018-11-19 10:23   ` [PATCH v5 18/47] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
                     ` (29 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:22 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that simd_packed_fp for the opcode space 0f38 major opcodes 14 and
15 is not really correct, but sufficient for the purposes here. Further
adjustments may later be needed for the down conversion unsigned
saturating VPMOV* insns, first and foremost for the different Disp8
scaling those ones use.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -178,6 +178,24 @@ static const struct test avx512f_all[] =
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSNX(prol,        66,   0f, 72, 1, vl,     dq, vl),
+    INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
+    INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
+    INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
+    INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
+    INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
+    INSNX(psllq,       66,   0f, 73, 6, vl,      q, vl),
+    INSN(psllv,        66, 0f38, 47,    vl,     dq, vl),
+    INSNX(psra,        66,   0f, 72, 4, vl,     dq, vl),
+    INSN(psrad,        66,   0f, e2,    el_4,    d, vl),
+    INSN(psraq,        66,   0f, e2,    el_2,    q, vl),
+    INSN(psrav,        66, 0f38, 46,    vl,     dq, vl),
+    INSN(psrld,        66,   0f, d2,    el_4,    d, vl),
+    INSNX(psrld,       66,   0f, 72, 2, vl,      d, vl),
+    INSN(psrlq,        66,   0f, d3,    el_2,    q, vl),
+    INSNX(psrlq,       66,   0f, 73, 2, vl,      q, vl),
+    INSN(psrlv,        66, 0f38, 45,    vl,     dq, vl),
     INSN(psubd,        66,   0f, fa,    vl,      d, vl),
     INSN(psubq,        66,   0f, fb,    vl,      q, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
@@ -241,6 +259,17 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSNX(pslldq,     66,   0f, 73, 7, vl,    b, vl),
+    INSN(psllvw,      66, 0f38, 12,    vl,    w, vl),
+    INSN(psllw,       66,   0f, f1,    el_8,  w, vl),
+    INSNX(psllw,      66,   0f, 71, 6, vl,    w, vl),
+    INSN(psravw,      66, 0f38, 11,    vl,    w, vl),
+    INSN(psraw,       66,   0f, e1,    el_8,  w, vl),
+    INSNX(psraw,      66,   0f, 71, 4, vl,    w, vl),
+    INSNX(psrldq,     66,   0f, 73, 3, vl,    b, vl),
+    INSN(psrlvw,      66, 0f38, 10,    vl,    w, vl),
+    INSN(psrlw,       66,   0f, d1,    el_8,  w, vl),
+    INSNX(psrlw,      66,   0f, 71, 2, vl,    w, vl),
     INSN(psubb,       66,   0f, f8,    vl,    b, vl),
     INSN(psubsb,      66,   0f, e8,    vl,    b, vl),
     INSN(psubsw,      66,   0f, e9,    vl,    w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -319,7 +319,7 @@ static const struct twobyte_table {
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
-    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
+    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
@@ -366,19 +366,19 @@ static const struct twobyte_table {
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
     [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -434,9 +434,9 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
-    [0x10] = { .simd_size = simd_packed_int },
+    [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
-    [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
+    [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
     [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
@@ -453,7 +453,7 @@ static const struct ext0f38_table {
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x45 ... 0x47] = { .simd_size = simd_packed_int },
+    [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
@@ -5979,10 +5979,15 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x14): /* vprorv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x15): /* vprolv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x39): /* vpmins{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3b): /* vpminu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3d): /* vpmaxs{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3f): /* vpmaxu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -6599,6 +6604,9 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
@@ -6898,6 +6906,37 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x71): /* Grp12 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsraw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512bw_shift_imm:
+            fault_suppression = false;
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512bw_imm;
+        }
+        goto unrecognized_insn;
+
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x72): /* Grp13 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpslld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(evex.w, EXC_UD);
+            /* fall through */
+        case 0: /* vpror{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 1: /* vprol{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsra{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512f_shift_imm:
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512f_imm8_no_sae;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x73):        /* Grp14 */
         switch ( modrm_reg & 7 )
         {
@@ -6923,6 +6962,19 @@ x86_emulate(
         }
         goto unrecognized_insn;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x73): /* Grp14 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(!evex.w, EXC_UD);
+            goto avx512f_shift_imm;
+        case 3: /* vpsrldq $imm8,{x,y}mm,{x,y}mm */
+        case 7: /* vpslldq $imm8,{x,y}mm,{x,y}mm */
+            goto avx512bw_shift_imm;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x77):        /* emms */
     case X86EMUL_OPC_VEX(0x0f, 0x77):    /* vzero{all,upper} */
         if ( vex.opcx != vex_none )
@@ -7858,6 +7910,16 @@ x86_emulate(
         }
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe2): /* vpsra{d,q} xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.br, EXC_UD);
+        fault_suppression = false;
+        if ( b == 0xe2 )
+            goto avx512f_no_sae;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8098,6 +8160,14 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x10): /* vpsrlvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x11): /* vpsravw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x12): /* vpsllvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
         generate_exception_if(evex.w || evex.br, EXC_UD);
     avx512_broadcast:
@@ -8855,6 +8925,7 @@ x86_emulate(
         generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
         if ( !(b & 0x20) )
             goto avx512f_imm8_no_sae;
+    avx512bw_imm:
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.br, EXC_UD);
         elem_bytes = 1 << evex.w;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 18/47] x86emul: support AVX512{F, BW, DQ} extract insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (16 preceding siblings ...)
  2018-11-19 10:22   ` [PATCH v5 17/47] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
@ 2018-11-19 10:23   ` Jan Beulich
  2018-11-19 10:23   ` [PATCH v5 19/47] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
                     ` (28 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:23 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -212,6 +212,7 @@ static const struct test avx512f_all[] =
 };
 
 static const struct test avx512f_128[] = {
+    INSN(extractps, 66, 0f3a, 17, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -221,10 +222,14 @@ static const struct test avx512f_128[] =
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
+    INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
+    INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
+    INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -280,6 +285,12 @@ static const struct test avx512bw_all[]
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
 };
 
+static const struct test avx512bw_128[] = {
+    INSN(pextrb, 66, 0f3a, 14, el, b, el),
+//       pextrw, 66,   0f, c5,     w
+    INSN(pextrw, 66, 0f3a, 15, el, w, el),
+};
+
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
@@ -288,13 +299,21 @@ static const struct test avx512dq_all[]
     INSN_PFP(xor,              0f, 57),
 };
 
+static const struct test avx512dq_128[] = {
+    INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+};
+
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
+    INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
+    INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
@@ -632,7 +651,9 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, no128);
     RUN(avx512f, 512);
     RUN(avx512bw, all);
+    RUN(avx512bw, 128);
     RUN(avx512dq, all);
+    RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -512,9 +512,13 @@ static const struct ext0f3a_table {
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
-    [0x14 ... 0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1 },
+    [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
+    [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
+    [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
+    [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
     [0x18] = { .simd_size = simd_128 },
-    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none },
@@ -523,7 +527,8 @@ static const struct ext0f3a_table {
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
-    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
@@ -2667,6 +2672,8 @@ x86_decode_0f3a(
      ... X86EMUL_OPC_66(0, 0x17):     /* pextr*, extractps */
     case X86EMUL_OPC_VEX_66(0, 0x14)
      ... X86EMUL_OPC_VEX_66(0, 0x17): /* vpextr*, vextractps */
+    case X86EMUL_OPC_EVEX_66(0, 0x14)
+     ... X86EMUL_OPC_EVEX_66(0, 0x17): /* vpextr*, vextractps */
     case X86EMUL_OPC_VEX_F2(0, 0xf0): /* rorx */
         break;
 
@@ -8844,9 +8851,9 @@ x86_emulate(
         opc[0] = b;
         /* Convert memory/GPR operand to (%rAX). */
         rex_prefix &= ~REX_B;
-        vex.b = 1;
+        evex.b = vex.b = 1;
         if ( !mode_64bit() )
-            vex.w = 0;
+            evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
         opc[3] = 0xc3;
@@ -8856,7 +8863,10 @@ x86_emulate(
             --opc;
         }
 
-        copy_REX_VEX(opc, rex_prefix, vex);
+        if ( evex_encoded() )
+            copy_EVEX(opc, evex);
+        else
+            copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=m" (dst.val) : "a" (&dst.val));
         put_stub(stub);
 
@@ -8876,6 +8886,52 @@ x86_emulate(
         opc = init_prefixes(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
+        generate_exception_if(ea.type != OP_REG, EXC_UD);
+        /* Convert to alternative encoding: We want to use a memory operand. */
+        evex.opcx = ext_0f3a;
+        b = 0x15;
+        modrm <<= 3;
+        evex.r = evex.b;
+        evex.R = evex.x;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x14): /* vpextrb $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x15): /* vpextrw $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x16): /* vpextr{d,q} $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x17): /* vextractps $imm8,xmm,r/m */
+        generate_exception_if((evex.lr || evex.reg != 0xf || !evex.RX ||
+                               evex.opmsk || evex.br),
+                              EXC_UD);
+        if ( !(b & 2) )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( !(b & 1) )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto pextr;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(evex.lr != 2 || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
     {
         uint32_t mxcsr;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 19/47] x86emul: support AVX512{F, BW, DQ} insert insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (17 preceding siblings ...)
  2018-11-19 10:23   ` [PATCH v5 18/47] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
@ 2018-11-19 10:23   ` Jan Beulich
  2018-11-19 10:24   ` [PATCH v5 20/47] x86emul: basic AVX512F testing Jan Beulich
                     ` (27 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:23 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also correct the comment of the AVX form of VINSERTPS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -213,6 +213,7 @@ static const struct test avx512f_all[] =
 
 static const struct test avx512f_128[] = {
     INSN(extractps, 66, 0f3a, 17, el,    d, el),
+    INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -224,12 +225,16 @@ static const struct test avx512f_no128[]
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
+    INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
+    INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
+    INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
+    INSN(inserti64x4,    66, 0f3a, 3a, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -289,6 +294,8 @@ static const struct test avx512bw_128[]
     INSN(pextrb, 66, 0f3a, 14, el, b, el),
 //       pextrw, 66,   0f, c5,     w
     INSN(pextrw, 66, 0f3a, 15, el, w, el),
+    INSN(pinsrb, 66, 0f3a, 20, el, b, el),
+    INSN(pinsrw, 66,   0f, c4, el, w, el),
 };
 
 static const struct test avx512dq_all[] = {
@@ -301,6 +308,7 @@ static const struct test avx512dq_all[]
 
 static const struct test avx512dq_128[] = {
     INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+    INSN(pinsr, 66, 0f3a, 22, el, dq64, el),
 };
 
 static const struct test avx512dq_no128[] = {
@@ -308,12 +316,16 @@ static const struct test avx512dq_no128[
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
+    INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
+    INSN(inserti64x2,    66, 0f3a, 38, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
+    INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
+    INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -360,7 +360,7 @@ static const struct twobyte_table {
     [0xc1] = { DstMem|SrcReg|ModRM },
     [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp, d8s_vl },
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
-    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
+    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int, 1 },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
     [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp, d8s_vl },
     [0xc7] = { ImplicitOps|ModRM },
@@ -516,17 +516,19 @@ static const struct ext0f3a_table {
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
     [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
-    [0x18] = { .simd_size = simd_128 },
+    [0x18] = { .simd_size = simd_128, .d8s = 4 },
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x20] = { .simd_size = simd_none },
-    [0x21] = { .simd_size = simd_other },
-    [0x22] = { .simd_size = simd_none },
+    [0x20] = { .simd_size = simd_none, .d8s = 0 },
+    [0x21] = { .simd_size = simd_other, .d8s = 2 },
+    [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
-    [0x38] = { .simd_size = simd_128 },
+    [0x38] = { .simd_size = simd_128, .d8s = 4 },
+    [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -2577,6 +2579,7 @@ x86_decode_twobyte(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
     case X86EMUL_OPC_VEX_66(0, 0xc4): /* vpinsrw */
+    case X86EMUL_OPC_EVEX_66(0, 0xc4): /* vpinsrw */
         state->desc = DstReg | SrcMem16;
         break;
 
@@ -2679,6 +2682,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x20):     /* pinsrb */
     case X86EMUL_OPC_VEX_66(0, 0x20): /* vpinsrb */
+    case X86EMUL_OPC_EVEX_66(0, 0x20): /* vpinsrb */
         state->desc = DstImplicit | SrcMem;
         if ( modrm_mod != 3 )
             state->desc |= ByteOp;
@@ -2686,6 +2690,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x22):     /* pinsr{d,q} */
     case X86EMUL_OPC_VEX_66(0, 0x22): /* vpinsr{d,q} */
+    case X86EMUL_OPC_EVEX_66(0, 0x22): /* vpinsr{d,q} */
         state->desc = DstImplicit | SrcMem;
         break;
 
@@ -7700,6 +7705,23 @@ x86_emulate(
         ea.type = OP_MEM;
         goto simd_0f_int_imm8;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc4):   /* vpinsrw $imm8,r32/m16,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x20): /* vpinsrb $imm8,r32/m8,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x22): /* vpinsr{d,q} $imm8,r/m,xmm,xmm */
+        generate_exception_if(evex.lr || evex.opmsk || evex.br, EXC_UD);
+        if ( b & 2 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        if ( !mode_64bit() )
+            evex.w = 0;
+        memcpy(mmvalp, &src.val, op_bytes);
+        ea.type = OP_MEM;
+        op_bytes = src.bytes;
+        d = SrcMem16; /* Fake for the common SIMD code below. */
+        state->simd_size = simd_other;
+        goto avx512f_imm8_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0xc5):      /* pextrw $imm8,{,x}mm,reg */
     case X86EMUL_OPC_VEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
         generate_exception_if(vex.l, EXC_UD);
@@ -8912,8 +8934,12 @@ x86_emulate(
         opc = init_evex(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x18): /* vinsertf32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinsertf64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x38): /* vinserti32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinserti64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
@@ -8922,8 +8948,12 @@ x86_emulate(
         fault_suppression = false;
         goto avx512f_imm8_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1a): /* vinsertf32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinsertf64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3a): /* vinserti32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinserti64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
         if ( !evex.w )
@@ -9016,13 +9046,19 @@ x86_emulate(
         op_bytes = 4;
         goto simd_0f3a_common;
 
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m128,xmm,xmm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
         op_bytes = 4;
         /* fall through */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x41): /* vdppd $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
+        op_bytes = 4;
+        generate_exception_if(evex.lr || evex.w || evex.opmsk || evex.br,
+                              EXC_UD);
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 20/47] x86emul: basic AVX512F testing
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (18 preceding siblings ...)
  2018-11-19 10:23   ` [PATCH v5 19/47] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
@ 2018-11-19 10:24   ` Jan Beulich
  2018-11-19 10:25   ` [PATCH v5 21/47] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
                     ` (26 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:24 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Add VSQRT* tests.
v4: Make eq() also work for 4- and 8-byte integer element sizes.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -63,6 +63,9 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
+avx512f-vecs := 64
+avx512f-ints := 4 8
+avx512f-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
@@ -148,7 +151,7 @@ $(addsuffix .c,$(SG)):
 
 $(addsuffix .h,$(SIMD) $(FMA) $(SG)): simd.h
 
-xop.h: simd-fma.c
+xop.h avx512f.h: simd-fma.c
 
 endif # 32-bit override
 
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -2,7 +2,41 @@
 
 ENTRY(simd_test);
 
-#if VEC_SIZE == 8 && defined(__SSE__)
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if VEC_SIZE == 4
+#  define eq(x, y) ({ \
+    float x_ = (x)[0]; \
+    float __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpss $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif VEC_SIZE == 8
+#  define eq(x, y) ({ \
+    double x_ = (x)[0]; \
+    double __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpsd $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif FLOAT_SIZE == 4
+/*
+ * gcc's (up to at least 8.2) __builtin_ia32_cmpps256_mask() has an anomaly in
+ * that its return type is QI rather than UQI, and hence the value would get
+ * sign-extended before comapring to ALL_TRUE. The same oddity does not matter
+ * for __builtin_ia32_cmppd256_mask(), as there only 4 bits are significant.
+ * Hence the extra " & ALL_TRUE".
+ */
+#  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
+# elif FLOAT_SIZE == 8
+#  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif INT_SIZE == 4 || UINT_SIZE == 4
+#  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+# endif
+#elif VEC_SIZE == 8 && defined(__SSE__)
 # define to_bool(cmp) (__builtin_ia32_pmovmskb(cmp) == 0xff)
 #elif VEC_SIZE == 16
 # if defined(__AVX__) && defined(FLOAT_SIZE)
@@ -93,6 +127,56 @@ static inline bool _to_bool(byte_vec_t b
     touch(x); \
     __builtin_ia32_pfrcpit2(__builtin_ia32_pfrsqit1(__builtin_ia32_pfmul(t_, t_), x), t_); \
 })
+#elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
+# if FLOAT_SIZE == 4
+#  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
+# elif FLOAT_SIZE == 8
+#  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
+# endif
+#elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if FLOAT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastss %1, %0" \
+          : "=v" (t_) : "m" (*(float[1]){ x }) ); \
+    t_; \
+})
+#  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
+#   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
+#   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#  endif
+# elif FLOAT_SIZE == 8
+#  if VEC_SIZE >= 32
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastsd %1, %0" : "=v" (t_) \
+          : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#  else
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#  endif
+#  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
+#   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
+#   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#  endif
+# endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
 # if VEC_SIZE == 32 && defined(__AVX__)
 #  if defined(__AVX2__)
@@ -191,7 +275,30 @@ static inline bool _to_bool(byte_vec_t b
 #  define sqrt(x) scalar_1op(x, "sqrtsd %[in], %[out]")
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE2__)
+#if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
+     defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 4 || UINT_SIZE == 4
+#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+# endif
+# if INT_SIZE == 4
+#  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
+#  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 4
+#  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# endif
+#elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
 #  define interleave_lo(x, y) ((vec_t)__builtin_ia32_punpcklbw128((vqi_t)(x), (vqi_t)(y)))
@@ -587,6 +694,10 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if defined(__AVX512F__) && defined(FLOAT_SIZE)
+# include "simd-fma.c"
+#endif
+
 int simd_test(void)
 {
     unsigned int i, j;
@@ -1034,7 +1145,8 @@ int simd_test(void)
 # endif
 #endif
 
-#if defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)
+#if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
+    (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
 #endif
 
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,9 +70,111 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE == 16
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
+# define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
+#elif VEC_SIZE == 32
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 256 ## s(a)
+#elif VEC_SIZE == 64
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 512 ## s(a)
+# define BR(n, s, a...)  __builtin_ia32_ ## n ## 512 ## s(a, 4)
+#endif
+#ifndef B_
+# define B_ B
+#endif
+#ifndef BR
+# define BR B
+# define BR_ B_
+#endif
+#ifndef BR_
+# define BR_ BR
+#endif
+
+#ifdef __AVX512F__
+
+/*
+ * The original plan was to effect use of EVEX encodings for scalar as well as
+ * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
+ * only of course) XMM16-XMM31 only. All sorts of compiler errors result when
+ * doing this with gcc 8.2. Therefore resort to injecting {evex} prefixes,
+ * which has the benefit of also working for 32-bit. Granted, there is a lot of
+ * escaping to get right here.
+ */
+asm ( ".macro override insn    \n\t"
+      ".macro $\\insn o:vararg \n\t"
+      ".purgem \\insn          \n\t"
+      "{evex} \\insn \\(\\)o   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\(\\))o     \n\t"
+      ".endm                   \n\t"
+      ".endm                   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\)o         \n\t"
+      ".endm                   \n\t"
+      ".endm" );
+
+#define OVR(n) asm ( "override v" #n )
+#define OVR_SFP(n) OVR(n ## sd); OVR(n ## ss)
+
+#ifdef __AVX512VL__
+# ifdef __AVX512BW__
+#  define OVR_BW(n) OVR(p ## n ## b); OVR(p ## n ## w)
+# else
+#  define OVR_BW(n)
+# endif
+# define OVR_DQ(n) OVR(p ## n ## d); OVR(p ## n ## q)
+# define OVR_VFP(n) OVR(n ## pd); OVR(n ## ps)
+#else
+# define OVR_BW(n)
+# define OVR_DQ(n)
+# define OVR_VFP(n)
+#endif
+
+#define OVR_FMA(n, w) OVR_ ## w(n ## 132); OVR_ ## w(n ## 213); \
+                      OVR_ ## w(n ## 231)
+#define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
+#define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
+
+OVR_SFP(broadcast);
+OVR_SFP(comi);
+OVR_FP(add);
+OVR_FP(div);
+OVR(extractps);
+OVR_FMA(fmadd, FP);
+OVR_FMA(fmsub, FP);
+OVR_FMA(fnmadd, FP);
+OVR_FMA(fnmsub, FP);
+OVR(insertps);
+OVR_FP(max);
+OVR_FP(min);
+OVR(movd);
+OVR(movq);
+OVR_SFP(mov);
+OVR_FP(mul);
+OVR_FP(sqrt);
+OVR_FP(sub);
+OVR_SFP(ucomi);
+
+#undef OVR_VFP
+#undef OVR_SFP
+#undef OVR_INT
+#undef OVR_FP
+#undef OVR_FMA
+#undef OVR_DQ
+#undef OVR_BW
+#undef OVR
+
+#endif
+
 /*
  * Suppress value propagation by the compiler, preventing unwanted
  * optimization. This at once makes the compiler use memory operands
  * more often, which for our purposes is the more interesting case.
  */
 #define touch(var) asm volatile ( "" : "+m" (var) )
+
+static inline vec_t undef(void)
+{
+    vec_t v = v;
+    return v;
+}
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -1,10 +1,9 @@
+#if !defined(__XOP__) && !defined(__AVX512F__)
 #include "simd.h"
-
-#ifndef __XOP__
 ENTRY(fma_test);
 #endif
 
-#if VEC_SIZE < 16
+#if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
 #elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
@@ -24,7 +23,13 @@ ENTRY(fma_test);
 # define eq(x, y) to_bool((x) == (y))
 #endif
 
-#if VEC_SIZE == 16
+#if defined(__AVX512F__) && VEC_SIZE > FLOAT_SIZE
+# if FLOAT_SIZE == 4
+#  define fmaddsub(x, y, z) BR(vfmaddsubps, _mask, x, y, z, ~0)
+# elif FLOAT_SIZE == 8
+#  define fmaddsub(x, y, z) BR(vfmaddsubpd, _mask, x, y, z, ~0)
+# endif
+#elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
 #  define addsub(x, y) __builtin_ia32_addsubps(x, y)
 #  if defined(__FMA4__) || defined(__FMA__)
@@ -50,6 +55,10 @@ ENTRY(fma_test);
 # endif
 #endif
 
+#if defined(fmaddsub) && !defined(addsub)
+# define addsub(x, y) fmaddsub(x, broadcast(1), y)
+#endif
+
 int fma_test(void)
 {
     unsigned int i;
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -21,6 +21,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f-opmask.h"
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
+#include "avx512f.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -248,6 +249,14 @@ static const struct {
     SIMD(OPMASK/b,    avx512dq_opmask,         1),
     SIMD(OPMASK/d,    avx512bw_opmask,         4),
     SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(AVX512F f32 scalar,  avx512f,        f4),
+    SIMD(AVX512F f32x16,      avx512f,      64f4),
+    SIMD(AVX512F f64 scalar,  avx512f,        f8),
+    SIMD(AVX512F f64x8,       avx512f,      64f8),
+    SIMD(AVX512F s32x16,      avx512f,      64i4),
+    SIMD(AVX512F u32x16,      avx512f,      64u4),
+    SIMD(AVX512F s64x8,       avx512f,      64i8),
+    SIMD(AVX512F u64x8,       avx512f,      64u8),
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 21/47] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (19 preceding siblings ...)
  2018-11-19 10:24   ` [PATCH v5 20/47] x86emul: basic AVX512F testing Jan Beulich
@ 2018-11-19 10:25   ` Jan Beulich
  2018-11-19 10:25   ` [PATCH v5 22/47] x86emul: basic AVX512VL testing Jan Beulich
                     ` (25 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:25 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the pbroadcastw table entry in evex-disp8.c is slightly
different from what one would expect, due to it requiring EVEX.W to be
zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -164,6 +164,9 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+//       pbroadcast,   66, 0f38, 7c,          dq64
+    INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
+    INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
     INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
     INSN(pcmpeqd,      66,   0f, 76,    vl,      d, vl),
     INSN(pcmpeqq,      66, 0f38, 29,    vl,      q, vl),
@@ -222,6 +225,7 @@ static const struct test avx512f_128[] =
 
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
+    INSN(broadcasti32x4, 66, 0f38, 5a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
@@ -231,6 +235,7 @@ static const struct test avx512f_no128[]
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(broadcasti64x4, 66, 0f38, 5b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
     INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
@@ -250,6 +255,10 @@ static const struct test avx512bw_all[]
     INSN(paddw,       66,   0f, fd,    vl,    w, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
+    INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
+//       pbroadcastb, 66, 0f38, 7a,           b
+    INSN(pbroadcastw, 66, 0f38, 79,    el_2,  b, vl),
+//       pbroadcastw, 66, 0f38, 7b,           b
     INSN(pcmp,        66, 0f3a, 3f,    vl,   bw, vl),
     INSN(pcmpeqb,     66,   0f, 74,    vl,    b, vl),
     INSN(pcmpeqw,     66,   0f, 75,    vl,    w, vl),
@@ -301,6 +310,7 @@ static const struct test avx512bw_128[]
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
+    INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
@@ -314,6 +324,7 @@ static const struct test avx512dq_128[]
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(broadcasti64x2, 66, 0f38, 5a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
     INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
@@ -322,6 +333,7 @@ static const struct test avx512dq_no128[
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(broadcasti32x8, 66, 0f38, 5b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
     INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -278,9 +278,33 @@ static inline bool _to_bool(byte_vec_t b
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if INT_SIZE == 4 || UINT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastd %1, %0" \
+          : "=v" (t_) : "m" (*(int[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(long long[1]){ x }) ); \
+    t_; \
+})
+#  ifdef __x86_64__
+#   define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastq %1, %0" : "=v" (t_) : "r" ((x) + 0ULL) ); \
+    t_; \
+})
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
@@ -977,10 +1001,14 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
-#if defined(broadcast)
+#ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#ifdef broadcast2
+    if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -454,9 +454,13 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
-    [0x5a] = { .simd_size = simd_128, .two_op = 1 },
-    [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
+    [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
+    [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
+    [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
+    [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x78] = { .simd_size = simd_other, .two_op = 1 },
+    [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
+    [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -2627,6 +2631,11 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
+    case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
+    case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
+        break;
+
     case 0xf0: /* movbe / crc32 */
         state->desc |= repne_prefix() ? ByteOp : Mov;
         if ( rep_prefix() )
@@ -8198,6 +8207,8 @@ x86_emulate(
         goto avx512f_no_sae;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
+        op_bytes = elem_bytes;
         generate_exception_if(evex.w || evex.br, EXC_UD);
     avx512_broadcast:
         /*
@@ -8217,17 +8228,27 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
                                             /* vbroadcastf64x4 m256,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
+                                            /* vbroadcasti64x4 m256,zmm{k} */
         generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
                                             /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
-        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        generate_exception_if(!evex.lr, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
+                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
+        if ( b == 0x59 )
+            op_bytes = 8;
+        generate_exception_if(evex.br, EXC_UD);
         if ( !evex.w )
             host_and_vcpu_must_have(avx512dq);
         goto avx512_broadcast;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
                                             /* vbroadcastf64x2 m128,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
         generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.br,
                               EXC_UD);
         if ( evex.w )
@@ -8421,6 +8442,45 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+        op_bytes = elem_bytes = 1 << (b & 1);
+        /* See the comment at the avx512_broadcast label. */
+        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
+        generate_exception_if((ea.type != OP_REG || evex.br ||
+                               evex.reg != 0xf || !evex.RX),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        opc[1] = modrm & 0xf8;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "+m" (src.val) : "a" (src.val));
+
+        put_stub(stub);
+        ASSERT(!state->simd_size);
+        break;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 22/47] x86emul: basic AVX512VL testing
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (20 preceding siblings ...)
  2018-11-19 10:25   ` [PATCH v5 21/47] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
@ 2018-11-19 10:25   ` Jan Beulich
  2018-11-19 10:26   ` [PATCH v5 23/47] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
                     ` (24 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:25 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test the 128- and 256-bit variants of the insns which have been
implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -63,7 +63,7 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
-avx512f-vecs := 64
+avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
 
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -5,13 +5,13 @@ ENTRY(fma_test);
 
 #if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
-#elif VEC_SIZE == 16
+#elif VEC_SIZE == 16 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
 #  define to_bool(cmp) __builtin_ia32_vtestcpd(cmp, (vec_t){} == 0)
 # endif
-#elif VEC_SIZE == 32
+#elif VEC_SIZE == 32 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps256(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -539,7 +539,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define rotr(x, n) ((vec_t)__builtin_ia32_palignr128((vdi_t)(x), (vdi_t)(x), (n) * 64))
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE4_1__)
+#if VEC_SIZE == 16 && defined(__SSE4_1__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)__builtin_ia32_pmaxsb128((vqi_t)(x), (vqi_t)(y)))
 #  define min(x, y) ((vec_t)__builtin_ia32_pminsb128((vqi_t)(x), (vqi_t)(y)))
@@ -593,7 +593,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define mix(x, y) __builtin_ia32_blendpd(x, y, 0b10)
 # endif
 #endif
-#if VEC_SIZE == 32 && defined(__AVX__)
+#if VEC_SIZE == 32 && defined(__AVX__) && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define dot_product(x, y) ({ \
     vec_t t_ = __builtin_ia32_dpps256(x, y, 0b11110001); \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -92,6 +92,15 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+#if VEC_SIZE < 64
+# pragma GCC target ( "avx512vl" )
+#endif
+
+#define REN(insn, old, new)                      \
+    asm ( ".macro v" #insn #old " o:vararg \n\t" \
+          "v" #insn #new " \\o             \n\t" \
+          ".endm" )
+
 /*
  * The original plan was to effect use of EVEX encodings for scalar as well as
  * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
@@ -135,25 +144,88 @@ asm ( ".macro override insn    \n\t"
 #define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
 #define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
 
+OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
+OVR_INT(add);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
+OVR_FMA(fmaddsub, VFP);
 OVR_FMA(fmsub, FP);
+OVR_FMA(fmsubadd, VFP);
 OVR_FMA(fnmadd, FP);
 OVR_FMA(fnmsub, FP);
 OVR(insertps);
 OVR_FP(max);
+OVR_INT(maxs);
+OVR_INT(maxu);
 OVR_FP(min);
+OVR_INT(mins);
+OVR_INT(minu);
 OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
+OVR_VFP(mova);
+OVR_VFP(movnt);
+OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(shuf);
+OVR_INT(sll);
+OVR_DQ(sllv);
 OVR_FP(sqrt);
+OVR_INT(sra);
+OVR_DQ(srav);
+OVR_INT(srl);
+OVR_DQ(srlv);
 OVR_FP(sub);
+OVR_INT(sub);
 OVR_SFP(ucomi);
+OVR_VFP(unpckh);
+OVR_VFP(unpckl);
+
+#ifdef __AVX512VL__
+# if ELEM_SIZE == 8 && defined(__AVX512DQ__)
+REN(extract, f128, f64x2);
+REN(extract, i128, i64x2);
+REN(insert, f128, f64x2);
+REN(insert, i128, i64x2);
+# else
+REN(extract, f128, f32x4);
+REN(extract, i128, i32x4);
+REN(insert, f128, f32x4);
+REN(insert, i128, i32x4);
+# endif
+# if ELEM_SIZE == 8
+REN(movdqa, , 64);
+REN(movdqu, , 64);
+REN(pand, , q);
+REN(pandn, , q);
+REN(por, , q);
+REN(pxor, , q);
+# else
+#  if ELEM_SIZE == 1 && defined(__AVX512BW__)
+REN(movdq, a, u8);
+REN(movdqu, , 8);
+#  elif ELEM_SIZE == 2 && defined(__AVX512BW__)
+REN(movdq, a, u16);
+REN(movdqu, , 16);
+#  else
+REN(movdqa, , 32);
+REN(movdqu, , 32);
+#  endif
+REN(pand, , d);
+REN(pandn, , d);
+REN(por, , d);
+REN(pxor, , d);
+# endif
+OVR(movntdq);
+OVR(movntdqa);
+OVR(pmulld);
+OVR(pmuldq);
+OVR(pmuludq);
+#endif
 
 #undef OVR_VFP
 #undef OVR_SFP
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -88,6 +88,11 @@ static bool simd_check_avx512f(void)
 }
 #define simd_check_avx512f_opmask simd_check_avx512f
 
+static bool simd_check_avx512f_vl(void)
+{
+    return cpu_has_avx512f && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512dq(void)
 {
     return cpu_has_avx512dq;
@@ -142,11 +147,21 @@ static const struct {
       .check_cpu = simd_check_ ## feat,                             \
       .set_regs = simd_set_regs,                                    \
       .check_regs = simd_check_regs }
+#define AVX512VL_(bits, desc, feat, form)                          \
+    { .code = feat ## _x86_ ## bits ## _D ## _ ## form,            \
+      .size = sizeof(feat ## _x86_ ## bits ## _D ## _ ## form),    \
+      .bitness = bits, .name = "AVX512" #desc,                     \
+      .check_cpu = simd_check_ ## feat ## _vl,                     \
+      .set_regs = simd_set_regs,                                   \
+      .check_regs = simd_check_regs }
 #ifdef __x86_64__
 # define SIMD(desc, feat, form) SIMD_(64, desc, feat, form), \
                                 SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(64, desc, feat, form), \
+                                    AVX512VL_(32, desc, feat, form)
 #else
 # define SIMD(desc, feat, form) SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(32, desc, feat, form)
 #endif
     SIMD(3DNow! single,          _3dnow,     8f4),
     SIMD(SSE scalar single,      sse,         f4),
@@ -257,6 +272,20 @@ static const struct {
     SIMD(AVX512F u32x16,      avx512f,      64u4),
     SIMD(AVX512F s64x8,       avx512f,      64i8),
     SIMD(AVX512F u64x8,       avx512f,      64u8),
+    AVX512VL(VL f32x4,        avx512f,      16f4),
+    AVX512VL(VL f64x2,        avx512f,      16f8),
+    AVX512VL(VL f32x8,        avx512f,      32f4),
+    AVX512VL(VL f64x4,        avx512f,      32f8),
+    AVX512VL(VL s32x4,        avx512f,      16i4),
+    AVX512VL(VL u32x4,        avx512f,      16u4),
+    AVX512VL(VL s32x8,        avx512f,      32i4),
+    AVX512VL(VL u32x8,        avx512f,      32u4),
+    AVX512VL(VL s64x2,        avx512f,      16i8),
+    AVX512VL(VL u64x2,        avx512f,      16u8),
+    AVX512VL(VL s64x4,        avx512f,      32i8),
+    AVX512VL(VL u64x4,        avx512f,      32u8),
+#undef AVX512VL_
+#undef AVX512VL
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 23/47] x86emul: support AVX512{F, BW} zero- and sign-extending moves
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (21 preceding siblings ...)
  2018-11-19 10:25   ` [PATCH v5 22/47] x86emul: basic AVX512VL testing Jan Beulich
@ 2018-11-19 10:26   ` Jan Beulich
  2018-11-19 10:26   ` [PATCH v5 24/47] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
                     ` (23 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the testing in simd.c doesn't really follow the ISA extension
pattern - to fit the scheme, extensions from byte and word granular
vectors can (currently) sensibly only happen in the AVX512BW case (and
hence respective abstraction macros will be added there rather than
here).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -177,6 +177,16 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
+    INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
+    INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
+    INSN(pmovzxwq,     66, 0f38, 34,    vl_4,    w, vl),
+    INSN(pmovzxdq,     66, 0f38, 35,    vl_2, d_nb, vl),
     INSN(pmuldq,       66, 0f38, 28,    vl,      q, vl),
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
@@ -274,6 +284,8 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -443,13 +443,23 @@ static const struct ext0f38_table {
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
+    [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x23] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x24] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
-    [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
+    [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x32] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x33] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x34] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x35] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
@@ -8314,6 +8324,25 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x25): /* vpmovsxdq {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
+        elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -311,10 +311,12 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxdq, _mask, x, (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 4
 #  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -222,6 +222,16 @@ REN(pxor, , d);
 # endif
 OVR(movntdq);
 OVR(movntdqa);
+OVR(pmovsxbd);
+OVR(pmovsxbq);
+OVR(pmovsxdq);
+OVR(pmovsxwd);
+OVR(pmovsxwq);
+OVR(pmovzxbd);
+OVR(pmovzxbq);
+OVR(pmovzxdq);
+OVR(pmovzxwd);
+OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 24/47] x86emul: support AVX512{F, BW} down conversion moves
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (22 preceding siblings ...)
  2018-11-19 10:26   ` [PATCH v5 23/47] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
@ 2018-11-19 10:26   ` Jan Beulich
  2018-11-19 10:28   ` [PATCH v5 25/47] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
                     ` (22 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the vpmov{,s,us}{d,q}w table entries in evex-disp8.c are
slightly different from what one would expect, due to them requiring
EVEX.W to be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Also adjust x86_insn_is_mem_write().
v4: Also #UD when evex.z is set with a memory operand.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -177,11 +177,26 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovdb,       f3, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovdw,       f3, 0f38, 33,    vl_2,    b, vl),
+    INSN(pmovqb,       f3, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovqd,       f3, 0f38, 35,    vl_2, d_nb, vl),
+    INSN(pmovqw,       f3, 0f38, 34,    vl_4,    b, vl),
+    INSN(pmovsdb,      f3, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsdw,      f3, 0f38, 23,    vl_2,    b, vl),
+    INSN(pmovsqb,      f3, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsqd,      f3, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovsqw,      f3, 0f38, 24,    vl_4,    b, vl),
     INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
     INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
     INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
     INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
     INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovusdb,     f3, 0f38, 11,    vl_4,    b, vl),
+    INSN(pmovusdw,     f3, 0f38, 13,    vl_2,    b, vl),
+    INSN(pmovusqb,     f3, 0f38, 12,    vl_8,    b, vl),
+    INSN(pmovusqd,     f3, 0f38, 15,    vl_2, d_nb, vl),
+    INSN(pmovusqw,     f3, 0f38, 14,    vl_4,    b, vl),
     INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
     INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
     INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
@@ -284,7 +299,10 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+    INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -277,6 +277,17 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextracti{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextracti32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -291,6 +302,7 @@ static inline bool _to_bool(byte_vec_t b
 })
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -720,6 +732,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if VEC_SIZE >= 16
+
+# if !defined(low_half) && defined(HALF_SIZE)
+static inline half_t low_half(vec_t x)
+{
+#  if HALF_SIZE < VEC_SIZE
+    half_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 2; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1087,6 +1120,21 @@ int simd_test(void)
 
 #endif
 
+#if defined(widen1) && defined(shrink1)
+    {
+        half_t aux1 = low_half(src), aux2;
+
+        touch(aux1);
+        x = widen1(aux1);
+        touch(x);
+        aux2 = shrink1(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 2; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
 #ifdef dup_lo
     touch(src);
     x = dup_lo(src);
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,23 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE >= 16
+
+# if ELEM_COUNT >= 2
+#  if VEC_SIZE > 32
+#   define HALF_SIZE (VEC_SIZE / 2)
+#  else
+#   define HALF_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(HALF_SIZE))) half_t;
+typedef char __attribute__((vector_size(HALF_SIZE))) vqi_half_t;
+typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
+typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
+typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+# endif
+
+#endif
+
 #if VEC_SIZE == 16
 # define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
 # define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3063,7 +3063,22 @@ x86_decode(
                 d |= vSIB;
             state->simd_size = ext0f38_table[b].simd_size;
             if ( evex_encoded() )
-                disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+            {
+                /*
+                 * VPMOVUS* are identical to VPMOVS* Disp8-scaling-wise, but
+                 * their attributes don't match those of the vex_66 encoded
+                 * insns with the same base opcodes. Rather than adding new
+                 * columns to the table, handle this here for now.
+                 */
+                if ( evex.pfx != vex_f3 || (b & 0xf8) != 0x10 )
+                    disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+                else
+                {
+                    disp8scale = decode_disp8scale(ext0f38_table[b + 0x10].d8s,
+                                                   state);
+                    state->simd_size = simd_other;
+                }
+            }
             break;
 
         case ext_0f3a:
@@ -8324,10 +8339,14 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10): /* vpmovuswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20): /* vpmovswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30): /* vpmovwb [xyz]mm,{x,y}mm/mem{k} */
         host_and_vcpu_must_have(avx512bw);
-        /* fall through */
+        if ( evex.pfx != vex_f3 )
+        {
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
@@ -8338,7 +8357,28 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
-        generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+            generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        }
+        else
+        {
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x11): /* vpmovusdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x12): /* vpmovusqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x13): /* vpmovusdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x14): /* vpmovusqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* vpmovusqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x21): /* vpmovsdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x22): /* vpmovsqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x23): /* vpmovsdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x24): /* vpmovsqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* vpmovsqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x31): /* vpmovdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x32): /* vpmovqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x33): /* vpmovdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x34): /* vpmovqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* vpmovqd [xyz]mm,{x,y}mm/mem{k} */
+            generate_exception_if(evex.w || (ea.type == OP_MEM && evex.z), EXC_UD);
+            d = DstMem | SrcReg | TwoOp;
+        }
         op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
@@ -10164,6 +10204,12 @@ x86_insn_is_mem_write(const struct x86_e
     case X86EMUL_OPC(0x0f, 0xab):        /* BTS */
     case X86EMUL_OPC(0x0f, 0xb3):        /* BTR */
     case X86EMUL_OPC(0x0f, 0xbb):        /* BTC */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* VPMOVUS* */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* VPMOVS* */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* VPMOV{D,Q,W}* */
         return true;
 
     case 0xd9:




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 25/47] x86emul: support AVX512{F, BW} integer unpack insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (23 preceding siblings ...)
  2018-11-19 10:26   ` [PATCH v5 24/47] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
@ 2018-11-19 10:28   ` Jan Beulich
  2018-11-19 10:28   ` [PATCH v5 26/47] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
                     ` (21 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:28 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

There's once again one extra twobyte_table[] entry which gets its Disp8
shift value set right away without getting support implemented just yet,
again to avoid needlessly splitting groups of entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -229,6 +229,10 @@ static const struct test avx512f_all[] =
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
     INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
     INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
+    INSN(punpckhdq,    66,   0f, 6a,    vl,      d, vl),
+    INSN(punpckhqdq,   66,   0f, 6d,    vl,      q, vl),
+    INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
+    INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
@@ -327,6 +331,10 @@ static const struct test avx512bw_all[]
     INSN(psubw,       66,   0f, f9,    vl,    w, vl),
     INSN(ptestm,      66, 0f38, 26,    vl,   bw, vl),
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
+    INSN(punpckhbw,   66,   0f, 68,    vl,    b, vl),
+    INSN(punpckhwd,   66,   0f, 69,    vl,    w, vl),
+    INSN(punpcklbw,   66,   0f, 60,    vl,    b, vl),
+    INSN(punpcklwd,   66,   0f, 61,    vl,    w, vl),
 };
 
 static const struct test avx512bw_128[] = {
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -300,6 +300,10 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
@@ -317,6 +321,10 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -252,6 +252,10 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(punpckhdq);
+OVR(punpckhqdq);
+OVR(punpckldq);
+OVR(punpcklqdq);
 #endif
 
 #undef OVR_VFP
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -312,10 +312,10 @@ static const struct twobyte_table {
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
+    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
@@ -6650,6 +6650,12 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        op_bytes = 16 << evex.lr;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6678,6 +6684,13 @@ x86_emulate(
         elem_bytes = 1 << (b & 1);
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x62): /* vpunpckldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6a): /* vpunpckhdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        fault_suppression = false;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
         op_bytes = 16 << evex.lr;
@@ -6704,6 +6717,10 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd4): /* vpaddq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf4): /* vpmuludq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x28): /* vpmuldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 26/47] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (24 preceding siblings ...)
  2018-11-19 10:28   ` [PATCH v5 25/47] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
@ 2018-11-19 10:28   ` Jan Beulich
  2018-11-19 10:29   ` [PATCH v5 27/47] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
                     ` (20 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:28 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Take the liberty and also correct the (public interface) name of the
AVX512_VBMI feature flag, on the assumption that no external consumer
has actually been using that flag so far. Furthermore make it have
AVX512BW instead of AVX512F as a prerequisite, for requiring full
64-bit mask registers (the upper 48 bits of which can't be accessed
other than through XSAVE/XRSTOR without AVX512BW support).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Re-base.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -173,6 +173,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
+    INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -294,6 +298,8 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
+    INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
@@ -378,6 +384,11 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512_vbmi_all[] = {
+    INSN(permi2b,       66, 0f38, 75, vl, b, vl),
+    INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -718,4 +729,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -150,6 +150,9 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#  else
+#   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
+#   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -175,6 +178,9 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#  else
+#   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
+#   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -303,6 +309,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -324,6 +333,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
@@ -769,6 +781,7 @@ int simd_test(void)
 {
     unsigned int i, j;
     vec_t x, y, z, src, inv, alt, sh;
+    vint_t interleave_lo, interleave_hi;
 
     for ( i = 0, j = ELEM_SIZE << 3; i < ELEM_COUNT; ++i )
     {
@@ -782,6 +795,9 @@ int simd_test(void)
         if ( !(i & (i + 1)) )
             --j;
         sh[i] = j;
+
+        interleave_lo[i] = ((i & 1) * ELEM_COUNT) | (i >> 1);
+        interleave_hi[i] = interleave_lo[i] + (ELEM_COUNT / 2);
     }
 
     touch(src);
@@ -1075,7 +1091,7 @@ int simd_test(void)
     x = src * alt;
     y = interleave_lo(x, alt < 0);
     touch(x);
-    z = widen1(x);
+    z = widen1(low_half(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1107,7 +1123,7 @@ int simd_test(void)
 
 # ifdef widen1
     touch(src);
-    x = widen1(src);
+    x = widen1(low_half(src));
     touch(src);
     if ( !eq(x, y) ) return __LINE__;
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,16 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if ELEM_SIZE == 1
+typedef vqi_t vint_t;
+#elif ELEM_SIZE == 2
+typedef vhi_t vint_t;
+#elif ELEM_SIZE == 4
+typedef vsi_t vint_t;
+#elif ELEM_SIZE == 8
+typedef vdi_t vint_t;
+#endif
+
 #if VEC_SIZE >= 16
 
 # if ELEM_COUNT >= 2
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -136,6 +136,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
+#define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -468,9 +468,13 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
     [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
+    [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -1862,6 +1866,7 @@ static bool vcpu_has(
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
+#define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -6019,6 +6024,11 @@ x86_emulate(
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x76): /* vpermi2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x77): /* vpermi2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7e): /* vpermt2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7f): /* vpermt2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdb): /* vpand{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8528,6 +8538,16 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512_vbmi);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -107,6 +107,9 @@
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
 
+/* CPUID level 0x00000007:0.ecx */
+#define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
+
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
 
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -224,7 +224,7 @@ XEN_CPUFEATURE(AVX512VL,      5*32+31) /
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.ecx, word 6 */
 XEN_CPUFEATURE(PREFETCHWT1,   6*32+ 0) /*A  PREFETCHWT1 instruction */
-XEN_CPUFEATURE(AVX512VBMI,    6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -259,12 +259,17 @@ def crunch_numbers(state):
         AVX2: [AVX512F],
 
         # AVX512F is taken to mean hardware support for 512bit registers
-        # (which in practice depends on the EVEX prefix to encode), and the
-        # instructions themselves. All further AVX512 features are built on
-        # top of AVX512F
+        # (which in practice depends on the EVEX prefix to encode) as well
+        # as mask registers, and the instructions themselves. All further
+        # AVX512 features are built on top of AVX512F
         AVX512F: [AVX512DQ, AVX512IFMA, AVX512PF, AVX512ER, AVX512CD,
-                  AVX512BW, AVX512VL, AVX512VBMI, AVX512_4VNNIW,
-                  AVX512_4FMAPS, AVX512_VPOPCNTDQ],
+                  AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
+                  AVX512_VPOPCNTDQ],
+
+        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # dependents of AVX512BW (as to requiring wider than 16-bit mask
+        # registers), despite the SDM not formally making this connection.
+        AVX512BW: [AVX512_VBMI],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 27/47] x86emul: support AVX512{F, BW} integer shuffle insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (25 preceding siblings ...)
  2018-11-19 10:28   ` [PATCH v5 26/47] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
@ 2018-11-19 10:29   ` Jan Beulich
  2018-11-19 10:30   ` [PATCH v5 28/47] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
                     ` (19 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:29 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also include shuff{32x4,64x2} as being very similar to shufi{32x4,64x2}.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Re-base over changes earlier in the series.
v4: Move OVR() addition into __AVX512VL__ conditional. Correct comments.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -214,6 +214,7 @@ static const struct test avx512f_all[] =
     INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
     INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
     INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pshufd,       66,   0f, 70,    vl,      d, vl),
     INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
     INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
     INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
@@ -264,6 +265,10 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
+    INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
+    INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
+    INSN(shufi64x2,      66, 0f3a, 43, vl,    q, vl),
 };
 
 static const struct test avx512f_512[] = {
@@ -318,6 +323,9 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSN(pshufb,      66, 0f38, 00,    vl,    b, vl),
+    INSN(pshufhw,     f3,   0f, 70,    vl,    w, vl),
+    INSN(pshuflw,     f2,   0f, 70,    vl,    w, vl),
     INSNX(pslldq,     66,   0f, 73, 7, vl,    b, vl),
     INSN(psllvw,      66, 0f38, 12,    vl,    w, vl),
     INSN(psllw,       66,   0f, f1,    el_8,  w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -153,6 +153,10 @@ static inline bool _to_bool(byte_vec_t b
 #  else
 #   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
+#   define swap(x) ({ \
+    vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
+})
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -181,6 +185,10 @@ static inline bool _to_bool(byte_vec_t b
 #  else
 #   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
+#   define swap(x) ({ \
+    vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
+})
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -309,9 +317,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
+                               VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
+                             0b00011011, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -333,9 +346,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b01001110, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
+                                      VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -119,6 +119,12 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+/* Sadly there are a few exceptions to the general naming rules. */
+#define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
+#define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
+#define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
+#define __builtin_ia32_shuf_i64x2_512_mask __builtin_ia32_shuf_i64x2_mask
+
 #if VEC_SIZE < 64
 # pragma GCC target ( "avx512vl" )
 #endif
@@ -262,6 +268,7 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(pshufd);
 OVR(punpckhdq);
 OVR(punpckhqdq);
 OVR(punpckldq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -318,7 +318,7 @@ static const struct twobyte_table {
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
-    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
+    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other, d8s_vl },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
@@ -432,7 +432,8 @@ static const struct ext0f38_table {
     uint8_t vsib:1;
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
-    [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
@@ -543,6 +544,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
+    [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
@@ -552,6 +554,7 @@ static const struct ext0f3a_table {
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
+    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6671,6 +6674,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6926,6 +6930,20 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 3;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x70): /* vpshufd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x70): /* vpshufhw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x70): /* vpshuflw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx == vex_66 )
+            generate_exception_if(evex.w, EXC_UD);
+        else
+        {
+            host_and_vcpu_must_have(avx512bw);
+            generate_exception_if(evex.br, EXC_UD);
+        }
+        d = (d & ~SrcMask) | SrcMem | TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_imm8_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x71):    /* Grp12 */
     case X86EMUL_OPC_VEX_66(0x0f, 0x71):
     CASE_SIMD_PACKED_INT(0x0f, 0x72):    /* Grp13 */
@@ -9110,7 +9128,13 @@ x86_emulate(
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
-        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        generate_exception_if(evex.br, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x23): /* vshuff32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshuff64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x43): /* vshufi32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshufi64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
         fault_suppression = false;
         goto avx512f_imm8_no_sae;
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 28/47] x86emul: support AVX512{BW, DQ} mask move insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (26 preceding siblings ...)
  2018-11-19 10:29   ` [PATCH v5 27/47] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
@ 2018-11-19 10:30   ` Jan Beulich
  2018-11-19 10:30   ` [PATCH v5 29/47] x86emul: basic AVX512BW testing Jan Beulich
                     ` (18 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:30 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Entries to the tables in evex-disp8.c are added despite these insns not
allowing for memory operands, with the goal of the tables giving a
complete picture of the supported EVEX-encoded insns in the end.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -314,9 +314,12 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+//       pmovb2m,     f3, 0f38, 29,           b
+//       pmovm2,      f3, 0f38, 28,          bw
     INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+//       pmovw2m,     f3, 0f38, 29,           w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
@@ -364,6 +367,9 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
+//       pmovd2m,        f3, 0f38, 39,        d
+//       pmovm2,         f3, 0f38, 38,       dq
+//       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
 };
--- a/tools/tests/x86_emulator/opmask.S
+++ b/tools/tests/x86_emulator/opmask.S
@@ -12,17 +12,23 @@
 
 #if SIZE == 1
 # define _(x) x##b
+# define _v(x, t) _v_(x##q, t)
 #elif SIZE == 2
 # define _(x) x##w
+# define _v(x, t) _v_(x##d, t)
 # define WIDEN(x) x##bw
 #elif SIZE == 4
 # define _(x) x##d
+# define _v(x, t) _v_(x##w, t)
 # define WIDEN(x) x##wd
 #elif SIZE == 8
 # define _(x) x##q
+# define _v(x, t) _v_(x##b, t)
 # define WIDEN(x) x##dq
 #endif
 
+#define _v_(x, t) v##x##t
+
     .macro check res1:req, res2:req, line:req
     _(kmov)       %\res1, DATA(out)
 #if SIZE < 8 || !defined(__i386__)
@@ -131,6 +137,15 @@ _start:
 
 #endif
 
+#if SIZE > 2 ? defined(__AVX512BW__) : defined(__AVX512DQ__)
+
+    _(kmov)       DATA(in1), %k0
+    _v(pmovm2,)   %k0, %zmm7
+    _v(pmov,2m)   %zmm7, %k3
+    check         k0, k3, __LINE__
+
+#endif
+
     xor           %eax, %eax
     ret
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -8428,6 +8428,21 @@ x86_emulate(
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x29): /* vpmov{b,w}2m [xyz]mm,k */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x39): /* vpmov{d,q}2m [xyz]mm,k */
+        generate_exception_if(!evex.r || !evex.R, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x28): /* vpmovm2{b,w} k,[xyz]mm */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x38): /* vpmovm2{d,q} k,[xyz]mm */
+        if ( b & 0x10 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.opmsk || ea.type != OP_REG, EXC_UD);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 29/47] x86emul: basic AVX512BW testing
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (27 preceding siblings ...)
  2018-11-19 10:30   ` [PATCH v5 28/47] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
@ 2018-11-19 10:30   ` Jan Beulich
  2018-11-19 10:31   ` [PATCH v5 30/47] x86emul: basic AVX512DQ testing Jan Beulich
                     ` (17 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:30 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Add __AVX512VL__ conditional around majority of OVR() additions.
    Correct eq() for 1- and 2-byte cases.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -66,6 +66,9 @@ xop-flts := $(avx-flts)
 avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
+avx512bw-vecs := $(avx512f-vecs)
+avx512bw-ints := 1 2
+avx512bw-flts :=
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -31,6 +31,10 @@ ENTRY(simd_test);
 #  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
 # elif FLOAT_SIZE == 8
 #  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif (INT_SIZE == 1 || UINT_SIZE == 1) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqb, _mask, (vqi_t)(x), (vqi_t)(y), -1) == ALL_TRUE)
+# elif (INT_SIZE == 2 || UINT_SIZE == 2) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqw, _mask, (vhi_t)(x), (vhi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 4 || UINT_SIZE == 4
 #  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 8 || UINT_SIZE == 8
@@ -374,6 +378,87 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # endif
+#elif (INT_SIZE == 1 || UINT_SIZE == 1 || INT_SIZE == 2 || UINT_SIZE == 2) && \
+      defined(__AVX512BW__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 1 || UINT_SIZE == 1
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastb %1, %0" \
+          : "=v" (t_) : "m" (*(char[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastb %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  elif defined(__AVX512VBMI__)
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
+#  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+# elif INT_SIZE == 2 || UINT_SIZE == 2
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastw %1, %0" \
+          : "=v" (t_) : "m" (*(short[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastw %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(pshufhw, _mask, \
+                                      B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
+                                      0b00011011, (vhi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+# endif
+# if INT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovsxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 2
+#  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
+#  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
+#  define widen1(x) ((vec_t)B(pmovsxwd, _mask, x, (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxwq, _mask, x, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 2
+#  define max(x, y) ((vec_t)B(pmaxuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define mul_hi(x, y) ((vec_t)B(pmulhuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxwd, _mask, (vhi_half_t)(x), (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxwq, _mask, (vhi_quarter_t)(x), (vdi_t)undef(), ~0))
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
@@ -565,7 +650,7 @@ static inline bool _to_bool(byte_vec_t b
 #  endif
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSSE3__)
+#if VEC_SIZE == 16 && defined(__SSSE3__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define abs(x) ((vec_t)__builtin_ia32_pabsb128((vqi_t)(x)))
 # elif INT_SIZE == 2
@@ -789,6 +874,40 @@ static inline half_t low_half(vec_t x)
 }
 # endif
 
+# if !defined(low_quarter) && defined(QUARTER_SIZE)
+static inline quarter_t low_quarter(vec_t x)
+{
+#  if QUARTER_SIZE < VEC_SIZE
+    quarter_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+# if !defined(low_eighth) && defined(EIGHTH_SIZE)
+static inline eighth_t low_eighth(vec_t x)
+{
+#  if EIGHTH_SIZE < VEC_SIZE
+    eighth_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
 #endif
 
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
@@ -1117,7 +1236,7 @@ int simd_test(void)
     y = interleave_lo(alt < 0, alt < 0);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen2(x);
+    z = widen2(low_quarter(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1126,7 +1245,7 @@ int simd_test(void)
     y = interleave_lo(y, y);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen3(x);
+    z = widen3(low_eighth(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 #  endif
@@ -1148,14 +1267,14 @@ int simd_test(void)
 
 # ifdef widen2
     touch(src);
-    x = widen2(src);
+    x = widen2(low_quarter(src));
     touch(src);
     if ( !eq(x, z) ) return __LINE__;
 # endif
 
 # ifdef widen3
     touch(src);
-    x = widen3(src);
+    x = widen3(low_eighth(src));
     touch(src);
     if ( !eq(x, interleave_lo(z, (vec_t){})) ) return __LINE__;
 # endif
@@ -1175,6 +1294,36 @@ int simd_test(void)
             if ( aux2[i] != src[i] )
                 return __LINE__;
     }
+#endif
+
+#if defined(widen2) && defined(shrink2)
+    {
+        quarter_t aux1 = low_quarter(src), aux2;
+
+        touch(aux1);
+        x = widen2(aux1);
+        touch(x);
+        aux2 = shrink2(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 4; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
+#if defined(widen3) && defined(shrink3)
+    {
+        eighth_t aux1 = low_eighth(src), aux2;
+
+        touch(aux1);
+        x = widen3(aux1);
+        touch(x);
+        aux2 = shrink3(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 8; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
 #endif
 
 #ifdef dup_lo
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -95,6 +95,32 @@ typedef int __attribute__((vector_size(H
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
 # endif
 
+# if ELEM_COUNT >= 4
+#  if VEC_SIZE > 64
+#   define QUARTER_SIZE (VEC_SIZE / 4)
+#  else
+#   define QUARTER_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(QUARTER_SIZE))) quarter_t;
+typedef char __attribute__((vector_size(QUARTER_SIZE))) vqi_quarter_t;
+typedef short __attribute__((vector_size(QUARTER_SIZE))) vhi_quarter_t;
+typedef int __attribute__((vector_size(QUARTER_SIZE))) vsi_quarter_t;
+typedef long long __attribute__((vector_size(QUARTER_SIZE))) vdi_quarter_t;
+# endif
+
+# if ELEM_COUNT >= 8
+#  if VEC_SIZE > 128
+#   define EIGHTH_SIZE (VEC_SIZE / 8)
+#  else
+#   define EIGHTH_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(EIGHTH_SIZE))) eighth_t;
+typedef char __attribute__((vector_size(EIGHTH_SIZE))) vqi_eighth_t;
+typedef short __attribute__((vector_size(EIGHTH_SIZE))) vhi_eighth_t;
+typedef int __attribute__((vector_size(EIGHTH_SIZE))) vsi_eighth_t;
+typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
+# endif
+
 #endif
 
 #if VEC_SIZE == 16
@@ -182,6 +208,9 @@ OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
 OVR_INT(add);
+OVR_BW(adds);
+OVR_BW(addus);
+OVR_BW(avg);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
@@ -214,6 +243,8 @@ OVR_INT(srl);
 OVR_DQ(srlv);
 OVR_FP(sub);
 OVR_INT(sub);
+OVR_BW(subs);
+OVR_BW(subus);
 OVR_SFP(ucomi);
 OVR_VFP(unpckh);
 OVR_VFP(unpckl);
@@ -275,6 +306,31 @@ OVR(punpckldq);
 OVR(punpcklqdq);
 #endif
 
+#ifdef __AVX512BW__
+OVR(pextrb);
+OVR(pextrw);
+OVR(pinsrb);
+OVR(pinsrw);
+# ifdef __AVX512VL__
+OVR(pmaddwd);
+OVR(pmovsxbw);
+OVR(pmovzxbw);
+OVR(pmulhuw);
+OVR(pmulhw);
+OVR(pmullw);
+OVR(psadbw);
+OVR(pshufb);
+OVR(pshufhw);
+OVR(pshuflw);
+OVR(punpckhbw);
+OVR(punpckhwd);
+OVR(punpcklbw);
+OVR(punpcklwd);
+OVR(slldq);
+OVR(srldq);
+# endif
+#endif
+
 #undef OVR_VFP
 #undef OVR_SFP
 #undef OVR_INT
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
+#include "avx512bw.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -105,6 +106,11 @@ static bool simd_check_avx512bw(void)
 }
 #define simd_check_avx512bw_opmask simd_check_avx512bw
 
+static bool simd_check_avx512bw_vl(void)
+{
+    return cpu_has_avx512bw && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -284,6 +290,18 @@ static const struct {
     AVX512VL(VL u64x2,        avx512f,      16u8),
     AVX512VL(VL s64x4,        avx512f,      32i8),
     AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512BW s8x64,     avx512bw,      64i1),
+    SIMD(AVX512BW u8x64,     avx512bw,      64u1),
+    SIMD(AVX512BW s16x32,    avx512bw,      64i2),
+    SIMD(AVX512BW u16x32,    avx512bw,      64u2),
+    AVX512VL(BW+VL s8x16,    avx512bw,      16i1),
+    AVX512VL(BW+VL u8x16,    avx512bw,      16u1),
+    AVX512VL(BW+VL s8x32,    avx512bw,      32i1),
+    AVX512VL(BW+VL u8x32,    avx512bw,      32u1),
+    AVX512VL(BW+VL s16x8,    avx512bw,      16i2),
+    AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
+    AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
+    AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 30/47] x86emul: basic AVX512DQ testing
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (28 preceding siblings ...)
  2018-11-19 10:30   ` [PATCH v5 29/47] x86emul: basic AVX512BW testing Jan Beulich
@ 2018-11-19 10:31   ` Jan Beulich
  2018-11-19 10:31   ` [PATCH v5 31/47] x86emul: support AVX512F move high/low insns Jan Beulich
                     ` (16 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:31 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Re-base over changes earlier in the series.
v4: Wrap OVR(pmullq) in __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -69,9 +69,12 @@ avx512f-flts := 4 8
 avx512bw-vecs := $(avx512f-vecs)
 avx512bw-ints := 1 2
 avx512bw-flts :=
+avx512dq-vecs := $(avx512f-vecs)
+avx512dq-ints := $(avx512f-ints)
+avx512dq-flts := $(avx512f-flts)
 
 avx512f-opmask-vecs := 2
-avx512dq-opmask-vecs := 1
+avx512dq-opmask-vecs := 1 2
 avx512bw-opmask-vecs := 4 8
 
 # For AVX and later, have the compiler avoid XMM0 to widen coverage of
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -121,6 +121,34 @@ typedef int __attribute__((vector_size(E
 typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
 # endif
 
+# define DECL_PAIR(w) \
+typedef w ## _t pair_t; \
+typedef vsi_ ## w ## _t vsi_pair_t; \
+typedef vdi_ ## w ## _t vdi_pair_t
+# define DECL_QUARTET(w) \
+typedef w ## _t quartet_t; \
+typedef vsi_ ## w ## _t vsi_quartet_t; \
+typedef vdi_ ## w ## _t vdi_quartet_t
+# define DECL_OCTET(w) \
+typedef w ## _t octet_t; \
+typedef vsi_ ## w ## _t vsi_octet_t; \
+typedef vdi_ ## w ## _t vdi_octet_t
+
+# if ELEM_COUNT == 4
+DECL_PAIR(half);
+# elif ELEM_COUNT == 8
+DECL_PAIR(quarter);
+DECL_QUARTET(half);
+# elif ELEM_COUNT == 16
+DECL_PAIR(eighth);
+DECL_QUARTET(quarter);
+DECL_OCTET(half);
+# endif
+
+# undef DECL_OCTET
+# undef DECL_QUARTET
+# undef DECL_PAIR
+
 #endif
 
 #if VEC_SIZE == 16
@@ -146,6 +174,14 @@ typedef long long __attribute__((vector_
 #ifdef __AVX512F__
 
 /* Sadly there are a few exceptions to the general naming rules. */
+#define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
+#define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+#define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
+#define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
+#define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
+#define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
+#define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
+#define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
 #define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 #define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 #define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -331,6 +367,20 @@ OVR(srldq);
 # endif
 #endif
 
+#ifdef __AVX512DQ__
+OVR_VFP(and);
+OVR_VFP(andn);
+OVR_VFP(or);
+OVR(pextrd);
+OVR(pextrq);
+OVR(pinsrd);
+OVR(pinsrq);
+# ifdef __AVX512VL__
+OVR(pmullq);
+# endif
+OVR_VFP(xor);
+#endif
+
 #undef OVR_VFP
 #undef OVR_SFP
 #undef OVR_INT
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -139,6 +139,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
+     (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if FLOAT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -146,6 +167,17 @@ static inline bool _to_bool(byte_vec_t b
           : "=v" (t_) : "m" (*(float[1]){ x }) ); \
     t_; \
 })
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcastf32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
+#   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
+#  endif
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
@@ -155,6 +187,13 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
 #  else
+#   define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
+#   define insert_pair(x, y, p) \
+    B(insertf32x4_, _mask, x, \
+      /* Cast needed below to work around gcc 7.x quirk. */ \
+      (p) & 1 ? (typeof(y))__builtin_ia32_shufps(y, y, 0b01000100) : (y), \
+      (p) >> 1, x, 3 << ((p) * 2))
+#   define insert_quartet(x, y, p) B(insertf32x4_, _mask, x, y, p, undef(), ~0)
 #   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #   define swap(x) ({ \
@@ -178,6 +217,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) B(broadcastf64x2_, _mask, x, undef(), ~0)
+#   define insert_pair(x, y, p) B(insertf64x2_, _mask, x, y, p, undef(), ~0)
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
+#   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
+#  endif
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
@@ -306,6 +353,16 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 # endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextracti32x4 */ || \
+       (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -318,11 +375,30 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  ifdef __AVX512DQ__
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcasti32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) ((vec_t)B(broadcasti32x8_, _mask, (vsi_octet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_octet(x, y, p) ((vec_t)B(inserti32x8_, _mask, (vsi_t)(x), (vsi_octet_t)(y), p, (vsi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti32x4_, _mask, (vsi_quartet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_pair(x, y, p) \
+    (vec_t)(B(inserti32x4_, _mask, (vsi_t)(x), \
+              /* First cast needed below to work around gcc 7.x quirk. */ \
+              (p) & 1 ? (vsi_pair_t)__builtin_ia32_pshufd((vsi_pair_t)(y), 0b01000100) \
+                      : (vsi_pair_t)(y), \
+              (p) >> 1, (vsi_t)(x), 3 << ((p) * 2)))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti32x4_, _mask, (vsi_t)(x), (vsi_quartet_t)(y), p, (vsi_t)undef(), ~0))
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
@@ -347,6 +423,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ((vec_t)B(broadcasti64x2_, _mask, (vdi_pair_t)(x), (vdi_t)undef(), ~0))
+#   define insert_pair(x, y, p) ((vec_t)B(inserti64x2_, _mask, (vdi_t)(x), (vdi_pair_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti64x4_, , (vdi_quartet_t)(x), (vdi_t)undef(), ~0))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti64x4_, _mask, (vdi_t)(x), (vdi_quartet_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
@@ -898,7 +982,7 @@ static inline eighth_t low_eighth(vec_t
     eighth_t y;
     unsigned int i;
 
-    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+    for ( i = 0; i < ELEM_COUNT / 8; ++i )
         y[i] = x[i];
 
     return y;
@@ -910,6 +994,50 @@ static inline eighth_t low_eighth(vec_t
 
 #endif
 
+#ifdef broadcast_pair
+# if ELEM_COUNT == 4
+#  define broadcast_half broadcast_pair
+# elif ELEM_COUNT == 8
+#  define broadcast_quarter broadcast_pair
+# elif ELEM_COUNT == 16
+#  define broadcast_eighth broadcast_pair
+# endif
+#endif
+
+#ifdef insert_pair
+# if ELEM_COUNT == 4
+#  define insert_half insert_pair
+# elif ELEM_COUNT == 8
+#  define insert_quarter insert_pair
+# elif ELEM_COUNT == 16
+#  define insert_eighth insert_pair
+# endif
+#endif
+
+#ifdef broadcast_quartet
+# if ELEM_COUNT == 8
+#  define broadcast_half broadcast_quartet
+# elif ELEM_COUNT == 16
+#  define broadcast_quarter broadcast_quartet
+# endif
+#endif
+
+#ifdef insert_quartet
+# if ELEM_COUNT == 8
+#  define insert_half insert_quartet
+# elif ELEM_COUNT == 16
+#  define insert_quarter insert_quartet
+# endif
+#endif
+
+#if defined(broadcast_octet) && ELEM_COUNT == 16
+# define broadcast_half broadcast_octet
+#endif
+
+#if defined(insert_octet) && ELEM_COUNT == 16
+# define insert_half insert_octet
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1205,6 +1333,60 @@ int simd_test(void)
     if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#if defined(broadcast_half) && defined(insert_half)
+    {
+        half_t aux = low_half(src);
+
+        touch(aux);
+        x = broadcast_half(aux);
+        touch(aux);
+        y = insert_half(src, aux, 1);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_quarter) && defined(insert_quarter)
+    {
+        quarter_t aux = low_quarter(src);
+
+        touch(aux);
+        x = broadcast_quarter(aux);
+        touch(aux);
+        y = insert_quarter(src, aux, 1);
+        touch(aux);
+        y = insert_quarter(y, aux, 2);
+        touch(aux);
+        y = insert_quarter(y, aux, 3);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_eighth) && defined(insert_eighth) && \
+    /* At least gcc 7.3 "optimizes" away all insert_eighth() calls below. */ \
+    __GNUC__ >= 8
+    {
+        eighth_t aux = low_eighth(src);
+
+        touch(aux);
+        x = broadcast_eighth(aux);
+        touch(aux);
+        y = insert_eighth(src, aux, 1);
+        touch(aux);
+        y = insert_eighth(y, aux, 2);
+        touch(aux);
+        y = insert_eighth(y, aux, 3);
+        touch(aux);
+        y = insert_eighth(y, aux, 4);
+        touch(aux);
+        y = insert_eighth(y, aux, 5);
+        touch(aux);
+        y = insert_eighth(y, aux, 6);
+        touch(aux);
+        y = insert_eighth(y, aux, 7);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -23,6 +23,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
 #include "avx512bw.h"
+#include "avx512dq.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -100,6 +101,11 @@ static bool simd_check_avx512dq(void)
 }
 #define simd_check_avx512dq_opmask simd_check_avx512dq
 
+static bool simd_check_avx512dq_vl(void)
+{
+    return cpu_has_avx512dq && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -267,9 +273,10 @@ static const struct {
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
     SIMD(OPMASK/w,     avx512f_opmask,         2),
-    SIMD(OPMASK/b,    avx512dq_opmask,         1),
-    SIMD(OPMASK/d,    avx512bw_opmask,         4),
-    SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(OPMASK+DQ/b, avx512dq_opmask,         1),
+    SIMD(OPMASK+DQ/w, avx512dq_opmask,         2),
+    SIMD(OPMASK+BW/d, avx512bw_opmask,         4),
+    SIMD(OPMASK+BW/q, avx512bw_opmask,         8),
     SIMD(AVX512F f32 scalar,  avx512f,        f4),
     SIMD(AVX512F f32x16,      avx512f,      64f4),
     SIMD(AVX512F f64 scalar,  avx512f,        f8),
@@ -302,6 +309,24 @@ static const struct {
     AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
     AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
     AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
+    SIMD(AVX512DQ f32x16,    avx512dq,      64f4),
+    SIMD(AVX512DQ f64x8,     avx512dq,      64f8),
+    SIMD(AVX512DQ s32x16,    avx512dq,      64i4),
+    SIMD(AVX512DQ u32x16,    avx512dq,      64u4),
+    SIMD(AVX512DQ s64x8,     avx512dq,      64i8),
+    SIMD(AVX512DQ u64x8,     avx512dq,      64u8),
+    AVX512VL(DQ+VL f32x4,    avx512dq,      16f4),
+    AVX512VL(DQ+VL f64x2,    avx512dq,      16f8),
+    AVX512VL(DQ+VL f32x8,    avx512dq,      32f4),
+    AVX512VL(DQ+VL f64x4,    avx512dq,      32f8),
+    AVX512VL(DQ+VL s32x4,    avx512dq,      16i4),
+    AVX512VL(DQ+VL u32x4,    avx512dq,      16u4),
+    AVX512VL(DQ+VL s32x8,    avx512dq,      32i4),
+    AVX512VL(DQ+VL u32x8,    avx512dq,      32u4),
+    AVX512VL(DQ+VL s64x2,    avx512dq,      16i8),
+    AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
+    AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
+    AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 31/47] x86emul: support AVX512F move high/low insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (29 preceding siblings ...)
  2018-11-19 10:31   ` [PATCH v5 30/47] x86emul: basic AVX512DQ testing Jan Beulich
@ 2018-11-19 10:31   ` Jan Beulich
  2018-11-19 10:31   ` [PATCH v5 32/47] x86emul: support AVX512F move duplicate insns Jan Beulich
                     ` (15 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:31 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -253,6 +253,16 @@ static const struct test avx512f_128[] =
     INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
+//       movhlps,     ,   0f, 12,        d
+    INSN(movhpd,    66,   0f, 16, el,    q, vl),
+    INSN(movhpd,    66,   0f, 17, el,    q, vl),
+    INSN(movhps,      ,   0f, 16, el_2,  d, vl),
+    INSN(movhps,      ,   0f, 17, el_2,  d, vl),
+//       movlhps,     ,   0f, 16,        d
+    INSN(movlpd,    66,   0f, 12, el,    q, vl),
+    INSN(movlpd,    66,   0f, 13, el,    q, vl),
+    INSN(movlps,      ,   0f, 12, el_2,  d, vl),
+    INSN(movlps,      ,   0f, 13, el_2,  d, vl),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
     INSN(movq,      66,   0f, d6, el,    q, el),
 };
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -266,6 +266,12 @@ OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
 OVR_VFP(mova);
+OVR(movhlps);
+OVR(movhpd);
+OVR(movhps);
+OVR(movlhps);
+OVR(movlpd);
+OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -286,11 +286,11 @@ static const struct twobyte_table {
     [0x0f] = { ModRM|SrcImmByte },
     [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
-    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
     [0x20 ... 0x21] = { DstMem|SrcImplicit|ModRM },
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
@@ -6008,6 +6008,26 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_fp;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x12):   /* vmovlpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x13): /* vmovlp{s,d} xmm,m64 */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x16):   /* vmovhpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x17): /* vmovhp{s,d} xmm,m64 */
+        generate_exception_if(ea.type != OP_MEM, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x12):      /* vmovlps m64,xmm,xmm */
+                                            /* vmovhlps xmm,xmm,xmm */
+    case X86EMUL_OPC_EVEX(0x0f, 0x16):      /* vmovhps m64,xmm,xmm */
+                                            /* vmovlhps xmm,xmm,xmm */
+        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( (d & DstMask) != DstMem )
+            d &= ~TwoOp;
+        op_bytes = 8;
+        fault_suppression = false;
+        goto simd_zmm;
+
     case X86EMUL_OPC_F3(0x0f, 0x12):       /* movsldup xmm/m128,xmm */
     case X86EMUL_OPC_VEX_F3(0x0f, 0x12):   /* vmovsldup {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F2(0x0f, 0x12):       /* movddup xmm/m64,xmm */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 32/47] x86emul: support AVX512F move duplicate insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (30 preceding siblings ...)
  2018-11-19 10:31   ` [PATCH v5 31/47] x86emul: support AVX512F move high/low insns Jan Beulich
@ 2018-11-19 10:31   ` Jan Beulich
  2018-11-19 10:32   ` [PATCH v5 33/47] x86emul: support AVX512{F, BW, VBMI} permute insns Jan Beulich
                     ` (14 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:31 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Judging from insn prefixes, these are scalar insns, but their (memory)
operands are vector ones (with the exception of 128-bit VMOVDDUP). For
this some adjustments to disp8scale calculation code are needed.

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -157,6 +157,8 @@ static const struct test avx512f_all[] =
     INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
     INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
     INSN_PFP_NB(movnt,       0f, 2b),
+    INSN(movshdup,     f3,   0f, 16,    vl,   d_nb, vl),
+    INSN(movsldup,     f3,   0f, 12,    vl,   d_nb, vl),
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
@@ -253,6 +255,7 @@ static const struct test avx512f_128[] =
     INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
+    INSN(movddup,   f2,   0f, 12, el,    q, el),
 //       movhlps,     ,   0f, 12,        d
     INSN(movhpd,    66,   0f, 16, el,    q, vl),
     INSN(movhpd,    66,   0f, 17, el,    q, vl),
@@ -275,6 +278,7 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(movddup,        f2,   0f, 12, vl, q_nb, vl),
     INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
     INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
     INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -326,8 +326,11 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 # endif
+OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
+OVR(movshdup);
+OVR(movsldup);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3043,6 +3043,15 @@ x86_decode(
 
             switch ( b )
             {
+            case 0x12: /* vmovsldup / vmovddup */
+                if ( evex.pfx == vex_f2 )
+                    disp8scale = evex.lr ? 4 + evex.lr : 3;
+                /* fall through */
+            case 0x16: /* vmovshdup */
+                if ( evex.pfx == vex_f3 )
+                    disp8scale = 4 + evex.lr;
+                break;
+
             case 0x20: /* mov cr,reg */
             case 0x21: /* mov dr,reg */
             case 0x22: /* mov reg,cr */
@@ -6043,6 +6052,20 @@ x86_emulate(
         host_and_vcpu_must_have(sse3);
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x12):   /* vmovsldup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x12):   /* vmovddup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x16):   /* vmovshdup [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if((evex.br ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = !(evex.pfx & VEX_PREFIX_DOUBLE_MASK) || evex.lr
+                   ? 16 << evex.lr : 8;
+        fault_suppression = false;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x14): /* vunpcklp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 33/47] x86emul: support AVX512{F, BW, VBMI} permute insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (31 preceding siblings ...)
  2018-11-19 10:31   ` [PATCH v5 32/47] x86emul: support AVX512F move duplicate insns Jan Beulich
@ 2018-11-19 10:32   ` Jan Beulich
  2018-11-19 10:33   ` [PATCH v5 34/47] x86emul: support AVX512BW pack insns Jan Beulich
                     ` (13 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -177,6 +177,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permilpd,     66, 0f38, 0d,    vl,      q, vl),
+    INSN(permilpd,     66, 0f3a, 05,    vl,      q, vl),
+    INSN(permilps,     66, 0f38, 0c,    vl,      d, vl),
+    INSN(permilps,     66, 0f3a, 04,    vl,      d, vl),
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
@@ -279,6 +283,10 @@ static const struct test avx512f_no128[]
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
     INSN(movddup,        f2,   0f, 12, vl, q_nb, vl),
+    INSN(perm,           66, 0f38, 36, vl,   dq, vl),
+    INSN(perm,           66, 0f38, 16, vl,   sd, vl),
+    INSN(permpd,         66, 0f3a, 01, vl,    q, vl),
+    INSN(permq,          66, 0f3a, 00, vl,    q, vl),
     INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
     INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
     INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
@@ -317,6 +325,7 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permw,       66, 0f38, 8d,    vl,    w, vl),
     INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
     INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
@@ -413,6 +422,7 @@ static const struct test avx512dq_512[]
 };
 
 static const struct test avx512_vbmi_all[] = {
+    INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
 };
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -186,6 +186,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#   define swap2(x) B_(vpermilps, _mask, x, 0b00011011, undef(), ~0)
 #  else
 #   define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
 #   define insert_pair(x, y, p) \
@@ -200,6 +201,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
 })
+#   define swap2(x) B(vpermilps, _mask, \
+                       B(shuf_f32x4_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b00011011, undef(), ~0)
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -233,6 +238,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#   define swap2(x) B_(vpermilpd, _mask, x, 0b01, undef(), ~0)
 #  else
 #   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
@@ -240,6 +246,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
 })
+#   define swap2(x) B(vpermilpd, _mask, \
+                       B(shuf_f64x2_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b01010101, undef(), ~0)
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -405,6 +415,7 @@ static inline bool _to_bool(byte_vec_t b
                              B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
                                VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
                              0b00011011, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B_(permvarsi, _mask, (vsi_t)(x), (vsi_t)(inv - 1), (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -442,8 +453,17 @@ static inline bool _to_bool(byte_vec_t b
                              (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
                                       VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
                              0b01001110, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B(permvardi, _mask, (vdi_t)(x), (vdi_t)(inv - 1), (vdi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+#  if VEC_SIZE == 32
+#   define swap3(x) ((vec_t)B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0))
+#  elif VEC_SIZE == 64
+#   define swap3(x) ({ \
+    vdi_t t_ = B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0); \
+    B(shuf_i64x2_, _mask, t_, t_, 0b01001110, (vdi_t)undef(), ~0); \
+})
+#  endif
 # endif
 # if INT_SIZE == 4
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
@@ -489,6 +509,9 @@ static inline bool _to_bool(byte_vec_t b
 #  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
 #  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+#  ifdef __AVX512VBMI__
+#   define swap2(x) ((vec_t)B(permvarqi, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  endif
 # elif INT_SIZE == 2 || UINT_SIZE == 2
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -517,6 +540,7 @@ static inline bool _to_bool(byte_vec_t b
                               (0b01010101010101010101010101010101 & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+#  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
 # endif
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
@@ -1325,6 +1349,12 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
+#ifdef swap3
+    touch(src);
+    if ( !eq(swap3(src), inv) ) return __LINE__;
+    touch(src);
+#endif
+
 #ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -275,6 +275,8 @@ OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(perm);
+OVR_VFP(permil);
 OVR_VFP(shuf);
 OVR_INT(sll);
 OVR_DQ(sllv);
@@ -331,6 +333,8 @@ OVR(movntdq);
 OVR(movntdqa);
 OVR(movshdup);
 OVR(movsldup);
+OVR(permd);
+OVR(permq);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -434,7 +434,8 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
-    [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
+    [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -477,6 +478,7 @@ static const struct ext0f38_table {
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
+    [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -522,10 +524,10 @@ static const struct ext0f3a_table {
     uint8_t four_op:1;
     disp8scale_t d8s:4;
 } ext0f3a_table[256] = {
-    [0x00] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x00] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
+    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x02] = { .simd_size = simd_packed_int },
-    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
@@ -8067,6 +8069,9 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.br, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0c): /* vpermilps [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0d): /* vpermilpd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         if ( b == 0xe2 )
             goto avx512f_no_sae;
@@ -8412,6 +8417,12 @@ x86_emulate(
         generate_exception_if(!vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x16): /* vpermp{s,d} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x36): /* vperm{d,q} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x20): /* vpmovsxbw xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,{x,y}mm */
@@ -8616,6 +8627,7 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         if ( !evex.w )
             host_and_vcpu_must_have(avx512_vbmi);
         else
@@ -9042,6 +9054,12 @@ x86_emulate(
         generate_exception_if(!vex.l || !vex.w, EXC_UD);
         goto simd_0f_imm8_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x00): /* vpermq $imm8,{y,z}mm/mem,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x01): /* vpermpd $imm8,{y,z}mm/mem,{y,z}mm{k} */
+        generate_exception_if(!evex.lr || !evex.w, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x38): /* vinserti128 $imm8,xmm/m128,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x39): /* vextracti128 $imm8,ymm,xmm/m128 */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x46): /* vperm2i128 $imm8,ymm/m256,ymm,ymm */
@@ -9061,6 +9079,12 @@ x86_emulate(
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x04): /* vpermilps $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x05): /* vpermilpd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_66(0x0f3a, 0x08): /* roundps $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x09): /* roundpd $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x0a): /* roundss $imm8,xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 34/47] x86emul: support AVX512BW pack insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (32 preceding siblings ...)
  2018-11-19 10:32   ` [PATCH v5 33/47] x86emul: support AVX512{F, BW, VBMI} permute insns Jan Beulich
@ 2018-11-19 10:33   ` Jan Beulich
  2018-11-19 10:33   ` [PATCH v5 35/47] x86emul: support AVX512F floating-point conversion insns Jan Beulich
                     ` (12 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:33 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

No further test harness additions - what is there is good enough for
these rather "regular" insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -307,6 +307,10 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(packssdw,    66,   0f, 6b,    vl, d_nb, vl),
+    INSN(packsswb,    66,   0f, 63,    vl,    w, vl),
+    INSN(packusdw,    66, 0f38, 2b,    vl, d_nb, vl),
+    INSN(packuswb,    66,   0f, 67,    vl,    w, vl),
     INSN(paddb,       66,   0f, fc,    vl,    b, vl),
     INSN(paddsb,      66,   0f, ec,    vl,    b, vl),
     INSN(paddsw,      66,   0f, ed,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -361,6 +361,10 @@ OVR(pextrw);
 OVR(pinsrb);
 OVR(pinsrw);
 # ifdef __AVX512VL__
+OVR(packssdw);
+OVR(packsswb);
+OVR(packusdw);
+OVR(packuswb);
 OVR(pmaddwd);
 OVR(pmovsxbw);
 OVR(pmovzxbw);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -453,7 +453,7 @@ static const struct ext0f38_table {
     [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
-    [0x2b] = { .simd_size = simd_packed_int },
+    [0x2b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
@@ -6714,6 +6714,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         op_bytes = 16 << evex.lr;
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x63): /* vpacksswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x67): /* vpackuswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6776,6 +6778,12 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6b): /* vpackssdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2b): /* vpackusdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 35/47] x86emul: support AVX512F floating-point conversion insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (33 preceding siblings ...)
  2018-11-19 10:33   ` [PATCH v5 34/47] x86emul: support AVX512BW pack insns Jan Beulich
@ 2018-11-19 10:33   ` Jan Beulich
  2018-11-19 10:34   ` [PATCH v5 36/47] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
                     ` (11 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:33 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVTPS2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size change for twobyte_table[0x5a] is benign to pre-existing
code, but allows decode_disp8scale() to work as is here.

Also correct the comment on an AVX counterpart.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,6 +109,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
+    INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
+    INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -181,7 +181,9 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  define widen1(x) ((vec_t)BR(cvtps2pd, _mask, x, (vdf_t)undef(), ~0))
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -68,6 +68,7 @@ typedef short __attribute__((vector_size
 typedef int __attribute__((vector_size(VEC_SIZE))) vsi_t;
 #if VEC_SIZE >= 8
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
+typedef double __attribute__((vector_size(VEC_SIZE))) vdf_t;
 #endif
 
 #if ELEM_SIZE == 1
@@ -93,6 +94,7 @@ typedef char __attribute__((vector_size(
 typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
 typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+typedef float __attribute__((vector_size(HALF_SIZE))) vsf_half_t;
 # endif
 
 # if ELEM_COUNT >= 4
@@ -328,6 +330,13 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 # endif
+OVR(cvtpd2psx);
+OVR(cvtpd2psy);
+OVR(cvtph2ps);
+OVR(cvtps2pd);
+OVR(cvtps2ph);
+OVR(cvtsd2ss);
+OVR(cvtss2sd);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3842,6 +3842,49 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vcvtph2ps 32(%ecx),%zmm7{%k4}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vcvtph2ps);
+        decl_insn(evex_vcvtps2ph);
+
+        asm volatile ( "vpternlogd $0x81, %%zmm7, %%zmm7, %%zmm7\n\t"
+                       "kmovw %1,%%k4\n"
+                       put_insn(evex_vcvtph2ps, "vcvtph2ps 32(%0), %%zmm7%{%%k4%}")
+                       :: "c" (NULL), "r" (0x3333) );
+
+        set_insn(evex_vcvtph2ps);
+        memset(res, 0xff, 128);
+        res[8] = 0x40003c00; /* (1.0, 2.0) */
+        res[10] = 0x44004200; /* (3.0, 4.0) */
+        res[12] = 0x3400b800; /* (-.5, .25) */
+        res[14] = 0xbc000000; /* (0.0, -1.) */
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        asm volatile ( "vmovups %%zmm7, %0" : "=m" (res[16]) );
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtph2ps) )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vcvtps2ph $0,%zmm3,64(%edx){%k4}...");
+        asm volatile ( "vmovups %0, %%zmm3\n"
+                       put_insn(evex_vcvtps2ph, "vcvtps2ph $0, %%zmm3, 128(%1)%{%%k4%}")
+                       :: "m" (res[16]), "d" (NULL) );
+
+        set_insn(evex_vcvtps2ph);
+        regs.edx = (unsigned long)res;
+        memset(res + 32, 0xcc, 32);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtps2ph) )
+            goto fail;
+        res[15] = res[13] = res[11] = res[9] = 0xcccccccc;
+        if ( memcmp(res + 8, res + 32, 32) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -310,7 +310,8 @@ static const struct twobyte_table {
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -437,7 +438,7 @@ static const struct ext0f38_table {
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x13] = { .simd_size = simd_other, .two_op = 1 },
+    [0x13] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
@@ -541,7 +542,7 @@ static const struct ext0f3a_table {
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
-    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
+    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
@@ -3066,6 +3067,11 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x5a: /* vcvtps2pd needs special casing */
+                if ( disp8scale && !evex.pfx && !evex.br )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -5974,6 +5980,7 @@ x86_emulate(
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5d):    /* vmin{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5e):    /* vdiv{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5f):    /* vmax{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    avx512f_all_fp:
         generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
                                (ea.type == OP_MEM && evex.br &&
                                 (evex.pfx & VEX_PREFIX_SCALAR_MASK))),
@@ -6530,7 +6537,7 @@ x86_emulate(
         goto simd_zmm;
 
     CASE_SIMD_ALL_FP(, 0x0f, 0x5a):        /* cvt{p,s}{s,d}2{p,s}{s,d} xmm/mem,xmm */
-    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} xmm/mem,xmm */
+    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} {x,y}mm/mem,{x,y}mm */
                                            /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm */
         op_bytes = 4 << (((vex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + vex.l) +
                          !!(vex.pfx & VEX_PREFIX_DOUBLE_MASK));
@@ -6539,6 +6546,12 @@ x86_emulate(
             goto simd_0f_sse2;
         goto simd_0f_avx;
 
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5a):   /* vcvtp{s,d}2p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+                                           /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm{k} */
+        op_bytes = 4 << (((evex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + evex.lr) +
+                         evex.w);
+        goto avx512f_all_fp;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x5b):     /* cvt{ps,dq}2{dq,ps} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x5b): /* vcvt{ps,dq}2{dq,ps} {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F3(0x0f, 0x5b):       /* cvttps2dq xmm/mem,xmm */
@@ -8420,6 +8433,15 @@ x86_emulate(
         op_bytes = 8 << vex.l;
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x13): /* vcvtph2ps {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w || (ea.type == OP_MEM && evex.br), EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.br )
+            avx512_vlen_check(false);
+        op_bytes = 8 << evex.lr;
+        elem_bytes = 2;
+        goto simd_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x16): /* vpermps ymm/m256,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x36): /* vpermd ymm/m256,ymm,ymm */
         generate_exception_if(!vex.l || vex.w, EXC_UD);
@@ -9243,27 +9265,79 @@ x86_emulate(
         goto avx512f_imm8_no_sae;
 
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,[xyz]mm,{x,y}mm/mem{k} */
     {
         uint32_t mxcsr;
 
-        generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
-        host_and_vcpu_must_have(f16c);
         fail_if(!ops->write);
+        if ( evex_encoded() )
+        {
+            generate_exception_if((evex.w || evex.reg != 0xf || !evex.RX ||
+                                   (ea.type == OP_MEM && (evex.z || evex.br))),
+                                  EXC_UD);
+            host_and_vcpu_must_have(avx512f);
+            avx512_vlen_check(false);
+            opc = init_evex(stub);
+        }
+        else
+        {
+            generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
+            host_and_vcpu_must_have(f16c);
+            opc = init_prefixes(stub);
+        }
+
+        op_bytes = 8 << evex.lr;
 
-        opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
         if ( ea.type == OP_MEM )
         {
             /* Convert memory operand to (%rAX). */
             vex.b = 1;
+            evex.b = 1;
             opc[1] &= 0x38;
         }
         opc[2] = imm1;
-        insn_bytes = PFX_BYTES + 3;
+        if ( evex_encoded() )
+        {
+            unsigned int full = 0;
+
+            insn_bytes = EVEX_PFX_BYTES + 3;
+            copy_EVEX(opc, evex);
+
+            if ( ea.type == OP_MEM && evex.opmsk )
+            {
+                full = 0xffff >> (16 - op_bytes / 2);
+                op_mask &= full;
+                if ( !op_mask )
+                    goto complete_insn;
+
+                first_byte = __builtin_ctz(op_mask);
+                op_mask >>= first_byte;
+                full >>= first_byte;
+                first_byte <<= 1;
+                op_bytes = (32 - __builtin_clz(op_mask)) << 1;
+
+                /*
+                 * We may need to read (parts of) the memory operand for the
+                 * purpose of merging in order to avoid splitting the write
+                 * below into multiple ones.
+                 */
+                if ( op_mask != full &&
+                     (rc = ops->read(ea.mem.seg,
+                                     truncate_ea(ea.mem.off + first_byte),
+                                     (void *)mmvalp + first_byte, op_bytes,
+                                     ctxt)) != X86EMUL_OKAY )
+                    goto done;
+            }
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 3;
+            copy_VEX(opc, vex);
+        }
         opc[3] = 0xc3;
 
-        copy_VEX(opc, vex);
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
                     "=m" (*mmvalp), [mxcsr] "=m" (mxcsr) : "a" (mmvalp));
@@ -9272,7 +9346,8 @@ x86_emulate(
 
         if ( ea.type == OP_MEM )
         {
-            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
+            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
+                            (void *)mmvalp + first_byte, op_bytes, ctxt);
             if ( rc != X86EMUL_OKAY )
             {
                 asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 36/47] x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (34 preceding siblings ...)
  2018-11-19 10:33   ` [PATCH v5 35/47] x86emul: support AVX512F floating-point conversion insns Jan Beulich
@ 2018-11-19 10:34   ` Jan Beulich
  2018-11-19 10:35   ` [PATCH v5 37/47] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
                     ` (10 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:34 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

... including the two AVX512DQ forms which shared encodings, just with
EVEX.W set there.

VCVTDQ2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size changes for the twobyte_table[] entries are benign to
pre-existing code, but allow decode_disp8scale() to work as is here.

The at this point wrong placement of the 0xe6 case block is once again
in anticipation of further additions of case labels.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,8 +109,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
+    INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
+    INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
@@ -399,6 +403,8 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
+    INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -92,6 +92,13 @@ static inline bool _to_bool(byte_vec_t b
 # define to_int(x) ((vec_t){ (int)(x)[0] })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
+#elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if FLOAT_SIZE == 4
+#  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+# elif FLOAT_SIZE == 8
+#  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
 #  define to_int(x) __builtin_ia32_cvtdq2ps(__builtin_ia32_cvtps2dq(x))
@@ -1142,15 +1149,21 @@ int simd_test(void)
     touch(src);
     if ( !eq(x * -alt, -src) ) return __LINE__;
 
-# if defined(recip) && defined(to_int)
+# ifdef to_int
+
+    touch(src);
+    x = to_int(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
 
+#  ifdef recip
     touch(src);
     x = recip(src);
     touch(src);
     touch(x);
     if ( !eq(to_int(recip(x)), src) ) return __LINE__;
 
-#  ifdef rsqrt
+#   ifdef rsqrt
     x = src * src;
     touch(x);
     y = rsqrt(x);
@@ -1158,6 +1171,7 @@ int simd_test(void)
     if ( !eq(to_int(recip(y)), src) ) return __LINE__;
     touch(src);
     if ( !eq(to_int(y), to_int(recip(src))) ) return __LINE__;
+#   endif
 #  endif
 
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -244,6 +244,7 @@ asm ( ".macro override insn    \n\t"
 OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
+OVR_VFP(cvtdq2);
 OVR_FP(add);
 OVR_INT(add);
 OVR_BW(adds);
@@ -330,13 +331,19 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 # endif
+OVR(cvtpd2dqx);
+OVR(cvtpd2dqy);
 OVR(cvtpd2psx);
 OVR(cvtpd2psy);
 OVR(cvtph2ps);
+OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
 OVR(cvtss2sd);
+OVR(cvttpd2dqx);
+OVR(cvttpd2dqy);
+OVR(cvttps2dq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -311,7 +311,7 @@ static const struct twobyte_table {
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -375,7 +375,7 @@ static const struct twobyte_table {
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
@@ -3076,6 +3076,11 @@ x86_decode(
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
                 break;
+
+            case 0xe6: /* vcvtdq2pd needs special casing */
+                if ( disp8scale && evex.pfx == vex_f3 && !evex.w && !evex.br )
+                    --disp8scale;
+                break;
             }
             break;
 
@@ -6560,6 +6565,22 @@ x86_emulate(
         op_bytes = 16 << vex.l;
         goto simd_0f_cvt;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x5b): /* vcvtps2dq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x5b): /* vcvttps2dq [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x60):    /* punpcklbw {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x60): /* vpunpcklbw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0x61):    /* punpcklwd {,x}mm/mem,{,x}mm */
@@ -7218,6 +7239,27 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe6):   /* vcvttpd2dq [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx != vex_f3 )
+            host_and_vcpu_must_have(avx512f);
+        else if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+        {
+            host_and_vcpu_must_have(avx512f);
+            generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        }
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 8 << (evex.w + evex.lr);
+        goto simd_zmm;
+
     case X86EMUL_OPC_F2(0x0f, 0xf0):     /* lddqu m128,xmm */
     case X86EMUL_OPC_VEX_F2(0x0f, 0xf0): /* vlddqu mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 37/47] x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (35 preceding siblings ...)
  2018-11-19 10:34   ` [PATCH v5 36/47] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
@ 2018-11-19 10:35   ` Jan Beulich
  2018-11-19 10:36   ` [PATCH v5 38/47] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
                     ` (9 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVT{,T}S{S,D}2SI use EVEX.W for their destination (register) rather
than their (possibly memory) source operand size and hence need a
"manual" override of disp8scale.

Slightly adjust the scalar to_int() in the test harness, to increase the
chances of the operand ending up in memory.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -117,8 +117,16 @@ static const struct test avx512f_all[] =
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
+    INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
+    INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -734,8 +742,9 @@ static void test_group(const struct test
                 break;
 
             case ESZ_dq:
-                test_pair(&tests[i], vl[j], ESZ_d, "d", ESZ_q, "q",
-                          instr, ctxt);
+                test_pair(&tests[i], vl[j], ESZ_d,
+                          strncmp(tests[i].mnemonic, "cvt", 3) ? "d" : "l",
+                          ESZ_q, "q", instr, ctxt);
                 break;
 
 #ifdef __i386__
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -89,7 +89,7 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 
 #if VEC_SIZE == FLOAT_SIZE
-# define to_int(x) ((vec_t){ (int)(x)[0] })
+# define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -340,10 +340,28 @@ OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
+OVR(cvtsd2si);
+OVR(cvtsd2sil);
+OVR(cvtsd2siq);
+OVR(cvtsi2sd);
+OVR(cvtsi2sdl);
+OVR(cvtsi2sdq);
+OVR(cvtsi2ss);
+OVR(cvtsi2ssl);
+OVR(cvtsi2ssq);
 OVR(cvtss2sd);
+OVR(cvtss2si);
+OVR(cvtss2sil);
+OVR(cvtss2siq);
 OVR(cvttpd2dqx);
 OVR(cvttpd2dqy);
 OVR(cvttps2dq);
+OVR(cvttsd2si);
+OVR(cvttsd2sil);
+OVR(cvttsd2siq);
+OVR(cvttss2si);
+OVR(cvttss2sil);
+OVR(cvttss2siq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -296,7 +296,7 @@ static const struct twobyte_table {
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
     [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp, simd_none, d8s_dq },
@@ -3067,6 +3067,12 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x2c: /* vcvtts{s,d}2si need special casing */
+            case 0x2d: /* vcvts{s,d}2si need special casing */
+                if ( evex_encoded() )
+                    disp8scale = 2 + (evex.pfx & VEX_PREFIX_DOUBLE_MASK);
+                break;
+
             case 0x5a: /* vcvtps2pd needs special casing */
                 if ( disp8scale && !evex.pfx && !evex.br )
                     --disp8scale;
@@ -6176,6 +6182,26 @@ x86_emulate(
         state->simd_size = simd_none;
         goto simd_0f_rm;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+        generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.br),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        if ( ea.type == OP_MEM )
+        {
+            rc = read_ulong(ea.mem.seg, ea.mem.off, &src.val,
+                            rex_prefix & REX_W ? 8 : 4, ctxt, ops);
+            if ( rc != X86EMUL_OKAY )
+                goto done;
+        }
+        else
+            src.val = rex_prefix & REX_W ? *ea.reg : (uint32_t)*ea.reg;
+
+        state->simd_size = simd_none;
+        goto avx512f_rm;
+
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2c):     /* cvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_VEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2d):     /* cvts{s,d}2si xmm/mem,reg */
@@ -6197,14 +6223,17 @@ x86_emulate(
         }
 
         opc = init_prefixes(stub);
+    cvts_2si:
         opc[0] = b;
         /* Convert GPR destination to %rAX and memory operand to (%rCX). */
         rex_prefix &= ~REX_R;
         vex.r = 1;
+        evex.r = 1;
         if ( ea.type == OP_MEM )
         {
             rex_prefix &= ~REX_B;
             vex.b = 1;
+            evex.b = 1;
             opc[1] = 0x01;
 
             rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp,
@@ -6215,11 +6244,22 @@ x86_emulate(
         else
             opc[1] = modrm & 0xc7;
         if ( !mode_64bit() )
+        {
             vex.w = 0;
-        insn_bytes = PFX_BYTES + 2;
+            evex.w = 0;
+        }
+        if ( evex_encoded() )
+        {
+            insn_bytes = EVEX_PFX_BYTES + 2;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 2;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
         opc[2] = 0xc3;
 
-        copy_REX_VEX(opc, rex_prefix, vex);
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
 
@@ -6227,6 +6267,17 @@ x86_emulate(
         state->simd_size = simd_none;
         break;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+        generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
+                               (ea.type != OP_REG && evex.br)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto cvts_2si;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2e):     /* ucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x2e): /* vucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2f):     /* comis{s,d} xmm/mem,xmm */
@@ -6877,6 +6928,7 @@ x86_emulate(
         host_and_vcpu_must_have(avx512f);
         get_fpu(X86EMUL_FPU_zmm);
 
+    avx512f_rm:
         opc = init_evex(stub);
         opc[0] = b;
         /* Convert memory/GPR operand to (%rAX). */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 38/47] x86emul: support AVX512DQ packed quad-int/FP conversion insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (36 preceding siblings ...)
  2018-11-19 10:35   ` [PATCH v5 37/47] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
@ 2018-11-19 10:36   ` Jan Beulich
  2018-11-19 10:37   ` [PATCH v5 39/47] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
                     ` (8 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVT{,T}PS2QQ, sharing their main opcodes with others, once again need
"manual" overrides of disp8scale.

While not directly related here, also add a scalar variant of to_wint()
to the test harness.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -411,8 +411,12 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
+    INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -90,14 +90,35 @@ static inline bool _to_bool(byte_vec_t b
 
 #if VEC_SIZE == FLOAT_SIZE
 # define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
+# ifdef __x86_64__
+#  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) ({ \
+    vsf_half_t t_ = low_half(x); \
+    vdi_t lo_, hi_; \
+    touch(t_); \
+    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    t_ = high_half(x); \
+    touch(t_); \
+    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    touch(lo_); touch(hi_); \
+    insert_half(insert_half(undef(), \
+                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+})
+#  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
@@ -121,6 +142,14 @@ static inline bool _to_bool(byte_vec_t b
 })
 #endif
 
+#if VEC_SIZE == 16 && FLOAT_SIZE == 4 && defined(__SSE__)
+# define low_half(x) (x)
+# define high_half(x) B_(movhlps, , undef(), x)
+# define insert_pair(x, y, p) \
+    ((p) ? B_(movlhps, , x, y) \
+         : ({ vec_t t_ = (x); t_[0] = (y)[0]; t_[1] = (y)[1]; t_; }))
+#endif
+
 #if VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW_A__)
 # define max __builtin_ia32_pfmax
 # define min __builtin_ia32_pfmin
@@ -149,13 +178,16 @@ static inline bool _to_bool(byte_vec_t b
 # if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
      (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
      (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
-#  define low_half(x) ({ \
+#  define _half(x, lh) ({ \
     half_t t_; \
-    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+    asm ( "vextractf%c[w]x%c[n] %[sel], %[s], %[d]" \
           : [d] "=m" (t_) \
-          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+          : [s] "v" (x), [sel] "i" (lh), \
+            [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
     t_; \
 })
+#  define low_half(x)  _half(x, 0)
+#  define high_half(x) _half(x, 1)
 # endif
 # if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
      (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
@@ -1176,6 +1208,13 @@ int simd_test(void)
 
 # endif
 
+# ifdef to_wint
+    touch(src);
+    x = to_wint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
 # ifdef sqrt
     x = src * src;
     touch(x);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -325,6 +325,8 @@ static const struct twobyte_table {
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3078,6 +3080,12 @@ x86_decode(
                     --disp8scale;
                 break;
 
+            case 0x7a: /* vcvttps2qq needs special casing */
+            case 0x7b: /* vcvtps2qq needs special casing */
+                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.br )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -7300,7 +7308,13 @@ x86_emulate(
         if ( evex.pfx != vex_f3 )
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
+        {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2qq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512dq);
+        }
         else
         {
             host_and_vcpu_must_have(avx512f);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 39/47] x86emul: support AVX512{F, DQ} uint-to-FP conversion insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (37 preceding siblings ...)
  2018-11-19 10:36   ` [PATCH v5 38/47] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
@ 2018-11-19 10:37   ` Jan Beulich
  2018-11-19 10:37   ` [PATCH v5 40/47] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
                     ` (7 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:37 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Some "manual" overrides of disp8scale are needed here again. In
particular code ends up simpler when using d8s_dq64 in the
twobyte_table[] entry.

Test harness additions will be done once the reverse conversions are
also available.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -127,6 +127,10 @@ static const struct test avx512f_all[] =
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
+    INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
+    INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
+    INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -417,6 +421,8 @@ static const struct test avx512dq_all[]
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
+    INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -326,7 +326,7 @@ static const struct twobyte_table {
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3080,12 +3080,16 @@ x86_decode(
                     --disp8scale;
                 break;
 
-            case 0x7a: /* vcvttps2qq needs special casing */
-            case 0x7b: /* vcvtps2qq needs special casing */
-                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.br )
+            case 0x7a: /* vcvttps2qq and vcvtudq2pd need special casing */
+                if ( disp8scale && evex.pfx != vex_f2 && !evex.w && !evex.br )
                     --disp8scale;
                 break;
 
+            case 0x7b: /* vcvtp{s,d}2qq need special casing */
+                if ( disp8scale && evex.pfx == vex_66 )
+                    disp8scale = (evex.br ? 2 : 3 + evex.lr) + evex.w;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -6191,6 +6195,7 @@ x86_emulate(
         goto simd_0f_rm;
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x7b): /* vcvtusi2s{s,d} r/m,xmm,xmm */
         generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.br),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
@@ -6630,6 +6635,8 @@ x86_emulate(
         /* fall through */
     case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
                                           /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x7a): /* vcvtudq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtuqq2ps [xyz]mm/mem,{x,y}mm{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
@@ -7303,6 +7310,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
         generate_exception_if(!evex.w, EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7a):   /* vcvtudq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtuqq2pd [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
         if ( evex.pfx != vex_f3 )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 40/47] x86emul: support AVX512{F, DQ} FP-to-uint conversion insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (38 preceding siblings ...)
  2018-11-19 10:37   ` [PATCH v5 39/47] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
@ 2018-11-19 10:37   ` Jan Beulich
  2018-11-19 10:38   ` [PATCH v5 41/47] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
                     ` (6 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:37 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Along the lines of prior patches, VCVT{,T}PS2UQQ as well as
VCVT{,T}S{S,D}2USI need "manual" overrides of disp8scale.

The twobyte_table[] entries get altered, with their prior values
now put in place in x86_decode_twobyte().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -112,21 +112,29 @@ static const struct test avx512f_all[] =
     INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
     INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
     INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
+    INSN(cvtpd2udq,      ,   0f, 79,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtps2udq,      ,   0f, 79,    vl,      d, vl),
     INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
+    INSN(cvtsd2usi,    f2,   0f, 79,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
     INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
     INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvtss2usi,    f3,   0f, 79,    el,      d, el),
     INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttpd2udq,     ,   0f, 78,    vl,      q, vl),
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttps2udq,     ,   0f, 78,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttsd2usi,   f2,   0f, 78,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvttss2usi,   f3,   0f, 78,    el,      d, el),
     INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
     INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
@@ -416,11 +424,15 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtpd2uqq,      66,   0f, 79,   vl,  q, vl),
     INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
+    INSN(cvtps2uqq,      66,   0f, 79, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttpd2uqq,     66,   0f, 78,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvttps2uqq,     66,   0f, 78, vl_2,  d, vl),
     INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
     INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -93,31 +93,65 @@ static inline bool _to_bool(byte_vec_t b
 # ifdef __x86_64__
 #  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
 # endif
+# ifdef __AVX512F__
+/*
+ * Sadly even gcc 9.x, at the time of writing, does not carry out at least
+ * uint -> FP conversions using VCVTUSI2S{S,D}, so we need to use builtins
+ * or inline assembly here. The full-vector parameter types of the builtins
+ * aren't very helpful for our purposes, so use inline assembly.
+ */
+#  if FLOAT_SIZE == 4
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    float __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtss2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2ss%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  elif FLOAT_SIZE == 8
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    double __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtsd2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2sd%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  endif
+#  define to_uint(x) to_u_int(int, x)
+#  ifdef __x86_64__
+#   define to_uwint(x) to_u_int(long, x)
+#  endif
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  define to_uint(x) BR(cvtudq2ps, _mask, BR(cvtps2udq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
-#   define to_wint(x) ({ \
+#   define to_w_int(x, s) ({ \
     vsf_half_t t_ = low_half(x); \
     vdi_t lo_, hi_; \
     touch(t_); \
-    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    lo_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     t_ = high_half(x); \
     touch(t_); \
-    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    hi_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     touch(lo_); touch(hi_); \
     insert_half(insert_half(undef(), \
-                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
-                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+                            BR(cvt ## s ## qq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvt ## s ## qq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
 })
+#   define to_wint(x) to_w_int(x, )
+#   define to_uwint(x) to_w_int(x, u)
 #  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  define to_uint(x) B(cvtudq2pd, _mask, BR(cvtpd2udq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
 #   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#   define to_uwint(x) BR(cvtuqq2pd, _mask, BR(cvtpd2uqq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
 #  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
@@ -1214,6 +1248,20 @@ int simd_test(void)
     touch(src);
     if ( !eq(x, src) ) return __LINE__;
 # endif
+
+# ifdef to_uint
+    touch(src);
+    x = to_uint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
+# ifdef to_uwint
+    touch(src);
+    x = to_uwint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
 
 # ifdef sqrt
     x = src * src;
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -323,8 +323,7 @@ static const struct twobyte_table {
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
-    [0x78] = { ImplicitOps|ModRM },
-    [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x78 ... 0x79] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -2514,6 +2513,8 @@ x86_decode_twobyte(
         break;
 
     case 0x78:
+        state->desc = ImplicitOps;
+        state->simd_size = simd_none;
         switch ( vex.pfx )
         {
         case vex_66: /* extrq $imm8, $imm8, xmm */
@@ -2526,7 +2527,7 @@ x86_decode_twobyte(
     case 0x10 ... 0x18:
     case 0x28 ... 0x2f:
     case 0x50 ... 0x77:
-    case 0x79 ... 0x7d:
+    case 0x7a ... 0x7d:
     case 0x7f:
     case 0xc2 ... 0xc3:
     case 0xc5 ... 0xc6:
@@ -2548,6 +2549,12 @@ x86_decode_twobyte(
         op_bytes = mode_64bit() ? 8 : 4;
         break;
 
+    case 0x79:
+        state->desc = DstReg | SrcMem;
+        state->simd_size = simd_packed_int;
+        ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
+        break;
+
     case 0x7e:
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         if ( vex.pfx == vex_f3 ) /* movq xmm/m64,xmm */
@@ -3069,6 +3076,18 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x78:
+            case 0x79:
+                if ( !evex.pfx )
+                    break;
+                /* vcvt{,t}ps2uqq need special casing */
+                if ( evex.pfx == vex_66 )
+                {
+                    if ( !evex.w && !evex.br )
+                        --disp8scale;
+                    break;
+                }
+                /* vcvt{,t}s{s,d}2usi need special casing: fall through */
             case 0x2c: /* vcvtts{s,d}2si need special casing */
             case 0x2d: /* vcvts{s,d}2si need special casing */
                 if ( evex_encoded() )
@@ -6282,6 +6301,8 @@ x86_emulate(
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x78): /* vcvtts{s,d}2usi xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x79): /* vcvts{s,d}2usi xmm/mem,reg */
         generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
                                (ea.type != OP_REG && evex.br)),
                               EXC_UD);
@@ -6640,7 +6661,11 @@ x86_emulate(
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
+        {
+    case X86EMUL_OPC_EVEX(0x0f, 0x78):    /* vcvttp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX(0x0f, 0x79):    /* vcvtp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512f);
+        }
         if ( ea.type == OP_MEM || !evex.br )
             avx512_vlen_check(false);
         d |= TwoOp;
@@ -7318,6 +7343,10 @@ x86_emulate(
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
         {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x78):   /* vcvttps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2uqq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x79):   /* vcvtps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2uqq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 41/47] x86emul: support remaining AVX512F legacy-equivalent insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (39 preceding siblings ...)
  2018-11-19 10:37   ` [PATCH v5 40/47] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
@ 2018-11-19 10:38   ` Jan Beulich
  2018-11-19 10:38   ` [PATCH v5 42/47] x86emul: support remaining AVX512BW " Jan Beulich
                     ` (5 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Plus their AVX512BW counterparts.

Take the opportunity and also eliminate a pair of open coded instances
of scalar_1op().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -192,6 +192,8 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(pabsd,        66, 0f38, 1e,    vl,      d, vl),
+    INSN(pabsq,        66, 0f38, 1f,    vl,      q, vl),
     INSN(paddd,        66,   0f, fe,    vl,      d, vl),
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
@@ -275,6 +277,10 @@ static const struct test avx512f_all[] =
     INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
     INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
+    INSN(rndscalepd,   66, 0f3a, 09,    vl,      q, vl),
+    INSN(rndscaleps,   66, 0f3a, 08,    vl,      d, vl),
+    INSN(rndscalesd,   66, 0f3a, 0b,    el,      q, el),
+    INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
@@ -337,6 +343,8 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(pabsb,       66, 0f38, 1c,    vl,    b, vl),
+    INSN(pabsw,       66, 0f38, 1d,    vl,    w, vl),
     INSN(packssdw,    66,   0f, 6b,    vl, d_nb, vl),
     INSN(packsswb,    66,   0f, 63,    vl,    w, vl),
     INSN(packusdw,    66, 0f38, 2b,    vl, d_nb, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -204,8 +204,10 @@ static inline bool _to_bool(byte_vec_t b
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
+#  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
+#  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
@@ -256,6 +258,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
 #  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  define trunc(x) BR(rndscaleps_, _mask, x, 0b1011, undef(), ~0)
 #  define widen1(x) ((vec_t)BR(cvtps2pd, _mask, x, (vdf_t)undef(), ~0))
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
@@ -309,6 +312,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
 #  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
+#  define trunc(x) BR(rndscalepd_, _mask, x, 0b1011, undef(), ~0)
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
@@ -541,6 +545,7 @@ static inline bool _to_bool(byte_vec_t b
 #  endif
 # endif
 # if INT_SIZE == 4
+#  define abs(x) B(pabsd, _mask, x, undef(), ~0)
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
@@ -551,6 +556,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
 #  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
+#  define abs(x) ((vec_t)B(pabsq, _mask, (vdi_t)(x), (vdi_t)undef(), ~0))
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 8
@@ -618,6 +624,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
 # endif
 # if INT_SIZE == 1
+#  define abs(x) ((vec_t)B(pabsb, _mask, (vqi_t)(x), (vqi_t)undef(), ~0))
 #  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
@@ -630,6 +637,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
 #  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 2
+#  define abs(x) B(pabsw, _mask, x, undef(), ~0)
 #  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
 #  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
@@ -941,19 +949,11 @@ static inline bool _to_bool(byte_vec_t b
 #if VEC_SIZE == FLOAT_SIZE
 # define max(x, y) ((vec_t){({ typeof(x[0]) x_ = (x)[0], y_ = (y)[0]; x_ > y_ ? x_ : y_; })})
 # define min(x, y) ((vec_t){({ typeof(x[0]) x_ = (x)[0], y_ = (y)[0]; x_ < y_ ? x_ : y_; })})
-# ifdef __SSE4_1__
+# if defined(__SSE4_1__) && !defined(__AVX512F__)
 #  if FLOAT_SIZE == 4
-#   define trunc(x) ({ \
-    float __attribute__((vector_size(16))) r_; \
-    asm ( "roundss $0b1011,%1,%0" : "=x" (r_) : "m" (x) ); \
-    (vec_t){ r_[0] }; \
-})
+#   define trunc(x) scalar_1op(x, "roundss $0b1011, %[in], %[out]")
 #  elif FLOAT_SIZE == 8
-#   define trunc(x) ({ \
-    double __attribute__((vector_size(16))) r_; \
-    asm ( "roundsd $0b1011,%1,%0" : "=x" (r_) : "m" (x) ); \
-    (vec_t){ r_[0] }; \
-})
+#   define trunc(x) scalar_1op(x, "roundsd $0b1011, %[in], %[out]")
 #  endif
 # endif
 #endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -184,6 +184,8 @@ DECL_OCTET(half);
 #define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
 #define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
 #define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
+#define __builtin_ia32_rndscalepd_512_mask __builtin_ia32_rndscalepd_mask
+#define __builtin_ia32_rndscaleps_512_mask __builtin_ia32_rndscaleps_mask
 #define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 #define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 #define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -245,6 +247,7 @@ OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_VFP(cvtdq2);
+OVR_INT(abs);
 OVR_FP(add);
 OVR_INT(add);
 OVR_BW(adds);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -446,7 +446,7 @@ static const struct ext0f38_table {
     [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
-    [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x1c ... 0x1f] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
     [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
@@ -531,8 +531,8 @@ static const struct ext0f3a_table {
     [0x02] = { .simd_size = simd_packed_int },
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
-    [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
-    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
+    [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc, .d8s = d8s_dq },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
     [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
@@ -6865,6 +6865,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.br, EXC_UD);
         elem_bytes = 1 << (b & 1);
@@ -8246,6 +8248,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1e): /* vpabsd [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1f): /* vpabsq [xyz]mm/mem,[xyz]mm{k} */
         generate_exception_if(evex.w != (b & 1), EXC_UD);
         goto avx512f_no_sae;
 
@@ -9274,6 +9278,17 @@ x86_emulate(
         host_and_vcpu_must_have(sse4_1);
         goto simd_0f3a_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0a): /* vrndscaless $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0b): /* vrndscalesd $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(ea.type != OP_REG && evex.br, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x08): /* vrndscaleps $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x09): /* vrndscalepd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        avx512_vlen_check(b & 2);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC(0x0f3a, 0x0f):    /* palignr $imm8,mm/m64,mm */
     case X86EMUL_OPC_66(0x0f3a, 0x0f): /* palignr $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(ssse3);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 42/47] x86emul: support remaining AVX512BW legacy-equivalent insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (40 preceding siblings ...)
  2018-11-19 10:38   ` [PATCH v5 41/47] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
@ 2018-11-19 10:38   ` Jan Beulich
  2018-11-19 10:39   ` [PATCH v5 43/47] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
                     ` (4 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -355,6 +355,7 @@ static const struct test avx512bw_all[]
     INSN(paddusb,     66,   0f, dc,    vl,    b, vl),
     INSN(paddusw,     66,   0f, dd,    vl,    w, vl),
     INSN(paddw,       66,   0f, fd,    vl,    w, vl),
+    INSN(palignr,     66, 0f3a, 0f,    vl,    b, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
     INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
@@ -370,6 +371,7 @@ static const struct test avx512bw_all[]
     INSN(permw,       66, 0f38, 8d,    vl,    w, vl),
     INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
     INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
+    INSN(pmaddubsw,   66, 0f38, 04,    vl,    b, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
@@ -387,6 +389,7 @@ static const struct test avx512bw_all[]
 //       pmovw2m,     f3, 0f38, 29,           w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
+    INSN(pmulhrsw,    66, 0f38, 0b,    vl,    w, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -580,6 +580,7 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define rotr(x, n) ((vec_t)B(palignr, _mask, (vdi_t)(x), (vdi_t)(x), (n) * 8, (vdi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
 #  elif defined(__AVX512VBMI__)
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
@@ -608,6 +609,7 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define rotr(x, n) ((vec_t)B(palignr, _mask, (vdi_t)(x), (vdi_t)(x), (n) * 16, (vdi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
                              (vsi_t)B(pshufhw, _mask, \
                                       B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -402,9 +402,12 @@ OVR(packssdw);
 OVR(packsswb);
 OVR(packusdw);
 OVR(packuswb);
+OVR(palignr);
+OVR(pmaddubsw);
 OVR(pmaddwd);
 OVR(pmovsxbw);
 OVR(pmovzxbw);
+OVR(pmulhrsw);
 OVR(pmulhuw);
 OVR(pmulhw);
 OVR(pmullw);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -435,7 +435,10 @@ static const struct ext0f38_table {
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x01 ... 0x03] = { .simd_size = simd_packed_int },
+    [0x04] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x05 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x0b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -534,7 +537,8 @@ static const struct ext0f3a_table {
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc, .d8s = d8s_dq },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
-    [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
+    [0x0e] = { .simd_size = simd_packed_int },
+    [0x0f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
@@ -6847,6 +6851,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x04): /* vpmaddubsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6865,6 +6870,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0b): /* vpmulhrsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
@@ -9317,6 +9323,10 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 4;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0f): /* vpalignr $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        goto avx512bw_imm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x14): /* pextrb $imm8,xmm,r/m */
     case X86EMUL_OPC_66(0x0f3a, 0x15): /* pextrw $imm8,xmm,r/m */
     case X86EMUL_OPC_66(0x0f3a, 0x16): /* pextr{d,q} $imm8,xmm,r/m */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 43/47] x86emul: support AVX512{F, ER} reciprocal insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (41 preceding siblings ...)
  2018-11-19 10:38   ` [PATCH v5 42/47] x86emul: support remaining AVX512BW " Jan Beulich
@ 2018-11-19 10:39   ` Jan Beulich
  2018-11-19 10:39   ` [PATCH v5 44/47] x86emul: support AVX512F floating point manipulation insns Jan Beulich
                     ` (3 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:39 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also include the only other AVX512ER insn pair, VEXP2P{D,S}.

Note that despite the replacement of the SHA insns' table slots there's
no need to special case their decoding: Their insn-specific code already
sets op_bytes (as was required due to simd_other), and TwoOp is of no
relevance for legacy encoded SIMD insns.

The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
to be on the safe side. The SDM does not clarify behavior there, and
it's even more ambiguous here (without AVX512VL in the picture).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Note: AVX512ER parts untested. Intel?
---
v5: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -72,6 +72,9 @@ avx512bw-flts :=
 avx512dq-vecs := $(avx512f-vecs)
 avx512dq-ints := $(avx512f-ints)
 avx512dq-flts := $(avx512f-flts)
+avx512er-vecs := 64
+avx512er-ints :=
+avx512er-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -277,10 +277,14 @@ static const struct test avx512f_all[] =
     INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
     INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
+    INSN(rcp14,        66, 0f38, 4c,    vl,     sd, vl),
+    INSN(rcp14,        66, 0f38, 4d,    el,     sd, el),
     INSN(rndscalepd,   66, 0f3a, 09,    vl,      q, vl),
     INSN(rndscaleps,   66, 0f3a, 08,    vl,      d, vl),
     INSN(rndscalesd,   66, 0f3a, 0b,    el,      q, el),
     INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
+    INSN(rsqrt14,      66, 0f38, 4e,    vl,     sd, vl),
+    INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
@@ -478,6 +482,14 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512er_512[] = {
+    INSN(exp2,    66, 0f38, c8, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, ca, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, cb, el, sd, el),
+    INSN(rsqrt28, 66, 0f38, cc, vl, sd, vl),
+    INSN(rsqrt28, 66, 0f38, cd, el, sd, el),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -825,5 +837,6 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512er, 512);
     RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -203,9 +203,23 @@ static inline bool _to_bool(byte_vec_t b
 })
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28ss %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14ss %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28sd %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14sd %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
@@ -256,6 +270,13 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28ps, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14ps, _mask, x, undef(), ~0)
+#  endif
 #  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscaleps_, _mask, x, 0b1011, undef(), ~0)
@@ -311,6 +332,13 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28pd, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14pd, _mask, x, undef(), ~0)
+#  endif
 #  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscalepd_, _mask, x, 0b1011, undef(), ~0)
 #  if VEC_SIZE == 16
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -178,14 +178,20 @@ DECL_OCTET(half);
 /* Sadly there are a few exceptions to the general naming rules. */
 #define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
 #define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+#define __builtin_ia32_exp2pd512_mask __builtin_ia32_exp2pd_mask
+#define __builtin_ia32_exp2ps512_mask __builtin_ia32_exp2ps_mask
 #define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
 #define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
 #define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
 #define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
 #define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
 #define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
+#define __builtin_ia32_rcp28pd512_mask __builtin_ia32_rcp28pd_mask
+#define __builtin_ia32_rcp28ps512_mask __builtin_ia32_rcp28ps_mask
 #define __builtin_ia32_rndscalepd_512_mask __builtin_ia32_rndscalepd_mask
 #define __builtin_ia32_rndscaleps_512_mask __builtin_ia32_rndscaleps_mask
+#define __builtin_ia32_rsqrt28pd512_mask __builtin_ia32_rsqrt28pd_mask
+#define __builtin_ia32_rsqrt28ps512_mask __builtin_ia32_rsqrt28ps_mask
 #define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 #define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 #define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -24,6 +24,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f.h"
 #include "avx512bw.h"
 #include "avx512dq.h"
+#include "avx512er.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -106,6 +107,11 @@ static bool simd_check_avx512dq_vl(void)
     return cpu_has_avx512dq && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx512er(void)
+{
+    return cpu_has_avx512er;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -327,6 +333,10 @@ static const struct {
     AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
     AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
     AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
+    SIMD(AVX512ER f32 scalar,avx512er,        f4),
+    SIMD(AVX512ER f32x16,    avx512er,      64f4),
+    SIMD(AVX512ER f64 scalar,avx512er,        f8),
+    SIMD(AVX512ER f64x8,     avx512er,      64f8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -134,6 +134,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_bmi2       cp.feat.bmi2
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
+#define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -471,6 +471,10 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
@@ -510,7 +514,12 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
-    [0xc8 ... 0xcd] = { .simd_size = simd_other },
+    [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xc9] = { .simd_size = simd_other },
+    [0xca] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xcc] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
     [0xf0] = { .two_op = 1 },
@@ -1874,6 +1883,7 @@ static bool vcpu_has(
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
+#define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
@@ -6145,6 +6155,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4c): /* vrcp14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4e): /* vrsqrt14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -9052,12 +9064,17 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
-        host_and_vcpu_must_have(avx512f);
+    simd_zmm_scalar_sae:
         if ( ea.type == OP_MEM )
         {
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4d): /* vrcp14s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4f): /* vrsqrt14s{s,d} xmm/mem,xmm,xmm{k} */
+            host_and_vcpu_must_have(avx512f);
             generate_exception_if(evex.br, EXC_UD);
             avx512_vlen_check(true);
         }
+        else
+            host_and_vcpu_must_have(avx512f);
         goto simd_zmm;
 
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */
@@ -9070,6 +9087,18 @@ x86_emulate(
         op_bytes = 16;
         goto simd_0f38_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc8): /* vexp2p{s,d} zmm/m512,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xca): /* vrcp28p{s,d} zmm/m512,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcc): /* vrsqrt28p{s,d} zmm/m512,zmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        generate_exception_if(ea.type == OP_MEM && evex.lr != 2, EXC_UD);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcb): /* vrcp28s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcd): /* vrsqrt28s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        goto simd_zmm_scalar_sae;
+
     case X86EMUL_OPC(0x0f38, 0xf0): /* movbe m,r */
     case X86EMUL_OPC(0x0f38, 0xf1): /* movbe r,m */
         vcpu_must_have(movbe);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -103,6 +103,7 @@
 #define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
+#define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 44/47] x86emul: support AVX512F floating point manipulation insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (42 preceding siblings ...)
  2018-11-19 10:39   ` [PATCH v5 43/47] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
@ 2018-11-19 10:39   ` Jan Beulich
  2018-11-19 10:40   ` [PATCH v5 45/47] x86emul: support AVX512DQ " Jan Beulich
                     ` (2 subsequent siblings)
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:39 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -140,6 +140,8 @@ static const struct test avx512f_all[] =
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
     INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
+    INSN(fixupimm,     66, 0f3a, 54,    vl,     sd, vl),
+    INSN(fixupimm,     66, 0f3a, 55,    el,     sd, el),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
     INSN(fmadd213,     66, 0f38, a8,    vl,     sd, vl),
@@ -170,6 +172,10 @@ static const struct test avx512f_all[] =
     INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
     INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
     INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
+    INSN(getexp,       66, 0f38, 42,    vl,     sd, vl),
+    INSN(getexp,       66, 0f38, 43,    el,     sd, el),
+    INSN(getmant,      66, 0f3a, 26,    vl,     sd, vl),
+    INSN(getmant,      66, 0f3a, 27,    el,     sd, el),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
@@ -285,6 +291,8 @@ static const struct test avx512f_all[] =
     INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
     INSN(rsqrt14,      66, 0f38, 4e,    vl,     sd, vl),
     INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
+    INSN(scalef,       66, 0f38, 2c,    vl,     sd, vl),
+    INSN(scalef,       66, 0f38, 2d,    el,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -174,6 +174,11 @@ static inline bool _to_bool(byte_vec_t b
     asm ( op : [out] "=&x" (r_) : [in] "m" (x) ); \
     (vec_t){ r_[0] }; \
 })
+# define scalar_2op(x, y, op) ({ \
+    typeof((x)[0]) __attribute__((vector_size(16))) r_ = { x[0] }; \
+    asm ( op : [out] "=&x" (r_) : [in1] "[out]" (r_), [in2] "m" (y) ); \
+    (vec_t){ r_[0] }; \
+})
 #endif
 
 #if VEC_SIZE == 16 && FLOAT_SIZE == 4 && defined(__SSE__)
@@ -203,6 +208,8 @@ static inline bool _to_bool(byte_vec_t b
 })
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
+#  define getexp(x) scalar_1op(x, "vgetexpss %[in], %[out], %[out]")
+#  define getmant(x) scalar_1op(x, "vgetmantss $0, %[in], %[out], %[out]")
 #  ifdef __AVX512ER__
 #   define recip(x) scalar_1op(x, "vrcp28ss %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt28ss %[in], %[out], %[out]")
@@ -210,9 +217,12 @@ static inline bool _to_bool(byte_vec_t b
 #   define recip(x) scalar_1op(x, "vrcp14ss %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt14ss %[in], %[out], %[out]")
 #  endif
+#  define scale(x, y) scalar_2op(x, y, "vscalefss %[in2], %[in1], %[out]")
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
+#  define getexp(x) scalar_1op(x, "vgetexpsd %[in], %[out], %[out]")
+#  define getmant(x) scalar_1op(x, "vgetmantsd $0, %[in], %[out], %[out]")
 #  ifdef __AVX512ER__
 #   define recip(x) scalar_1op(x, "vrcp28sd %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt28sd %[in], %[out], %[out]")
@@ -220,6 +230,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define recip(x) scalar_1op(x, "vrcp14sd %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt14sd %[in], %[out], %[out]")
 #  endif
+#  define scale(x, y) scalar_2op(x, y, "vscalefsd %[in2], %[in1], %[out]")
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
@@ -267,9 +278,12 @@ static inline bool _to_bool(byte_vec_t b
 #   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
 #   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  define getexp(x) BR(getexpps, _mask, x, undef(), ~0)
+#  define getmant(x) BR(getmantps, _mask, x, 0, undef(), ~0)
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
 #   define rsqrt(x) BR(rsqrt28ps, _mask, x, undef(), ~0)
@@ -329,9 +343,12 @@ static inline bool _to_bool(byte_vec_t b
 #   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
 #   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  define getexp(x) BR(getexppd, _mask, x, undef(), ~0)
+#  define getmant(x) BR(getmantpd, _mask, x, 0, undef(), ~0)
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
 #   define rsqrt(x) BR(rsqrt28pd, _mask, x, undef(), ~0)
@@ -1759,6 +1776,28 @@ int simd_test(void)
 # endif
 #endif
 
+#if defined(getexp) && defined(getmant)
+    touch(src);
+    x = getmant(src);
+    touch(src);
+    y = getexp(src);
+    touch(src);
+    for ( j = i = 0; i < ELEM_COUNT; ++i )
+    {
+        if ( y[i] != j ) return __LINE__;
+
+        if ( !((i + 1) & (i + 2)) )
+            ++j;
+
+        if ( !(i & (i + 1)) && x[i] != 1 ) return __LINE__;
+    }
+# ifdef scale
+    touch(y);
+    z = scale(x, y);
+    if ( !eq(src, z) ) return __LINE__;
+# endif
+#endif
+
 #if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
     (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3895,6 +3895,44 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vfixupimmpd $0,8(%edx){1to8},%zmm3,%zmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vfixupimmpd);
+        static const struct {
+            double d[4];
+        }
+        src = { { -1, 0, 1, 2 } },
+        dst = { { 3, 4, 5, 6 } },
+        out = { { .5, -1, 90, 2 } };
+
+        asm volatile ( "vbroadcastf64x4 %1, %%zmm3\n\t"
+                       "vbroadcastf64x4 %2, %%zmm4\n"
+                       put_insn(vfixupimmpd,
+                                "vfixupimmpd $0, 8(%0)%{1to8%}, %%zmm3, %%zmm4")
+                       :: "d" (NULL), "m" (src), "m" (dst) );
+
+        set_insn(vfixupimmpd);
+        /*
+         * Nibble (token) mapping (unused ones simply set to zero):
+         * 2 (ZERO)    ->  -1 (0x9)
+         * 3 (POS_ONE) ->  90 (0xc)
+         * 6 (NEG)     -> 1/2 (0xb)
+         * 7 (POS)     -> src (0x1)
+         */
+        res[2] = 0x1b00c900;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        asm volatile ( "vmovupd %%zmm4, %0" : "=m" (res[0]) );
+        if ( rc != X86EMUL_OKAY || !check_eip(vfixupimmpd) ||
+             memcmp(res + 0, &out, sizeof(out)) ||
+             memcmp(res + 8, &out, sizeof(out)) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -459,7 +459,8 @@ static const struct ext0f38_table {
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
+    [0x2c] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x2d] = { .simd_size = simd_packed_fp, .d8s = d8s_dq },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
@@ -470,6 +471,8 @@ static const struct ext0f38_table {
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x42] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x43] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -563,6 +566,8 @@ static const struct ext0f3a_table {
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
     [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x26] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x27] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
     [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
@@ -577,6 +582,8 @@ static const struct ext0f3a_table {
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4c] = { .simd_size = simd_packed_int, .four_op = 1 },
+    [0x54] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x55] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -2675,6 +2682,10 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x2d): /* vscalefs{d,s} */
+        state->simd_size = simd_scalar_vexw;
+        break;
+
     case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
     case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
     case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
@@ -9029,6 +9040,8 @@ x86_emulate(
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2c): /* vscalefp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x42): /* vgetexpp{s,d} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -9052,6 +9065,8 @@ x86_emulate(
             avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2d): /* vscalefs{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x43): /* vgetexps{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
@@ -9614,6 +9629,23 @@ x86_emulate(
                               EXC_UD);
         goto avx512f_imm8_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM )
+        {
+            generate_exception_if(evex.br, EXC_UD);
+            avx512_vlen_check(true);
+        }
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 45/47] x86emul: support AVX512DQ floating point manipulation insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (43 preceding siblings ...)
  2018-11-19 10:39   ` [PATCH v5 44/47] x86emul: support AVX512F floating point manipulation insns Jan Beulich
@ 2018-11-19 10:40   ` Jan Beulich
  2018-11-19 10:40   ` [PATCH v5 46/47] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
  2018-11-19 10:41   ` [PATCH v5 47/47] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:40 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512DQ in the insn emulator.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -458,11 +458,17 @@ static const struct test avx512dq_all[]
     INSN(cvttps2uqq,     66,   0f, 78, vl_2,  d, vl),
     INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
     INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
+    INSN(fpclass,        66, 0f3a, 66,   vl, sd, vl),
+    INSN(fpclass,        66, 0f3a, 67,   el, sd, el),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
 //       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
+    INSN(range,          66, 0f3a, 50,   vl, sd, vl),
+    INSN(range,          66, 0f3a, 51,   el, sd, el),
+    INSN(reduce,         66, 0f3a, 56,   vl, sd, vl),
+    INSN(reduce,         66, 0f3a, 57,   el, sd, el),
     INSN_PFP(xor,              0f, 57),
 };
 
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -278,10 +278,18 @@ static inline bool _to_bool(byte_vec_t b
 #   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
 #   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  ifdef __AVX512DQ__
+#   define frac(x) B(reduceps, _mask, x, 0b00001011, undef(), ~0)
+#  endif
 #  define getexp(x) BR(getexpps, _mask, x, undef(), ~0)
 #  define getmant(x) BR(getmantps, _mask, x, 0, undef(), ~0)
-#  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
-#  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define max(x, y) BR(rangeps, _mask, x, y, 0b0101, undef(), ~0)
+#   define min(x, y) BR(rangeps, _mask, x, y, 0b0100, undef(), ~0)
+#  else
+#   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  endif
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
 #  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
@@ -343,10 +351,18 @@ static inline bool _to_bool(byte_vec_t b
 #   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
 #   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  ifdef __AVX512DQ__
+#   define frac(x) B(reducepd, _mask, x, 0b00001011, undef(), ~0)
+#  endif
 #  define getexp(x) BR(getexppd, _mask, x, undef(), ~0)
 #  define getmant(x) BR(getmantpd, _mask, x, 0, undef(), ~0)
-#  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
-#  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define max(x, y) BR(rangepd, _mask, x, y, 0b0101, undef(), ~0)
+#   define min(x, y) BR(rangepd, _mask, x, y, 0b0100, undef(), ~0)
+#  else
+#   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  endif
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
 #  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3933,6 +3933,39 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+
+    printf("%-40s", "Testing vfpclasspsz $0x46,64(%edx),%k2...");
+    if ( stack_exec && cpu_has_avx512dq )
+    {
+        decl_insn(vfpclassps);
+
+        asm volatile ( put_insn(vfpclassps,
+                                /* 0x46: check for +/- 0 and neg. */
+                                "vfpclasspsz $0x46, 64(%0), %%k2")
+                       :: "d" (NULL) );
+
+        set_insn(vfpclassps);
+        for ( i = 0; i < 3; ++i )
+        {
+            res[16 + i * 5 + 0] = 0x00000000; /* +0 */
+            res[16 + i * 5 + 1] = 0x80000000; /* -0 */
+            res[16 + i * 5 + 2] = 0x80000001; /* -DEN */
+            res[16 + i * 5 + 3] = 0xff000000; /* -FIN */
+            res[16 + i * 5 + 4] = 0x7f000000; /* +FIN */
+        }
+        res[31] = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vfpclassps) )
+            goto fail;
+        asm volatile ( "kmovw %%k2, %0" : "=g" (rc) );
+        if ( rc != 0xbdef )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -582,10 +582,16 @@ static const struct ext0f3a_table {
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4c] = { .simd_size = simd_packed_int, .four_op = 1 },
+    [0x50] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x51] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x54] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x55] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x56] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x57] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x66] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x67] = { .simd_size = simd_scalar_vexw, .two_op = 1, .d8s = d8s_dq },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x6a ... 0x6b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x6c ... 0x6d] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -9629,6 +9635,10 @@ x86_emulate(
                               EXC_UD);
         goto avx512f_imm8_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x50): /* vrangep{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x56): /* vreducep{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512dq);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512f);
@@ -9636,11 +9646,16 @@ x86_emulate(
             avx512_vlen_check(false);
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x51): /* vranges{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x57): /* vreduces{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512dq);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
         host_and_vcpu_must_have(avx512f);
         if ( ea.type == OP_MEM )
         {
+    simd_imm8_zmm_scalar:
             generate_exception_if(evex.br, EXC_UD);
             avx512_vlen_check(true);
         }
@@ -9793,6 +9808,14 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x66): /* vfpclassp{d,s} $imm8,[xyz]mm/mem,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x67): /* vfpclasss{d,s} $imm8,[xyz]mm/mem,k{k} */
+        host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( b & 1 )
+            goto simd_imm8_zmm_scalar;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC(0x0f3a, 0xcc):     /* sha1rnds4 $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(sha);
         op_bytes = 16;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 46/47] x86emul: support AVX512{F, _VBMI2} compress/expand insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (44 preceding siblings ...)
  2018-11-19 10:40   ` [PATCH v5 45/47] x86emul: support AVX512DQ " Jan Beulich
@ 2018-11-19 10:40   ` Jan Beulich
  2018-11-19 10:41   ` [PATCH v5 47/47] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:40 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.
---
TBD: Equivalent byte/word element size tests would be nice, but I didn't
     dare to code them up without actually being able to test the result.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,6 +109,7 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(compress,     66, 0f38, 8a,    vl,     sd, el),
     INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
     INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
     INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
@@ -140,6 +141,7 @@ static const struct test avx512f_all[] =
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
     INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
+    INSN(expand,       66, 0f38, 88,    vl,     sd, el),
     INSN(fixupimm,     66, 0f3a, 54,    vl,     sd, vl),
     INSN(fixupimm,     66, 0f3a, 55,    el,     sd, el),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
@@ -213,6 +215,7 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(pcompress,    66, 0f38, 8b,    vl,     dq, el),
     INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
     INSN(permilpd,     66, 0f38, 0d,    vl,      q, vl),
@@ -221,6 +224,7 @@ static const struct test avx512f_all[] =
     INSN(permilps,     66, 0f3a, 04,    vl,      d, vl),
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
+    INSN(pexpand,      66, 0f38, 89,    vl,     dq, el),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -510,6 +514,11 @@ static const struct test avx512_vbmi_all
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
 };
 
+static const struct test avx512_vbmi2_all[] = {
+    INSN(pcompress, 66, 0f38, 63, vl, bw, el),
+    INSN(pexpand,   66, 0f38, 62, vl, bw, el),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -853,4 +862,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 512);
     RUN(avx512er, 512);
     RUN(avx512_vbmi, all);
+    RUN(avx512_vbmi2, all);
 }
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3966,6 +3966,113 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    /*
+     * The following compress/expand tests are not only making sure the
+     * accessed data is correct, but they also verify (by placing operands
+     * on the mapping boundaries) that elements controlled by clear mask
+     * bits don't get accessed.
+     */
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vpcompressd);
+        decl_insn(vpcompressq);
+        decl_insn(vpexpandd);
+        decl_insn(vpexpandq);
+        static const struct {
+            unsigned int d[16];
+        } dsrc = { { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 } };
+        static const struct {
+            unsigned long long q[8];
+        } qsrc = { { 0, 1, 2, 3, 4, 5, 6, 7 } };
+        unsigned int *ptr = res + MMAP_SZ / sizeof(*res) - 32;
+
+        printf("%-40s", "Testing vpcompressd %zmm1,24*4(%ecx){%k2}...");
+        asm volatile ( "kmovw %1, %%k2\n\t"
+                       "vmovdqu32 %2, %%zmm1\n"
+                       put_insn(vpcompressd,
+                                "vpcompressd %%zmm1, 24*4(%0)%{%%k2%}")
+                       :: "c" (NULL), "r" (0x55aa), "m" (dsrc) );
+
+        memset(ptr, 0xdb, 32 * 4);
+        set_insn(vpcompressd);
+        regs.ecx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressd) ||
+             memcmp(ptr, ptr + 8, 16 * 4) )
+            goto fail;
+        for ( i = 0; i < 4; ++i )
+            if ( ptr[24 + i] != 2 * i + 1 )
+                goto fail;
+        for ( ; i < 8; ++i )
+            if ( ptr[24 + i] != 2 * i )
+                goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandd 8*4(%edx),%zmm3{%k2}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm3, %%zmm3, %%zmm3\n"
+                       put_insn(vpexpandd,
+                                "vpexpandd 8*4(%0), %%zmm3%{%%k2%}%{z%}")
+                       :: "d" (NULL) );
+        set_insn(vpexpandd);
+        regs.edx = (unsigned long)(ptr + 16);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandd) )
+            goto fail;
+        asm ( "vmovdqa32 %%zmm1, %%zmm2%{%%k2%}%{z%}\n\t"
+              "vpcmpeqd %%zmm2, %%zmm3, %%k0\n\t"
+              "kmovw %%k0, %0"
+              : "=r" (rc) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpcompressq %zmm4,12*8(%edx){%k3}...");
+        asm volatile ( "kmovw %1, %%k3\n\t"
+                       "vmovdqu64 %2, %%zmm4\n"
+                       put_insn(vpcompressq,
+                                "vpcompressq %%zmm4, 12*8(%0)%{%%k3%}")
+                       :: "d" (NULL), "r" (0x5a), "m" (qsrc) );
+
+        memset(ptr, 0xdb, 32 * 4);
+        set_insn(vpcompressq);
+        regs.edx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressq) ||
+             memcmp(ptr, ptr + 8, 8 * 8) )
+            goto fail;
+        for ( i = 0; i < 2; ++i )
+        {
+            if ( ptr[(12 + i) * 2] != 2 * i + 1 ||
+                 ptr[(12 + i) * 2 + 1] )
+                goto fail;
+        }
+        for ( ; i < 4; ++i )
+        {
+            if ( ptr[(12 + i) * 2] != 2 * i ||
+                 ptr[(12 + i) * 2 + 1] )
+                goto fail;
+        }
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandq 4*8(%ecx),%zmm5{%k3}{z}...");
+        asm volatile ( "vpternlogq $0x81, %%zmm5, %%zmm5, %%zmm5\n"
+                       put_insn(vpexpandq,
+                                "vpexpandq 4*8(%0), %%zmm5%{%%k3%}%{z%}")
+                       :: "c" (NULL) );
+        set_insn(vpexpandq);
+        regs.ecx = (unsigned long)(ptr + 16);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandq) )
+            goto fail;
+        asm ( "vmovdqa64 %%zmm4, %%zmm6%{%%k3%}%{z%}\n\t"
+              "vpcmpeqq %%zmm5, %%zmm6, %%k0\n\t"
+              "kmovw %%k0, %0"
+              : "=r" (rc) );
+        if ( rc != 0xff )
+            goto fail;
+        printf("okay\n");
+    }
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -59,6 +59,9 @@
     (type *)((char *)mptr__ - offsetof(type, member)); \
 })
 
+#define hweight32 __builtin_popcount
+#define hweight64 __builtin_popcountll
+
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
 extern uint32_t mxcsr_mask;
@@ -138,6 +141,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
+#define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -482,6 +482,8 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
+    [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -489,6 +491,10 @@ static const struct ext0f38_table {
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x88] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_dq },
+    [0x89] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_dq },
+    [0x8a] = { .simd_size = simd_packed_fp, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
+    [0x8b] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
@@ -1901,6 +1907,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
+#define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -8840,6 +8847,36 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x62): /* vpexpand{b,w} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x63): /* vpcompress{b,w} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        elem_bytes = 1 << evex.w;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x88): /* vexpandp{d,s} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x89): /* vpexpand{d,q} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8a): /* vcompressp{d,s} [xyz]mm,[xyz]mm/mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8b): /* vpcompress{d,q} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.br, EXC_UD);
+        avx512_vlen_check(false);
+        /*
+         * For the respective code below the main switch() to work we need to
+         * compact op_mask here: Memory accesses are non-sparse even if the
+         * mask register has sparsely set bits.
+         */
+        if ( likely(fault_suppression) )
+        {
+            n = 1 << ((b & 8 ? 2 : 4) + evex.lr - evex.w);
+            IMPOSSIBLE(elem_bytes <= 0);
+            ASSERT(op_bytes == n * elem_bytes);
+            op_mask &= ~0ULL >> (64 - n);
+            n = hweight64(op_mask);
+            op_bytes = n * elem_bytes;
+            if ( n )
+                op_mask = ~0ULL >> (64 - n);
+        }
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -110,6 +110,7 @@
 
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
+#define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -228,6 +228,7 @@ XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
+XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
 XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
 
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -266,10 +266,10 @@ def crunch_numbers(state):
                   AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
                   AVX512_VPOPCNTDQ],
 
-        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask
         # registers), despite the SDM not formally making this connection.
-        AVX512BW: [AVX512_VBMI],
+        AVX512BW: [AVX512_VBMI, AVX512_VBMI2],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v5 47/47] x86emul: support remaining misc AVX512{F, BW} insns
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (45 preceding siblings ...)
  2018-11-19 10:40   ` [PATCH v5 46/47] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
@ 2018-11-19 10:41   ` Jan Beulich
  46 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-19 10:41 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512BW in the insn emulator, and leaves just
the scatter/gather ones open in the AVX512F set.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.
---
TBD: The *blendm* inline functions don't reliably produce the intended
     insns, as the respective moves are about as good a fit for the
     compiler when looking for a match for the intended operation. We'd
     need to switch to inline assembly if we wanted to guarantee the
     testing of those insns. Thoughts?

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -105,6 +105,8 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN(align,        66, 0f3a, 03,    vl,     dq, vl),
+    INSN(blendm,       66, 0f38, 65,    vl,     sd, vl),
     INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
@@ -206,6 +208,7 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(pblendm,      66, 0f38, 64,    vl,     dq, vl),
 //       pbroadcast,   66, 0f38, 7c,          dq64
     INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
     INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
@@ -355,6 +358,7 @@ static const struct test avx512f_512[] =
 };
 
 static const struct test avx512bw_all[] = {
+    INSN(dbpsadbw,    66, 0f3a, 42,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
@@ -374,6 +378,7 @@ static const struct test avx512bw_all[]
     INSN(palignr,     66, 0f3a, 0f,    vl,    b, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
+    INSN(pblendm,     66, 0f38, 66,    vl,   bw, vl),
     INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
 //       pbroadcastb, 66, 0f38, 7a,           b
     INSN(pbroadcastw, 66, 0f38, 79,    el_2,  b, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -290,7 +290,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  endif
-#  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define mix(x, y) B(blendmps_, _mask, x, y, (0b1010101010101010 & ALL_TRUE))
 #  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
@@ -363,7 +363,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  endif
-#  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define mix(x, y) B(blendmpd_, _mask, x, y, 0b10101010)
 #  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
@@ -557,8 +557,9 @@ static inline bool _to_bool(byte_vec_t b
                              0b00011011, (vsi_t)undef(), ~0))
 #   define swap2(x) ((vec_t)B_(permvarsi, _mask, (vsi_t)(x), (vsi_t)(inv - 1), (vsi_t)undef(), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
-                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define mix(x, y) ((vec_t)B(blendmd_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b1010101010101010 & ((1 << ELEM_COUNT) - 1))))
+#  define rotr(x, n) ((vec_t)B(alignd, _mask, (vsi_t)(x), (vsi_t)(x), n, (vsi_t)undef(), ~0))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
@@ -595,7 +596,8 @@ static inline bool _to_bool(byte_vec_t b
                              0b01001110, (vsi_t)undef(), ~0))
 #   define swap2(x) ((vec_t)B(permvardi, _mask, (vdi_t)(x), (vdi_t)(inv - 1), (vdi_t)undef(), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+#  define mix(x, y) ((vec_t)B(blendmq_, _mask, (vdi_t)(x), (vdi_t)(y), 0b10101010))
+#  define rotr(x, n) ((vec_t)B(alignq, _mask, (vdi_t)(x), (vdi_t)(x), n, (vdi_t)undef(), ~0))
 #  if VEC_SIZE == 32
 #   define swap3(x) ((vec_t)B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0))
 #  elif VEC_SIZE == 64
@@ -647,8 +649,8 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
-                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define mix(x, y) ((vec_t)B(blendmb_, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b1010101010101010101010101010101010101010101010101010101010101010LL & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
 #  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
@@ -680,8 +682,8 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
-                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define mix(x, y) ((vec_t)B(blendmw_, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b10101010101010101010101010101010 & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
 #  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -484,6 +484,7 @@ static const struct ext0f38_table {
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
     [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
+    [0x64 ... 0x66] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -550,6 +551,7 @@ static const struct ext0f3a_table {
     [0x00] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x01] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x02] = { .simd_size = simd_packed_int },
+    [0x03] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -581,8 +583,7 @@ static const struct ext0f3a_table {
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
-    [0x42] = { .simd_size = simd_packed_int },
-    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x42 ... 0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6181,6 +6182,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x4c): /* vrcp14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x4e): /* vrsqrt14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x64): /* vpblendm{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x65): /* vblendmp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -6909,6 +6912,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x0b): /* vpmulhrsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x66): /* vpblendm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.br, EXC_UD);
         elem_bytes = 1 << (b & 1);
@@ -8073,10 +8077,12 @@ x86_emulate(
         goto simd_0f_to_gpr;
 
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        fault_suppression = false;
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x03): /* valign{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_imm8_no_sae:
         host_and_vcpu_must_have(avx512f);
@@ -9410,6 +9416,9 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 4;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x42): /* vdbpsadbw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0f): /* vpalignr $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         goto avx512bw_imm;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v5 01/47] x86emul: introduce IMPOSSIBLE()
  2018-11-19 10:13   ` [PATCH v5 01/47] x86emul: introduce IMPOSSIBLE() Jan Beulich
@ 2018-11-19 18:11     ` Andrew Cooper
  2018-11-20  8:12       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-11-19 18:11 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/11/2018 10:13, Jan Beulich wrote:
> @@ -8828,12 +8837,7 @@ x86_emulate(
>                  dst.type = OP_NONE;
>                  break;
>              default:
> -                if ( (d & DstMask) != DstMem )
> -                {
> -                    ASSERT_UNREACHABLE();
> -                    rc = X86EMUL_UNHANDLEABLE;
> -                    goto done;
> -                }
> +                IMPOSSIBLE((d & DstMask) != DstMem);

IMPOSSIBLE() doesn't really convey the correct meaning here IMO, because
the purpose of the construct is to try and do something safe in the case
that the impossible does happen.

Instead, I'd suggest EXPECT() or REQUIRE() with an inverted condition,
because that better encapsulates the meaning that we expect this always
to be true, but that it might not be.

With a suitable name, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v5 02/47] x86emul: support basic AVX512 moves
  2018-11-19 10:13   ` [PATCH v5 02/47] x86emul: support basic AVX512 moves Jan Beulich
@ 2018-11-19 18:35     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-19 18:35 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/11/2018 10:13, Jan Beulich wrote:
> Note: SDM Vol 2 rev 067 is not really consistent about EVEX.L'L for LIG
>       insns - the only place where this is made explicit is a table in
>       the section titled "Vector Length Orthogonality": While they
>       tolerate 0, 1, and 2, a value of 3 uniformly leads to #UD.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v5 03/47] x86emul: test for correct EVEX Disp8 scaling
  2018-11-19 10:14   ` [PATCH v5 03/47] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
@ 2018-11-19 18:35     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-19 18:35 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/11/2018 10:14, Jan Beulich wrote:
> Besides the already existing tests (which are going to be extended once
> respective ISA extension support is complete), let's also ensure for
> every individual insn that their Disp8 scaling (and memory access width)
> are correct.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v5 04/47] x86emul: also allow running the 32-bit harness on a 64-bit distro
  2018-11-19 10:14   ` [PATCH v5 04/47] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
@ 2018-11-19 18:40     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-19 18:40 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/11/2018 10:14, Jan Beulich wrote:
> In order to be able to verify the 32-bit variant builds and runs,
> introduce a respective target (and the necessary other adjustments).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

This seems to all work fine now.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v5 12/47] x86emul: support AVX512{F, DQ} FP broadcast insns
  2018-11-19 10:19   ` [PATCH v5 12/47] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
@ 2018-11-19 18:44     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-19 18:44 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/11/2018 10:19, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v5 14/47] x86emul: support AVX512{F, BW} packed integer compare insns
  2018-11-19 10:21   ` [PATCH v5 14/47] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
@ 2018-11-19 19:09     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-19 19:09 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/11/2018 10:21, Jan Beulich wrote:
> Include VPTEST{,N}M{B,D,Q,W} as once again possibly used by the compiler
> for comparison against all-zero vectors.
>
> Also table entries for a few more insns get their .d8s field set right
> away, again in order to not split and later re-combine the groups.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v5 15/47] x86emul: support AVX512{F, BW} packed integer arithmetic insns
  2018-11-19 10:21   ` [PATCH v5 15/47] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
@ 2018-11-19 19:18     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-19 19:18 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/11/2018 10:21, Jan Beulich wrote:
> Note: vpadd* / vpsub* et al are put at seemingly the wrong slot of the
> big switch(). This is in anticipation of adding e.g. vpunpck* to those
> groups (see the legacy/VEX encoded case labels nearby to support this).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v5 16/47] x86emul: use simd_128 also for legacy vector shift insns
  2018-11-19 10:22   ` [PATCH v5 16/47] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
@ 2018-11-19 19:21     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-11-19 19:21 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/11/2018 10:22, Jan Beulich wrote:
> @@ -3344,7 +3344,8 @@ x86_decode(
>          break;
>  
>      case simd_128:
> -        op_bytes = 16;
> +        /* The special case here are MMX shift insns. */

"special case here is the MMX" or "special cases here are".

Otherwise, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v5 01/47] x86emul: introduce IMPOSSIBLE()
  2018-11-19 18:11     ` Andrew Cooper
@ 2018-11-20  8:12       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-11-20  8:12 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 19.11.18 at 19:11, <andrew.cooper3@citrix.com> wrote:
> On 19/11/2018 10:13, Jan Beulich wrote:
>> @@ -8828,12 +8837,7 @@ x86_emulate(
>>                  dst.type = OP_NONE;
>>                  break;
>>              default:
>> -                if ( (d & DstMask) != DstMem )
>> -                {
>> -                    ASSERT_UNREACHABLE();
>> -                    rc = X86EMUL_UNHANDLEABLE;
>> -                    goto done;
>> -                }
>> +                IMPOSSIBLE((d & DstMask) != DstMem);
> 
> IMPOSSIBLE() doesn't really convey the correct meaning here IMO, because
> the purpose of the construct is to try and do something safe in the case
> that the impossible does happen.
> 
> Instead, I'd suggest EXPECT() or REQUIRE() with an inverted condition,
> because that better encapsulates the meaning that we expect this always
> to be true, but that it might not be.

Okay, EXPECT() it'll be then.

> With a suitable name, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 00/42] x86emul: fair parts of AVX512 support
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (9 preceding siblings ...)
  2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
@ 2018-12-06  9:45 ` Jan Beulich
  2018-12-06  9:51   ` [PATCH v6 01/42] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
                     ` (41 more replies)
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
  12 siblings, 42 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:45 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

01: support AVX512{F,BW} shift/rotate insns
02: support AVX512{F,BW,DQ} extract insns
03: support AVX512{F,BW,DQ} insert insns
04: basic AVX512F testing
05: support AVX512{F,BW,DQ} integer broadcast insns
06: basic AVX512VL testing
07: support AVX512{F,BW} zero- and sign-extending moves
08: support AVX512{F,BW} down conversion moves
09: support AVX512{F,BW} integer unpack insns
10: support AVX512{F,BW,_VBMI} full permute insns
11: support AVX512{F,BW} integer shuffle insns
12: support AVX512{BW,DQ} mask move insns
13: basic AVX512BW testing
14: basic AVX512DQ testing
15: support AVX512F move high/low insns
16: support AVX512F move duplicate insns
17: support AVX512{F,BW,VBMI} permute insns
18: support AVX512BW pack insns
19: support AVX512F floating-point conversion insns
20: support AVX512F legacy-equivalent packed int/FP conversion insns
21: support AVX512F legacy-equivalent scalar int/FP conversion insns
22: support AVX512DQ packed quad-int/FP conversion insns
23: support AVX512{F,DQ} uint-to-FP conversion insns
24: support AVX512{F,DQ} FP-to-uint conversion insns
25: support remaining AVX512F legacy-equivalent insns
26: support remaining AVX512BW legacy-equivalent insns
27: support AVX512{F,ER} reciprocal insns
28: support AVX512F floating point manipulation insns
29: support AVX512DQ floating point manipulation insns
30: support AVX512{F,_VBMI2} compress/expand insns
31: support remaining misc AVX512{F,BW} insns
32: support AVX512F gather insns
33: add high register S/G test cases
34: support AVX512F scatter insns
35: support AVX512PF insns
36: support AVX512CD insns
37: complete support of AVX512_VBMI insns
38: support of AVX512* population count insns
39: support of AVX512_IFMA insns
40: support remaining AVX512_VBMI2 insns
41: support AVX512_4FMAPS insns
42: support AVX512_4VNNIW insns

This adds support for all AVX512* insns in SDM rev 067 as well as
a few (AVX512_VBMI2) from ISA extensions rev 034.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 01/42] x86emul: support AVX512{F, BW} shift/rotate insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
@ 2018-12-06  9:51   ` Jan Beulich
  2018-12-06  9:51   ` [PATCH v6 02/42] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
                     ` (40 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:51 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that simd_packed_fp for the opcode space 0f38 major opcodes 14 and
15 is not really correct, but sufficient for the purposes here. Further
adjustments may later be needed for the down conversion unsigned
saturating VPMOV* insns, first and foremost for the different Disp8
scaling those ones use.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -178,6 +178,24 @@ static const struct test avx512f_all[] =
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSNX(prol,        66,   0f, 72, 1, vl,     dq, vl),
+    INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
+    INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
+    INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
+    INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
+    INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
+    INSNX(psllq,       66,   0f, 73, 6, vl,      q, vl),
+    INSN(psllv,        66, 0f38, 47,    vl,     dq, vl),
+    INSNX(psra,        66,   0f, 72, 4, vl,     dq, vl),
+    INSN(psrad,        66,   0f, e2,    el_4,    d, vl),
+    INSN(psraq,        66,   0f, e2,    el_2,    q, vl),
+    INSN(psrav,        66, 0f38, 46,    vl,     dq, vl),
+    INSN(psrld,        66,   0f, d2,    el_4,    d, vl),
+    INSNX(psrld,       66,   0f, 72, 2, vl,      d, vl),
+    INSN(psrlq,        66,   0f, d3,    el_2,    q, vl),
+    INSNX(psrlq,       66,   0f, 73, 2, vl,      q, vl),
+    INSN(psrlv,        66, 0f38, 45,    vl,     dq, vl),
     INSN(psubd,        66,   0f, fa,    vl,      d, vl),
     INSN(psubq,        66,   0f, fb,    vl,      q, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
@@ -241,6 +259,17 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSNX(pslldq,     66,   0f, 73, 7, vl,    b, vl),
+    INSN(psllvw,      66, 0f38, 12,    vl,    w, vl),
+    INSN(psllw,       66,   0f, f1,    el_8,  w, vl),
+    INSNX(psllw,      66,   0f, 71, 6, vl,    w, vl),
+    INSN(psravw,      66, 0f38, 11,    vl,    w, vl),
+    INSN(psraw,       66,   0f, e1,    el_8,  w, vl),
+    INSNX(psraw,      66,   0f, 71, 4, vl,    w, vl),
+    INSNX(psrldq,     66,   0f, 73, 3, vl,    b, vl),
+    INSN(psrlvw,      66, 0f38, 10,    vl,    w, vl),
+    INSN(psrlw,       66,   0f, d1,    el_8,  w, vl),
+    INSNX(psrlw,      66,   0f, 71, 2, vl,    w, vl),
     INSN(psubb,       66,   0f, f8,    vl,    b, vl),
     INSN(psubsb,      66,   0f, e8,    vl,    b, vl),
     INSN(psubsw,      66,   0f, e9,    vl,    w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -319,7 +319,7 @@ static const struct twobyte_table {
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
-    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
+    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
@@ -366,19 +366,19 @@ static const struct twobyte_table {
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
     [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -434,9 +434,9 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
-    [0x10] = { .simd_size = simd_packed_int },
+    [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
-    [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
+    [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
     [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
@@ -453,7 +453,7 @@ static const struct ext0f38_table {
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x45 ... 0x47] = { .simd_size = simd_packed_int },
+    [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
@@ -5987,10 +5987,15 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x14): /* vprorv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x15): /* vprolv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x39): /* vpmins{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3b): /* vpminu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3d): /* vpmaxs{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3f): /* vpmaxu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -6608,6 +6613,9 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
@@ -6907,6 +6915,37 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x71): /* Grp12 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsraw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512bw_shift_imm:
+            fault_suppression = false;
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512bw_imm;
+        }
+        goto unrecognized_insn;
+
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x72): /* Grp13 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpslld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(evex.w, EXC_UD);
+            /* fall through */
+        case 0: /* vpror{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 1: /* vprol{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsra{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512f_shift_imm:
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512f_imm8_no_sae;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x73):        /* Grp14 */
         switch ( modrm_reg & 7 )
         {
@@ -6932,6 +6971,19 @@ x86_emulate(
         }
         goto unrecognized_insn;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x73): /* Grp14 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(!evex.w, EXC_UD);
+            goto avx512f_shift_imm;
+        case 3: /* vpsrldq $imm8,{x,y}mm,{x,y}mm */
+        case 7: /* vpslldq $imm8,{x,y}mm,{x,y}mm */
+            goto avx512bw_shift_imm;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x77):        /* emms */
     case X86EMUL_OPC_VEX(0x0f, 0x77):    /* vzero{all,upper} */
         if ( vex.opcx != vex_none )
@@ -7869,6 +7921,16 @@ x86_emulate(
         }
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe2): /* vpsra{d,q} xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.br, EXC_UD);
+        fault_suppression = false;
+        if ( b == 0xe2 )
+            goto avx512f_no_sae;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8109,6 +8171,14 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x10): /* vpsrlvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x11): /* vpsravw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x12): /* vpsllvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
         generate_exception_if(evex.w || evex.br, EXC_UD);
     avx512_broadcast:
@@ -8867,6 +8937,7 @@ x86_emulate(
         generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
         if ( !(b & 0x20) )
             goto avx512f_imm8_no_sae;
+    avx512bw_imm:
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.br, EXC_UD);
         elem_bytes = 1 << evex.w;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 02/42] x86emul: support AVX512{F, BW, DQ} extract insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
  2018-12-06  9:51   ` [PATCH v6 01/42] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
@ 2018-12-06  9:51   ` Jan Beulich
  2018-12-06  9:51   ` [PATCH v6 03/42] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
                     ` (39 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:51 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -212,6 +212,7 @@ static const struct test avx512f_all[] =
 };
 
 static const struct test avx512f_128[] = {
+    INSN(extractps, 66, 0f3a, 17, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -221,10 +222,14 @@ static const struct test avx512f_128[] =
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
+    INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
+    INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
+    INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -280,6 +285,12 @@ static const struct test avx512bw_all[]
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
 };
 
+static const struct test avx512bw_128[] = {
+    INSN(pextrb, 66, 0f3a, 14, el, b, el),
+//       pextrw, 66,   0f, c5,     w
+    INSN(pextrw, 66, 0f3a, 15, el, w, el),
+};
+
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
@@ -288,13 +299,21 @@ static const struct test avx512dq_all[]
     INSN_PFP(xor,              0f, 57),
 };
 
+static const struct test avx512dq_128[] = {
+    INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+};
+
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
+    INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
+    INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
@@ -632,7 +651,9 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, no128);
     RUN(avx512f, 512);
     RUN(avx512bw, all);
+    RUN(avx512bw, 128);
     RUN(avx512dq, all);
+    RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -512,9 +512,13 @@ static const struct ext0f3a_table {
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
-    [0x14 ... 0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1 },
+    [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
+    [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
+    [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
+    [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
     [0x18] = { .simd_size = simd_128 },
-    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none },
@@ -523,7 +527,8 @@ static const struct ext0f3a_table {
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
-    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
@@ -2669,6 +2674,8 @@ x86_decode_0f3a(
      ... X86EMUL_OPC_66(0, 0x17):     /* pextr*, extractps */
     case X86EMUL_OPC_VEX_66(0, 0x14)
      ... X86EMUL_OPC_VEX_66(0, 0x17): /* vpextr*, vextractps */
+    case X86EMUL_OPC_EVEX_66(0, 0x14)
+     ... X86EMUL_OPC_EVEX_66(0, 0x17): /* vpextr*, vextractps */
     case X86EMUL_OPC_VEX_F2(0, 0xf0): /* rorx */
         break;
 
@@ -8856,9 +8863,9 @@ x86_emulate(
         opc[0] = b;
         /* Convert memory/GPR operand to (%rAX). */
         rex_prefix &= ~REX_B;
-        vex.b = 1;
+        evex.b = vex.b = 1;
         if ( !mode_64bit() )
-            vex.w = 0;
+            evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
         opc[3] = 0xc3;
@@ -8868,7 +8875,10 @@ x86_emulate(
             --opc;
         }
 
-        copy_REX_VEX(opc, rex_prefix, vex);
+        if ( evex_encoded() )
+            copy_EVEX(opc, evex);
+        else
+            copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=m" (dst.val) : "a" (&dst.val));
         put_stub(stub);
 
@@ -8888,6 +8898,52 @@ x86_emulate(
         opc = init_prefixes(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
+        generate_exception_if(ea.type != OP_REG, EXC_UD);
+        /* Convert to alternative encoding: We want to use a memory operand. */
+        evex.opcx = ext_0f3a;
+        b = 0x15;
+        modrm <<= 3;
+        evex.r = evex.b;
+        evex.R = evex.x;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x14): /* vpextrb $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x15): /* vpextrw $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x16): /* vpextr{d,q} $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x17): /* vextractps $imm8,xmm,r/m */
+        generate_exception_if((evex.lr || evex.reg != 0xf || !evex.RX ||
+                               evex.opmsk || evex.br),
+                              EXC_UD);
+        if ( !(b & 2) )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( !(b & 1) )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto pextr;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(evex.lr != 2 || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
     {
         uint32_t mxcsr;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 03/42] x86emul: support AVX512{F, BW, DQ} insert insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
  2018-12-06  9:51   ` [PATCH v6 01/42] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
  2018-12-06  9:51   ` [PATCH v6 02/42] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
@ 2018-12-06  9:51   ` Jan Beulich
  2018-12-06  9:52   ` [PATCH v6 04/42] x86emul: basic AVX512F testing Jan Beulich
                     ` (38 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:51 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also correct the comment of the AVX form of VINSERTPS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Don't refuse to emulate VINSERTPS without AVX512VL.
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -213,6 +213,7 @@ static const struct test avx512f_all[] =
 
 static const struct test avx512f_128[] = {
     INSN(extractps, 66, 0f3a, 17, el,    d, el),
+    INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -224,12 +225,16 @@ static const struct test avx512f_no128[]
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
+    INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
+    INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
+    INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
+    INSN(inserti64x4,    66, 0f3a, 3a, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -289,6 +294,8 @@ static const struct test avx512bw_128[]
     INSN(pextrb, 66, 0f3a, 14, el, b, el),
 //       pextrw, 66,   0f, c5,     w
     INSN(pextrw, 66, 0f3a, 15, el, w, el),
+    INSN(pinsrb, 66, 0f3a, 20, el, b, el),
+    INSN(pinsrw, 66,   0f, c4, el, w, el),
 };
 
 static const struct test avx512dq_all[] = {
@@ -301,6 +308,7 @@ static const struct test avx512dq_all[]
 
 static const struct test avx512dq_128[] = {
     INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+    INSN(pinsr, 66, 0f3a, 22, el, dq64, el),
 };
 
 static const struct test avx512dq_no128[] = {
@@ -308,12 +316,16 @@ static const struct test avx512dq_no128[
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
+    INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
+    INSN(inserti64x2,    66, 0f3a, 38, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
+    INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
+    INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -360,7 +360,7 @@ static const struct twobyte_table {
     [0xc1] = { DstMem|SrcReg|ModRM },
     [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp, d8s_vl },
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
-    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
+    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int, 1 },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
     [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp, d8s_vl },
     [0xc7] = { ImplicitOps|ModRM },
@@ -516,17 +516,19 @@ static const struct ext0f3a_table {
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
     [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
-    [0x18] = { .simd_size = simd_128 },
+    [0x18] = { .simd_size = simd_128, .d8s = 4 },
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x20] = { .simd_size = simd_none },
-    [0x21] = { .simd_size = simd_other },
-    [0x22] = { .simd_size = simd_none },
+    [0x20] = { .simd_size = simd_none, .d8s = 0 },
+    [0x21] = { .simd_size = simd_other, .d8s = 2 },
+    [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
-    [0x38] = { .simd_size = simd_128 },
+    [0x38] = { .simd_size = simd_128, .d8s = 4 },
+    [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -2579,6 +2581,7 @@ x86_decode_twobyte(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
     case X86EMUL_OPC_VEX_66(0, 0xc4): /* vpinsrw */
+    case X86EMUL_OPC_EVEX_66(0, 0xc4): /* vpinsrw */
         state->desc = DstReg | SrcMem16;
         break;
 
@@ -2681,6 +2684,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x20):     /* pinsrb */
     case X86EMUL_OPC_VEX_66(0, 0x20): /* vpinsrb */
+    case X86EMUL_OPC_EVEX_66(0, 0x20): /* vpinsrb */
         state->desc = DstImplicit | SrcMem;
         if ( modrm_mod != 3 )
             state->desc |= ByteOp;
@@ -2688,6 +2692,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x22):     /* pinsr{d,q} */
     case X86EMUL_OPC_VEX_66(0, 0x22): /* vpinsr{d,q} */
+    case X86EMUL_OPC_EVEX_66(0, 0x22): /* vpinsr{d,q} */
         state->desc = DstImplicit | SrcMem;
         break;
 
@@ -7711,6 +7716,23 @@ x86_emulate(
         ea.type = OP_MEM;
         goto simd_0f_int_imm8;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc4):   /* vpinsrw $imm8,r32/m16,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x20): /* vpinsrb $imm8,r32/m8,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x22): /* vpinsr{d,q} $imm8,r/m,xmm,xmm */
+        generate_exception_if(evex.lr || evex.opmsk || evex.br, EXC_UD);
+        if ( b & 2 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        if ( !mode_64bit() )
+            evex.w = 0;
+        memcpy(mmvalp, &src.val, op_bytes);
+        ea.type = OP_MEM;
+        op_bytes = src.bytes;
+        d = SrcMem16; /* Fake for the common SIMD code below. */
+        state->simd_size = simd_other;
+        goto avx512f_imm8_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0xc5):      /* pextrw $imm8,{,x}mm,reg */
     case X86EMUL_OPC_VEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
         generate_exception_if(vex.l, EXC_UD);
@@ -8924,8 +8946,12 @@ x86_emulate(
         opc = init_evex(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x18): /* vinsertf32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinsertf64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x38): /* vinserti32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinserti64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
@@ -8934,8 +8960,12 @@ x86_emulate(
         fault_suppression = false;
         goto avx512f_imm8_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1a): /* vinsertf32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinsertf64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3a): /* vinserti32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinserti64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
         if ( !evex.w )
@@ -9028,13 +9058,20 @@ x86_emulate(
         op_bytes = 4;
         goto simd_0f3a_common;
 
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m128,xmm,xmm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
         op_bytes = 4;
         /* fall through */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x41): /* vdppd $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.lr || evex.w || evex.opmsk || evex.br,
+                              EXC_UD);
+        op_bytes = 4;
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 04/42] x86emul: basic AVX512F testing
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (2 preceding siblings ...)
  2018-12-06  9:51   ` [PATCH v6 03/42] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
@ 2018-12-06  9:52   ` Jan Beulich
  2018-12-06  9:53   ` [PATCH v6 05/42] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
                     ` (37 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:52 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Fix formatting in simd.h.
v5: Add VSQRT* tests.
v4: Make eq() also work for 4- and 8-byte integer element sizes.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -63,6 +63,9 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
+avx512f-vecs := 64
+avx512f-ints := 4 8
+avx512f-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
@@ -170,7 +173,7 @@ $(addsuffix .c,$(SG)):
 
 $(addsuffix .h,$(SIMD) $(FMA) $(SG)): simd.h
 
-xop.h: simd-fma.c
+xop.h avx512f.h: simd-fma.c
 
 endif # 32-bit override
 
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -2,7 +2,41 @@
 
 ENTRY(simd_test);
 
-#if VEC_SIZE == 8 && defined(__SSE__)
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if VEC_SIZE == 4
+#  define eq(x, y) ({ \
+    float x_ = (x)[0]; \
+    float __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpss $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif VEC_SIZE == 8
+#  define eq(x, y) ({ \
+    double x_ = (x)[0]; \
+    double __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpsd $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif FLOAT_SIZE == 4
+/*
+ * gcc's (up to at least 8.2) __builtin_ia32_cmpps256_mask() has an anomaly in
+ * that its return type is QI rather than UQI, and hence the value would get
+ * sign-extended before comapring to ALL_TRUE. The same oddity does not matter
+ * for __builtin_ia32_cmppd256_mask(), as there only 4 bits are significant.
+ * Hence the extra " & ALL_TRUE".
+ */
+#  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
+# elif FLOAT_SIZE == 8
+#  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif INT_SIZE == 4 || UINT_SIZE == 4
+#  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+# endif
+#elif VEC_SIZE == 8 && defined(__SSE__)
 # define to_bool(cmp) (__builtin_ia32_pmovmskb(cmp) == 0xff)
 #elif VEC_SIZE == 16
 # if defined(__AVX__) && defined(FLOAT_SIZE)
@@ -93,6 +127,56 @@ static inline bool _to_bool(byte_vec_t b
     touch(x); \
     __builtin_ia32_pfrcpit2(__builtin_ia32_pfrsqit1(__builtin_ia32_pfmul(t_, t_), x), t_); \
 })
+#elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
+# if FLOAT_SIZE == 4
+#  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
+# elif FLOAT_SIZE == 8
+#  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
+# endif
+#elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if FLOAT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastss %1, %0" \
+          : "=v" (t_) : "m" (*(float[1]){ x }) ); \
+    t_; \
+})
+#  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
+#   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
+#   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#  endif
+# elif FLOAT_SIZE == 8
+#  if VEC_SIZE >= 32
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastsd %1, %0" : "=v" (t_) \
+          : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#  else
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#  endif
+#  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
+#   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
+#   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#  endif
+# endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
 # if VEC_SIZE == 32 && defined(__AVX__)
 #  if defined(__AVX2__)
@@ -191,7 +275,30 @@ static inline bool _to_bool(byte_vec_t b
 #  define sqrt(x) scalar_1op(x, "sqrtsd %[in], %[out]")
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE2__)
+#if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
+     defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 4 || UINT_SIZE == 4
+#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+# endif
+# if INT_SIZE == 4
+#  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
+#  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 4
+#  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# endif
+#elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
 #  define interleave_lo(x, y) ((vec_t)__builtin_ia32_punpcklbw128((vqi_t)(x), (vqi_t)(y)))
@@ -587,6 +694,10 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if defined(__AVX512F__) && defined(FLOAT_SIZE)
+# include "simd-fma.c"
+#endif
+
 int simd_test(void)
 {
     unsigned int i, j;
@@ -1034,7 +1145,8 @@ int simd_test(void)
 # endif
 #endif
 
-#if defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)
+#if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
+    (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
 #endif
 
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,9 +70,111 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE == 16
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
+# define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
+#elif VEC_SIZE == 32
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 256 ## s(a)
+#elif VEC_SIZE == 64
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 512 ## s(a)
+# define BR(n, s, a...)  __builtin_ia32_ ## n ## 512 ## s(a, 4)
+#endif
+#ifndef B_
+# define B_ B
+#endif
+#ifndef BR
+# define BR B
+# define BR_ B_
+#endif
+#ifndef BR_
+# define BR_ BR
+#endif
+
+#ifdef __AVX512F__
+
+/*
+ * The original plan was to effect use of EVEX encodings for scalar as well as
+ * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
+ * only of course) XMM16-XMM31 only. All sorts of compiler errors result when
+ * doing this with gcc 8.2. Therefore resort to injecting {evex} prefixes,
+ * which has the benefit of also working for 32-bit. Granted, there is a lot of
+ * escaping to get right here.
+ */
+asm ( ".macro override insn    \n\t"
+      ".macro $\\insn o:vararg \n\t"
+      ".purgem \\insn          \n\t"
+      "{evex} \\insn \\(\\)o   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\(\\))o     \n\t"
+      ".endm                   \n\t"
+      ".endm                   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\)o         \n\t"
+      ".endm                   \n\t"
+      ".endm" );
+
+# define OVR(n) asm ( "override v" #n )
+# define OVR_SFP(n) OVR(n ## sd); OVR(n ## ss)
+
+# ifdef __AVX512VL__
+#  ifdef __AVX512BW__
+#   define OVR_BW(n) OVR(p ## n ## b); OVR(p ## n ## w)
+#  else
+#   define OVR_BW(n)
+#  endif
+#  define OVR_DQ(n) OVR(p ## n ## d); OVR(p ## n ## q)
+#  define OVR_VFP(n) OVR(n ## pd); OVR(n ## ps)
+# else
+#  define OVR_BW(n)
+#  define OVR_DQ(n)
+#  define OVR_VFP(n)
+# endif
+
+# define OVR_FMA(n, w) OVR_ ## w(n ## 132); OVR_ ## w(n ## 213); \
+                       OVR_ ## w(n ## 231)
+# define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
+# define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
+
+OVR_SFP(broadcast);
+OVR_SFP(comi);
+OVR_FP(add);
+OVR_FP(div);
+OVR(extractps);
+OVR_FMA(fmadd, FP);
+OVR_FMA(fmsub, FP);
+OVR_FMA(fnmadd, FP);
+OVR_FMA(fnmsub, FP);
+OVR(insertps);
+OVR_FP(max);
+OVR_FP(min);
+OVR(movd);
+OVR(movq);
+OVR_SFP(mov);
+OVR_FP(mul);
+OVR_FP(sqrt);
+OVR_FP(sub);
+OVR_SFP(ucomi);
+
+# undef OVR_VFP
+# undef OVR_SFP
+# undef OVR_INT
+# undef OVR_FP
+# undef OVR_FMA
+# undef OVR_DQ
+# undef OVR_BW
+# undef OVR
+
+#endif /* __AVX512F__ */
+
 /*
  * Suppress value propagation by the compiler, preventing unwanted
  * optimization. This at once makes the compiler use memory operands
  * more often, which for our purposes is the more interesting case.
  */
 #define touch(var) asm volatile ( "" : "+m" (var) )
+
+static inline vec_t undef(void)
+{
+    vec_t v = v;
+    return v;
+}
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -1,10 +1,9 @@
+#if !defined(__XOP__) && !defined(__AVX512F__)
 #include "simd.h"
-
-#ifndef __XOP__
 ENTRY(fma_test);
 #endif
 
-#if VEC_SIZE < 16
+#if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
 #elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
@@ -24,7 +23,13 @@ ENTRY(fma_test);
 # define eq(x, y) to_bool((x) == (y))
 #endif
 
-#if VEC_SIZE == 16
+#if defined(__AVX512F__) && VEC_SIZE > FLOAT_SIZE
+# if FLOAT_SIZE == 4
+#  define fmaddsub(x, y, z) BR(vfmaddsubps, _mask, x, y, z, ~0)
+# elif FLOAT_SIZE == 8
+#  define fmaddsub(x, y, z) BR(vfmaddsubpd, _mask, x, y, z, ~0)
+# endif
+#elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
 #  define addsub(x, y) __builtin_ia32_addsubps(x, y)
 #  if defined(__FMA4__) || defined(__FMA__)
@@ -50,6 +55,10 @@ ENTRY(fma_test);
 # endif
 #endif
 
+#if defined(fmaddsub) && !defined(addsub)
+# define addsub(x, y) fmaddsub(x, broadcast(1), y)
+#endif
+
 int fma_test(void)
 {
     unsigned int i;
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -21,6 +21,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f-opmask.h"
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
+#include "avx512f.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -248,6 +249,14 @@ static const struct {
     SIMD(OPMASK/b,    avx512dq_opmask,         1),
     SIMD(OPMASK/d,    avx512bw_opmask,         4),
     SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(AVX512F f32 scalar,  avx512f,        f4),
+    SIMD(AVX512F f32x16,      avx512f,      64f4),
+    SIMD(AVX512F f64 scalar,  avx512f,        f8),
+    SIMD(AVX512F f64x8,       avx512f,      64f8),
+    SIMD(AVX512F s32x16,      avx512f,      64i4),
+    SIMD(AVX512F u32x16,      avx512f,      64u4),
+    SIMD(AVX512F s64x8,       avx512f,      64i8),
+    SIMD(AVX512F u64x8,       avx512f,      64u8),
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 05/42] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (3 preceding siblings ...)
  2018-12-06  9:52   ` [PATCH v6 04/42] x86emul: basic AVX512F testing Jan Beulich
@ 2018-12-06  9:53   ` Jan Beulich
  2018-12-06  9:53   ` [PATCH v6 06/42] x86emul: basic AVX512VL testing Jan Beulich
                     ` (36 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the pbroadcastw table entry in evex-disp8.c is slightly
different from what one would expect, due to it requiring EVEX.W to be
zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -164,6 +164,9 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+//       pbroadcast,   66, 0f38, 7c,          dq64
+    INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
+    INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
     INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
     INSN(pcmpeqd,      66,   0f, 76,    vl,      d, vl),
     INSN(pcmpeqq,      66, 0f38, 29,    vl,      q, vl),
@@ -222,6 +225,7 @@ static const struct test avx512f_128[] =
 
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
+    INSN(broadcasti32x4, 66, 0f38, 5a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
@@ -231,6 +235,7 @@ static const struct test avx512f_no128[]
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(broadcasti64x4, 66, 0f38, 5b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
     INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
@@ -250,6 +255,10 @@ static const struct test avx512bw_all[]
     INSN(paddw,       66,   0f, fd,    vl,    w, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
+    INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
+//       pbroadcastb, 66, 0f38, 7a,           b
+    INSN(pbroadcastw, 66, 0f38, 79,    el_2,  b, vl),
+//       pbroadcastw, 66, 0f38, 7b,           b
     INSN(pcmp,        66, 0f3a, 3f,    vl,   bw, vl),
     INSN(pcmpeqb,     66,   0f, 74,    vl,    b, vl),
     INSN(pcmpeqw,     66,   0f, 75,    vl,    w, vl),
@@ -301,6 +310,7 @@ static const struct test avx512bw_128[]
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
+    INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
@@ -314,6 +324,7 @@ static const struct test avx512dq_128[]
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(broadcasti64x2, 66, 0f38, 5a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
     INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
@@ -322,6 +333,7 @@ static const struct test avx512dq_no128[
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(broadcasti32x8, 66, 0f38, 5b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
     INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -278,9 +278,33 @@ static inline bool _to_bool(byte_vec_t b
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if INT_SIZE == 4 || UINT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastd %1, %0" \
+          : "=v" (t_) : "m" (*(int[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(long long[1]){ x }) ); \
+    t_; \
+})
+#  ifdef __x86_64__
+#   define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastq %1, %0" : "=v" (t_) : "r" ((x) + 0ULL) ); \
+    t_; \
+})
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
@@ -977,10 +1001,14 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
-#if defined(broadcast)
+#ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#ifdef broadcast2
+    if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -454,9 +454,13 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
-    [0x5a] = { .simd_size = simd_128, .two_op = 1 },
-    [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
+    [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
+    [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
+    [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
+    [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x78] = { .simd_size = simd_other, .two_op = 1 },
+    [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
+    [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -2629,6 +2633,11 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
+    case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
+    case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
+        break;
+
     case 0xf0: /* movbe / crc32 */
         state->desc |= repne_prefix() ? ByteOp : Mov;
         if ( rep_prefix() )
@@ -8209,6 +8218,8 @@ x86_emulate(
         goto avx512f_no_sae;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
+        op_bytes = elem_bytes;
         generate_exception_if(evex.w || evex.br, EXC_UD);
     avx512_broadcast:
         /*
@@ -8228,17 +8239,27 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
                                             /* vbroadcastf64x4 m256,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
+                                            /* vbroadcasti64x4 m256,zmm{k} */
         generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
                                             /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
-        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        generate_exception_if(!evex.lr, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
+                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
+        if ( b == 0x59 )
+            op_bytes = 8;
+        generate_exception_if(evex.br, EXC_UD);
         if ( !evex.w )
             host_and_vcpu_must_have(avx512dq);
         goto avx512_broadcast;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
                                             /* vbroadcastf64x2 m128,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
         generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.br,
                               EXC_UD);
         if ( evex.w )
@@ -8432,6 +8453,45 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+        op_bytes = elem_bytes = 1 << (b & 1);
+        /* See the comment at the avx512_broadcast label. */
+        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
+        generate_exception_if((ea.type != OP_REG || evex.br ||
+                               evex.reg != 0xf || !evex.RX),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        opc[1] = modrm & 0xf8;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "+m" (src.val) : "a" (src.val));
+
+        put_stub(stub);
+        ASSERT(!state->simd_size);
+        break;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 06/42] x86emul: basic AVX512VL testing
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (4 preceding siblings ...)
  2018-12-06  9:53   ` [PATCH v6 05/42] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
@ 2018-12-06  9:53   ` Jan Beulich
  2018-12-06  9:53   ` [PATCH v6 07/42] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
                     ` (35 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test the 128- and 256-bit variants of the insns which have been
implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Don't enable AVX512VL for scalar tests, nor for S/G ones with index
    wider than data. Re-base over changes earlier in the series.
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -63,7 +63,7 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
-avx512f-vecs := 64
+avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
 
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -5,13 +5,13 @@ ENTRY(fma_test);
 
 #if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
-#elif VEC_SIZE == 16
+#elif VEC_SIZE == 16 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
 #  define to_bool(cmp) __builtin_ia32_vtestcpd(cmp, (vec_t){} == 0)
 # endif
-#elif VEC_SIZE == 32
+#elif VEC_SIZE == 32 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps256(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -539,7 +539,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define rotr(x, n) ((vec_t)__builtin_ia32_palignr128((vdi_t)(x), (vdi_t)(x), (n) * 64))
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE4_1__)
+#if VEC_SIZE == 16 && defined(__SSE4_1__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)__builtin_ia32_pmaxsb128((vqi_t)(x), (vqi_t)(y)))
 #  define min(x, y) ((vec_t)__builtin_ia32_pminsb128((vqi_t)(x), (vqi_t)(y)))
@@ -593,7 +593,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define mix(x, y) __builtin_ia32_blendpd(x, y, 0b10)
 # endif
 #endif
-#if VEC_SIZE == 32 && defined(__AVX__)
+#if VEC_SIZE == 32 && defined(__AVX__) && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define dot_product(x, y) ({ \
     vec_t t_ = __builtin_ia32_dpps256(x, y, 0b11110001); \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -92,6 +92,15 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+# if VEC_SIZE > ELEM_SIZE && (defined(VEC_MAX) ? VEC_MAX : VEC_SIZE) < 64
+#  pragma GCC target ( "avx512vl" )
+# endif
+
+# define REN(insn, old, new)                     \
+    asm ( ".macro v" #insn #old " o:vararg \n\t" \
+          "v" #insn #new " \\o             \n\t" \
+          ".endm" )
+
 /*
  * The original plan was to effect use of EVEX encodings for scalar as well as
  * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
@@ -135,25 +144,88 @@ asm ( ".macro override insn    \n\t"
 # define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
 # define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
 
+OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
+OVR_INT(add);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
+OVR_FMA(fmaddsub, VFP);
 OVR_FMA(fmsub, FP);
+OVR_FMA(fmsubadd, VFP);
 OVR_FMA(fnmadd, FP);
 OVR_FMA(fnmsub, FP);
 OVR(insertps);
 OVR_FP(max);
+OVR_INT(maxs);
+OVR_INT(maxu);
 OVR_FP(min);
+OVR_INT(mins);
+OVR_INT(minu);
 OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
+OVR_VFP(mova);
+OVR_VFP(movnt);
+OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(shuf);
+OVR_INT(sll);
+OVR_DQ(sllv);
 OVR_FP(sqrt);
+OVR_INT(sra);
+OVR_DQ(srav);
+OVR_INT(srl);
+OVR_DQ(srlv);
 OVR_FP(sub);
+OVR_INT(sub);
 OVR_SFP(ucomi);
+OVR_VFP(unpckh);
+OVR_VFP(unpckl);
+
+# ifdef __AVX512VL__
+#  if ELEM_SIZE == 8 && defined(__AVX512DQ__)
+REN(extract, f128, f64x2);
+REN(extract, i128, i64x2);
+REN(insert, f128, f64x2);
+REN(insert, i128, i64x2);
+#  else
+REN(extract, f128, f32x4);
+REN(extract, i128, i32x4);
+REN(insert, f128, f32x4);
+REN(insert, i128, i32x4);
+#  endif
+#  if ELEM_SIZE == 8
+REN(movdqa, , 64);
+REN(movdqu, , 64);
+REN(pand, , q);
+REN(pandn, , q);
+REN(por, , q);
+REN(pxor, , q);
+#  else
+#   if ELEM_SIZE == 1 && defined(__AVX512BW__)
+REN(movdq, a, u8);
+REN(movdqu, , 8);
+#   elif ELEM_SIZE == 2 && defined(__AVX512BW__)
+REN(movdq, a, u16);
+REN(movdqu, , 16);
+#   else
+REN(movdqa, , 32);
+REN(movdqu, , 32);
+#   endif
+REN(pand, , d);
+REN(pandn, , d);
+REN(por, , d);
+REN(pxor, , d);
+#  endif
+OVR(movntdq);
+OVR(movntdqa);
+OVR(pmulld);
+OVR(pmuldq);
+OVR(pmuludq);
+# endif
 
 # undef OVR_VFP
 # undef OVR_SFP
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -88,6 +88,11 @@ static bool simd_check_avx512f(void)
 }
 #define simd_check_avx512f_opmask simd_check_avx512f
 
+static bool simd_check_avx512f_vl(void)
+{
+    return cpu_has_avx512f && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512dq(void)
 {
     return cpu_has_avx512dq;
@@ -142,11 +147,21 @@ static const struct {
       .check_cpu = simd_check_ ## feat,                             \
       .set_regs = simd_set_regs,                                    \
       .check_regs = simd_check_regs }
+#define AVX512VL_(bits, desc, feat, form)                          \
+    { .code = feat ## _x86_ ## bits ## _D ## _ ## form,            \
+      .size = sizeof(feat ## _x86_ ## bits ## _D ## _ ## form),    \
+      .bitness = bits, .name = "AVX512" #desc,                     \
+      .check_cpu = simd_check_ ## feat ## _vl,                     \
+      .set_regs = simd_set_regs,                                   \
+      .check_regs = simd_check_regs }
 #ifdef __x86_64__
 # define SIMD(desc, feat, form) SIMD_(64, desc, feat, form), \
                                 SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(64, desc, feat, form), \
+                                    AVX512VL_(32, desc, feat, form)
 #else
 # define SIMD(desc, feat, form) SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(32, desc, feat, form)
 #endif
     SIMD(3DNow! single,          _3dnow,     8f4),
     SIMD(SSE scalar single,      sse,         f4),
@@ -257,6 +272,20 @@ static const struct {
     SIMD(AVX512F u32x16,      avx512f,      64u4),
     SIMD(AVX512F s64x8,       avx512f,      64i8),
     SIMD(AVX512F u64x8,       avx512f,      64u8),
+    AVX512VL(VL f32x4,        avx512f,      16f4),
+    AVX512VL(VL f64x2,        avx512f,      16f8),
+    AVX512VL(VL f32x8,        avx512f,      32f4),
+    AVX512VL(VL f64x4,        avx512f,      32f8),
+    AVX512VL(VL s32x4,        avx512f,      16i4),
+    AVX512VL(VL u32x4,        avx512f,      16u4),
+    AVX512VL(VL s32x8,        avx512f,      32i4),
+    AVX512VL(VL u32x8,        avx512f,      32u4),
+    AVX512VL(VL s64x2,        avx512f,      16i8),
+    AVX512VL(VL u64x2,        avx512f,      16u8),
+    AVX512VL(VL s64x4,        avx512f,      32i8),
+    AVX512VL(VL u64x4,        avx512f,      32u8),
+#undef AVX512VL_
+#undef AVX512VL
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 07/42] x86emul: support AVX512{F, BW} zero- and sign-extending moves
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (5 preceding siblings ...)
  2018-12-06  9:53   ` [PATCH v6 06/42] x86emul: basic AVX512VL testing Jan Beulich
@ 2018-12-06  9:53   ` Jan Beulich
  2018-12-06  9:54   ` [PATCH v6 08/42] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
                     ` (34 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the testing in simd.c doesn't really follow the ISA extension
pattern - to fit the scheme, extensions from byte and word granular
vectors can (currently) sensibly only happen in the AVX512BW case (and
hence respective abstraction macros will be added there rather than
here).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -177,6 +177,16 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
+    INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
+    INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
+    INSN(pmovzxwq,     66, 0f38, 34,    vl_4,    w, vl),
+    INSN(pmovzxdq,     66, 0f38, 35,    vl_2, d_nb, vl),
     INSN(pmuldq,       66, 0f38, 28,    vl,      q, vl),
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
@@ -274,6 +284,8 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -443,13 +443,23 @@ static const struct ext0f38_table {
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
+    [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x23] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x24] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
-    [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
+    [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x32] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x33] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x34] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x35] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
@@ -8325,6 +8335,25 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x25): /* vpmovsxdq {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
+        elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -311,10 +311,12 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxdq, _mask, x, (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 4
 #  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -222,6 +222,16 @@ REN(pxor, , d);
 #  endif
 OVR(movntdq);
 OVR(movntdqa);
+OVR(pmovsxbd);
+OVR(pmovsxbq);
+OVR(pmovsxdq);
+OVR(pmovsxwd);
+OVR(pmovsxwq);
+OVR(pmovzxbd);
+OVR(pmovzxbq);
+OVR(pmovzxdq);
+OVR(pmovzxwd);
+OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 08/42] x86emul: support AVX512{F, BW} down conversion moves
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (6 preceding siblings ...)
  2018-12-06  9:53   ` [PATCH v6 07/42] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
@ 2018-12-06  9:54   ` Jan Beulich
  2018-12-06  9:54   ` [PATCH v6 09/42] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
                     ` (33 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:54 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the vpmov{,s,us}{d,q}w table entries in evex-disp8.c are
slightly different from what one would expect, due to them requiring
EVEX.W to be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Also adjust x86_insn_is_mem_write().
v4: Also #UD when evex.z is set with a memory operand.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -177,11 +177,26 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovdb,       f3, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovdw,       f3, 0f38, 33,    vl_2,    b, vl),
+    INSN(pmovqb,       f3, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovqd,       f3, 0f38, 35,    vl_2, d_nb, vl),
+    INSN(pmovqw,       f3, 0f38, 34,    vl_4,    b, vl),
+    INSN(pmovsdb,      f3, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsdw,      f3, 0f38, 23,    vl_2,    b, vl),
+    INSN(pmovsqb,      f3, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsqd,      f3, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovsqw,      f3, 0f38, 24,    vl_4,    b, vl),
     INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
     INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
     INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
     INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
     INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovusdb,     f3, 0f38, 11,    vl_4,    b, vl),
+    INSN(pmovusdw,     f3, 0f38, 13,    vl_2,    b, vl),
+    INSN(pmovusqb,     f3, 0f38, 12,    vl_8,    b, vl),
+    INSN(pmovusqd,     f3, 0f38, 15,    vl_2, d_nb, vl),
+    INSN(pmovusqw,     f3, 0f38, 14,    vl_4,    b, vl),
     INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
     INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
     INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
@@ -284,7 +299,10 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+    INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -277,6 +277,17 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextracti{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextracti32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -291,6 +302,7 @@ static inline bool _to_bool(byte_vec_t b
 })
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -720,6 +732,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if VEC_SIZE >= 16
+
+# if !defined(low_half) && defined(HALF_SIZE)
+static inline half_t low_half(vec_t x)
+{
+#  if HALF_SIZE < VEC_SIZE
+    half_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 2; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1087,6 +1120,21 @@ int simd_test(void)
 
 #endif
 
+#if defined(widen1) && defined(shrink1)
+    {
+        half_t aux1 = low_half(src), aux2;
+
+        touch(aux1);
+        x = widen1(aux1);
+        touch(x);
+        aux2 = shrink1(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 2; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
 #ifdef dup_lo
     touch(src);
     x = dup_lo(src);
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,23 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE >= 16
+
+# if ELEM_COUNT >= 2
+#  if VEC_SIZE > 32
+#   define HALF_SIZE (VEC_SIZE / 2)
+#  else
+#   define HALF_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(HALF_SIZE))) half_t;
+typedef char __attribute__((vector_size(HALF_SIZE))) vqi_half_t;
+typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
+typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
+typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+# endif
+
+#endif
+
 #if VEC_SIZE == 16
 # define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
 # define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3065,7 +3065,22 @@ x86_decode(
                 d |= vSIB;
             state->simd_size = ext0f38_table[b].simd_size;
             if ( evex_encoded() )
-                disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+            {
+                /*
+                 * VPMOVUS* are identical to VPMOVS* Disp8-scaling-wise, but
+                 * their attributes don't match those of the vex_66 encoded
+                 * insns with the same base opcodes. Rather than adding new
+                 * columns to the table, handle this here for now.
+                 */
+                if ( evex.pfx != vex_f3 || (b & 0xf8) != 0x10 )
+                    disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+                else
+                {
+                    disp8scale = decode_disp8scale(ext0f38_table[b + 0x10].d8s,
+                                                   state);
+                    state->simd_size = simd_other;
+                }
+            }
             break;
 
         case ext_0f3a:
@@ -8335,10 +8350,14 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10): /* vpmovuswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20): /* vpmovswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30): /* vpmovwb [xyz]mm,{x,y}mm/mem{k} */
         host_and_vcpu_must_have(avx512bw);
-        /* fall through */
+        if ( evex.pfx != vex_f3 )
+        {
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
@@ -8349,7 +8368,28 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
-        generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+            generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        }
+        else
+        {
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x11): /* vpmovusdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x12): /* vpmovusqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x13): /* vpmovusdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x14): /* vpmovusqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* vpmovusqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x21): /* vpmovsdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x22): /* vpmovsqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x23): /* vpmovsdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x24): /* vpmovsqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* vpmovsqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x31): /* vpmovdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x32): /* vpmovqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x33): /* vpmovdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x34): /* vpmovqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* vpmovqd [xyz]mm,{x,y}mm/mem{k} */
+            generate_exception_if(evex.w || (ea.type == OP_MEM && evex.z), EXC_UD);
+            d = DstMem | SrcReg | TwoOp;
+        }
         op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
@@ -10182,6 +10222,12 @@ x86_insn_is_mem_write(const struct x86_e
     case X86EMUL_OPC(0x0f, 0xab):        /* BTS */
     case X86EMUL_OPC(0x0f, 0xb3):        /* BTR */
     case X86EMUL_OPC(0x0f, 0xbb):        /* BTC */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* VPMOVUS* */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* VPMOVS* */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* VPMOV{D,Q,W}* */
         return true;
 
     case 0xd9:




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 09/42] x86emul: support AVX512{F, BW} integer unpack insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (7 preceding siblings ...)
  2018-12-06  9:54   ` [PATCH v6 08/42] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
@ 2018-12-06  9:54   ` Jan Beulich
  2018-12-06  9:55   ` [PATCH v6 10/42] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
                     ` (32 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:54 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

There's once again one extra twobyte_table[] entry which gets its Disp8
shift value set right away without getting support implemented just yet,
again to avoid needlessly splitting groups of entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base over changes earlier in the series.
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -229,6 +229,10 @@ static const struct test avx512f_all[] =
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
     INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
     INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
+    INSN(punpckhdq,    66,   0f, 6a,    vl,      d, vl),
+    INSN(punpckhqdq,   66,   0f, 6d,    vl,      q, vl),
+    INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
+    INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
@@ -327,6 +331,10 @@ static const struct test avx512bw_all[]
     INSN(psubw,       66,   0f, f9,    vl,    w, vl),
     INSN(ptestm,      66, 0f38, 26,    vl,   bw, vl),
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
+    INSN(punpckhbw,   66,   0f, 68,    vl,    b, vl),
+    INSN(punpckhwd,   66,   0f, 69,    vl,    w, vl),
+    INSN(punpcklbw,   66,   0f, 60,    vl,    b, vl),
+    INSN(punpcklwd,   66,   0f, 61,    vl,    w, vl),
 };
 
 static const struct test avx512bw_128[] = {
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -300,6 +300,10 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
@@ -317,6 +321,10 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -252,6 +252,10 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(punpckhdq);
+OVR(punpckhqdq);
+OVR(punpckldq);
+OVR(punpcklqdq);
 # endif
 
 # undef OVR_VFP
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -312,10 +312,10 @@ static const struct twobyte_table {
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
+    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
@@ -6659,6 +6659,12 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        op_bytes = 16 << evex.lr;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6687,6 +6693,13 @@ x86_emulate(
         elem_bytes = 1 << (b & 1);
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x62): /* vpunpckldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6a): /* vpunpckhdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        fault_suppression = false;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
         op_bytes = 16 << evex.lr;
@@ -6713,6 +6726,10 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd4): /* vpaddq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf4): /* vpmuludq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x28): /* vpmuldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 10/42] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (8 preceding siblings ...)
  2018-12-06  9:54   ` [PATCH v6 09/42] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
@ 2018-12-06  9:55   ` Jan Beulich
  2018-12-06  9:55   ` [PATCH v6 11/42] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
                     ` (31 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Take the liberty and also correct the (public interface) name of the
AVX512_VBMI feature flag, on the assumption that no external consumer
has actually been using that flag so far. Furthermore make it have
AVX512BW instead of AVX512F as a prerequisite, for requiring full
64-bit mask registers (the upper 48 bits of which can't be accessed
other than through XSAVE/XRSTOR without AVX512BW support).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Re-base.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -173,6 +173,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
+    INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -294,6 +298,8 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
+    INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
@@ -378,6 +384,11 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512_vbmi_all[] = {
+    INSN(permi2b,       66, 0f38, 75, vl, b, vl),
+    INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -718,4 +729,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -150,6 +150,9 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#  else
+#   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
+#   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -175,6 +178,9 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#  else
+#   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
+#   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -303,6 +309,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -324,6 +333,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
@@ -769,6 +781,7 @@ int simd_test(void)
 {
     unsigned int i, j;
     vec_t x, y, z, src, inv, alt, sh;
+    vint_t interleave_lo, interleave_hi;
 
     for ( i = 0, j = ELEM_SIZE << 3; i < ELEM_COUNT; ++i )
     {
@@ -782,6 +795,9 @@ int simd_test(void)
         if ( !(i & (i + 1)) )
             --j;
         sh[i] = j;
+
+        interleave_lo[i] = ((i & 1) * ELEM_COUNT) | (i >> 1);
+        interleave_hi[i] = interleave_lo[i] + (ELEM_COUNT / 2);
     }
 
     touch(src);
@@ -1075,7 +1091,7 @@ int simd_test(void)
     x = src * alt;
     y = interleave_lo(x, alt < 0);
     touch(x);
-    z = widen1(x);
+    z = widen1(low_half(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1107,7 +1123,7 @@ int simd_test(void)
 
 # ifdef widen1
     touch(src);
-    x = widen1(src);
+    x = widen1(low_half(src));
     touch(src);
     if ( !eq(x, y) ) return __LINE__;
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,16 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if ELEM_SIZE == 1
+typedef vqi_t vint_t;
+#elif ELEM_SIZE == 2
+typedef vhi_t vint_t;
+#elif ELEM_SIZE == 4
+typedef vsi_t vint_t;
+#elif ELEM_SIZE == 8
+typedef vdi_t vint_t;
+#endif
+
 #if VEC_SIZE >= 16
 
 # if ELEM_COUNT >= 2
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -136,6 +136,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
+#define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -468,9 +468,13 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
     [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
+    [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -1861,6 +1865,7 @@ static bool vcpu_has(
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
+#define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -6027,6 +6032,11 @@ x86_emulate(
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x76): /* vpermi2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x77): /* vpermi2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7e): /* vpermt2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7f): /* vpermt2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdb): /* vpand{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8539,6 +8549,16 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512_vbmi);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -107,6 +107,9 @@
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
 
+/* CPUID level 0x00000007:0.ecx */
+#define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
+
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
 
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -224,7 +224,7 @@ XEN_CPUFEATURE(AVX512VL,      5*32+31) /
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.ecx, word 6 */
 XEN_CPUFEATURE(PREFETCHWT1,   6*32+ 0) /*A  PREFETCHWT1 instruction */
-XEN_CPUFEATURE(AVX512VBMI,    6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -259,12 +259,17 @@ def crunch_numbers(state):
         AVX2: [AVX512F],
 
         # AVX512F is taken to mean hardware support for 512bit registers
-        # (which in practice depends on the EVEX prefix to encode), and the
-        # instructions themselves. All further AVX512 features are built on
-        # top of AVX512F
+        # (which in practice depends on the EVEX prefix to encode) as well
+        # as mask registers, and the instructions themselves. All further
+        # AVX512 features are built on top of AVX512F
         AVX512F: [AVX512DQ, AVX512IFMA, AVX512PF, AVX512ER, AVX512CD,
-                  AVX512BW, AVX512VL, AVX512VBMI, AVX512_4VNNIW,
-                  AVX512_4FMAPS, AVX512_VPOPCNTDQ],
+                  AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
+                  AVX512_VPOPCNTDQ],
+
+        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # dependents of AVX512BW (as to requiring wider than 16-bit mask
+        # registers), despite the SDM not formally making this connection.
+        AVX512BW: [AVX512_VBMI],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 11/42] x86emul: support AVX512{F, BW} integer shuffle insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (9 preceding siblings ...)
  2018-12-06  9:55   ` [PATCH v6 10/42] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
@ 2018-12-06  9:55   ` Jan Beulich
  2018-12-06  9:55   ` [PATCH v6 12/42] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
                     ` (30 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also include shuff{32x4,64x2} as being very similar to shufi{32x4,64x2}.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base over changes earlier in the series.
v5: Re-base over changes earlier in the series.
v4: Move OVR() addition into __AVX512VL__ conditional. Correct comments.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -214,6 +214,7 @@ static const struct test avx512f_all[] =
     INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
     INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
     INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pshufd,       66,   0f, 70,    vl,      d, vl),
     INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
     INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
     INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
@@ -264,6 +265,10 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
+    INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
+    INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
+    INSN(shufi64x2,      66, 0f3a, 43, vl,    q, vl),
 };
 
 static const struct test avx512f_512[] = {
@@ -318,6 +323,9 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSN(pshufb,      66, 0f38, 00,    vl,    b, vl),
+    INSN(pshufhw,     f3,   0f, 70,    vl,    w, vl),
+    INSN(pshuflw,     f2,   0f, 70,    vl,    w, vl),
     INSNX(pslldq,     66,   0f, 73, 7, vl,    b, vl),
     INSN(psllvw,      66, 0f38, 12,    vl,    w, vl),
     INSN(psllw,       66,   0f, f1,    el_8,  w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -153,6 +153,10 @@ static inline bool _to_bool(byte_vec_t b
 #  else
 #   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
+#   define swap(x) ({ \
+    vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
+})
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -181,6 +185,10 @@ static inline bool _to_bool(byte_vec_t b
 #  else
 #   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
+#   define swap(x) ({ \
+    vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
+})
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -309,9 +317,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
+                               VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
+                             0b00011011, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -333,9 +346,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b01001110, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
+                                      VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -119,6 +119,12 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+/* Sadly there are a few exceptions to the general naming rules. */
+# define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
+# define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
+# define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
+# define __builtin_ia32_shuf_i64x2_512_mask __builtin_ia32_shuf_i64x2_mask
+
 # if VEC_SIZE > ELEM_SIZE && (defined(VEC_MAX) ? VEC_MAX : VEC_SIZE) < 64
 #  pragma GCC target ( "avx512vl" )
 # endif
@@ -262,6 +268,7 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(pshufd);
 OVR(punpckhdq);
 OVR(punpckhqdq);
 OVR(punpckldq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -318,7 +318,7 @@ static const struct twobyte_table {
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
-    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
+    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other, d8s_vl },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
@@ -432,7 +432,8 @@ static const struct ext0f38_table {
     uint8_t vsib:1;
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
-    [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
@@ -543,6 +544,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
+    [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
@@ -552,6 +554,7 @@ static const struct ext0f3a_table {
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
+    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6680,6 +6683,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6935,6 +6939,20 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 3;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x70): /* vpshufd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x70): /* vpshufhw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x70): /* vpshuflw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx == vex_66 )
+            generate_exception_if(evex.w, EXC_UD);
+        else
+        {
+            host_and_vcpu_must_have(avx512bw);
+            generate_exception_if(evex.br, EXC_UD);
+        }
+        d = (d & ~SrcMask) | SrcMem | TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_imm8_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x71):    /* Grp12 */
     case X86EMUL_OPC_VEX_66(0x0f, 0x71):
     CASE_SIMD_PACKED_INT(0x0f, 0x72):    /* Grp13 */
@@ -9122,7 +9140,13 @@ x86_emulate(
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
-        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        generate_exception_if(evex.br, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x23): /* vshuff32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshuff64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x43): /* vshufi32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshufi64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
         fault_suppression = false;
         goto avx512f_imm8_no_sae;
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 12/42] x86emul: support AVX512{BW, DQ} mask move insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (10 preceding siblings ...)
  2018-12-06  9:55   ` [PATCH v6 11/42] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
@ 2018-12-06  9:55   ` Jan Beulich
  2018-12-06  9:56   ` [PATCH v6 13/42] x86emul: basic AVX512BW testing Jan Beulich
                     ` (29 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Entries to the tables in evex-disp8.c are added despite these insns not
allowing for memory operands, with the goal of the tables giving a
complete picture of the supported EVEX-encoded insns in the end.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -314,9 +314,12 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+//       pmovb2m,     f3, 0f38, 29,           b
+//       pmovm2,      f3, 0f38, 28,          bw
     INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+//       pmovw2m,     f3, 0f38, 29,           w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
@@ -364,6 +367,9 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
+//       pmovd2m,        f3, 0f38, 39,        d
+//       pmovm2,         f3, 0f38, 38,       dq
+//       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
 };
--- a/tools/tests/x86_emulator/opmask.S
+++ b/tools/tests/x86_emulator/opmask.S
@@ -12,17 +12,23 @@
 
 #if SIZE == 1
 # define _(x) x##b
+# define _v(x, t) _v_(x##q, t)
 #elif SIZE == 2
 # define _(x) x##w
+# define _v(x, t) _v_(x##d, t)
 # define WIDEN(x) x##bw
 #elif SIZE == 4
 # define _(x) x##d
+# define _v(x, t) _v_(x##w, t)
 # define WIDEN(x) x##wd
 #elif SIZE == 8
 # define _(x) x##q
+# define _v(x, t) _v_(x##b, t)
 # define WIDEN(x) x##dq
 #endif
 
+#define _v_(x, t) v##x##t
+
     .macro check res1:req, res2:req, line:req
     _(kmov)       %\res1, DATA(out)
 #if SIZE < 8 || !defined(__i386__)
@@ -131,6 +137,15 @@ _start:
 
 #endif
 
+#if SIZE > 2 ? defined(__AVX512BW__) : defined(__AVX512DQ__)
+
+    _(kmov)       DATA(in1), %k0
+    _v(pmovm2,)   %k0, %zmm7
+    _v(pmov,2m)   %zmm7, %k3
+    check         k0, k3, __LINE__
+
+#endif
+
     xor           %eax, %eax
     ret
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -8439,6 +8439,21 @@ x86_emulate(
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x29): /* vpmov{b,w}2m [xyz]mm,k */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x39): /* vpmov{d,q}2m [xyz]mm,k */
+        generate_exception_if(!evex.r || !evex.R, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x28): /* vpmovm2{b,w} k,[xyz]mm */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x38): /* vpmovm2{d,q} k,[xyz]mm */
+        if ( b & 0x10 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.opmsk || ea.type != OP_REG, EXC_UD);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 13/42] x86emul: basic AVX512BW testing
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (11 preceding siblings ...)
  2018-12-06  9:55   ` [PATCH v6 12/42] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
@ 2018-12-06  9:56   ` Jan Beulich
  2018-12-06  9:57   ` [PATCH v6 14/42] x86emul: basic AVX512DQ testing Jan Beulich
                     ` (28 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base over changes earlier in the series.
v4: Add __AVX512VL__ conditional around majority of OVR() additions.
    Correct eq() for 1- and 2-byte cases.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -66,6 +66,9 @@ xop-flts := $(avx-flts)
 avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
+avx512bw-vecs := $(avx512f-vecs)
+avx512bw-ints := 1 2
+avx512bw-flts :=
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -31,6 +31,10 @@ ENTRY(simd_test);
 #  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
 # elif FLOAT_SIZE == 8
 #  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif (INT_SIZE == 1 || UINT_SIZE == 1) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqb, _mask, (vqi_t)(x), (vqi_t)(y), -1) == ALL_TRUE)
+# elif (INT_SIZE == 2 || UINT_SIZE == 2) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqw, _mask, (vhi_t)(x), (vhi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 4 || UINT_SIZE == 4
 #  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 8 || UINT_SIZE == 8
@@ -374,6 +378,87 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # endif
+#elif (INT_SIZE == 1 || UINT_SIZE == 1 || INT_SIZE == 2 || UINT_SIZE == 2) && \
+      defined(__AVX512BW__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 1 || UINT_SIZE == 1
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastb %1, %0" \
+          : "=v" (t_) : "m" (*(char[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastb %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  elif defined(__AVX512VBMI__)
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
+#  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+# elif INT_SIZE == 2 || UINT_SIZE == 2
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastw %1, %0" \
+          : "=v" (t_) : "m" (*(short[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastw %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(pshufhw, _mask, \
+                                      B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
+                                      0b00011011, (vhi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+# endif
+# if INT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovsxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 2
+#  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
+#  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
+#  define widen1(x) ((vec_t)B(pmovsxwd, _mask, x, (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxwq, _mask, x, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 2
+#  define max(x, y) ((vec_t)B(pmaxuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define mul_hi(x, y) ((vec_t)B(pmulhuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxwd, _mask, (vhi_half_t)(x), (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxwq, _mask, (vhi_quarter_t)(x), (vdi_t)undef(), ~0))
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
@@ -565,7 +650,7 @@ static inline bool _to_bool(byte_vec_t b
 #  endif
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSSE3__)
+#if VEC_SIZE == 16 && defined(__SSSE3__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define abs(x) ((vec_t)__builtin_ia32_pabsb128((vqi_t)(x)))
 # elif INT_SIZE == 2
@@ -789,6 +874,40 @@ static inline half_t low_half(vec_t x)
 }
 # endif
 
+# if !defined(low_quarter) && defined(QUARTER_SIZE)
+static inline quarter_t low_quarter(vec_t x)
+{
+#  if QUARTER_SIZE < VEC_SIZE
+    quarter_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+# if !defined(low_eighth) && defined(EIGHTH_SIZE)
+static inline eighth_t low_eighth(vec_t x)
+{
+#  if EIGHTH_SIZE < VEC_SIZE
+    eighth_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
 #endif
 
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
@@ -1117,7 +1236,7 @@ int simd_test(void)
     y = interleave_lo(alt < 0, alt < 0);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen2(x);
+    z = widen2(low_quarter(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1126,7 +1245,7 @@ int simd_test(void)
     y = interleave_lo(y, y);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen3(x);
+    z = widen3(low_eighth(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 #  endif
@@ -1148,14 +1267,14 @@ int simd_test(void)
 
 # ifdef widen2
     touch(src);
-    x = widen2(src);
+    x = widen2(low_quarter(src));
     touch(src);
     if ( !eq(x, z) ) return __LINE__;
 # endif
 
 # ifdef widen3
     touch(src);
-    x = widen3(src);
+    x = widen3(low_eighth(src));
     touch(src);
     if ( !eq(x, interleave_lo(z, (vec_t){})) ) return __LINE__;
 # endif
@@ -1175,6 +1294,36 @@ int simd_test(void)
             if ( aux2[i] != src[i] )
                 return __LINE__;
     }
+#endif
+
+#if defined(widen2) && defined(shrink2)
+    {
+        quarter_t aux1 = low_quarter(src), aux2;
+
+        touch(aux1);
+        x = widen2(aux1);
+        touch(x);
+        aux2 = shrink2(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 4; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
+#if defined(widen3) && defined(shrink3)
+    {
+        eighth_t aux1 = low_eighth(src), aux2;
+
+        touch(aux1);
+        x = widen3(aux1);
+        touch(x);
+        aux2 = shrink3(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 8; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
 #endif
 
 #ifdef dup_lo
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -95,6 +95,32 @@ typedef int __attribute__((vector_size(H
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
 # endif
 
+# if ELEM_COUNT >= 4
+#  if VEC_SIZE > 64
+#   define QUARTER_SIZE (VEC_SIZE / 4)
+#  else
+#   define QUARTER_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(QUARTER_SIZE))) quarter_t;
+typedef char __attribute__((vector_size(QUARTER_SIZE))) vqi_quarter_t;
+typedef short __attribute__((vector_size(QUARTER_SIZE))) vhi_quarter_t;
+typedef int __attribute__((vector_size(QUARTER_SIZE))) vsi_quarter_t;
+typedef long long __attribute__((vector_size(QUARTER_SIZE))) vdi_quarter_t;
+# endif
+
+# if ELEM_COUNT >= 8
+#  if VEC_SIZE > 128
+#   define EIGHTH_SIZE (VEC_SIZE / 8)
+#  else
+#   define EIGHTH_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(EIGHTH_SIZE))) eighth_t;
+typedef char __attribute__((vector_size(EIGHTH_SIZE))) vqi_eighth_t;
+typedef short __attribute__((vector_size(EIGHTH_SIZE))) vhi_eighth_t;
+typedef int __attribute__((vector_size(EIGHTH_SIZE))) vsi_eighth_t;
+typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
+# endif
+
 #endif
 
 #if VEC_SIZE == 16
@@ -182,6 +208,9 @@ OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
 OVR_INT(add);
+OVR_BW(adds);
+OVR_BW(addus);
+OVR_BW(avg);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
@@ -214,6 +243,8 @@ OVR_INT(srl);
 OVR_DQ(srlv);
 OVR_FP(sub);
 OVR_INT(sub);
+OVR_BW(subs);
+OVR_BW(subus);
 OVR_SFP(ucomi);
 OVR_VFP(unpckh);
 OVR_VFP(unpckl);
@@ -275,6 +306,31 @@ OVR(punpckldq);
 OVR(punpcklqdq);
 # endif
 
+# ifdef __AVX512BW__
+OVR(pextrb);
+OVR(pextrw);
+OVR(pinsrb);
+OVR(pinsrw);
+#  ifdef __AVX512VL__
+OVR(pmaddwd);
+OVR(pmovsxbw);
+OVR(pmovzxbw);
+OVR(pmulhuw);
+OVR(pmulhw);
+OVR(pmullw);
+OVR(psadbw);
+OVR(pshufb);
+OVR(pshufhw);
+OVR(pshuflw);
+OVR(punpckhbw);
+OVR(punpckhwd);
+OVR(punpcklbw);
+OVR(punpcklwd);
+OVR(slldq);
+OVR(srldq);
+#  endif
+# endif
+
 # undef OVR_VFP
 # undef OVR_SFP
 # undef OVR_INT
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
+#include "avx512bw.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -105,6 +106,11 @@ static bool simd_check_avx512bw(void)
 }
 #define simd_check_avx512bw_opmask simd_check_avx512bw
 
+static bool simd_check_avx512bw_vl(void)
+{
+    return cpu_has_avx512bw && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -284,6 +290,18 @@ static const struct {
     AVX512VL(VL u64x2,        avx512f,      16u8),
     AVX512VL(VL s64x4,        avx512f,      32i8),
     AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512BW s8x64,     avx512bw,      64i1),
+    SIMD(AVX512BW u8x64,     avx512bw,      64u1),
+    SIMD(AVX512BW s16x32,    avx512bw,      64i2),
+    SIMD(AVX512BW u16x32,    avx512bw,      64u2),
+    AVX512VL(BW+VL s8x16,    avx512bw,      16i1),
+    AVX512VL(BW+VL u8x16,    avx512bw,      16u1),
+    AVX512VL(BW+VL s8x32,    avx512bw,      32i1),
+    AVX512VL(BW+VL u8x32,    avx512bw,      32u1),
+    AVX512VL(BW+VL s16x8,    avx512bw,      16i2),
+    AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
+    AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
+    AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 14/42] x86emul: basic AVX512DQ testing
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (12 preceding siblings ...)
  2018-12-06  9:56   ` [PATCH v6 13/42] x86emul: basic AVX512BW testing Jan Beulich
@ 2018-12-06  9:57   ` Jan Beulich
  2018-12-06  9:57   ` [PATCH v6 15/42] x86emul: support AVX512F move high/low insns Jan Beulich
                     ` (27 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:57 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base.
v5: Re-base over changes earlier in the series.
v4: Wrap OVR(pmullq) in __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -69,9 +69,12 @@ avx512f-flts := 4 8
 avx512bw-vecs := $(avx512f-vecs)
 avx512bw-ints := 1 2
 avx512bw-flts :=
+avx512dq-vecs := $(avx512f-vecs)
+avx512dq-ints := $(avx512f-ints)
+avx512dq-flts := $(avx512f-flts)
 
 avx512f-opmask-vecs := 2
-avx512dq-opmask-vecs := 1
+avx512dq-opmask-vecs := 1 2
 avx512bw-opmask-vecs := 4 8
 
 # Suppress building by default of the harness if the compiler can't deal
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -121,6 +121,34 @@ typedef int __attribute__((vector_size(E
 typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
 # endif
 
+# define DECL_PAIR(w) \
+typedef w ## _t pair_t; \
+typedef vsi_ ## w ## _t vsi_pair_t; \
+typedef vdi_ ## w ## _t vdi_pair_t
+# define DECL_QUARTET(w) \
+typedef w ## _t quartet_t; \
+typedef vsi_ ## w ## _t vsi_quartet_t; \
+typedef vdi_ ## w ## _t vdi_quartet_t
+# define DECL_OCTET(w) \
+typedef w ## _t octet_t; \
+typedef vsi_ ## w ## _t vsi_octet_t; \
+typedef vdi_ ## w ## _t vdi_octet_t
+
+# if ELEM_COUNT == 4
+DECL_PAIR(half);
+# elif ELEM_COUNT == 8
+DECL_PAIR(quarter);
+DECL_QUARTET(half);
+# elif ELEM_COUNT == 16
+DECL_PAIR(eighth);
+DECL_QUARTET(quarter);
+DECL_OCTET(half);
+# endif
+
+# undef DECL_OCTET
+# undef DECL_QUARTET
+# undef DECL_PAIR
+
 #endif
 
 #if VEC_SIZE == 16
@@ -146,6 +174,14 @@ typedef long long __attribute__((vector_
 #ifdef __AVX512F__
 
 /* Sadly there are a few exceptions to the general naming rules. */
+# define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
+# define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+# define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
+# define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
+# define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
+# define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
+# define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
+# define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -331,6 +367,20 @@ OVR(srldq);
 #  endif
 # endif
 
+# ifdef __AVX512DQ__
+OVR_VFP(and);
+OVR_VFP(andn);
+OVR_VFP(or);
+OVR(pextrd);
+OVR(pextrq);
+OVR(pinsrd);
+OVR(pinsrq);
+#  ifdef __AVX512VL__
+OVR(pmullq);
+#  endif
+OVR_VFP(xor);
+# endif
+
 # undef OVR_VFP
 # undef OVR_SFP
 # undef OVR_INT
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -139,6 +139,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
+     (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if FLOAT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -146,6 +167,17 @@ static inline bool _to_bool(byte_vec_t b
           : "=v" (t_) : "m" (*(float[1]){ x }) ); \
     t_; \
 })
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcastf32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
+#   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
+#  endif
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
@@ -155,6 +187,13 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
 #  else
+#   define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
+#   define insert_pair(x, y, p) \
+    B(insertf32x4_, _mask, x, \
+      /* Cast needed below to work around gcc 7.x quirk. */ \
+      (p) & 1 ? (typeof(y))__builtin_ia32_shufps(y, y, 0b01000100) : (y), \
+      (p) >> 1, x, 3 << ((p) * 2))
+#   define insert_quartet(x, y, p) B(insertf32x4_, _mask, x, y, p, undef(), ~0)
 #   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #   define swap(x) ({ \
@@ -178,6 +217,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) B(broadcastf64x2_, _mask, x, undef(), ~0)
+#   define insert_pair(x, y, p) B(insertf64x2_, _mask, x, y, p, undef(), ~0)
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
+#   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
+#  endif
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
@@ -306,6 +353,16 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 # endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextracti32x4 */ || \
+       (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -318,11 +375,30 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  ifdef __AVX512DQ__
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcasti32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) ((vec_t)B(broadcasti32x8_, _mask, (vsi_octet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_octet(x, y, p) ((vec_t)B(inserti32x8_, _mask, (vsi_t)(x), (vsi_octet_t)(y), p, (vsi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti32x4_, _mask, (vsi_quartet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_pair(x, y, p) \
+    (vec_t)(B(inserti32x4_, _mask, (vsi_t)(x), \
+              /* First cast needed below to work around gcc 7.x quirk. */ \
+              (p) & 1 ? (vsi_pair_t)__builtin_ia32_pshufd((vsi_pair_t)(y), 0b01000100) \
+                      : (vsi_pair_t)(y), \
+              (p) >> 1, (vsi_t)(x), 3 << ((p) * 2)))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti32x4_, _mask, (vsi_t)(x), (vsi_quartet_t)(y), p, (vsi_t)undef(), ~0))
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
@@ -347,6 +423,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ((vec_t)B(broadcasti64x2_, _mask, (vdi_pair_t)(x), (vdi_t)undef(), ~0))
+#   define insert_pair(x, y, p) ((vec_t)B(inserti64x2_, _mask, (vdi_t)(x), (vdi_pair_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti64x4_, , (vdi_quartet_t)(x), (vdi_t)undef(), ~0))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti64x4_, _mask, (vdi_t)(x), (vdi_quartet_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
@@ -898,7 +982,7 @@ static inline eighth_t low_eighth(vec_t
     eighth_t y;
     unsigned int i;
 
-    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+    for ( i = 0; i < ELEM_COUNT / 8; ++i )
         y[i] = x[i];
 
     return y;
@@ -910,6 +994,50 @@ static inline eighth_t low_eighth(vec_t
 
 #endif
 
+#ifdef broadcast_pair
+# if ELEM_COUNT == 4
+#  define broadcast_half broadcast_pair
+# elif ELEM_COUNT == 8
+#  define broadcast_quarter broadcast_pair
+# elif ELEM_COUNT == 16
+#  define broadcast_eighth broadcast_pair
+# endif
+#endif
+
+#ifdef insert_pair
+# if ELEM_COUNT == 4
+#  define insert_half insert_pair
+# elif ELEM_COUNT == 8
+#  define insert_quarter insert_pair
+# elif ELEM_COUNT == 16
+#  define insert_eighth insert_pair
+# endif
+#endif
+
+#ifdef broadcast_quartet
+# if ELEM_COUNT == 8
+#  define broadcast_half broadcast_quartet
+# elif ELEM_COUNT == 16
+#  define broadcast_quarter broadcast_quartet
+# endif
+#endif
+
+#ifdef insert_quartet
+# if ELEM_COUNT == 8
+#  define insert_half insert_quartet
+# elif ELEM_COUNT == 16
+#  define insert_quarter insert_quartet
+# endif
+#endif
+
+#if defined(broadcast_octet) && ELEM_COUNT == 16
+# define broadcast_half broadcast_octet
+#endif
+
+#if defined(insert_octet) && ELEM_COUNT == 16
+# define insert_half insert_octet
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1205,6 +1333,60 @@ int simd_test(void)
     if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#if defined(broadcast_half) && defined(insert_half)
+    {
+        half_t aux = low_half(src);
+
+        touch(aux);
+        x = broadcast_half(aux);
+        touch(aux);
+        y = insert_half(src, aux, 1);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_quarter) && defined(insert_quarter)
+    {
+        quarter_t aux = low_quarter(src);
+
+        touch(aux);
+        x = broadcast_quarter(aux);
+        touch(aux);
+        y = insert_quarter(src, aux, 1);
+        touch(aux);
+        y = insert_quarter(y, aux, 2);
+        touch(aux);
+        y = insert_quarter(y, aux, 3);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_eighth) && defined(insert_eighth) && \
+    /* At least gcc 7.3 "optimizes" away all insert_eighth() calls below. */ \
+    __GNUC__ >= 8
+    {
+        eighth_t aux = low_eighth(src);
+
+        touch(aux);
+        x = broadcast_eighth(aux);
+        touch(aux);
+        y = insert_eighth(src, aux, 1);
+        touch(aux);
+        y = insert_eighth(y, aux, 2);
+        touch(aux);
+        y = insert_eighth(y, aux, 3);
+        touch(aux);
+        y = insert_eighth(y, aux, 4);
+        touch(aux);
+        y = insert_eighth(y, aux, 5);
+        touch(aux);
+        y = insert_eighth(y, aux, 6);
+        touch(aux);
+        y = insert_eighth(y, aux, 7);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -23,6 +23,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
 #include "avx512bw.h"
+#include "avx512dq.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -100,6 +101,11 @@ static bool simd_check_avx512dq(void)
 }
 #define simd_check_avx512dq_opmask simd_check_avx512dq
 
+static bool simd_check_avx512dq_vl(void)
+{
+    return cpu_has_avx512dq && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -267,9 +273,10 @@ static const struct {
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
     SIMD(OPMASK/w,     avx512f_opmask,         2),
-    SIMD(OPMASK/b,    avx512dq_opmask,         1),
-    SIMD(OPMASK/d,    avx512bw_opmask,         4),
-    SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(OPMASK+DQ/b, avx512dq_opmask,         1),
+    SIMD(OPMASK+DQ/w, avx512dq_opmask,         2),
+    SIMD(OPMASK+BW/d, avx512bw_opmask,         4),
+    SIMD(OPMASK+BW/q, avx512bw_opmask,         8),
     SIMD(AVX512F f32 scalar,  avx512f,        f4),
     SIMD(AVX512F f32x16,      avx512f,      64f4),
     SIMD(AVX512F f64 scalar,  avx512f,        f8),
@@ -302,6 +309,24 @@ static const struct {
     AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
     AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
     AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
+    SIMD(AVX512DQ f32x16,    avx512dq,      64f4),
+    SIMD(AVX512DQ f64x8,     avx512dq,      64f8),
+    SIMD(AVX512DQ s32x16,    avx512dq,      64i4),
+    SIMD(AVX512DQ u32x16,    avx512dq,      64u4),
+    SIMD(AVX512DQ s64x8,     avx512dq,      64i8),
+    SIMD(AVX512DQ u64x8,     avx512dq,      64u8),
+    AVX512VL(DQ+VL f32x4,    avx512dq,      16f4),
+    AVX512VL(DQ+VL f64x2,    avx512dq,      16f8),
+    AVX512VL(DQ+VL f32x8,    avx512dq,      32f4),
+    AVX512VL(DQ+VL f64x4,    avx512dq,      32f8),
+    AVX512VL(DQ+VL s32x4,    avx512dq,      16i4),
+    AVX512VL(DQ+VL u32x4,    avx512dq,      16u4),
+    AVX512VL(DQ+VL s32x8,    avx512dq,      32i4),
+    AVX512VL(DQ+VL u32x8,    avx512dq,      32u4),
+    AVX512VL(DQ+VL s64x2,    avx512dq,      16i8),
+    AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
+    AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
+    AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 15/42] x86emul: support AVX512F move high/low insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (13 preceding siblings ...)
  2018-12-06  9:57   ` [PATCH v6 14/42] x86emul: basic AVX512DQ testing Jan Beulich
@ 2018-12-06  9:57   ` Jan Beulich
  2018-12-06  9:58   ` [PATCH v6 16/42] x86emul: support AVX512F move duplicate insns Jan Beulich
                     ` (26 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:57 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -253,6 +253,16 @@ static const struct test avx512f_128[] =
     INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
+//       movhlps,     ,   0f, 12,        d
+    INSN(movhpd,    66,   0f, 16, el,    q, vl),
+    INSN(movhpd,    66,   0f, 17, el,    q, vl),
+    INSN(movhps,      ,   0f, 16, el_2,  d, vl),
+    INSN(movhps,      ,   0f, 17, el_2,  d, vl),
+//       movlhps,     ,   0f, 16,        d
+    INSN(movlpd,    66,   0f, 12, el,    q, vl),
+    INSN(movlpd,    66,   0f, 13, el,    q, vl),
+    INSN(movlps,      ,   0f, 12, el_2,  d, vl),
+    INSN(movlps,      ,   0f, 13, el_2,  d, vl),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
     INSN(movq,      66,   0f, d6, el,    q, el),
 };
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -266,6 +266,12 @@ OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
 OVR_VFP(mova);
+OVR(movhlps);
+OVR(movhpd);
+OVR(movhps);
+OVR(movlhps);
+OVR(movlpd);
+OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -286,11 +286,11 @@ static const struct twobyte_table {
     [0x0f] = { ModRM|SrcImmByte },
     [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
-    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
     [0x20 ... 0x21] = { DstMem|SrcImplicit|ModRM },
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
@@ -6016,6 +6016,26 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_fp;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x12):   /* vmovlpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x13): /* vmovlp{s,d} xmm,m64 */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x16):   /* vmovhpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x17): /* vmovhp{s,d} xmm,m64 */
+        generate_exception_if(ea.type != OP_MEM, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x12):      /* vmovlps m64,xmm,xmm */
+                                            /* vmovhlps xmm,xmm,xmm */
+    case X86EMUL_OPC_EVEX(0x0f, 0x16):      /* vmovhps m64,xmm,xmm */
+                                            /* vmovlhps xmm,xmm,xmm */
+        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( (d & DstMask) != DstMem )
+            d &= ~TwoOp;
+        op_bytes = 8;
+        fault_suppression = false;
+        goto simd_zmm;
+
     case X86EMUL_OPC_F3(0x0f, 0x12):       /* movsldup xmm/m128,xmm */
     case X86EMUL_OPC_VEX_F3(0x0f, 0x12):   /* vmovsldup {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F2(0x0f, 0x12):       /* movddup xmm/m64,xmm */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 16/42] x86emul: support AVX512F move duplicate insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (14 preceding siblings ...)
  2018-12-06  9:57   ` [PATCH v6 15/42] x86emul: support AVX512F move high/low insns Jan Beulich
@ 2018-12-06  9:58   ` Jan Beulich
  2018-12-06  9:59   ` [PATCH v6 17/42] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
                     ` (25 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:58 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Judging from insn prefixes, these are scalar insns, but their (memory)
operands are vector ones (with the exception of 128-bit VMOVDDUP). For
this some adjustments to disp8scale calculation code are needed.

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Fix Disp8 test for VMOVDDUP when AVX512VL is unavailable.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -146,6 +146,7 @@ static const struct test avx512f_all[] =
     INSN_SFP(mov,            0f, 11),
     INSN_PFP_NB(mova,        0f, 28),
     INSN_PFP_NB(mova,        0f, 29),
+    INSN(movddup,      f2,   0f, 12,    vl,   q_nb, vl),
     INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
     INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
     INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
@@ -157,6 +158,8 @@ static const struct test avx512f_all[] =
     INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
     INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
     INSN_PFP_NB(movnt,       0f, 2b),
+    INSN(movshdup,     f3,   0f, 16,    vl,   d_nb, vl),
+    INSN(movsldup,     f3,   0f, 12,    vl,   d_nb, vl),
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
@@ -694,6 +697,19 @@ static void test_group(const struct test
 
             switch ( tests[i].esz )
             {
+            case ESZ_q_nb:
+                /* The 128-bit form of VMOVDDUP needs special casing. */
+                if ( vl[j] == VL_128 && tests[i].spc == SPC_0f &&
+                     tests[i].opc == 0x12 && tests[i].pfx == PFX_f2 )
+                {
+                    struct test test = tests[i];
+
+                    test.vsz = VSZ_el;
+                    test.scale = SC_el;
+                    test_one(&test, vl[j], instr, ctxt);
+                    continue;
+                }
+                /* fall through */
             default:
                 test_one(&tests[i], vl[j], instr, ctxt);
                 break;
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -326,8 +326,11 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
+OVR(movshdup);
+OVR(movsldup);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3045,6 +3045,15 @@ x86_decode(
 
             switch ( b )
             {
+            case 0x12: /* vmovsldup / vmovddup */
+                if ( evex.pfx == vex_f2 )
+                    disp8scale = evex.lr ? 4 + evex.lr : 3;
+                /* fall through */
+            case 0x16: /* vmovshdup */
+                if ( evex.pfx == vex_f3 )
+                    disp8scale = 4 + evex.lr;
+                break;
+
             case 0x20: /* mov cr,reg */
             case 0x21: /* mov dr,reg */
             case 0x22: /* mov reg,cr */
@@ -6051,6 +6060,20 @@ x86_emulate(
         host_and_vcpu_must_have(sse3);
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x12):   /* vmovsldup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x12):   /* vmovddup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x16):   /* vmovshdup [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if((evex.br ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = !(evex.pfx & VEX_PREFIX_DOUBLE_MASK) || evex.lr
+                   ? 16 << evex.lr : 8;
+        fault_suppression = false;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x14): /* vunpcklp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 17/42] x86emul: support AVX512{F, BW, _VBMI} permute insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (15 preceding siblings ...)
  2018-12-06  9:58   ` [PATCH v6 16/42] x86emul: support AVX512F move duplicate insns Jan Beulich
@ 2018-12-06  9:59   ` Jan Beulich
  2018-12-06  9:59   ` [PATCH v6 18/42] x86emul: support AVX512BW pack insns Jan Beulich
                     ` (24 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:59 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -178,6 +178,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permilpd,     66, 0f38, 0d,    vl,      q, vl),
+    INSN(permilpd,     66, 0f3a, 05,    vl,      q, vl),
+    INSN(permilps,     66, 0f38, 0c,    vl,      d, vl),
+    INSN(permilps,     66, 0f3a, 04,    vl,      d, vl),
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
@@ -278,6 +282,10 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(perm,           66, 0f38, 36, vl,   dq, vl),
+    INSN(perm,           66, 0f38, 16, vl,   sd, vl),
+    INSN(permpd,         66, 0f3a, 01, vl,    q, vl),
+    INSN(permq,          66, 0f3a, 00, vl,    q, vl),
     INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
     INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
     INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
@@ -316,6 +324,7 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permw,       66, 0f38, 8d,    vl,    w, vl),
     INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
     INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
@@ -412,6 +421,7 @@ static const struct test avx512dq_512[]
 };
 
 static const struct test avx512_vbmi_all[] = {
+    INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
 };
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -186,6 +186,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#   define swap2(x) B_(vpermilps, _mask, x, 0b00011011, undef(), ~0)
 #  else
 #   define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
 #   define insert_pair(x, y, p) \
@@ -200,6 +201,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
 })
+#   define swap2(x) B(vpermilps, _mask, \
+                       B(shuf_f32x4_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b00011011, undef(), ~0)
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -233,6 +238,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#   define swap2(x) B_(vpermilpd, _mask, x, 0b01, undef(), ~0)
 #  else
 #   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
@@ -240,6 +246,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
 })
+#   define swap2(x) B(vpermilpd, _mask, \
+                       B(shuf_f64x2_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b01010101, undef(), ~0)
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -405,6 +415,7 @@ static inline bool _to_bool(byte_vec_t b
                              B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
                                VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
                              0b00011011, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B_(permvarsi, _mask, (vsi_t)(x), (vsi_t)(inv - 1), (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -442,8 +453,17 @@ static inline bool _to_bool(byte_vec_t b
                              (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
                                       VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
                              0b01001110, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B(permvardi, _mask, (vdi_t)(x), (vdi_t)(inv - 1), (vdi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+#  if VEC_SIZE == 32
+#   define swap3(x) ((vec_t)B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0))
+#  elif VEC_SIZE == 64
+#   define swap3(x) ({ \
+    vdi_t t_ = B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0); \
+    B(shuf_i64x2_, _mask, t_, t_, 0b01001110, (vdi_t)undef(), ~0); \
+})
+#  endif
 # endif
 # if INT_SIZE == 4
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
@@ -489,6 +509,9 @@ static inline bool _to_bool(byte_vec_t b
 #  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
 #  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+#  ifdef __AVX512VBMI__
+#   define swap2(x) ((vec_t)B(permvarqi, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  endif
 # elif INT_SIZE == 2 || UINT_SIZE == 2
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -517,6 +540,7 @@ static inline bool _to_bool(byte_vec_t b
                               (0b01010101010101010101010101010101 & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+#  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
 # endif
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
@@ -1325,6 +1349,12 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
+#ifdef swap3
+    touch(src);
+    if ( !eq(swap3(src), inv) ) return __LINE__;
+    touch(src);
+#endif
+
 #ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -275,6 +275,8 @@ OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(perm);
+OVR_VFP(permil);
 OVR_VFP(shuf);
 OVR_INT(sll);
 OVR_DQ(sllv);
@@ -331,6 +333,8 @@ OVR(movntdq);
 OVR(movntdqa);
 OVR(movshdup);
 OVR(movsldup);
+OVR(permd);
+OVR(permq);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -434,7 +434,8 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
-    [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
+    [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -477,6 +478,7 @@ static const struct ext0f38_table {
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
+    [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -522,10 +524,10 @@ static const struct ext0f3a_table {
     uint8_t four_op:1;
     disp8scale_t d8s:4;
 } ext0f3a_table[256] = {
-    [0x00] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x00] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
+    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x02] = { .simd_size = simd_packed_int },
-    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
@@ -8078,6 +8080,9 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.br, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0c): /* vpermilps [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0d): /* vpermilpd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         if ( b == 0xe2 )
             goto avx512f_no_sae;
@@ -8423,6 +8428,12 @@ x86_emulate(
         generate_exception_if(!vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x16): /* vpermp{s,d} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x36): /* vperm{d,q} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x20): /* vpmovsxbw xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,{x,y}mm */
@@ -8627,6 +8638,7 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         if ( !evex.w )
             host_and_vcpu_must_have(avx512_vbmi);
         else
@@ -9054,6 +9066,12 @@ x86_emulate(
         generate_exception_if(!vex.l || !vex.w, EXC_UD);
         goto simd_0f_imm8_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x00): /* vpermq $imm8,{y,z}mm/mem,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x01): /* vpermpd $imm8,{y,z}mm/mem,{y,z}mm{k} */
+        generate_exception_if(!evex.lr || !evex.w, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x38): /* vinserti128 $imm8,xmm/m128,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x39): /* vextracti128 $imm8,ymm,xmm/m128 */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x46): /* vperm2i128 $imm8,ymm/m256,ymm,ymm */
@@ -9073,6 +9091,12 @@ x86_emulate(
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x04): /* vpermilps $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x05): /* vpermilpd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_66(0x0f3a, 0x08): /* roundps $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x09): /* roundpd $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x0a): /* roundss $imm8,xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 18/42] x86emul: support AVX512BW pack insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (16 preceding siblings ...)
  2018-12-06  9:59   ` [PATCH v6 17/42] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
@ 2018-12-06  9:59   ` Jan Beulich
  2018-12-06 10:00   ` [PATCH v6 19/42] x86emul: support AVX512F floating-point conversion insns Jan Beulich
                     ` (23 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06  9:59 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

No further test harness additions - what is there is good enough for
these rather "regular" insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -306,6 +306,10 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(packssdw,    66,   0f, 6b,    vl, d_nb, vl),
+    INSN(packsswb,    66,   0f, 63,    vl,    w, vl),
+    INSN(packusdw,    66, 0f38, 2b,    vl, d_nb, vl),
+    INSN(packuswb,    66,   0f, 67,    vl,    w, vl),
     INSN(paddb,       66,   0f, fc,    vl,    b, vl),
     INSN(paddsb,      66,   0f, ec,    vl,    b, vl),
     INSN(paddsw,      66,   0f, ed,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -361,6 +361,10 @@ OVR(pextrw);
 OVR(pinsrb);
 OVR(pinsrw);
 #  ifdef __AVX512VL__
+OVR(packssdw);
+OVR(packsswb);
+OVR(packusdw);
+OVR(packuswb);
 OVR(pmaddwd);
 OVR(pmovsxbw);
 OVR(pmovzxbw);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -453,7 +453,7 @@ static const struct ext0f38_table {
     [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
-    [0x2b] = { .simd_size = simd_packed_int },
+    [0x2b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
@@ -6723,6 +6723,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         op_bytes = 16 << evex.lr;
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x63): /* vpacksswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x67): /* vpackuswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6785,6 +6787,12 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6b): /* vpackssdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2b): /* vpackusdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w || evex.br, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 19/42] x86emul: support AVX512F floating-point conversion insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (17 preceding siblings ...)
  2018-12-06  9:59   ` [PATCH v6 18/42] x86emul: support AVX512BW pack insns Jan Beulich
@ 2018-12-06 10:00   ` Jan Beulich
  2018-12-06 10:00   ` [PATCH v6 20/42] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
                     ` (22 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:00 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVTPS2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size change for twobyte_table[0x5a] is benign to pre-existing
code, but allows decode_disp8scale() to work as is here.

Also correct the comment on an AVX counterpart.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base over changes earlier in the series.
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,6 +109,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
+    INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
+    INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -181,7 +181,9 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  define widen1(x) ((vec_t)BR(cvtps2pd, _mask, x, (vdf_t)undef(), ~0))
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -68,6 +68,7 @@ typedef short __attribute__((vector_size
 typedef int __attribute__((vector_size(VEC_SIZE))) vsi_t;
 #if VEC_SIZE >= 8
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
+typedef double __attribute__((vector_size(VEC_SIZE))) vdf_t;
 #endif
 
 #if ELEM_SIZE == 1
@@ -93,6 +94,7 @@ typedef char __attribute__((vector_size(
 typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
 typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+typedef float __attribute__((vector_size(HALF_SIZE))) vsf_half_t;
 # endif
 
 # if ELEM_COUNT >= 4
@@ -328,6 +330,13 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(cvtpd2psx);
+OVR(cvtpd2psy);
+OVR(cvtph2ps);
+OVR(cvtps2pd);
+OVR(cvtps2ph);
+OVR(cvtsd2ss);
+OVR(cvtss2sd);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3842,6 +3842,49 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vcvtph2ps 32(%ecx),%zmm7{%k4}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vcvtph2ps);
+        decl_insn(evex_vcvtps2ph);
+
+        asm volatile ( "vpternlogd $0x81, %%zmm7, %%zmm7, %%zmm7\n\t"
+                       "kmovw %1,%%k4\n"
+                       put_insn(evex_vcvtph2ps, "vcvtph2ps 32(%0), %%zmm7%{%%k4%}")
+                       :: "c" (NULL), "r" (0x3333) );
+
+        set_insn(evex_vcvtph2ps);
+        memset(res, 0xff, 128);
+        res[8] = 0x40003c00; /* (1.0, 2.0) */
+        res[10] = 0x44004200; /* (3.0, 4.0) */
+        res[12] = 0x3400b800; /* (-.5, .25) */
+        res[14] = 0xbc000000; /* (0.0, -1.) */
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        asm volatile ( "vmovups %%zmm7, %0" : "=m" (res[16]) );
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtph2ps) )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vcvtps2ph $0,%zmm3,64(%edx){%k4}...");
+        asm volatile ( "vmovups %0, %%zmm3\n"
+                       put_insn(evex_vcvtps2ph, "vcvtps2ph $0, %%zmm3, 128(%1)%{%%k4%}")
+                       :: "m" (res[16]), "d" (NULL) );
+
+        set_insn(evex_vcvtps2ph);
+        regs.edx = (unsigned long)res;
+        memset(res + 32, 0xcc, 32);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtps2ph) )
+            goto fail;
+        res[15] = res[13] = res[11] = res[9] = 0xcccccccc;
+        if ( memcmp(res + 8, res + 32, 32) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -310,7 +310,8 @@ static const struct twobyte_table {
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -437,7 +438,7 @@ static const struct ext0f38_table {
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x13] = { .simd_size = simd_other, .two_op = 1 },
+    [0x13] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
@@ -541,7 +542,7 @@ static const struct ext0f3a_table {
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
-    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
+    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
@@ -3068,6 +3069,11 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x5a: /* vcvtps2pd needs special casing */
+                if ( disp8scale && !evex.pfx && !evex.br )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -5982,6 +5988,7 @@ x86_emulate(
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5d):    /* vmin{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5e):    /* vdiv{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5f):    /* vmax{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    avx512f_all_fp:
         generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
                                (ea.type == OP_MEM && evex.br &&
                                 (evex.pfx & VEX_PREFIX_SCALAR_MASK))),
@@ -6539,7 +6546,7 @@ x86_emulate(
         goto simd_zmm;
 
     CASE_SIMD_ALL_FP(, 0x0f, 0x5a):        /* cvt{p,s}{s,d}2{p,s}{s,d} xmm/mem,xmm */
-    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} xmm/mem,xmm */
+    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} {x,y}mm/mem,{x,y}mm */
                                            /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm */
         op_bytes = 4 << (((vex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + vex.l) +
                          !!(vex.pfx & VEX_PREFIX_DOUBLE_MASK));
@@ -6548,6 +6555,12 @@ x86_emulate(
             goto simd_0f_sse2;
         goto simd_0f_avx;
 
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5a):   /* vcvtp{s,d}2p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+                                           /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm{k} */
+        op_bytes = 4 << (((evex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + evex.lr) +
+                         evex.w);
+        goto avx512f_all_fp;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x5b):     /* cvt{ps,dq}2{dq,ps} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x5b): /* vcvt{ps,dq}2{dq,ps} {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F3(0x0f, 0x5b):       /* cvttps2dq xmm/mem,xmm */
@@ -8431,6 +8444,15 @@ x86_emulate(
         op_bytes = 8 << vex.l;
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x13): /* vcvtph2ps {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w || (ea.type == OP_MEM && evex.br), EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.br )
+            avx512_vlen_check(false);
+        op_bytes = 8 << evex.lr;
+        elem_bytes = 2;
+        goto simd_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x16): /* vpermps ymm/m256,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x36): /* vpermd ymm/m256,ymm,ymm */
         generate_exception_if(!vex.l || vex.w, EXC_UD);
@@ -9255,27 +9277,79 @@ x86_emulate(
         goto avx512f_imm8_no_sae;
 
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,[xyz]mm,{x,y}mm/mem{k} */
     {
         uint32_t mxcsr;
 
-        generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
-        host_and_vcpu_must_have(f16c);
         fail_if(!ops->write);
+        if ( evex_encoded() )
+        {
+            generate_exception_if((evex.w || evex.reg != 0xf || !evex.RX ||
+                                   (ea.type == OP_MEM && (evex.z || evex.br))),
+                                  EXC_UD);
+            host_and_vcpu_must_have(avx512f);
+            avx512_vlen_check(false);
+            opc = init_evex(stub);
+        }
+        else
+        {
+            generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
+            host_and_vcpu_must_have(f16c);
+            opc = init_prefixes(stub);
+        }
+
+        op_bytes = 8 << evex.lr;
 
-        opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
         if ( ea.type == OP_MEM )
         {
             /* Convert memory operand to (%rAX). */
             vex.b = 1;
+            evex.b = 1;
             opc[1] &= 0x38;
         }
         opc[2] = imm1;
-        insn_bytes = PFX_BYTES + 3;
+        if ( evex_encoded() )
+        {
+            unsigned int full = 0;
+
+            insn_bytes = EVEX_PFX_BYTES + 3;
+            copy_EVEX(opc, evex);
+
+            if ( ea.type == OP_MEM && evex.opmsk )
+            {
+                full = 0xffff >> (16 - op_bytes / 2);
+                op_mask &= full;
+                if ( !op_mask )
+                    goto complete_insn;
+
+                first_byte = __builtin_ctz(op_mask);
+                op_mask >>= first_byte;
+                full >>= first_byte;
+                first_byte <<= 1;
+                op_bytes = (32 - __builtin_clz(op_mask)) << 1;
+
+                /*
+                 * We may need to read (parts of) the memory operand for the
+                 * purpose of merging in order to avoid splitting the write
+                 * below into multiple ones.
+                 */
+                if ( op_mask != full &&
+                     (rc = ops->read(ea.mem.seg,
+                                     truncate_ea(ea.mem.off + first_byte),
+                                     (void *)mmvalp + first_byte, op_bytes,
+                                     ctxt)) != X86EMUL_OKAY )
+                    goto done;
+            }
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 3;
+            copy_VEX(opc, vex);
+        }
         opc[3] = 0xc3;
 
-        copy_VEX(opc, vex);
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
                     "=m" (*mmvalp), [mxcsr] "=m" (mxcsr) : "a" (mmvalp));
@@ -9284,7 +9358,8 @@ x86_emulate(
 
         if ( ea.type == OP_MEM )
         {
-            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
+            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
+                            (void *)mmvalp + first_byte, op_bytes, ctxt);
             if ( rc != X86EMUL_OKAY )
             {
                 asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 20/42] x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (18 preceding siblings ...)
  2018-12-06 10:00   ` [PATCH v6 19/42] x86emul: support AVX512F floating-point conversion insns Jan Beulich
@ 2018-12-06 10:00   ` Jan Beulich
  2018-12-06 10:01   ` [PATCH v6 21/42] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
                     ` (21 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:00 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

... including the two AVX512DQ forms which shared encodings, just with
EVEX.W set there.

VCVTDQ2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size changes for the twobyte_table[] entries are benign to
pre-existing code, but allow decode_disp8scale() to work as is here.

The at this point wrong placement of the 0xe6 case block is once again
in anticipation of further additions of case labels.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,8 +109,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
+    INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
+    INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
@@ -398,6 +402,8 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
+    INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -92,6 +92,13 @@ static inline bool _to_bool(byte_vec_t b
 # define to_int(x) ((vec_t){ (int)(x)[0] })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
+#elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if FLOAT_SIZE == 4
+#  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+# elif FLOAT_SIZE == 8
+#  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
 #  define to_int(x) __builtin_ia32_cvtdq2ps(__builtin_ia32_cvtps2dq(x))
@@ -1142,15 +1149,21 @@ int simd_test(void)
     touch(src);
     if ( !eq(x * -alt, -src) ) return __LINE__;
 
-# if defined(recip) && defined(to_int)
+# ifdef to_int
+
+    touch(src);
+    x = to_int(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
 
+#  ifdef recip
     touch(src);
     x = recip(src);
     touch(src);
     touch(x);
     if ( !eq(to_int(recip(x)), src) ) return __LINE__;
 
-#  ifdef rsqrt
+#   ifdef rsqrt
     x = src * src;
     touch(x);
     y = rsqrt(x);
@@ -1158,6 +1171,7 @@ int simd_test(void)
     if ( !eq(to_int(recip(y)), src) ) return __LINE__;
     touch(src);
     if ( !eq(to_int(y), to_int(recip(src))) ) return __LINE__;
+#   endif
 #  endif
 
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -244,6 +244,7 @@ asm ( ".macro override insn    \n\t"
 OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
+OVR_VFP(cvtdq2);
 OVR_FP(add);
 OVR_INT(add);
 OVR_BW(adds);
@@ -330,13 +331,19 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(cvtpd2dqx);
+OVR(cvtpd2dqy);
 OVR(cvtpd2psx);
 OVR(cvtpd2psy);
 OVR(cvtph2ps);
+OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
 OVR(cvtss2sd);
+OVR(cvttpd2dqx);
+OVR(cvttpd2dqy);
+OVR(cvttps2dq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -311,7 +311,7 @@ static const struct twobyte_table {
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -375,7 +375,7 @@ static const struct twobyte_table {
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
@@ -3078,6 +3078,11 @@ x86_decode(
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
                 break;
+
+            case 0xe6: /* vcvtdq2pd needs special casing */
+                if ( disp8scale && evex.pfx == vex_f3 && !evex.w && !evex.br )
+                    --disp8scale;
+                break;
             }
             break;
 
@@ -6569,6 +6574,22 @@ x86_emulate(
         op_bytes = 16 << vex.l;
         goto simd_0f_cvt;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x5b): /* vcvtps2dq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x5b): /* vcvttps2dq [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x60):    /* punpcklbw {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x60): /* vpunpcklbw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0x61):    /* punpcklwd {,x}mm/mem,{,x}mm */
@@ -7227,6 +7248,27 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe6):   /* vcvttpd2dq [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx != vex_f3 )
+            host_and_vcpu_must_have(avx512f);
+        else if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+        {
+            host_and_vcpu_must_have(avx512f);
+            generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        }
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 8 << (evex.w + evex.lr);
+        goto simd_zmm;
+
     case X86EMUL_OPC_F2(0x0f, 0xf0):     /* lddqu m128,xmm */
     case X86EMUL_OPC_VEX_F2(0x0f, 0xf0): /* vlddqu mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 21/42] x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (19 preceding siblings ...)
  2018-12-06 10:00   ` [PATCH v6 20/42] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
@ 2018-12-06 10:01   ` Jan Beulich
  2018-12-06 10:01   ` [PATCH v6 22/42] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
                     ` (20 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:01 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVT{,T}S{S,D}2SI use EVEX.W for their destination (register) rather
than their (possibly memory) source operand size and hence need a
"manual" override of disp8scale.

Slightly adjust the scalar to_int() in the test harness, to increase the
chances of the operand ending up in memory.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -117,8 +117,16 @@ static const struct test avx512f_all[] =
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
+    INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
+    INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -746,8 +754,9 @@ static void test_group(const struct test
                 break;
 
             case ESZ_dq:
-                test_pair(&tests[i], vl[j], ESZ_d, "d", ESZ_q, "q",
-                          instr, ctxt);
+                test_pair(&tests[i], vl[j], ESZ_d,
+                          strncmp(tests[i].mnemonic, "cvt", 3) ? "d" : "l",
+                          ESZ_q, "q", instr, ctxt);
                 break;
 
 #ifdef __i386__
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -89,7 +89,7 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 
 #if VEC_SIZE == FLOAT_SIZE
-# define to_int(x) ((vec_t){ (int)(x)[0] })
+# define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -340,10 +340,28 @@ OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
+OVR(cvtsd2si);
+OVR(cvtsd2sil);
+OVR(cvtsd2siq);
+OVR(cvtsi2sd);
+OVR(cvtsi2sdl);
+OVR(cvtsi2sdq);
+OVR(cvtsi2ss);
+OVR(cvtsi2ssl);
+OVR(cvtsi2ssq);
 OVR(cvtss2sd);
+OVR(cvtss2si);
+OVR(cvtss2sil);
+OVR(cvtss2siq);
 OVR(cvttpd2dqx);
 OVR(cvttpd2dqy);
 OVR(cvttps2dq);
+OVR(cvttsd2si);
+OVR(cvttsd2sil);
+OVR(cvttsd2siq);
+OVR(cvttss2si);
+OVR(cvttss2sil);
+OVR(cvttss2siq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -296,7 +296,7 @@ static const struct twobyte_table {
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
     [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp, simd_none, d8s_dq },
@@ -3069,6 +3069,12 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x2c: /* vcvtts{s,d}2si need special casing */
+            case 0x2d: /* vcvts{s,d}2si need special casing */
+                if ( evex_encoded() )
+                    disp8scale = 2 + (evex.pfx & VEX_PREFIX_DOUBLE_MASK);
+                break;
+
             case 0x5a: /* vcvtps2pd needs special casing */
                 if ( disp8scale && !evex.pfx && !evex.br )
                     --disp8scale;
@@ -6184,6 +6190,26 @@ x86_emulate(
         state->simd_size = simd_none;
         goto simd_0f_rm;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+        generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.br),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        if ( ea.type == OP_MEM )
+        {
+            rc = read_ulong(ea.mem.seg, ea.mem.off, &src.val,
+                            rex_prefix & REX_W ? 8 : 4, ctxt, ops);
+            if ( rc != X86EMUL_OKAY )
+                goto done;
+        }
+        else
+            src.val = rex_prefix & REX_W ? *ea.reg : (uint32_t)*ea.reg;
+
+        state->simd_size = simd_none;
+        goto avx512f_rm;
+
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2c):     /* cvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_VEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2d):     /* cvts{s,d}2si xmm/mem,reg */
@@ -6205,14 +6231,17 @@ x86_emulate(
         }
 
         opc = init_prefixes(stub);
+    cvts_2si:
         opc[0] = b;
         /* Convert GPR destination to %rAX and memory operand to (%rCX). */
         rex_prefix &= ~REX_R;
         vex.r = 1;
+        evex.r = 1;
         if ( ea.type == OP_MEM )
         {
             rex_prefix &= ~REX_B;
             vex.b = 1;
+            evex.b = 1;
             opc[1] = 0x01;
 
             rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp,
@@ -6223,11 +6252,22 @@ x86_emulate(
         else
             opc[1] = modrm & 0xc7;
         if ( !mode_64bit() )
+        {
             vex.w = 0;
-        insn_bytes = PFX_BYTES + 2;
+            evex.w = 0;
+        }
+        if ( evex_encoded() )
+        {
+            insn_bytes = EVEX_PFX_BYTES + 2;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 2;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
         opc[2] = 0xc3;
 
-        copy_REX_VEX(opc, rex_prefix, vex);
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
 
@@ -6235,6 +6275,17 @@ x86_emulate(
         state->simd_size = simd_none;
         break;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+        generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
+                               (ea.type != OP_REG && evex.br)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto cvts_2si;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2e):     /* ucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x2e): /* vucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2f):     /* comis{s,d} xmm/mem,xmm */
@@ -6886,6 +6937,7 @@ x86_emulate(
         host_and_vcpu_must_have(avx512f);
         get_fpu(X86EMUL_FPU_zmm);
 
+    avx512f_rm:
         opc = init_evex(stub);
         opc[0] = b;
         /* Convert memory/GPR operand to (%rAX). */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 22/42] x86emul: support AVX512DQ packed quad-int/FP conversion insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (20 preceding siblings ...)
  2018-12-06 10:01   ` [PATCH v6 21/42] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
@ 2018-12-06 10:01   ` Jan Beulich
  2018-12-06 10:02   ` [PATCH v6 23/42] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
                     ` (19 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:01 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVT{,T}PS2QQ, sharing their main opcodes with others, once again need
"manual" overrides of disp8scale.

While not directly related here, also add a scalar variant of to_wint()
to the test harness.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Workaround for gcc 7 quirk.
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -410,8 +410,12 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
+    INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -90,14 +90,35 @@ static inline bool _to_bool(byte_vec_t b
 
 #if VEC_SIZE == FLOAT_SIZE
 # define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
+# ifdef __x86_64__
+#  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) ({ \
+    vsf_half_t t_ = low_half(x); \
+    vdi_t lo_, hi_; \
+    touch(t_); \
+    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    t_ = high_half(x); \
+    touch(t_); \
+    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    touch(lo_); touch(hi_); \
+    insert_half(insert_half(undef(), \
+                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+})
+#  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
@@ -121,6 +142,21 @@ static inline bool _to_bool(byte_vec_t b
 })
 #endif
 
+#if VEC_SIZE == 16 && FLOAT_SIZE == 4 && defined(__SSE__)
+# define low_half(x) (x)
+# define high_half(x) B_(movhlps, , undef(), x)
+/*
+ * GCC 7 (and perhaps earlier) report a bogus type mismatch for the conditional
+ * expression below. All works well with this no-op wrapper.
+ */
+static inline vec_t movlhps(vec_t x, vec_t y) {
+    return __builtin_ia32_movlhps(x, y);
+}
+# define insert_pair(x, y, p) \
+    ((p) ? movlhps(x, y) \
+         : ({ vec_t t_ = (x); t_[0] = (y)[0]; t_[1] = (y)[1]; t_; }))
+#endif
+
 #if VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW_A__)
 # define max __builtin_ia32_pfmax
 # define min __builtin_ia32_pfmin
@@ -149,13 +185,16 @@ static inline bool _to_bool(byte_vec_t b
 # if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
      (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
      (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
-#  define low_half(x) ({ \
+#  define _half(x, lh) ({ \
     half_t t_; \
-    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+    asm ( "vextractf%c[w]x%c[n] %[sel], %[s], %[d]" \
           : [d] "=m" (t_) \
-          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+          : [s] "v" (x), [sel] "i" (lh), \
+            [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
     t_; \
 })
+#  define low_half(x)  _half(x, 0)
+#  define high_half(x) _half(x, 1)
 # endif
 # if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
      (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
@@ -1176,6 +1215,13 @@ int simd_test(void)
 
 # endif
 
+# ifdef to_wint
+    touch(src);
+    x = to_wint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
 # ifdef sqrt
     x = src * src;
     touch(x);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -325,6 +325,8 @@ static const struct twobyte_table {
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3080,6 +3082,12 @@ x86_decode(
                     --disp8scale;
                 break;
 
+            case 0x7a: /* vcvttps2qq needs special casing */
+            case 0x7b: /* vcvtps2qq needs special casing */
+                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.br )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -7309,7 +7317,13 @@ x86_emulate(
         if ( evex.pfx != vex_f3 )
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
+        {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2qq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512dq);
+        }
         else
         {
             host_and_vcpu_must_have(avx512f);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 23/42] x86emul: support AVX512{F, DQ} uint-to-FP conversion insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (21 preceding siblings ...)
  2018-12-06 10:01   ` [PATCH v6 22/42] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
@ 2018-12-06 10:02   ` Jan Beulich
  2018-12-06 10:02   ` [PATCH v6 24/42] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
                     ` (18 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:02 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Some "manual" overrides of disp8scale are needed here again. In
particular code ends up simpler when using d8s_dq64 in the
twobyte_table[] entry.

Test harness additions will be done once the reverse conversions are
also available.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -127,6 +127,10 @@ static const struct test avx512f_all[] =
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
+    INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
+    INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
+    INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -416,6 +420,8 @@ static const struct test avx512dq_all[]
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
+    INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -326,7 +326,7 @@ static const struct twobyte_table {
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3082,12 +3082,16 @@ x86_decode(
                     --disp8scale;
                 break;
 
-            case 0x7a: /* vcvttps2qq needs special casing */
-            case 0x7b: /* vcvtps2qq needs special casing */
-                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.br )
+            case 0x7a: /* vcvttps2qq and vcvtudq2pd need special casing */
+                if ( disp8scale && evex.pfx != vex_f2 && !evex.w && !evex.br )
                     --disp8scale;
                 break;
 
+            case 0x7b: /* vcvtp{s,d}2qq need special casing */
+                if ( disp8scale && evex.pfx == vex_66 )
+                    disp8scale = (evex.br ? 2 : 3 + evex.lr) + evex.w;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -6199,6 +6203,7 @@ x86_emulate(
         goto simd_0f_rm;
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x7b): /* vcvtusi2s{s,d} r/m,xmm,xmm */
         generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.br),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
@@ -6639,6 +6644,8 @@ x86_emulate(
         /* fall through */
     case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
                                           /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x7a): /* vcvtudq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtuqq2ps [xyz]mm/mem,{x,y}mm{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
@@ -7312,6 +7319,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
         generate_exception_if(!evex.w, EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7a):   /* vcvtudq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtuqq2pd [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
         if ( evex.pfx != vex_f3 )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 24/42] x86emul: support AVX512{F, DQ} FP-to-uint conversion insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (22 preceding siblings ...)
  2018-12-06 10:02   ` [PATCH v6 23/42] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
@ 2018-12-06 10:02   ` Jan Beulich
  2018-12-06 10:04   ` [PATCH v6 25/42] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
                     ` (17 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:02 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Along the lines of prior patches, VCVT{,T}PS2UQQ as well as
VCVT{,T}S{S,D}2USI need "manual" overrides of disp8scale.

The twobyte_table[] entries get altered, with their prior values
now put in place in x86_decode_twobyte().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -112,21 +112,29 @@ static const struct test avx512f_all[] =
     INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
     INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
     INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
+    INSN(cvtpd2udq,      ,   0f, 79,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtps2udq,      ,   0f, 79,    vl,      d, vl),
     INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
+    INSN(cvtsd2usi,    f2,   0f, 79,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
     INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
     INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvtss2usi,    f3,   0f, 79,    el,      d, el),
     INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttpd2udq,     ,   0f, 78,    vl,      q, vl),
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttps2udq,     ,   0f, 78,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttsd2usi,   f2,   0f, 78,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvttss2usi,   f3,   0f, 78,    el,      d, el),
     INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
     INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
@@ -415,11 +423,15 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtpd2uqq,      66,   0f, 79,   vl,  q, vl),
     INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
+    INSN(cvtps2uqq,      66,   0f, 79, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttpd2uqq,     66,   0f, 78,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvttps2uqq,     66,   0f, 78, vl_2,  d, vl),
     INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
     INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -93,31 +93,65 @@ static inline bool _to_bool(byte_vec_t b
 # ifdef __x86_64__
 #  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
 # endif
+# ifdef __AVX512F__
+/*
+ * Sadly even gcc 9.x, at the time of writing, does not carry out at least
+ * uint -> FP conversions using VCVTUSI2S{S,D}, so we need to use builtins
+ * or inline assembly here. The full-vector parameter types of the builtins
+ * aren't very helpful for our purposes, so use inline assembly.
+ */
+#  if FLOAT_SIZE == 4
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    float __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtss2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2ss%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  elif FLOAT_SIZE == 8
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    double __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtsd2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2sd%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  endif
+#  define to_uint(x) to_u_int(int, x)
+#  ifdef __x86_64__
+#   define to_uwint(x) to_u_int(long, x)
+#  endif
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  define to_uint(x) BR(cvtudq2ps, _mask, BR(cvtps2udq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
-#   define to_wint(x) ({ \
+#   define to_w_int(x, s) ({ \
     vsf_half_t t_ = low_half(x); \
     vdi_t lo_, hi_; \
     touch(t_); \
-    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    lo_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     t_ = high_half(x); \
     touch(t_); \
-    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    hi_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     touch(lo_); touch(hi_); \
     insert_half(insert_half(undef(), \
-                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
-                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+                            BR(cvt ## s ## qq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvt ## s ## qq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
 })
+#   define to_wint(x) to_w_int(x, )
+#   define to_uwint(x) to_w_int(x, u)
 #  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  define to_uint(x) B(cvtudq2pd, _mask, BR(cvtpd2udq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
 #   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#   define to_uwint(x) BR(cvtuqq2pd, _mask, BR(cvtpd2uqq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
 #  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
@@ -1221,6 +1255,20 @@ int simd_test(void)
     touch(src);
     if ( !eq(x, src) ) return __LINE__;
 # endif
+
+# ifdef to_uint
+    touch(src);
+    x = to_uint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
+# ifdef to_uwint
+    touch(src);
+    x = to_uwint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
 
 # ifdef sqrt
     x = src * src;
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -323,8 +323,7 @@ static const struct twobyte_table {
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
-    [0x78] = { ImplicitOps|ModRM },
-    [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x78 ... 0x79] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -2516,6 +2515,8 @@ x86_decode_twobyte(
         break;
 
     case 0x78:
+        state->desc = ImplicitOps;
+        state->simd_size = simd_none;
         switch ( vex.pfx )
         {
         case vex_66: /* extrq $imm8, $imm8, xmm */
@@ -2528,7 +2529,7 @@ x86_decode_twobyte(
     case 0x10 ... 0x18:
     case 0x28 ... 0x2f:
     case 0x50 ... 0x77:
-    case 0x79 ... 0x7d:
+    case 0x7a ... 0x7d:
     case 0x7f:
     case 0xc2 ... 0xc3:
     case 0xc5 ... 0xc6:
@@ -2550,6 +2551,12 @@ x86_decode_twobyte(
         op_bytes = mode_64bit() ? 8 : 4;
         break;
 
+    case 0x79:
+        state->desc = DstReg | SrcMem;
+        state->simd_size = simd_packed_int;
+        ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
+        break;
+
     case 0x7e:
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         if ( vex.pfx == vex_f3 ) /* movq xmm/m64,xmm */
@@ -3071,6 +3078,18 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x78:
+            case 0x79:
+                if ( !evex.pfx )
+                    break;
+                /* vcvt{,t}ps2uqq need special casing */
+                if ( evex.pfx == vex_66 )
+                {
+                    if ( !evex.w && !evex.br )
+                        --disp8scale;
+                    break;
+                }
+                /* vcvt{,t}s{s,d}2usi need special casing: fall through */
             case 0x2c: /* vcvtts{s,d}2si need special casing */
             case 0x2d: /* vcvts{s,d}2si need special casing */
                 if ( evex_encoded() )
@@ -6290,6 +6309,8 @@ x86_emulate(
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x78): /* vcvtts{s,d}2usi xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x79): /* vcvts{s,d}2usi xmm/mem,reg */
         generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
                                (ea.type != OP_REG && evex.br)),
                               EXC_UD);
@@ -6649,7 +6670,11 @@ x86_emulate(
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
+        {
+    case X86EMUL_OPC_EVEX(0x0f, 0x78):    /* vcvttp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX(0x0f, 0x79):    /* vcvtp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512f);
+        }
         if ( ea.type == OP_MEM || !evex.br )
             avx512_vlen_check(false);
         d |= TwoOp;
@@ -7327,6 +7352,10 @@ x86_emulate(
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
         {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x78):   /* vcvttps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2uqq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x79):   /* vcvtps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2uqq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 25/42] x86emul: support remaining AVX512F legacy-equivalent insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (23 preceding siblings ...)
  2018-12-06 10:02   ` [PATCH v6 24/42] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
@ 2018-12-06 10:04   ` Jan Beulich
  2018-12-06 10:04   ` [PATCH v6 26/42] x86emul: support remaining AVX512BW " Jan Beulich
                     ` (16 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:04 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Plus their AVX512BW counterparts.

Take the opportunity and also eliminate a pair of open coded instances
of scalar_1op().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base over changes earlier in the series.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -193,6 +193,8 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(pabsd,        66, 0f38, 1e,    vl,      d, vl),
+    INSN(pabsq,        66, 0f38, 1f,    vl,      q, vl),
     INSN(paddd,        66,   0f, fe,    vl,      d, vl),
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
@@ -276,6 +278,10 @@ static const struct test avx512f_all[] =
     INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
     INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
+    INSN(rndscalepd,   66, 0f3a, 09,    vl,      q, vl),
+    INSN(rndscaleps,   66, 0f3a, 08,    vl,      d, vl),
+    INSN(rndscalesd,   66, 0f3a, 0b,    el,      q, el),
+    INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
@@ -336,6 +342,8 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(pabsb,       66, 0f38, 1c,    vl,    b, vl),
+    INSN(pabsw,       66, 0f38, 1d,    vl,    w, vl),
     INSN(packssdw,    66,   0f, 6b,    vl, d_nb, vl),
     INSN(packsswb,    66,   0f, 63,    vl,    w, vl),
     INSN(packusdw,    66, 0f38, 2b,    vl, d_nb, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -211,8 +211,10 @@ static inline vec_t movlhps(vec_t x, vec
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
+#  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
+#  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
@@ -263,6 +265,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
 #  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  define trunc(x) BR(rndscaleps_, _mask, x, 0b1011, undef(), ~0)
 #  define widen1(x) ((vec_t)BR(cvtps2pd, _mask, x, (vdf_t)undef(), ~0))
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
@@ -316,6 +319,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
 #  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
+#  define trunc(x) BR(rndscalepd_, _mask, x, 0b1011, undef(), ~0)
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
@@ -548,6 +552,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  endif
 # endif
 # if INT_SIZE == 4
+#  define abs(x) B(pabsd, _mask, x, undef(), ~0)
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
@@ -558,6 +563,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
 #  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
+#  define abs(x) ((vec_t)B(pabsq, _mask, (vdi_t)(x), (vdi_t)undef(), ~0))
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 8
@@ -625,6 +631,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
 # endif
 # if INT_SIZE == 1
+#  define abs(x) ((vec_t)B(pabsb, _mask, (vqi_t)(x), (vqi_t)undef(), ~0))
 #  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
@@ -637,6 +644,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
 #  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 2
+#  define abs(x) B(pabsw, _mask, x, undef(), ~0)
 #  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
 #  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
@@ -948,19 +956,11 @@ static inline vec_t movlhps(vec_t x, vec
 #if VEC_SIZE == FLOAT_SIZE
 # define max(x, y) ((vec_t){({ typeof(x[0]) x_ = (x)[0], y_ = (y)[0]; x_ > y_ ? x_ : y_; })})
 # define min(x, y) ((vec_t){({ typeof(x[0]) x_ = (x)[0], y_ = (y)[0]; x_ < y_ ? x_ : y_; })})
-# ifdef __SSE4_1__
+# if defined(__SSE4_1__) && !defined(__AVX512F__)
 #  if FLOAT_SIZE == 4
-#   define trunc(x) ({ \
-    float __attribute__((vector_size(16))) r_; \
-    asm ( "roundss $0b1011,%1,%0" : "=x" (r_) : "m" (x) ); \
-    (vec_t){ r_[0] }; \
-})
+#   define trunc(x) scalar_1op(x, "roundss $0b1011, %[in], %[out]")
 #  elif FLOAT_SIZE == 8
-#   define trunc(x) ({ \
-    double __attribute__((vector_size(16))) r_; \
-    asm ( "roundsd $0b1011,%1,%0" : "=x" (r_) : "m" (x) ); \
-    (vec_t){ r_[0] }; \
-})
+#   define trunc(x) scalar_1op(x, "roundsd $0b1011, %[in], %[out]")
 #  endif
 # endif
 #endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -184,6 +184,8 @@ DECL_OCTET(half);
 # define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
 # define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
 # define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
+# define __builtin_ia32_rndscalepd_512_mask __builtin_ia32_rndscalepd_mask
+# define __builtin_ia32_rndscaleps_512_mask __builtin_ia32_rndscaleps_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -245,6 +247,7 @@ OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_VFP(cvtdq2);
+OVR_INT(abs);
 OVR_FP(add);
 OVR_INT(add);
 OVR_BW(adds);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -446,7 +446,7 @@ static const struct ext0f38_table {
     [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
-    [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x1c ... 0x1f] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
     [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
@@ -531,8 +531,8 @@ static const struct ext0f3a_table {
     [0x02] = { .simd_size = simd_packed_int },
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
-    [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
-    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
+    [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc, .d8s = d8s_dq },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
     [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
@@ -6874,6 +6874,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.br, EXC_UD);
         elem_bytes = 1 << (b & 1);
@@ -8257,6 +8259,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1e): /* vpabsd [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1f): /* vpabsq [xyz]mm/mem,[xyz]mm{k} */
         generate_exception_if(evex.w != (b & 1), EXC_UD);
         goto avx512f_no_sae;
 
@@ -9286,6 +9290,17 @@ x86_emulate(
         host_and_vcpu_must_have(sse4_1);
         goto simd_0f3a_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0a): /* vrndscaless $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0b): /* vrndscalesd $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(ea.type != OP_REG && evex.br, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x08): /* vrndscaleps $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x09): /* vrndscalepd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        avx512_vlen_check(b & 2);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC(0x0f3a, 0x0f):    /* palignr $imm8,mm/m64,mm */
     case X86EMUL_OPC_66(0x0f3a, 0x0f): /* palignr $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(ssse3);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 26/42] x86emul: support remaining AVX512BW legacy-equivalent insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (24 preceding siblings ...)
  2018-12-06 10:04   ` [PATCH v6 25/42] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
@ 2018-12-06 10:04   ` Jan Beulich
  2018-12-06 10:04   ` [PATCH v6 27/42] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
                     ` (15 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:04 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -354,6 +354,7 @@ static const struct test avx512bw_all[]
     INSN(paddusb,     66,   0f, dc,    vl,    b, vl),
     INSN(paddusw,     66,   0f, dd,    vl,    w, vl),
     INSN(paddw,       66,   0f, fd,    vl,    w, vl),
+    INSN(palignr,     66, 0f3a, 0f,    vl,    b, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
     INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
@@ -369,6 +370,7 @@ static const struct test avx512bw_all[]
     INSN(permw,       66, 0f38, 8d,    vl,    w, vl),
     INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
     INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
+    INSN(pmaddubsw,   66, 0f38, 04,    vl,    b, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
@@ -386,6 +388,7 @@ static const struct test avx512bw_all[]
 //       pmovw2m,     f3, 0f38, 29,           w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
+    INSN(pmulhrsw,    66, 0f38, 0b,    vl,    w, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -587,6 +587,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define rotr(x, n) ((vec_t)B(palignr, _mask, (vdi_t)(x), (vdi_t)(x), (n) * 8, (vdi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
 #  elif defined(__AVX512VBMI__)
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
@@ -615,6 +616,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define rotr(x, n) ((vec_t)B(palignr, _mask, (vdi_t)(x), (vdi_t)(x), (n) * 16, (vdi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
                              (vsi_t)B(pshufhw, _mask, \
                                       B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -402,9 +402,12 @@ OVR(packssdw);
 OVR(packsswb);
 OVR(packusdw);
 OVR(packuswb);
+OVR(palignr);
+OVR(pmaddubsw);
 OVR(pmaddwd);
 OVR(pmovsxbw);
 OVR(pmovzxbw);
+OVR(pmulhrsw);
 OVR(pmulhuw);
 OVR(pmulhw);
 OVR(pmullw);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -435,7 +435,10 @@ static const struct ext0f38_table {
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x01 ... 0x03] = { .simd_size = simd_packed_int },
+    [0x04] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x05 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x0b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -534,7 +537,8 @@ static const struct ext0f3a_table {
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc, .d8s = d8s_dq },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
-    [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
+    [0x0e] = { .simd_size = simd_packed_int },
+    [0x0f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
@@ -6856,6 +6860,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x04): /* vpmaddubsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6874,6 +6879,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0b): /* vpmulhrsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
@@ -9329,6 +9335,10 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 4;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0f): /* vpalignr $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        goto avx512bw_imm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x14): /* pextrb $imm8,xmm,r/m */
     case X86EMUL_OPC_66(0x0f3a, 0x15): /* pextrw $imm8,xmm,r/m */
     case X86EMUL_OPC_66(0x0f3a, 0x16): /* pextr{d,q} $imm8,xmm,r/m */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 27/42] x86emul: support AVX512{F, ER} reciprocal insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (25 preceding siblings ...)
  2018-12-06 10:04   ` [PATCH v6 26/42] x86emul: support remaining AVX512BW " Jan Beulich
@ 2018-12-06 10:04   ` Jan Beulich
  2018-12-06 10:05   ` [PATCH v6 28/42] x86emul: support AVX512F floating point manipulation insns Jan Beulich
                     ` (14 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:04 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also include the only other AVX512ER insn pair, VEXP2P{D,S}.

Note that despite the replacement of the SHA insns' table slots there's
no need to special case their decoding: Their insn-specific code already
sets op_bytes (as was required due to simd_other), and TwoOp is of no
relevance for legacy encoded SIMD insns.

The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
to be on the safe side. The SDM does not clarify behavior there, and
it's even more ambiguous here (without AVX512VL in the picture).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base. AVX512ER tests now also successfully run.
v5: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -72,6 +72,9 @@ avx512bw-flts :=
 avx512dq-vecs := $(avx512f-vecs)
 avx512dq-ints := $(avx512f-ints)
 avx512dq-flts := $(avx512f-flts)
+avx512er-vecs := 64
+avx512er-ints :=
+avx512er-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -278,10 +278,14 @@ static const struct test avx512f_all[] =
     INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
     INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
+    INSN(rcp14,        66, 0f38, 4c,    vl,     sd, vl),
+    INSN(rcp14,        66, 0f38, 4d,    el,     sd, el),
     INSN(rndscalepd,   66, 0f3a, 09,    vl,      q, vl),
     INSN(rndscaleps,   66, 0f3a, 08,    vl,      d, vl),
     INSN(rndscalesd,   66, 0f3a, 0b,    el,      q, el),
     INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
+    INSN(rsqrt14,      66, 0f38, 4e,    vl,     sd, vl),
+    INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
@@ -477,6 +481,14 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512er_512[] = {
+    INSN(exp2,    66, 0f38, c8, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, ca, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, cb, el, sd, el),
+    INSN(rsqrt28, 66, 0f38, cc, vl, sd, vl),
+    INSN(rsqrt28, 66, 0f38, cd, el, sd, el),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -837,5 +849,6 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512er, 512);
     RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -210,9 +210,23 @@ static inline vec_t movlhps(vec_t x, vec
 })
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28ss %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14ss %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28sd %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14sd %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
@@ -263,6 +277,13 @@ static inline vec_t movlhps(vec_t x, vec
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28ps, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14ps, _mask, x, undef(), ~0)
+#  endif
 #  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscaleps_, _mask, x, 0b1011, undef(), ~0)
@@ -318,6 +339,13 @@ static inline vec_t movlhps(vec_t x, vec
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28pd, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14pd, _mask, x, undef(), ~0)
+#  endif
 #  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscalepd_, _mask, x, 0b1011, undef(), ~0)
 #  if VEC_SIZE == 16
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -178,14 +178,20 @@ DECL_OCTET(half);
 /* Sadly there are a few exceptions to the general naming rules. */
 # define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
 # define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+# define __builtin_ia32_exp2pd512_mask __builtin_ia32_exp2pd_mask
+# define __builtin_ia32_exp2ps512_mask __builtin_ia32_exp2ps_mask
 # define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
 # define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
 # define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
 # define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
 # define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
 # define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
+# define __builtin_ia32_rcp28pd512_mask __builtin_ia32_rcp28pd_mask
+# define __builtin_ia32_rcp28ps512_mask __builtin_ia32_rcp28ps_mask
 # define __builtin_ia32_rndscalepd_512_mask __builtin_ia32_rndscalepd_mask
 # define __builtin_ia32_rndscaleps_512_mask __builtin_ia32_rndscaleps_mask
+# define __builtin_ia32_rsqrt28pd512_mask __builtin_ia32_rsqrt28pd_mask
+# define __builtin_ia32_rsqrt28ps512_mask __builtin_ia32_rsqrt28ps_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -24,6 +24,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f.h"
 #include "avx512bw.h"
 #include "avx512dq.h"
+#include "avx512er.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -106,6 +107,11 @@ static bool simd_check_avx512dq_vl(void)
     return cpu_has_avx512dq && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx512er(void)
+{
+    return cpu_has_avx512er;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -327,6 +333,10 @@ static const struct {
     AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
     AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
     AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
+    SIMD(AVX512ER f32 scalar,avx512er,        f4),
+    SIMD(AVX512ER f32x16,    avx512er,      64f4),
+    SIMD(AVX512ER f64 scalar,avx512er,        f8),
+    SIMD(AVX512ER f64x8,     avx512er,      64f8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -134,6 +134,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_bmi2       cp.feat.bmi2
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
+#define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -471,6 +471,10 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
@@ -510,7 +514,12 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
-    [0xc8 ... 0xcd] = { .simd_size = simd_other },
+    [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xc9] = { .simd_size = simd_other },
+    [0xca] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xcc] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
     [0xf0] = { .two_op = 1 },
@@ -1873,6 +1882,7 @@ static bool vcpu_has(
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
+#define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
@@ -6153,6 +6163,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4c): /* vrcp14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4e): /* vrsqrt14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -9064,12 +9076,17 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
-        host_and_vcpu_must_have(avx512f);
+    simd_zmm_scalar_sae:
         if ( ea.type == OP_MEM )
         {
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4d): /* vrcp14s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4f): /* vrsqrt14s{s,d} xmm/mem,xmm,xmm{k} */
+            host_and_vcpu_must_have(avx512f);
             generate_exception_if(evex.br, EXC_UD);
             avx512_vlen_check(true);
         }
+        else
+            host_and_vcpu_must_have(avx512f);
         goto simd_zmm;
 
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */
@@ -9082,6 +9099,18 @@ x86_emulate(
         op_bytes = 16;
         goto simd_0f38_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc8): /* vexp2p{s,d} zmm/m512,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xca): /* vrcp28p{s,d} zmm/m512,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcc): /* vrsqrt28p{s,d} zmm/m512,zmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        generate_exception_if(ea.type == OP_MEM && evex.lr != 2, EXC_UD);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcb): /* vrcp28s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcd): /* vrsqrt28s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        goto simd_zmm_scalar_sae;
+
     case X86EMUL_OPC(0x0f38, 0xf0): /* movbe m,r */
     case X86EMUL_OPC(0x0f38, 0xf1): /* movbe r,m */
         vcpu_must_have(movbe);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -103,6 +103,7 @@
 #define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
+#define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 28/42] x86emul: support AVX512F floating point manipulation insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (26 preceding siblings ...)
  2018-12-06 10:04   ` [PATCH v6 27/42] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
@ 2018-12-06 10:05   ` Jan Beulich
  2018-12-06 10:05   ` [PATCH v6 29/42] x86emul: support AVX512DQ " Jan Beulich
                     ` (13 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:05 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -140,6 +140,8 @@ static const struct test avx512f_all[] =
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
     INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
+    INSN(fixupimm,     66, 0f3a, 54,    vl,     sd, vl),
+    INSN(fixupimm,     66, 0f3a, 55,    el,     sd, el),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
     INSN(fmadd213,     66, 0f38, a8,    vl,     sd, vl),
@@ -170,6 +172,10 @@ static const struct test avx512f_all[] =
     INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
     INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
     INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
+    INSN(getexp,       66, 0f38, 42,    vl,     sd, vl),
+    INSN(getexp,       66, 0f38, 43,    el,     sd, el),
+    INSN(getmant,      66, 0f3a, 26,    vl,     sd, vl),
+    INSN(getmant,      66, 0f3a, 27,    el,     sd, el),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
@@ -286,6 +292,8 @@ static const struct test avx512f_all[] =
     INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
     INSN(rsqrt14,      66, 0f38, 4e,    vl,     sd, vl),
     INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
+    INSN(scalef,       66, 0f38, 2c,    vl,     sd, vl),
+    INSN(scalef,       66, 0f38, 2d,    el,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -174,6 +174,11 @@ static inline bool _to_bool(byte_vec_t b
     asm ( op : [out] "=&x" (r_) : [in] "m" (x) ); \
     (vec_t){ r_[0] }; \
 })
+# define scalar_2op(x, y, op) ({ \
+    typeof((x)[0]) __attribute__((vector_size(16))) r_ = { x[0] }; \
+    asm ( op : [out] "=&x" (r_) : [in1] "[out]" (r_), [in2] "m" (y) ); \
+    (vec_t){ r_[0] }; \
+})
 #endif
 
 #if VEC_SIZE == 16 && FLOAT_SIZE == 4 && defined(__SSE__)
@@ -210,6 +215,8 @@ static inline vec_t movlhps(vec_t x, vec
 })
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
+#  define getexp(x) scalar_1op(x, "vgetexpss %[in], %[out], %[out]")
+#  define getmant(x) scalar_1op(x, "vgetmantss $0, %[in], %[out], %[out]")
 #  ifdef __AVX512ER__
 #   define recip(x) scalar_1op(x, "vrcp28ss %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt28ss %[in], %[out], %[out]")
@@ -217,9 +224,12 @@ static inline vec_t movlhps(vec_t x, vec
 #   define recip(x) scalar_1op(x, "vrcp14ss %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt14ss %[in], %[out], %[out]")
 #  endif
+#  define scale(x, y) scalar_2op(x, y, "vscalefss %[in2], %[in1], %[out]")
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
+#  define getexp(x) scalar_1op(x, "vgetexpsd %[in], %[out], %[out]")
+#  define getmant(x) scalar_1op(x, "vgetmantsd $0, %[in], %[out], %[out]")
 #  ifdef __AVX512ER__
 #   define recip(x) scalar_1op(x, "vrcp28sd %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt28sd %[in], %[out], %[out]")
@@ -227,6 +237,7 @@ static inline vec_t movlhps(vec_t x, vec
 #   define recip(x) scalar_1op(x, "vrcp14sd %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt14sd %[in], %[out], %[out]")
 #  endif
+#  define scale(x, y) scalar_2op(x, y, "vscalefsd %[in2], %[in1], %[out]")
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
@@ -274,9 +285,12 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
 #   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  define getexp(x) BR(getexpps, _mask, x, undef(), ~0)
+#  define getmant(x) BR(getmantps, _mask, x, 0, undef(), ~0)
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
 #   define rsqrt(x) BR(rsqrt28ps, _mask, x, undef(), ~0)
@@ -336,9 +350,12 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
 #   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  define getexp(x) BR(getexppd, _mask, x, undef(), ~0)
+#  define getmant(x) BR(getmantpd, _mask, x, 0, undef(), ~0)
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
 #   define rsqrt(x) BR(rsqrt28pd, _mask, x, undef(), ~0)
@@ -1766,6 +1783,28 @@ int simd_test(void)
 # endif
 #endif
 
+#if defined(getexp) && defined(getmant)
+    touch(src);
+    x = getmant(src);
+    touch(src);
+    y = getexp(src);
+    touch(src);
+    for ( j = i = 0; i < ELEM_COUNT; ++i )
+    {
+        if ( y[i] != j ) return __LINE__;
+
+        if ( !((i + 1) & (i + 2)) )
+            ++j;
+
+        if ( !(i & (i + 1)) && x[i] != 1 ) return __LINE__;
+    }
+# ifdef scale
+    touch(y);
+    z = scale(x, y);
+    if ( !eq(src, z) ) return __LINE__;
+# endif
+#endif
+
 #if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
     (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3895,6 +3895,44 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vfixupimmpd $0,8(%edx){1to8},%zmm3,%zmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vfixupimmpd);
+        static const struct {
+            double d[4];
+        }
+        src = { { -1, 0, 1, 2 } },
+        dst = { { 3, 4, 5, 6 } },
+        out = { { .5, -1, 90, 2 } };
+
+        asm volatile ( "vbroadcastf64x4 %1, %%zmm3\n\t"
+                       "vbroadcastf64x4 %2, %%zmm4\n"
+                       put_insn(vfixupimmpd,
+                                "vfixupimmpd $0, 8(%0)%{1to8%}, %%zmm3, %%zmm4")
+                       :: "d" (NULL), "m" (src), "m" (dst) );
+
+        set_insn(vfixupimmpd);
+        /*
+         * Nibble (token) mapping (unused ones simply set to zero):
+         * 2 (ZERO)    ->  -1 (0x9)
+         * 3 (POS_ONE) ->  90 (0xc)
+         * 6 (NEG)     -> 1/2 (0xb)
+         * 7 (POS)     -> src (0x1)
+         */
+        res[2] = 0x1b00c900;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        asm volatile ( "vmovupd %%zmm4, %0" : "=m" (res[0]) );
+        if ( rc != X86EMUL_OKAY || !check_eip(vfixupimmpd) ||
+             memcmp(res + 0, &out, sizeof(out)) ||
+             memcmp(res + 8, &out, sizeof(out)) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -459,7 +459,8 @@ static const struct ext0f38_table {
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
+    [0x2c] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x2d] = { .simd_size = simd_packed_fp, .d8s = d8s_dq },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
@@ -470,6 +471,8 @@ static const struct ext0f38_table {
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x42] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x43] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -563,6 +566,8 @@ static const struct ext0f3a_table {
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
     [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x26] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x27] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
     [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
@@ -577,6 +582,8 @@ static const struct ext0f3a_table {
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4c] = { .simd_size = simd_packed_int, .four_op = 1 },
+    [0x54] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x55] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -2677,6 +2684,10 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x2d): /* vscalefs{d,s} */
+        state->simd_size = simd_scalar_vexw;
+        break;
+
     case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
     case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
     case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
@@ -9041,6 +9052,8 @@ x86_emulate(
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2c): /* vscalefp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x42): /* vgetexpp{s,d} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -9064,6 +9077,8 @@ x86_emulate(
             avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2d): /* vscalefs{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x43): /* vgetexps{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
@@ -9627,6 +9642,23 @@ x86_emulate(
         op_bytes = 4;
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM || !evex.br )
+            avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type == OP_MEM )
+        {
+            generate_exception_if(evex.br, EXC_UD);
+            avx512_vlen_check(true);
+        }
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 29/42] x86emul: support AVX512DQ floating point manipulation insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (27 preceding siblings ...)
  2018-12-06 10:05   ` [PATCH v6 28/42] x86emul: support AVX512F floating point manipulation insns Jan Beulich
@ 2018-12-06 10:05   ` Jan Beulich
  2018-12-06 10:06   ` [PATCH v6 30/42] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
                     ` (12 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:05 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512DQ in the insn emulator.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -457,11 +457,17 @@ static const struct test avx512dq_all[]
     INSN(cvttps2uqq,     66,   0f, 78, vl_2,  d, vl),
     INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
     INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
+    INSN(fpclass,        66, 0f3a, 66,   vl, sd, vl),
+    INSN(fpclass,        66, 0f3a, 67,   el, sd, el),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
 //       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
+    INSN(range,          66, 0f3a, 50,   vl, sd, vl),
+    INSN(range,          66, 0f3a, 51,   el, sd, el),
+    INSN(reduce,         66, 0f3a, 56,   vl, sd, vl),
+    INSN(reduce,         66, 0f3a, 57,   el, sd, el),
     INSN_PFP(xor,              0f, 57),
 };
 
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -285,10 +285,18 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
 #   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  ifdef __AVX512DQ__
+#   define frac(x) B(reduceps, _mask, x, 0b00001011, undef(), ~0)
+#  endif
 #  define getexp(x) BR(getexpps, _mask, x, undef(), ~0)
 #  define getmant(x) BR(getmantps, _mask, x, 0, undef(), ~0)
-#  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
-#  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define max(x, y) BR(rangeps, _mask, x, y, 0b0101, undef(), ~0)
+#   define min(x, y) BR(rangeps, _mask, x, y, 0b0100, undef(), ~0)
+#  else
+#   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  endif
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
 #  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
@@ -350,10 +358,18 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
 #   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  ifdef __AVX512DQ__
+#   define frac(x) B(reducepd, _mask, x, 0b00001011, undef(), ~0)
+#  endif
 #  define getexp(x) BR(getexppd, _mask, x, undef(), ~0)
 #  define getmant(x) BR(getmantpd, _mask, x, 0, undef(), ~0)
-#  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
-#  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define max(x, y) BR(rangepd, _mask, x, y, 0b0101, undef(), ~0)
+#   define min(x, y) BR(rangepd, _mask, x, y, 0b0100, undef(), ~0)
+#  else
+#   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  endif
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
 #  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3933,6 +3933,39 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+
+    printf("%-40s", "Testing vfpclasspsz $0x46,64(%edx),%k2...");
+    if ( stack_exec && cpu_has_avx512dq )
+    {
+        decl_insn(vfpclassps);
+
+        asm volatile ( put_insn(vfpclassps,
+                                /* 0x46: check for +/- 0 and neg. */
+                                "vfpclasspsz $0x46, 64(%0), %%k2")
+                       :: "d" (NULL) );
+
+        set_insn(vfpclassps);
+        for ( i = 0; i < 3; ++i )
+        {
+            res[16 + i * 5 + 0] = 0x00000000; /* +0 */
+            res[16 + i * 5 + 1] = 0x80000000; /* -0 */
+            res[16 + i * 5 + 2] = 0x80000001; /* -DEN */
+            res[16 + i * 5 + 3] = 0xff000000; /* -FIN */
+            res[16 + i * 5 + 4] = 0x7f000000; /* +FIN */
+        }
+        res[31] = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vfpclassps) )
+            goto fail;
+        asm volatile ( "kmovw %%k2, %0" : "=g" (rc) );
+        if ( rc != 0xbdef )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -582,10 +582,16 @@ static const struct ext0f3a_table {
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4c] = { .simd_size = simd_packed_int, .four_op = 1 },
+    [0x50] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x51] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x54] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x55] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x56] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x57] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x66] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x67] = { .simd_size = simd_scalar_vexw, .two_op = 1, .d8s = d8s_dq },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x6a ... 0x6b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x6c ... 0x6d] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -9642,6 +9648,10 @@ x86_emulate(
         op_bytes = 4;
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x50): /* vrangep{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x56): /* vreducep{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512dq);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512f);
@@ -9649,11 +9659,16 @@ x86_emulate(
             avx512_vlen_check(false);
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x51): /* vranges{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x57): /* vreduces{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512dq);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
         host_and_vcpu_must_have(avx512f);
         if ( ea.type == OP_MEM )
         {
+    simd_imm8_zmm_scalar:
             generate_exception_if(evex.br, EXC_UD);
             avx512_vlen_check(true);
         }
@@ -9806,6 +9821,14 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x66): /* vfpclassp{d,s} $imm8,[xyz]mm/mem,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x67): /* vfpclasss{d,s} $imm8,[xyz]mm/mem,k{k} */
+        host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( b & 1 )
+            goto simd_imm8_zmm_scalar;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC(0x0f3a, 0xcc):     /* sha1rnds4 $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(sha);
         op_bytes = 16;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 30/42] x86emul: support AVX512{F, _VBMI2} compress/expand insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (28 preceding siblings ...)
  2018-12-06 10:05   ` [PATCH v6 29/42] x86emul: support AVX512DQ " Jan Beulich
@ 2018-12-06 10:06   ` Jan Beulich
  2018-12-06 10:07   ` [PATCH v6 31/42] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
                     ` (11 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:06 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base. Add tests for the byte/word forms.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,6 +109,7 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(compress,     66, 0f38, 8a,    vl,     sd, el),
     INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
     INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
     INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
@@ -140,6 +141,7 @@ static const struct test avx512f_all[] =
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
     INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
+    INSN(expand,       66, 0f38, 88,    vl,     sd, el),
     INSN(fixupimm,     66, 0f3a, 54,    vl,     sd, vl),
     INSN(fixupimm,     66, 0f3a, 55,    el,     sd, el),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
@@ -214,6 +216,7 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(pcompress,    66, 0f38, 8b,    vl,     dq, el),
     INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
     INSN(permilpd,     66, 0f38, 0d,    vl,      q, vl),
@@ -222,6 +225,7 @@ static const struct test avx512f_all[] =
     INSN(permilps,     66, 0f3a, 04,    vl,      d, vl),
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
+    INSN(pexpand,      66, 0f38, 89,    vl,     dq, el),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -509,6 +513,11 @@ static const struct test avx512_vbmi_all
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
 };
 
+static const struct test avx512_vbmi2_all[] = {
+    INSN(pcompress, 66, 0f38, 63, vl, bw, el),
+    INSN(pexpand,   66, 0f38, 62, vl, bw, el),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -865,4 +874,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 512);
     RUN(avx512er, 512);
     RUN(avx512_vbmi, all);
+    RUN(avx512_vbmi2, all);
 }
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3966,6 +3966,227 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    /*
+     * The following compress/expand tests are not only making sure the
+     * accessed data is correct, but they also verify (by placing operands
+     * on the mapping boundaries) that elements controlled by clear mask
+     * bits don't get accessed.
+     */
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vpcompressd);
+        decl_insn(vpcompressq);
+        decl_insn(vpexpandd);
+        decl_insn(vpexpandq);
+        static const struct {
+            unsigned int d[16];
+        } dsrc = { { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 } };
+        static const struct {
+            unsigned long long q[8];
+        } qsrc = { { 0, 1, 2, 3, 4, 5, 6, 7 } };
+        unsigned int *ptr = res + MMAP_SZ / sizeof(*res) - 32;
+
+        printf("%-40s", "Testing vpcompressd %zmm1,24*4(%ecx){%k2}...");
+        asm volatile ( "kmovw %1, %%k2\n\t"
+                       "vmovdqu32 %2, %%zmm1\n"
+                       put_insn(vpcompressd,
+                                "vpcompressd %%zmm1, 24*4(%0)%{%%k2%}")
+                       :: "c" (NULL), "r" (0x55aa), "m" (dsrc) );
+
+        memset(ptr, 0xdb, 32 * 4);
+        set_insn(vpcompressd);
+        regs.ecx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressd) ||
+             memcmp(ptr, ptr + 8, 16 * 4) )
+            goto fail;
+        for ( i = 0; i < 4; ++i )
+            if ( ptr[24 + i] != 2 * i + 1 )
+                goto fail;
+        for ( ; i < 8; ++i )
+            if ( ptr[24 + i] != 2 * i )
+                goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandd 8*4(%edx),%zmm3{%k2}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm3, %%zmm3, %%zmm3\n"
+                       put_insn(vpexpandd,
+                                "vpexpandd 8*4(%0), %%zmm3%{%%k2%}%{z%}")
+                       :: "d" (NULL) );
+        set_insn(vpexpandd);
+        regs.edx = (unsigned long)(ptr + 16);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandd) )
+            goto fail;
+        asm ( "vmovdqa32 %%zmm1, %%zmm2%{%%k2%}%{z%}\n\t"
+              "vpcmpeqd %%zmm2, %%zmm3, %%k0\n\t"
+              "kmovw %%k0, %0"
+              : "=r" (rc) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpcompressq %zmm4,12*8(%edx){%k3}...");
+        asm volatile ( "kmovw %1, %%k3\n\t"
+                       "vmovdqu64 %2, %%zmm4\n"
+                       put_insn(vpcompressq,
+                                "vpcompressq %%zmm4, 12*8(%0)%{%%k3%}")
+                       :: "d" (NULL), "r" (0x5a), "m" (qsrc) );
+
+        memset(ptr, 0xdb, 16 * 8);
+        set_insn(vpcompressq);
+        regs.edx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressq) ||
+             memcmp(ptr, ptr + 8, 8 * 8) )
+            goto fail;
+        for ( i = 0; i < 2; ++i )
+        {
+            if ( ptr[(12 + i) * 2] != 2 * i + 1 ||
+                 ptr[(12 + i) * 2 + 1] )
+                goto fail;
+        }
+        for ( ; i < 4; ++i )
+        {
+            if ( ptr[(12 + i) * 2] != 2 * i ||
+                 ptr[(12 + i) * 2 + 1] )
+                goto fail;
+        }
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandq 4*8(%ecx),%zmm5{%k3}{z}...");
+        asm volatile ( "vpternlogq $0x81, %%zmm5, %%zmm5, %%zmm5\n"
+                       put_insn(vpexpandq,
+                                "vpexpandq 4*8(%0), %%zmm5%{%%k3%}%{z%}")
+                       :: "c" (NULL) );
+        set_insn(vpexpandq);
+        regs.ecx = (unsigned long)(ptr + 16);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandq) )
+            goto fail;
+        asm ( "vmovdqa64 %%zmm4, %%zmm6%{%%k3%}%{z%}\n\t"
+              "vpcmpeqq %%zmm5, %%zmm6, %%k0\n\t"
+              "kmovw %%k0, %0"
+              : "=r" (rc) );
+        if ( rc != 0xff )
+            goto fail;
+        printf("okay\n");
+    }
+
+#if __GNUC__ > 7 /* can't check for __AVX512VBMI2__ here */
+    if ( stack_exec && cpu_has_avx512_vbmi2 )
+    {
+        decl_insn(vpcompressb);
+        decl_insn(vpcompressw);
+        decl_insn(vpexpandb);
+        decl_insn(vpexpandw);
+        static const struct {
+            unsigned char b[64];
+        } bsrc = { { 0,  1,  2,  3,  4,  5,  6,  7,
+                     8,  9, 10, 11, 12, 13, 14, 15,
+                    16, 17, 18, 19, 20, 21, 22, 23,
+                    24, 25, 26, 27, 28, 29, 30, 31,
+                    32, 33, 34, 35, 36, 37, 38, 39,
+                    40, 41, 42, 43, 44, 45, 46, 47,
+                    48, 49, 50, 51, 52, 53, 54, 55,
+                    56, 57, 58, 59, 60, 61, 62, 63 } };
+        static const struct {
+            unsigned short w[32];
+        } wsrc = { { 0,  1,  2,  3,  4,  5,  6,  7,
+                     8,  9, 10, 11, 12, 13, 14, 15,
+                    16, 17, 18, 19, 20, 21, 22, 23,
+                    24, 25, 26, 27, 28, 29, 30, 31 } };
+        unsigned char *ptr = (void *)res + MMAP_SZ - 128;
+        unsigned long long w = 0x55555555aaaaaaaaULL;
+
+        printf("%-40s", "Testing vpcompressb %zmm1,96*1(%ecx){%k2}...");
+        asm volatile ( "kmovq %1, %%k2\n\t"
+                       "vmovdqu8 %2, %%zmm1\n"
+                       put_insn(vpcompressb,
+                                "vpcompressb %%zmm1, 96*1(%0)%{%%k2%}")
+                       :: "c" (NULL), "m" (w), "m" (bsrc) );
+
+        memset(ptr, 0xdb, 128 * 1);
+        set_insn(vpcompressb);
+        regs.ecx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressb) ||
+             memcmp(ptr, ptr + 32, 64 * 1) )
+            goto fail;
+        for ( i = 0; i < 16; ++i )
+            if ( ptr[96 + i] != 2 * i + 1 )
+                goto fail;
+        for ( ; i < 32; ++i )
+            if ( ptr[96 + i] != 2 * i )
+                goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandb 32*1(%edx),%zmm3{%k2}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm3, %%zmm3, %%zmm3\n"
+                       put_insn(vpexpandb,
+                                "vpexpandb 32*1(%0), %%zmm3%{%%k2%}%{z%}")
+                       :: "d" (NULL) );
+        set_insn(vpexpandb);
+        regs.edx = (unsigned long)(ptr + 64);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandb) )
+            goto fail;
+        asm ( "vmovdqu8 %%zmm1, %%zmm2%{%%k2%}%{z%}\n\t"
+              "vpcmpeqb %%zmm2, %%zmm3, %%k0\n\t"
+              "kmovq %%k0, %0"
+              : "=m" (w) );
+        if ( w != 0xffffffffffffffffULL )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpcompressw %zmm4,48*2(%edx){%k3}...");
+        asm volatile ( "kmovd %1, %%k3\n\t"
+                       "vmovdqu16 %2, %%zmm4\n"
+                       put_insn(vpcompressw,
+                                "vpcompressw %%zmm4, 48*2(%0)%{%%k3%}")
+                       :: "d" (NULL), "r" (0x5555aaaa), "m" (wsrc) );
+
+        memset(ptr, 0xdb, 64 * 2);
+        set_insn(vpcompressw);
+        regs.edx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressw) ||
+             memcmp(ptr, ptr + 32, 32 * 2) )
+            goto fail;
+        for ( i = 0; i < 8; ++i )
+        {
+            if ( ptr[(48 + i) * 2] != 2 * i + 1 ||
+                 ptr[(48 + i) * 2 + 1] )
+                goto fail;
+        }
+        for ( ; i < 16; ++i )
+        {
+            if ( ptr[(48 + i) * 2] != 2 * i ||
+                 ptr[(48 + i) * 2 + 1] )
+                goto fail;
+        }
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandw 16*2(%ecx),%zmm5{%k3}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm5, %%zmm5, %%zmm5\n"
+                       put_insn(vpexpandw,
+                                "vpexpandw 16*2(%0), %%zmm5%{%%k3%}%{z%}")
+                       :: "c" (NULL) );
+        set_insn(vpexpandw);
+        regs.ecx = (unsigned long)(ptr + 64);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandw) )
+            goto fail;
+        asm ( "vmovdqu16 %%zmm4, %%zmm6%{%%k3%}%{z%}\n\t"
+              "vpcmpeqw %%zmm5, %%zmm6, %%k0\n\t"
+              "kmovq %%k0, %0"
+              : "=m" (w) );
+        if ( w != 0xffffffff )
+            goto fail;
+        printf("okay\n");
+    }
+#endif
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -59,6 +59,9 @@
     (type *)((char *)mptr__ - offsetof(type, member)); \
 })
 
+#define hweight32 __builtin_popcount
+#define hweight64 __builtin_popcountll
+
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
 extern uint32_t mxcsr_mask;
@@ -138,6 +141,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
+#define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -482,6 +482,8 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
+    [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -489,6 +491,10 @@ static const struct ext0f38_table {
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x88] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_dq },
+    [0x89] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_dq },
+    [0x8a] = { .simd_size = simd_packed_fp, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
+    [0x8b] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
@@ -1900,6 +1906,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
+#define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -8851,6 +8858,36 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x62): /* vpexpand{b,w} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x63): /* vpcompress{b,w} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        elem_bytes = 1 << evex.w;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x88): /* vexpandp{d,s} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x89): /* vpexpand{d,q} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8a): /* vcompressp{d,s} [xyz]mm,[xyz]mm/mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8b): /* vpcompress{d,q} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.br, EXC_UD);
+        avx512_vlen_check(false);
+        /*
+         * For the respective code below the main switch() to work we need to
+         * compact op_mask here: Memory accesses are non-sparse even if the
+         * mask register has sparsely set bits.
+         */
+        if ( likely(fault_suppression) )
+        {
+            n = 1 << ((b & 8 ? 2 : 4) + evex.lr - evex.w);
+            EXPECT(elem_bytes > 0);
+            ASSERT(op_bytes == n * elem_bytes);
+            op_mask &= ~0ULL >> (64 - n);
+            n = hweight64(op_mask);
+            op_bytes = n * elem_bytes;
+            if ( n )
+                op_mask = ~0ULL >> (64 - n);
+        }
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -110,6 +110,7 @@
 
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
+#define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -228,6 +228,7 @@ XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
+XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
 XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
 
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -266,10 +266,10 @@ def crunch_numbers(state):
                   AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
                   AVX512_VPOPCNTDQ],
 
-        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask
         # registers), despite the SDM not formally making this connection.
-        AVX512BW: [AVX512_VBMI],
+        AVX512BW: [AVX512_VBMI, AVX512_VBMI2],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 31/42] x86emul: support remaining misc AVX512{F, BW} insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (29 preceding siblings ...)
  2018-12-06 10:06   ` [PATCH v6 30/42] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
@ 2018-12-06 10:07   ` Jan Beulich
  2018-12-06 10:07   ` [PATCH v6 32/42] x86emul: support AVX512F gather insns Jan Beulich
                     ` (10 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:07 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512BW in the insn emulator, and leaves just
the scatter/gather ones open in the AVX512F set.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.
---
TBD: The *blendm* inline functions don't reliably produce the intended
     insns, as the respective moves are about as good a fit for the
     compiler when looking for a match for the intended operation. We'd
     need to switch to inline assembly if we wanted to guarantee the
     testing of those insns. Thoughts?

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -105,6 +105,8 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN(align,        66, 0f3a, 03,    vl,     dq, vl),
+    INSN(blendm,       66, 0f38, 65,    vl,     sd, vl),
     INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
@@ -207,6 +209,7 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(pblendm,      66, 0f38, 64,    vl,     dq, vl),
 //       pbroadcast,   66, 0f38, 7c,          dq64
     INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
     INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
@@ -354,6 +357,7 @@ static const struct test avx512f_512[] =
 };
 
 static const struct test avx512bw_all[] = {
+    INSN(dbpsadbw,    66, 0f3a, 42,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
@@ -373,6 +377,7 @@ static const struct test avx512bw_all[]
     INSN(palignr,     66, 0f3a, 0f,    vl,    b, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
+    INSN(pblendm,     66, 0f38, 66,    vl,   bw, vl),
     INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
 //       pbroadcastb, 66, 0f38, 7a,           b
     INSN(pbroadcastw, 66, 0f38, 79,    el_2,  b, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -297,7 +297,7 @@ static inline vec_t movlhps(vec_t x, vec
 #   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  endif
-#  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define mix(x, y) B(blendmps_, _mask, x, y, (0b1010101010101010 & ALL_TRUE))
 #  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
@@ -370,7 +370,7 @@ static inline vec_t movlhps(vec_t x, vec
 #   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  endif
-#  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define mix(x, y) B(blendmpd_, _mask, x, y, 0b10101010)
 #  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
@@ -564,8 +564,9 @@ static inline vec_t movlhps(vec_t x, vec
                              0b00011011, (vsi_t)undef(), ~0))
 #   define swap2(x) ((vec_t)B_(permvarsi, _mask, (vsi_t)(x), (vsi_t)(inv - 1), (vsi_t)undef(), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
-                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define mix(x, y) ((vec_t)B(blendmd_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b1010101010101010 & ((1 << ELEM_COUNT) - 1))))
+#  define rotr(x, n) ((vec_t)B(alignd, _mask, (vsi_t)(x), (vsi_t)(x), n, (vsi_t)undef(), ~0))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
@@ -602,7 +603,8 @@ static inline vec_t movlhps(vec_t x, vec
                              0b01001110, (vsi_t)undef(), ~0))
 #   define swap2(x) ((vec_t)B(permvardi, _mask, (vdi_t)(x), (vdi_t)(inv - 1), (vdi_t)undef(), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+#  define mix(x, y) ((vec_t)B(blendmq_, _mask, (vdi_t)(x), (vdi_t)(y), 0b10101010))
+#  define rotr(x, n) ((vec_t)B(alignq, _mask, (vdi_t)(x), (vdi_t)(x), n, (vdi_t)undef(), ~0))
 #  if VEC_SIZE == 32
 #   define swap3(x) ((vec_t)B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0))
 #  elif VEC_SIZE == 64
@@ -654,8 +656,8 @@ static inline vec_t movlhps(vec_t x, vec
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
-                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define mix(x, y) ((vec_t)B(blendmb_, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b1010101010101010101010101010101010101010101010101010101010101010LL & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
 #  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
@@ -687,8 +689,8 @@ static inline vec_t movlhps(vec_t x, vec
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
-                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define mix(x, y) ((vec_t)B(blendmw_, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b10101010101010101010101010101010 & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
 #  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -484,6 +484,7 @@ static const struct ext0f38_table {
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
     [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
+    [0x64 ... 0x66] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -550,6 +551,7 @@ static const struct ext0f3a_table {
     [0x00] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x01] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x02] = { .simd_size = simd_packed_int },
+    [0x03] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -581,8 +583,7 @@ static const struct ext0f3a_table {
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
-    [0x42] = { .simd_size = simd_packed_int },
-    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x42 ... 0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6189,6 +6190,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x4c): /* vrcp14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x4e): /* vrsqrt14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x64): /* vpblendm{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x65): /* vblendmp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
@@ -6918,6 +6921,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x0b): /* vpmulhrsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x66): /* vpblendm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.br, EXC_UD);
         elem_bytes = 1 << (b & 1);
@@ -8084,10 +8088,12 @@ x86_emulate(
         goto simd_0f_to_gpr;
 
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        fault_suppression = false;
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x03): /* valign{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_imm8_no_sae:
         host_and_vcpu_must_have(avx512f);
@@ -9422,6 +9428,9 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 4;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x42): /* vdbpsadbw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0f): /* vpalignr $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         goto avx512bw_imm;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 32/42] x86emul: support AVX512F gather insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (30 preceding siblings ...)
  2018-12-06 10:07   ` [PATCH v6 31/42] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
@ 2018-12-06 10:07   ` Jan Beulich
  2018-12-06 10:08   ` [PATCH v6 33/42] x86emul: add high register S/G test cases Jan Beulich
                     ` (9 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:07 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This requires getting modrm_reg and sib_index set correctly in the EVEX
case, to account for the high 16 [XYZ]MM registers. Extend the
adjustments to modrm_rm as well, such that x86_insn_modrm() would
correctly report register numbers (this was a latent issue only as we
don't currently have callers of that function which would care about an
EVEX case). The adjustment in turn requires dropping the assertion from
decode_gpr(), as we now need to actively mask off the high bit when a
GPR is meant. All other uses of modrm_reg and modrm_rm already get
suitably masked where necessary.

There was also an encoding mistake in the EVEX Disp8 test code, which
was benign (due to %rdx getting set to zero) to all non-vSIB tests as it
mistakenly encoded <disp8>(%rdx,%rdx) instead of <disp8>(%rdx,%riz). In
the vSIB case this meant <disp8>(%rdx,%zmm2) instead of the intended
<disp8>(%rdx,%zmm4).

Likewise the access count check wasn't entirely correct for the S/G
case: In the quad-word-index but dword-data case only half the number
of full vector elements get accessed.

As an unrelated change in the main test harness source file distinguish
the "n/a" messages by bitness.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -18,7 +18,7 @@ CFLAGS += $(CFLAGS_xeninclude)
 
 SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
 FMA := fma4 fma
-SG := avx2-sg
+SG := avx2-sg avx512f-sg avx512vl-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
 
 OPMASK := avx512f avx512dq avx512bw
@@ -66,6 +66,14 @@ xop-flts := $(avx-flts)
 avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
+avx512f-sg-vecs := 64
+avx512f-sg-idxs := 4 8
+avx512f-sg-ints := $(avx512f-ints)
+avx512f-sg-flts := $(avx512f-flts)
+avx512vl-sg-vecs := 16 32
+avx512vl-sg-idxs := $(avx512f-sg-idxs)
+avx512vl-sg-ints := $(avx512f-ints)
+avx512vl-sg-flts := $(avx512f-flts)
 avx512bw-vecs := $(avx512f-vecs)
 avx512bw-ints := 1 2
 avx512bw-flts :=
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -176,6 +176,8 @@ static const struct test avx512f_all[] =
     INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
     INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
     INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
+    INSN(gatherd,      66, 0f38, 92,    vl,     sd, el),
+    INSN(gatherq,      66, 0f38, 93,    vl,     sd, el),
     INSN(getexp,       66, 0f38, 42,    vl,     sd, vl),
     INSN(getexp,       66, 0f38, 43,    el,     sd, el),
     INSN(getmant,      66, 0f3a, 26,    vl,     sd, vl),
@@ -229,6 +231,8 @@ static const struct test avx512f_all[] =
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pexpand,      66, 0f38, 89,    vl,     dq, el),
+    INSN(pgatherd,     66, 0f38, 90,    vl,     dq, el),
+    INSN(pgatherq,     66, 0f38, 91,    vl,     dq, el),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -698,7 +702,7 @@ static void test_one(const struct test *
     instr[3] = evex.raw[2];
     instr[4] = test->opc;
     instr[5] = 0x44 | (test->ext << 3); /* ModR/M */
-    instr[6] = 0x12; /* SIB: base rDX, index none / xMM4 */
+    instr[6] = 0x22; /* SIB: base rDX, index none / xMM4 */
     instr[7] = 1; /* Disp8 */
     instr[8] = 0; /* immediate, if any */
 
@@ -718,7 +722,8 @@ static void test_one(const struct test *
          if ( accessed[i] )
              goto fail;
     for ( ; i < (test->scale == SC_vl ? vsz : esz) + (sg ? esz : vsz); ++i )
-         if ( accessed[i] != (sg ? vsz / esz : 1) )
+         if ( accessed[i] != (sg ? (vsz / esz) >> (test->opc & 1 & !evex.w)
+                                 : 1) )
              goto fail;
     for ( ; i < ARRAY_SIZE(accessed); ++i )
          if ( accessed[i] )
--- a/tools/tests/x86_emulator/simd-sg.c
+++ b/tools/tests/x86_emulator/simd-sg.c
@@ -35,13 +35,78 @@ typedef long long __attribute__((vector_
 #define ITEM_COUNT (VEC_SIZE / ELEM_SIZE < IVEC_SIZE / IDX_SIZE ? \
                     VEC_SIZE / ELEM_SIZE : IVEC_SIZE / IDX_SIZE)
 
-#if VEC_SIZE == 16
-# define to_bool(cmp) __builtin_ia32_ptestc128(cmp, (vec_t){} == 0)
-#else
-# define to_bool(cmp) __builtin_ia32_ptestc256(cmp, (vec_t){} == 0)
-#endif
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if ELEM_SIZE == 4
+#  if IDX_SIZE == 4 || defined(__AVX512VL__)
+#   define to_mask(msk) B(ptestmd, , (vsi_t)(msk), (vsi_t)(msk), ~0)
+#   define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
+#  else
+#   define widen(x) __builtin_ia32_pmovzxdq512_mask((vsi_t)(x), (idi_t){}, ~0)
+#   define to_mask(msk) __builtin_ia32_ptestmq512(widen(msk), widen(msk), ~0)
+#   define eq(x, y) (__builtin_ia32_pcmpeqq512_mask(widen(x), widen(y), ~0) == ALL_TRUE)
+#  endif
+#  define BG_(dt, it, reg, mem, idx, msk, scl) \
+    __builtin_ia32_gather##it##dt(reg, mem, idx, to_mask(msk), scl)
+# else
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+#  define BG_(dt, it, reg, mem, idx, msk, scl) \
+    __builtin_ia32_gather##it##dt(reg, mem, idx, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), scl)
+# endif
+/*
+ * Instead of replicating the main IDX_SIZE conditional below three times, use
+ * a double layer of macro invocations, allowing for substitution of the
+ * respective relevant macro argument tokens.
+ */
+# define BG(dt, it, reg, mem, idx, msk, scl) BG_(dt, it, reg, mem, idx, msk, scl)
+# if VEC_MAX < 64
+/*
+ * The sub-512-bit built-ins have an extra "3" infix, presumably because the
+ * 512-bit names were chosen without the AVX512VL extension in mind (and hence
+ * making the latter collide with the AVX2 ones).
+ */
+#  define si 3si
+#  define di 3di
+# endif
+# if VEC_MAX == 16
+#  define v8df v2df
+#  define v8di v2di
+#  define v16sf v4sf
+#  define v16si v4si
+# elif VEC_MAX == 32
+#  define v8df v4df
+#  define v8di v4di
+#  define v16sf v8sf
+#  define v16si v8si
+# endif
+# if IDX_SIZE == 4
+#  if INT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16si, si, reg, mem, idx, msk, scl)
+#  elif INT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, si, (vdi_t)(reg), mem, idx, msk, scl))
+#  elif FLOAT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16sf, si, reg, mem, idx, msk, scl)
+#  elif FLOAT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) BG(v8df, si, reg, mem, idx, msk, scl)
+#  endif
+# elif IDX_SIZE == 8
+#  if INT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16si, di, reg, mem, (idi_t)(idx), msk, scl)
+#  elif INT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, di, (vdi_t)(reg), mem, (idi_t)(idx), msk, scl))
+#  elif FLOAT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16sf, di, reg, mem, (idi_t)(idx), msk, scl)
+#  elif FLOAT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) BG(v8df, di, reg, mem, (idi_t)(idx), msk, scl)
+#  endif
+# endif
+#elif defined(__AVX2__)
+# if VEC_SIZE == 16
+#  define to_bool(cmp) __builtin_ia32_ptestc128(cmp, (vec_t){} == 0)
+# else
+#  define to_bool(cmp) __builtin_ia32_ptestc256(cmp, (vec_t){} == 0)
+# endif
 
-#if defined(__AVX2__)
 # if VEC_MAX == 16
 #  if IDX_SIZE == 4
 #   if INT_SIZE == 4
@@ -111,6 +176,10 @@ typedef long long __attribute__((vector_
 # endif
 #endif
 
+#ifndef eq
+# define eq(x, y) to_bool((x) == (y))
+#endif
+
 #define GLUE_(x, y) x ## y
 #define GLUE(x, y) GLUE_(x, y)
 
@@ -119,6 +188,7 @@ typedef long long __attribute__((vector_
 #define PUT8(n)  PUT4(n),   PUT4((n) +  4)
 #define PUT16(n) PUT8(n),   PUT8((n) +  8)
 #define PUT32(n) PUT16(n), PUT16((n) + 16)
+#define PUT64(n) PUT32(n), PUT32((n) + 32)
 
 const typeof((vec_t){}[0]) array[] = {
     GLUE(PUT, VEC_MAX)(1),
@@ -174,7 +244,7 @@ int sg_test(void)
 
     y = gather(full, array + ITEM_COUNT, -idx, full, ELEM_SIZE);
 #if ITEM_COUNT == ELEM_COUNT
-    if ( !to_bool(y == x - 1) )
+    if ( !eq(y, x - 1) )
         return __LINE__;
 #else
     for ( i = 0; i < ITEM_COUNT; ++i )
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,8 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
+#include "avx512f-sg.h"
+#include "avx512vl-sg.h"
 #include "avx512bw.h"
 #include "avx512dq.h"
 #include "avx512er.h"
@@ -90,11 +92,13 @@ static bool simd_check_avx512f(void)
     return cpu_has_avx512f;
 }
 #define simd_check_avx512f_opmask simd_check_avx512f
+#define simd_check_avx512f_sg simd_check_avx512f
 
 static bool simd_check_avx512f_vl(void)
 {
     return cpu_has_avx512f && cpu_has_avx512vl;
 }
+#define simd_check_avx512vl_sg simd_check_avx512f_vl
 
 static bool simd_check_avx512dq(void)
 {
@@ -291,6 +295,14 @@ static const struct {
     SIMD(AVX512F u32x16,      avx512f,      64u4),
     SIMD(AVX512F s64x8,       avx512f,      64i8),
     SIMD(AVX512F u64x8,       avx512f,      64u8),
+    SIMD(AVX512F S/G f32[16x32], avx512f_sg, 64x4f4),
+    SIMD(AVX512F S/G f64[ 8x32], avx512f_sg, 64x4f8),
+    SIMD(AVX512F S/G f32[ 8x64], avx512f_sg, 64x8f4),
+    SIMD(AVX512F S/G f64[ 8x64], avx512f_sg, 64x8f8),
+    SIMD(AVX512F S/G i32[16x32], avx512f_sg, 64x4i4),
+    SIMD(AVX512F S/G i64[ 8x32], avx512f_sg, 64x4i8),
+    SIMD(AVX512F S/G i32[ 8x64], avx512f_sg, 64x8i4),
+    SIMD(AVX512F S/G i64[ 8x64], avx512f_sg, 64x8i8),
     AVX512VL(VL f32x4,        avx512f,      16f4),
     AVX512VL(VL f64x2,        avx512f,      16f8),
     AVX512VL(VL f32x8,        avx512f,      32f4),
@@ -303,6 +315,22 @@ static const struct {
     AVX512VL(VL u64x2,        avx512f,      16u8),
     AVX512VL(VL s64x4,        avx512f,      32i8),
     AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512VL S/G f32[4x32], avx512vl_sg, 16x4f4),
+    SIMD(AVX512VL S/G f64[2x32], avx512vl_sg, 16x4f8),
+    SIMD(AVX512VL S/G f32[2x64], avx512vl_sg, 16x8f4),
+    SIMD(AVX512VL S/G f64[2x64], avx512vl_sg, 16x8f8),
+    SIMD(AVX512VL S/G f32[8x32], avx512vl_sg, 32x4f4),
+    SIMD(AVX512VL S/G f64[4x32], avx512vl_sg, 32x4f8),
+    SIMD(AVX512VL S/G f32[4x64], avx512vl_sg, 32x8f4),
+    SIMD(AVX512VL S/G f64[4x64], avx512vl_sg, 32x8f8),
+    SIMD(AVX512VL S/G i32[4x32], avx512vl_sg, 16x4i4),
+    SIMD(AVX512VL S/G i64[2x32], avx512vl_sg, 16x4i8),
+    SIMD(AVX512VL S/G i32[2x64], avx512vl_sg, 16x8i4),
+    SIMD(AVX512VL S/G i64[2x64], avx512vl_sg, 16x8i8),
+    SIMD(AVX512VL S/G i32[8x32], avx512vl_sg, 32x4i4),
+    SIMD(AVX512VL S/G i64[4x32], avx512vl_sg, 32x4i8),
+    SIMD(AVX512VL S/G i32[4x64], avx512vl_sg, 32x8i4),
+    SIMD(AVX512VL S/G i64[4x64], avx512vl_sg, 32x8i8),
     SIMD(AVX512BW s8x64,     avx512bw,      64i1),
     SIMD(AVX512BW u8x64,     avx512bw,      64u1),
     SIMD(AVX512BW s16x32,    avx512bw,      64i2),
@@ -4231,7 +4259,7 @@ int main(int argc, char **argv)
 
         if ( !blobs[j].size )
         {
-            printf("%-39s n/a\n", blobs[j].name);
+            printf("%-39s n/a (%u-bit)\n", blobs[j].name, blobs[j].bitness);
             continue;
         }
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -499,7 +499,7 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
-    [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
+    [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x9a] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -3051,7 +3051,8 @@ x86_decode(
 
         d &= ~ModRM;
 #undef ModRM /* Only its aliases are valid to use from here on. */
-        modrm_reg = ((rex_prefix & 4) << 1) | ((modrm & 0x38) >> 3);
+        modrm_reg = ((rex_prefix & 4) << 1) | ((modrm & 0x38) >> 3) |
+                    ((evex_encoded() && !evex.R) << 4);
         modrm_rm  = modrm & 0x07;
 
         /*
@@ -3221,7 +3222,8 @@ x86_decode(
         if ( modrm_mod == 3 )
         {
             generate_exception_if(d & vSIB, EXC_UD);
-            modrm_rm |= (rex_prefix & 1) << 3;
+            modrm_rm |= ((rex_prefix & 1) << 3) |
+                        (evex_encoded() && !evex.x) << 4;
             ea.type = OP_REG;
         }
         else if ( ad_bytes == 2 )
@@ -3286,7 +3288,10 @@ x86_decode(
 
                 state->sib_index = ((sib >> 3) & 7) | ((rex_prefix << 2) & 8);
                 state->sib_scale = (sib >> 6) & 3;
-                if ( state->sib_index != 4 && !(d & vSIB) )
+                if ( unlikely(d & vSIB) )
+                    state->sib_index |= (mode_64bit() && evex_encoded() &&
+                                         !evex.RX) << 4;
+                else if ( state->sib_index != 4 )
                 {
                     ea.mem.off = *decode_gpr(state->regs, state->sib_index);
                     ea.mem.off <<= state->sib_scale;
@@ -9065,6 +9070,130 @@ x86_emulate(
         put_stub(stub);
 
         state->simd_size = simd_none;
+        break;
+    }
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x90): /* vpgatherd{d,q} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x91): /* vpgatherq{d,q} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x92): /* vgatherdp{s,d} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x93): /* vgatherqp{s,d} mem,[xyz]mm{k} */
+    {
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+        bool done = false;
+
+        ASSERT(ea.type == OP_MEM);
+        generate_exception_if((!evex.opmsk || evex.br || evex.z ||
+                               evex.reg != 0xf ||
+                               modrm_reg == state->sib_index),
+                              EXC_UD);
+        avx512_vlen_check(false);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        /* Read destination and index registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x7f; /* vmovdqa{32,64} */
+        /*
+         * The register writeback below has to retain masked-off elements, but
+         * needs to clear upper portions in the index-wider-than-data cases.
+         * Therefore read (and write below) the full register. The alternative
+         * would have been to fiddle with the mask register used.
+         */
+        pevex->opmsk = 0;
+        /* Use (%rax) as destination and modrm_reg as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
+
+        pevex->pfx = vex_f3; /* vmovdqu{32,64} */
+        pevex->w = b & 1;
+        /* Switch to sib_index as source. */
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        opc[1] = (state->sib_index & 7) << 3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the destination and mask values. */
+        n = 1 << (2 + evex.lr - ((b & 1) | evex.w));
+        op_bytes = 4 << evex.w;
+        memset((void *)mmvalp + n * op_bytes, 0, 64 - n * op_bytes);
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = ops->read(ea.mem.seg,
+                           truncate_ea(ea.mem.off + (idx << state->sib_scale)),
+                           (void *)mmvalp + i * op_bytes, op_bytes, ctxt);
+            if ( rc != X86EMUL_OKAY )
+            {
+                /*
+                 * If we've made some progress and the access did not fault,
+                 * force a retry instead. This is for example necessary to
+                 * cope with the limited capacity of HVM's MMIO cache.
+                 */
+                if ( rc != X86EMUL_EXCEPTION && done )
+                    rc = X86EMUL_RETRY;
+                break;
+            }
+
+            op_mask &= ~(1 << i);
+            done = true;
+
+#ifdef __XEN__
+            if ( op_mask && local_events_need_delivery() )
+            {
+                rc = X86EMUL_RETRY;
+                break;
+            }
+#endif
+        }
+
+        /* Write destination and mask registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x6f; /* vmovdqa{32,64} */
+        pevex->opmsk = 0;
+        /* Use modrm_reg as destination and (%rax) as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
+
+        /*
+         * kmovw: This is VEX-encoded, so we can't use pevex. Avoid copy_VEX() etc
+         * as well, since we can easily use the 2-byte VEX form here.
+         */
+        opc -= EVEX_PFX_BYTES;
+        opc[0] = 0xc5;
+        opc[1] = 0xf8;
+        opc[2] = 0x90;
+        /* Use (%rax) as source. */
+        opc[3] = evex.opmsk << 3;
+        opc[4] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+        put_stub(stub);
+
+        state->simd_size = simd_none;
         break;
     }
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -656,9 +656,6 @@ static inline unsigned long *decode_gpr(
     BUILD_BUG_ON(ARRAY_SIZE(cpu_user_regs_gpr_offsets) &
                  (ARRAY_SIZE(cpu_user_regs_gpr_offsets) - 1));
 
-    ASSERT(modrm < ARRAY_SIZE(cpu_user_regs_gpr_offsets));
-
-    /* For safety in release builds.  Debug builds will hit the ASSERT() */
     modrm &= ARRAY_SIZE(cpu_user_regs_gpr_offsets) - 1;
 
     return (void *)regs + cpu_user_regs_gpr_offsets[modrm];




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 33/42] x86emul: add high register S/G test cases
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (31 preceding siblings ...)
  2018-12-06 10:07   ` [PATCH v6 32/42] x86emul: support AVX512F gather insns Jan Beulich
@ 2018-12-06 10:08   ` Jan Beulich
  2018-12-06 10:08   ` [PATCH v6 34/42] x86emul: support AVX512F scatter insns Jan Beulich
                     ` (8 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:08 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

In order to verify that in particular the index register decoding works
correctly in the S/G emulation paths, add dedicated (64-bit only) cases
disallowing the compiler to use the lower registers. Other than in the
generic SIMD case, where occasional uses of %xmm or %ymm registers in
generated code cause various internal compiler errors when disallowing
use of all of the lower 16 registers (apparently due to insn templates
trying to use AVX2 encodings), doing so here in the AVX512F case looks
to be fine.

While the main goal here is the AVX512F case, add an AVX2 variant as
well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -147,6 +147,12 @@ $(foreach flavor,$(SIMD) $(FMA),$(eval $
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
 $(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
+first-string = $(shell for s in $(1); do echo "$$s"; break; done)
+
+avx2-sg-cflags-x86_64    := "-D_high $(foreach n,7 6 5 4 3 2 1,-ffixed-ymm$(n)) $(call first-string,$(avx2-sg-cflags))"
+avx512f-sg-cflags-x86_64 := "-D_higher $(foreach n,7 6 5 4 3 2 1,-ffixed-zmm$(n)) $(call first-string,$(avx512f-sg-cflags))"
+avx512f-sg-cflags-x86_64 += "-D_highest $(foreach n,15 14 13 12 11 10 9 8,-ffixed-zmm$(n)) $(call first-string,$(avx512f-sg-cflags-x86_64))"
+
 $(addsuffix .h,$(TESTCASES)): %.h: %.c testcase.mk Makefile
 	rm -f $@.new $*.bin
 	$(foreach arch,$(filter-out $(XEN_COMPILE_ARCH),x86_32) $(XEN_COMPILE_ARCH), \
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -266,6 +266,9 @@ static const struct {
     SIMD(AVX2 S/G i64[4x32],  avx2_sg,    32x4i8),
     SIMD(AVX2 S/G i32[4x64],  avx2_sg,    32x8i4),
     SIMD(AVX2 S/G i64[4x64],  avx2_sg,    32x8i8),
+#ifdef __x86_64__
+    SIMD_(64, AVX2 S/G %ymm8+, avx2_sg,     high),
+#endif
     SIMD(XOP 128bit single,       xop,      16f4),
     SIMD(XOP 256bit single,       xop,      32f4),
     SIMD(XOP 128bit double,       xop,      16f8),
@@ -303,6 +306,10 @@ static const struct {
     SIMD(AVX512F S/G i64[ 8x32], avx512f_sg, 64x4i8),
     SIMD(AVX512F S/G i32[ 8x64], avx512f_sg, 64x8i4),
     SIMD(AVX512F S/G i64[ 8x64], avx512f_sg, 64x8i8),
+#ifdef __x86_64__
+    SIMD_(64, AVX512F S/G %zmm8+, avx512f_sg, higher),
+    SIMD_(64, AVX512F S/G %zmm16+, avx512f_sg, highest),
+#endif
     AVX512VL(VL f32x4,        avx512f,      16f4),
     AVX512VL(VL f64x2,        avx512f,      16f8),
     AVX512VL(VL f32x8,        avx512f,      32f4),





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 34/42] x86emul: support AVX512F scatter insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (32 preceding siblings ...)
  2018-12-06 10:08   ` [PATCH v6 33/42] x86emul: add high register S/G test cases Jan Beulich
@ 2018-12-06 10:08   ` Jan Beulich
  2018-12-06 10:09   ` [PATCH v6 35/42] x86emul: support AVX512PF insns Jan Beulich
                     ` (7 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:08 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512F in the insn emulator.

Note that in the test harness there's a little bit of trickery needed to
get around the not fully consistent naming of AVX512VL gather and
scatter built-ins. To suppress expansion of the "di" and "si" tokens
they get constructed by token concatenation in BS(), which is different
from BG().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: I couldn't really decide whether to duplicate code or merge scatter
     into gather emulation.
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -270,6 +270,8 @@ static const struct test avx512f_all[] =
     INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
     INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
     INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pscatterd,    66, 0f38, a0,    vl,     dq, el),
+    INSN(pscatterq,    66, 0f38, a1,    vl,     dq, el),
     INSN(pshufd,       66,   0f, 70,    vl,      d, vl),
     INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
     INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
@@ -305,6 +307,8 @@ static const struct test avx512f_all[] =
     INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
     INSN(scalef,       66, 0f38, 2c,    vl,     sd, vl),
     INSN(scalef,       66, 0f38, 2d,    el,     sd, el),
+    INSN(scatterd,     66, 0f38, a2,    vl,     sd, el),
+    INSN(scatterq,     66, 0f38, a3,    vl,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/tools/tests/x86_emulator/simd-sg.c
+++ b/tools/tests/x86_emulator/simd-sg.c
@@ -48,10 +48,14 @@ typedef long long __attribute__((vector_
 #  endif
 #  define BG_(dt, it, reg, mem, idx, msk, scl) \
     __builtin_ia32_gather##it##dt(reg, mem, idx, to_mask(msk), scl)
+#  define BS_(dt, it, mem, idx, reg, msk, scl) \
+    __builtin_ia32_scatter##it##dt(mem, to_mask(msk), idx, reg, scl)
 # else
 #  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
 #  define BG_(dt, it, reg, mem, idx, msk, scl) \
     __builtin_ia32_gather##it##dt(reg, mem, idx, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), scl)
+#  define BS_(dt, it, mem, idx, reg, msk, scl) \
+    __builtin_ia32_scatter##it##dt(mem, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), idx, reg, scl)
 # endif
 /*
  * Instead of replicating the main IDX_SIZE conditional below three times, use
@@ -59,6 +63,7 @@ typedef long long __attribute__((vector_
  * respective relevant macro argument tokens.
  */
 # define BG(dt, it, reg, mem, idx, msk, scl) BG_(dt, it, reg, mem, idx, msk, scl)
+# define BS(dt, it, mem, idx, reg, msk, scl) BS_(dt, it##i, mem, idx, reg, msk, scl)
 # if VEC_MAX < 64
 /*
  * The sub-512-bit built-ins have an extra "3" infix, presumably because the
@@ -82,22 +87,30 @@ typedef long long __attribute__((vector_
 # if IDX_SIZE == 4
 #  if INT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16si, si, reg, mem, idx, msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16si, s, mem, idx, reg, msk, scl)
 #  elif INT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, si, (vdi_t)(reg), mem, idx, msk, scl))
+#   define scatter(mem, idx, reg, msk, scl) BS(v8di, s, mem, idx, (vdi_t)(reg), msk, scl)
 #  elif FLOAT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16sf, si, reg, mem, idx, msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16sf, s, mem, idx, reg, msk, scl)
 #  elif FLOAT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) BG(v8df, si, reg, mem, idx, msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v8df, s, mem, idx, reg, msk, scl)
 #  endif
 # elif IDX_SIZE == 8
 #  if INT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16si, di, reg, mem, (idi_t)(idx), msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16si, d, mem, (idi_t)(idx), reg, msk, scl)
 #  elif INT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, di, (vdi_t)(reg), mem, (idi_t)(idx), msk, scl))
+#   define scatter(mem, idx, reg, msk, scl) BS(v8di, d, mem, (idi_t)(idx), (vdi_t)(reg), msk, scl)
 #  elif FLOAT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16sf, di, reg, mem, (idi_t)(idx), msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16sf, d, mem, (idi_t)(idx), reg, msk, scl)
 #  elif FLOAT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) BG(v8df, di, reg, mem, (idi_t)(idx), msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v8df, d, mem, (idi_t)(idx), reg, msk, scl)
 #  endif
 # endif
 #elif defined(__AVX2__)
@@ -195,6 +208,8 @@ const typeof((vec_t){}[0]) array[] = {
     GLUE(PUT, VEC_MAX)(VEC_MAX + 1)
 };
 
+typeof((vec_t){}[0]) out[VEC_MAX * 2];
+
 int sg_test(void)
 {
     unsigned int i;
@@ -275,5 +290,41 @@ int sg_test(void)
 # endif
 #endif
 
+#ifdef scatter
+
+    for ( i = 0; i < sizeof(out) / sizeof(*out); ++i )
+        out[i] = 0;
+
+    for ( i = 0; i < ITEM_COUNT; ++i )
+        x[i] = i + 1;
+
+    touch(x);
+
+    scatter(out, (idx_t){}, x, (vec_t){ 1 } != 0, 1);
+    if ( out[0] != 1 )
+        return __LINE__;
+    for ( i = 1; i < ITEM_COUNT; ++i )
+        if ( out[i] )
+            return __LINE__;
+
+    scatter(out, (idx_t){}, x, full, 1);
+    if ( out[0] != ITEM_COUNT )
+        return __LINE__;
+    for ( i = 1; i < ITEM_COUNT; ++i )
+        if ( out[i] )
+            return __LINE__;
+
+    scatter(out, idx, x, full, ELEM_SIZE);
+    for ( i = 1; i <= ITEM_COUNT; ++i )
+        if ( out[i] != i )
+            return __LINE__;
+
+    scatter(out, inv, x, full, ELEM_SIZE);
+    for ( i = 1; i <= ITEM_COUNT; ++i )
+        if ( out[i] != ITEM_COUNT + 1 - i )
+            return __LINE__;
+
+#endif
+
     return 0;
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -508,6 +508,7 @@ static const struct ext0f38_table {
     [0x9d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x9e] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x9f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xa0 ... 0xa3] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xa9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xaa] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -9282,6 +9283,102 @@ x86_emulate(
             host_and_vcpu_must_have(avx512f);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa0): /* vpscatterd{d,q} [xyz]mm,mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa1): /* vpscatterq{d,q} [xyz]mm,mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa2): /* vscatterdp{s,d} [xyz]mm,mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa3): /* vscatterqp{s,d} [xyz]mm,mem{k} */
+    {
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+        bool done = false;
+
+        ASSERT(ea.type == OP_MEM);
+        fail_if(!ops->write);
+        generate_exception_if((!evex.opmsk || evex.br || evex.z ||
+                               evex.reg != 0xf ||
+                               modrm_reg == state->sib_index),
+                              EXC_UD);
+        avx512_vlen_check(false);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        /* Read source and index registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x7f; /* vmovdqa{32,64} */
+        /* Use (%rax) as destination and modrm_reg as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
+
+        pevex->pfx = vex_f3; /* vmovdqu{32,64} */
+        pevex->w = b & 1;
+        /* Switch to sib_index as source. */
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        opc[1] = (state->sib_index & 7) << 3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the mask value. */
+        n = 1 << (2 + evex.lr - ((b & 1) | evex.w));
+        op_bytes = 4 << evex.w;
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = ops->write(ea.mem.seg,
+                            truncate_ea(ea.mem.off + (idx << state->sib_scale)),
+                            (void *)mmvalp + i * op_bytes, op_bytes, ctxt);
+            if ( rc != X86EMUL_OKAY )
+            {
+                /* See comment in gather emulation. */
+                if ( rc != X86EMUL_EXCEPTION && done )
+                    rc = X86EMUL_RETRY;
+                break;
+            }
+
+            op_mask &= ~(1 << i);
+            done = true;
+
+#ifdef __XEN__
+            if ( op_mask && local_events_need_delivery() )
+            {
+                rc = X86EMUL_RETRY;
+                break;
+            }
+#endif
+        }
+
+        /* Write mask register. See comment in gather emulation. */
+        opc = get_stub(stub);
+        opc[0] = 0xc5;
+        opc[1] = 0xf8;
+        opc[2] = 0x90;
+        /* Use (%rax) as source. */
+        opc[3] = evex.opmsk << 3;
+        opc[4] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+        put_stub(stub);
+
+        state->simd_size = simd_none;
+        break;
+    }
+
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xc9):     /* sha1msg1 xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xca):     /* sha1msg2 xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 35/42] x86emul: support AVX512PF insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (33 preceding siblings ...)
  2018-12-06 10:08   ` [PATCH v6 34/42] x86emul: support AVX512F scatter insns Jan Beulich
@ 2018-12-06 10:09   ` Jan Beulich
  2018-12-06 10:09   ` [PATCH v6 36/42] x86emul: support AVX512CD insns Jan Beulich
                     ` (6 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:09 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Some adjustments are necessary to the EVEX Disp8 scaling test code to
account for the zero byte reads/writes. I have to admit though that I'm
not fully convinced the SDM describes the faulting behavior correctly:
Other prefetch insns, including the Xeon Phi Coprocessor S/G ones, don't
produce #GP/#SS. Until proven otherwise this gets implemented as
specified, not the least because the respective exception specification
table, besides listing #GP and #SS, also explicitly says "EVEX-encoded
prefetch instructions that do not cause #PF follow exception class ...".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -520,6 +520,17 @@ static const struct test avx512er_512[]
     INSN(rsqrt28, 66, 0f38, cd, el, sd, el),
 };
 
+static const struct test avx512pf_512[] = {
+    INSNX(gatherpf0d,  66, 0f38, c6, 1, vl, sd, el),
+    INSNX(gatherpf0q,  66, 0f38, c7, 1, vl, sd, el),
+    INSNX(gatherpf1d,  66, 0f38, c6, 2, vl, sd, el),
+    INSNX(gatherpf1q,  66, 0f38, c7, 2, vl, sd, el),
+    INSNX(scatterpf0d, 66, 0f38, c6, 5, vl, sd, el),
+    INSNX(scatterpf0q, 66, 0f38, c7, 5, vl, sd, el),
+    INSNX(scatterpf1d, 66, 0f38, c6, 6, vl, sd, el),
+    INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -580,7 +591,7 @@ static bool record_access(enum x86_segme
 static int read(enum x86_segment seg, unsigned long offset, void *p_data,
                 unsigned int bytes, struct x86_emulate_ctxt *ctxt)
 {
-    if ( !record_access(seg, offset, bytes) )
+    if ( !record_access(seg, offset, bytes + !bytes) )
         return X86EMUL_UNHANDLEABLE;
     memset(p_data, 0, bytes);
     return X86EMUL_OKAY;
@@ -589,7 +600,7 @@ static int read(enum x86_segment seg, un
 static int write(enum x86_segment seg, unsigned long offset, void *p_data,
                  unsigned int bytes, struct x86_emulate_ctxt *ctxt)
 {
-    if ( !record_access(seg, offset, bytes) )
+    if ( !record_access(seg, offset, bytes + !bytes) )
         return X86EMUL_UNHANDLEABLE;
     return X86EMUL_OKAY;
 }
@@ -597,7 +608,7 @@ static int write(enum x86_segment seg, u
 static void test_one(const struct test *test, enum vl vl,
                      unsigned char *instr, struct x86_emulate_ctxt *ctxt)
 {
-    unsigned int vsz, esz, i;
+    unsigned int vsz, esz, i, n;
     int rc;
     bool sg = strstr(test->mnemonic, "gather") ||
               strstr(test->mnemonic, "scatter");
@@ -725,10 +736,20 @@ static void test_one(const struct test *
     for ( i = 0; i < (test->scale == SC_vl ? vsz : esz); ++i )
          if ( accessed[i] )
              goto fail;
-    for ( ; i < (test->scale == SC_vl ? vsz : esz) + (sg ? esz : vsz); ++i )
+
+    n = test->scale == SC_vl ? vsz : esz;
+    if ( !sg )
+        n += vsz;
+    else if ( !strstr(test->mnemonic, "pf") )
+        n += esz;
+    else
+        ++n;
+
+    for ( ; i < n; ++i )
          if ( accessed[i] != (sg ? (vsz / esz) >> (test->opc & 1 & !evex.w)
                                  : 1) )
              goto fail;
+
     for ( ; i < ARRAY_SIZE(accessed); ++i )
          if ( accessed[i] )
              goto fail;
@@ -887,6 +908,8 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
     RUN(avx512er, 512);
+#define cpu_has_avx512pf cpu_has_avx512f
+    RUN(avx512pf, 512);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
 }
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -135,12 +135,12 @@ int emul_test_cpuid(
         res->c |= 1U << 22;
 
     /*
-     * The emulator doesn't itself use ADCX/ADOX/RDPID, so we can always run
-     * the respective tests.
+     * The emulator doesn't itself use ADCX/ADOX/RDPID nor the S/G prefetch
+     * insns, so we can always run the respective tests.
      */
     if ( leaf == 7 && subleaf == 0 )
     {
-        res->b |= 1U << 19;
+        res->b |= (1U << 19) | (1U << 26);
         res->c |= 1U << 22;
     }
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -525,6 +525,7 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xc6 ... 0xc7] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xc9] = { .simd_size = simd_other },
     [0xca] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -1903,6 +1904,7 @@ static bool vcpu_has(
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
+#define vcpu_has_avx512pf()    vcpu_has(         7, EBX, 26, ctxt, ops)
 #define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
@@ -9377,6 +9379,80 @@ x86_emulate(
 
         state->simd_size = simd_none;
         break;
+    }
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc6):
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc7):
+    {
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+
+        ASSERT(ea.type == OP_MEM);
+        generate_exception_if((!cpu_has_avx512f || !evex.opmsk || evex.br ||
+                               evex.z || evex.reg != 0xf || evex.lr != 2),
+                              EXC_UD);
+        vcpu_must_have(avx512pf);
+
+        switch ( modrm_reg & 7 )
+        {
+        case 1: /* vgatherpf0{d,q}p{d,s} mem{k} */
+        case 2: /* vgatherpf1{d,q}p{d,s} mem{k} */
+            break;
+        case 5: /* vscatterpf0{d,q}p{d,s} mem{k} */
+        case 6: /* vscatterpf1{d,q}p{d,s} mem{k} */
+            fail_if(!ops->write);
+            break;
+        default:
+            generate_exception(EXC_UD);
+        }
+
+        get_fpu(X86EMUL_FPU_zmm);
+
+        /* Read index register. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        /* vmovdqu{32,64} */
+        opc[0] = 0x7f;
+        pevex->pfx = vex_f3;
+        pevex->w = b & 1;
+        /* Use (%rax) as destination and sib_index as source. */
+        pevex->b = 1;
+        opc[1] = (state->sib_index & 7) << 3;
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the mask value. */
+        n = 1 << (4 - ((b & 1) | evex.w));
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; rc == X86EMUL_OKAY && op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = (modrm_reg & 4
+                  ? ops->write
+                  : ops->read)(ea.mem.seg,
+                               truncate_ea(ea.mem.off +
+                                           (idx << state->sib_scale)),
+                               NULL, 0, ctxt);
+
+            op_mask &= ~(1 << i);
+        }
+
+        state->simd_size = simd_none;
+        break;
     }
 
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 36/42] x86emul: support AVX512CD insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (34 preceding siblings ...)
  2018-12-06 10:09   ` [PATCH v6 35/42] x86emul: support AVX512PF insns Jan Beulich
@ 2018-12-06 10:09   ` Jan Beulich
  2018-12-06 10:10   ` [PATCH v6 37/42] x86emul: complete support of AVX512_VBMI insns Jan Beulich
                     ` (5 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:09 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Since the insns here and in particular their memory access patterns
follow the usual scheme I didn't think it was necessary to add
contrived tests specifically for them, beyond the Disp8 scaling ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -458,6 +458,13 @@ static const struct test avx512bw_128[]
     INSN(pinsrw, 66,   0f, c4, el, w, el),
 };
 
+static const struct test avx512cd_all[] = {
+//       pbroadcastmb2q, f3, 0f38, 2a,      q
+//       pbroadcastmw2d, f3, 0f38, 3a,      d
+    INSN(pconflict,      66, 0f38, c4, vl, dq, vl),
+    INSN(plzcnt,         66, 0f38, 44, vl, dq, vl),
+};
+
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
@@ -903,6 +910,7 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, 512);
     RUN(avx512bw, all);
     RUN(avx512bw, 128);
+    RUN(avx512cd, all);
     RUN(avx512dq, all);
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -138,6 +138,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
 #define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
+#define cpu_has_avx512cd  (cp.feat.avx512cd && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -473,6 +473,7 @@ static const struct ext0f38_table {
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x42] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x43] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x44] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -525,6 +526,7 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xc4] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0xc6 ... 0xc7] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xc9] = { .simd_size = simd_other },
@@ -1906,6 +1908,7 @@ static bool vcpu_has(
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_avx512pf()    vcpu_has(         7, EBX, 26, ctxt, ops)
 #define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
+#define vcpu_has_avx512cd()    vcpu_has(         7, EBX, 28, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
@@ -8769,6 +8772,20 @@ x86_emulate(
         evex.opcx = vex_0f;
         goto vmovdqa;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x2a): /* vpbroadcastmb2q k,[xyz]mm */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x3a): /* vpbroadcastmw2d k,[xyz]mm */
+        generate_exception_if((ea.type != OP_REG || evex.opmsk ||
+                               evex.w == ((b >> 4) & 1)),
+                              EXC_UD);
+        d |= TwoOp;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc4): /* vpconflict{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x44): /* vplzcnt{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512cd);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2c): /* vmaskmovps mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2d): /* vmaskmovpd mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2e): /* vmaskmovps {x,y}mm,{x,y}mm,mem */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -104,6 +104,7 @@
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
+#define cpu_has_avx512cd        boot_cpu_has(X86_FEATURE_AVX512CD)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 37/42] x86emul: complete support of AVX512_VBMI insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (35 preceding siblings ...)
  2018-12-06 10:09   ` [PATCH v6 36/42] x86emul: support AVX512CD insns Jan Beulich
@ 2018-12-06 10:10   ` Jan Beulich
  2018-12-06 10:10   ` [PATCH v6 38/42] x86emul: support of AVX512* population count insns Jan Beulich
                     ` (4 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:10 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also add testing of ones support for which was added before. Sadly gcc's
command line option naming is not in line with Intel's naming of the
feature, which makes it necessary to mis-name things in the test harness.

Since the only new insn here and in particular its memory access pattern
follows the usual scheme, I didn't think it was necessary to add a
contrived test specifically for it, beyond the Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er avx512vbmi
 FMA := fma4 fma
 SG := avx2-sg avx512f-sg avx512vl-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -83,6 +83,9 @@ avx512dq-flts := $(avx512f-flts)
 avx512er-vecs := 64
 avx512er-ints :=
 avx512er-flts := 4 8
+avx512vbmi-vecs := $(avx512bw-vecs)
+avx512vbmi-ints := $(avx512bw-ints)
+avx512vbmi-flts := $(avx512bw-flts)
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -542,6 +542,7 @@ static const struct test avx512_vbmi_all
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
+    INSN(pmultishiftqb, 66, 0f38, 83, vl, q, vl),
 };
 
 static const struct test avx512_vbmi2_all[] = {
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -27,6 +27,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw.h"
 #include "avx512dq.h"
 #include "avx512er.h"
+#include "avx512vbmi.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -127,6 +128,16 @@ static bool simd_check_avx512bw_vl(void)
     return cpu_has_avx512bw && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx512vbmi(void)
+{
+    return cpu_has_avx512_vbmi;
+}
+
+static bool simd_check_avx512vbmi_vl(void)
+{
+    return cpu_has_avx512_vbmi && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -372,6 +383,18 @@ static const struct {
     SIMD(AVX512ER f32x16,    avx512er,      64f4),
     SIMD(AVX512ER f64 scalar,avx512er,        f8),
     SIMD(AVX512ER f64x8,     avx512er,      64f8),
+    SIMD(AVX512_VBMI s8x64,  avx512vbmi,    64i1),
+    SIMD(AVX512_VBMI u8x64,  avx512vbmi,    64u1),
+    SIMD(AVX512_VBMI s16x32, avx512vbmi,    64i2),
+    SIMD(AVX512_VBMI u16x32, avx512vbmi,    64u2),
+    AVX512VL(_VBMI+VL s8x16, avx512vbmi,    16i1),
+    AVX512VL(_VBMI+VL u8x16, avx512vbmi,    16u1),
+    AVX512VL(_VBMI+VL s8x32, avx512vbmi,    32i1),
+    AVX512VL(_VBMI+VL u8x32, avx512vbmi,    32u1),
+    AVX512VL(_VBMI+VL s16x8, avx512vbmi,    16i2),
+    AVX512VL(_VBMI+VL u16x8, avx512vbmi,    16u2),
+    AVX512VL(_VBMI+VL s16x16, avx512vbmi,   32i2),
+    AVX512VL(_VBMI+VL u16x16, avx512vbmi,   32u2),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -493,6 +493,7 @@ static const struct ext0f38_table {
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x83] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x88] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_dq },
     [0x89] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_dq },
     [0x8a] = { .simd_size = simd_packed_fp, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
@@ -8969,6 +8970,12 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x83): /* vpmultishiftqb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        host_and_vcpu_must_have(avx512_vbmi);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 38/42] x86emul: support of AVX512* population count insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (36 preceding siblings ...)
  2018-12-06 10:10   ` [PATCH v6 37/42] x86emul: complete support of AVX512_VBMI insns Jan Beulich
@ 2018-12-06 10:10   ` Jan Beulich
  2018-12-06 10:11   ` [PATCH v6 39/42] x86emul: support of AVX512_IFMA insns Jan Beulich
                     ` (3 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:10 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Plus the only other AVX512_BITALG one.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -538,6 +538,11 @@ static const struct test avx512pf_512[]
     INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
 };
 
+static const struct test avx512_bitalg_all[] = {
+    INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
+    INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -550,6 +555,10 @@ static const struct test avx512_vbmi2_al
     INSN(pexpand,   66, 0f38, 62, vl, bw, el),
 };
 
+static const struct test avx512_vpopcntdq_all[] = {
+    INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -919,6 +928,8 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512er, 512);
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
+    RUN(avx512_bitalg, all);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
+    RUN(avx512_vpopcntdq, all);
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -143,6 +143,8 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
+#define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -479,6 +479,7 @@ static const struct ext0f38_table {
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
@@ -501,6 +502,7 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
+    [0x8f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -1915,6 +1917,8 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
+#define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -8876,6 +8880,19 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8f): /* vpshufbitqmb [xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(evex.w || !evex.r || !evex.R || evex.z, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x54): /* vpopcnt{b,w} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_bitalg);
+        generate_exception_if(evex.br, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x55): /* vpopcnt{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vpopcntdq);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,{x,y}mm */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -112,6 +112,8 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
+#define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
 
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
 XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
 
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -269,7 +269,7 @@ def crunch_numbers(state):
         # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask
         # registers), despite the SDM not formally making this connection.
-        AVX512BW: [AVX512_VBMI, AVX512_VBMI2],
+        AVX512BW: [AVX512_VBMI, AVX512_BITALG, AVX512_VBMI2],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 39/42] x86emul: support of AVX512_IFMA insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (37 preceding siblings ...)
  2018-12-06 10:10   ` [PATCH v6 38/42] x86emul: support of AVX512* population count insns Jan Beulich
@ 2018-12-06 10:11   ` Jan Beulich
  2018-12-06 10:11   ` [PATCH v6 40/42] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
                     ` (2 subsequent siblings)
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:11 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Once again take the liberty and also correct the (public interface) name
of the AVX512_IFMA feature flag, on the assumption that no external
consumer has actually been using that flag so far.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -543,6 +543,11 @@ static const struct test avx512_bitalg_a
     INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
 };
 
+static const struct test avx512_ifma_all[] = {
+    INSN(pmadd52huq, 66, 0f38, b5, vl, q, vl),
+    INSN(pmadd52luq, 66, 0f38, b4, vl, q, vl),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -929,6 +934,7 @@ void evex_disp8_test(void *instr, struct
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
     RUN(avx512_bitalg, all);
+    RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
     RUN(avx512_vpopcntdq, all);
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -137,6 +137,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_bmi2       cp.feat.bmi2
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
+#define cpu_has_avx512_ifma (cp.feat.avx512_ifma && xcr0_mask(0xe6))
 #define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
 #define cpu_has_avx512cd  (cp.feat.avx512cd && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -521,6 +521,7 @@ static const struct ext0f38_table {
     [0xad] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xae] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xaf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xb4 ... 0xb5] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xb9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xba] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -1907,6 +1908,7 @@ static bool vcpu_has(
 #define vcpu_has_rdseed()      vcpu_has(         7, EBX, 18, ctxt, ops)
 #define vcpu_has_adx()         vcpu_has(         7, EBX, 19, ctxt, ops)
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
+#define vcpu_has_avx512_ifma() vcpu_has(         7, EBX, 21, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_avx512pf()    vcpu_has(         7, EBX, 26, ctxt, ops)
@@ -9422,6 +9424,11 @@ x86_emulate(
         break;
     }
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb4): /* vpmadd52luq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb5): /* vpmadd52huq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_ifma);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xc6):
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xc7):
     {
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -103,6 +103,7 @@
 #define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
+#define cpu_has_avx512_ifma     boot_cpu_has(X86_FEATURE_AVX512_IFMA)
 #define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_avx512cd        boot_cpu_has(X86_FEATURE_AVX512CD)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -212,7 +212,7 @@ XEN_CPUFEATURE(AVX512DQ,      5*32+17) /
 XEN_CPUFEATURE(RDSEED,        5*32+18) /*A  RDSEED instruction */
 XEN_CPUFEATURE(ADX,           5*32+19) /*A  ADCX, ADOX instructions */
 XEN_CPUFEATURE(SMAP,          5*32+20) /*S  Supervisor Mode Access Prevention */
-XEN_CPUFEATURE(AVX512IFMA,    5*32+21) /*A  AVX-512 Integer Fused Multiply Add */
+XEN_CPUFEATURE(AVX512_IFMA,   5*32+21) /*A  AVX-512 Integer Fused Multiply Add */
 XEN_CPUFEATURE(CLFLUSHOPT,    5*32+23) /*A  CLFLUSHOPT instruction */
 XEN_CPUFEATURE(CLWB,          5*32+24) /*A  CLWB instruction */
 XEN_CPUFEATURE(AVX512PF,      5*32+26) /*A  AVX-512 Prefetch Instructions */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -262,7 +262,7 @@ def crunch_numbers(state):
         # (which in practice depends on the EVEX prefix to encode) as well
         # as mask registers, and the instructions themselves. All further
         # AVX512 features are built on top of AVX512F
-        AVX512F: [AVX512DQ, AVX512IFMA, AVX512PF, AVX512ER, AVX512CD,
+        AVX512F: [AVX512DQ, AVX512_IFMA, AVX512PF, AVX512ER, AVX512CD,
                   AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
                   AVX512_VPOPCNTDQ],
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 40/42] x86emul: support remaining AVX512_VBMI2 insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (38 preceding siblings ...)
  2018-12-06 10:11   ` [PATCH v6 39/42] x86emul: support of AVX512_IFMA insns Jan Beulich
@ 2018-12-06 10:11   ` Jan Beulich
  2018-12-06 10:12   ` [PATCH v6 41/42] x86emul: support AVX512_4FMAPS insns Jan Beulich
  2018-12-06 10:12   ` [PATCH v6 42/42] x86emul: support AVX512_4VNNIW insns Jan Beulich
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:11 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -558,6 +558,14 @@ static const struct test avx512_vbmi_all
 static const struct test avx512_vbmi2_all[] = {
     INSN(pcompress, 66, 0f38, 63, vl, bw, el),
     INSN(pexpand,   66, 0f38, 62, vl, bw, el),
+    INSN(pshld,     66, 0f3a, 71, vl, dq, vl),
+    INSN(pshldv,    66, 0f38, 71, vl, dq, vl),
+    INSN(pshldvw,   66, 0f38, 70, vl,  w, vl),
+    INSN(pshldw,    66, 0f3a, 70, vl,  w, vl),
+    INSN(pshrd,     66, 0f3a, 73, vl, dq, vl),
+    INSN(pshrdv,    66, 0f38, 73, vl, dq, vl),
+    INSN(pshrdvw,   66, 0f38, 72, vl,  w, vl),
+    INSN(pshrdw,    66, 0f3a, 72, vl,  w, vl),
 };
 
 static const struct test avx512_vpopcntdq_all[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -487,6 +487,7 @@ static const struct ext0f38_table {
     [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
     [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
     [0x64 ... 0x66] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x70 ... 0x73] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -611,6 +612,7 @@ static const struct ext0f3a_table {
     [0x6a ... 0x6b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x6c ... 0x6d] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x6e ... 0x6f] = { .simd_size = simd_scalar_opc, .four_op = 1 },
+    [0x70 ... 0x73] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x78 ... 0x79] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x7a ... 0x7b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x7c ... 0x7d] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -8939,6 +8941,16 @@ x86_emulate(
         }
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x70): /* vpshldvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x72): /* vpshrdvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        elem_bytes = 2;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x71): /* vpshldv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x73): /* vpshrdv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -10225,6 +10237,16 @@ x86_emulate(
             goto simd_imm8_zmm_scalar;
         goto avx512f_imm8_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x70): /* vpshldw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x72): /* vpshrdw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        elem_bytes = 2;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x71): /* vpshld{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x73): /* vpshrd{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC(0x0f3a, 0xcc):     /* sha1rnds4 $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(sha);
         op_bytes = 16;





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 41/42] x86emul: support AVX512_4FMAPS insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (39 preceding siblings ...)
  2018-12-06 10:11   ` [PATCH v6 40/42] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
@ 2018-12-06 10:12   ` Jan Beulich
  2018-12-06 10:12   ` [PATCH v6 42/42] x86emul: support AVX512_4VNNIW insns Jan Beulich
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:12 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -538,6 +538,13 @@ static const struct test avx512pf_512[]
     INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
 };
 
+static const struct test avx512_4fmaps_512[] = {
+    INSN(4fmaddps,  f2, 0f38, 9a, el_4, d, vl),
+    INSN(4fmaddss,  f2, 0f38, 9b, el_4, d, vl),
+    INSN(4fnmaddps, f2, 0f38, aa, el_4, d, vl),
+    INSN(4fnmaddss, f2, 0f38, ab, el_4, d, vl),
+};
+
 static const struct test avx512_bitalg_all[] = {
     INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
     INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
@@ -941,6 +948,7 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512er, 512);
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
+    RUN(avx512_4fmaps, 512);
     RUN(avx512_bitalg, all);
     RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -4245,6 +4245,81 @@ int main(int argc, char **argv)
     }
 #endif
 
+    printf("%-40s", "Testing v4fmaddps 32(%ecx),%zmm4,%zmm4{%k5}...");
+    if ( stack_exec && cpu_has_avx512_4fmaps )
+    {
+        decl_insn(v4fmaddps);
+        static const struct {
+            float f[16];
+        } in = {{
+            1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
+        }}, out = {{
+            1 + 1 * 9 + 2 * 10 + 3 * 11 + 4 * 12,
+            2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+            16 + 16 * 9 + 17 * 10 + 18 * 11 + 19 * 12
+        }};
+
+        asm volatile ( "vmovups %1, %%zmm4\n\t"
+                       "vbroadcastss %%xmm4, %%zmm7\n\t"
+                       "vaddps %%zmm4, %%zmm7, %%zmm5\n\t"
+                       "vaddps %%zmm5, %%zmm7, %%zmm6\n\t"
+                       "vaddps %%zmm6, %%zmm7, %%zmm7\n\t"
+                       "kmovw %2, %%k5\n"
+                       put_insn(v4fmaddps,
+                                "v4fmaddps 32(%0), %%zmm4, %%zmm4%{%%k5%}")
+                       :: "c" (NULL), "m" (in), "rmk" (0x8001) );
+
+        set_insn(v4fmaddps);
+        regs.ecx = (unsigned long)&in;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(v4fmaddps) )
+            goto fail;
+
+        asm ( "vcmpeqps %1, %%zmm4, %%k0\n\t"
+              "kmovw %%k0, %0" : "=g" (rc) : "m" (out) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing v4fnmaddss 16(%edx),%zmm4,%zmm4{%k3}...");
+    if ( stack_exec && cpu_has_avx512_4fmaps )
+    {
+        decl_insn(v4fnmaddss);
+        static const struct {
+            float f[16];
+        } in = {{
+            1, 2, 3, 4, 5, 6, 7, 8
+        }}, out = {{
+            1 - 1 * 5 - 2 * 6 - 3 * 7 - 4 * 8, 2, 3, 4
+        }};
+
+        asm volatile ( "vmovups %1, %%xmm4\n\t"
+                       "vaddss %%xmm4, %%xmm4, %%xmm5\n\t"
+                       "vaddss %%xmm5, %%xmm4, %%xmm6\n\t"
+                       "vaddss %%xmm6, %%xmm4, %%xmm7\n\t"
+                       "kmovw %2, %%k3\n"
+                       put_insn(v4fnmaddss,
+                                "v4fnmaddss 16(%0), %%xmm4, %%xmm4%{%%k3%}")
+                       :: "d" (NULL), "m" (in), "rmk" (1) );
+
+        set_insn(v4fnmaddss);
+        regs.edx = (unsigned long)&in;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(v4fnmaddss) )
+            goto fail;
+
+        asm ( "vcmpeqps %1, %%zmm4, %%k0\n\t"
+              "kmovw %%k0, %0" : "=g" (rc) : "m" (out) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -146,6 +146,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
+#define cpu_has_avx512_4fmaps (cp.feat.avx512_4fmaps && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1923,6 +1923,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
+#define vcpu_has_avx512_4fmaps() vcpu_has(       7, EDX,  3, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -3202,6 +3203,18 @@ x86_decode(
                                                    state);
                     state->simd_size = simd_other;
                 }
+
+                switch ( b )
+                {
+                /* v4f{,n}madd{p,s}s need special casing */
+                case 0x9a: case 0x9b: case 0xaa: case 0xab:
+                    if ( evex.pfx == vex_f2 )
+                    {
+                        disp8scale = 4;
+                        state->simd_size = simd_128;
+                    }
+                    break;
+                }
             }
             break;
 
@@ -9340,6 +9353,24 @@ x86_emulate(
             host_and_vcpu_must_have(avx512f);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9a): /* v4fmaddps m128,zmm+3,zmm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0xaa): /* v4fnmaddps m128,zmm+3,zmm{k} */
+        host_and_vcpu_must_have(avx512_4fmaps);
+        generate_exception_if((ea.type != OP_MEM || evex.w || evex.br ||
+                               evex.lr != 2),
+                              EXC_UD);
+        op_mask = op_mask & 0xffff ? 0xf : 0;
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9b): /* v4fmaddss m128,xmm+3,xmm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0xab): /* v4fnmaddss m128,xmm+3,xmm{k} */
+        host_and_vcpu_must_have(avx512_4fmaps);
+        generate_exception_if((ea.type != OP_MEM || evex.w || evex.br ||
+                               evex.lr == 3),
+                              EXC_UD);
+        op_mask = op_mask & 1 ? 0xf : 0;
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xa0): /* vpscatterd{d,q} [xyz]mm,mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xa1): /* vpscatterq{d,q} [xyz]mm,mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xa2): /* vscatterdp{s,d} [xyz]mm,mem{k} */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -116,6 +116,9 @@
 #define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
 #define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
 
+/* CPUID level 0x00000007:0.edx */
+#define cpu_has_avx512_4fmaps   boot_cpu_has(X86_FEATURE_AVX512_4FMAPS)
+
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v6 42/42] x86emul: support AVX512_4VNNIW insns
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
                     ` (40 preceding siblings ...)
  2018-12-06 10:12   ` [PATCH v6 41/42] x86emul: support AVX512_4FMAPS insns Jan Beulich
@ 2018-12-06 10:12   ` Jan Beulich
  41 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-06 10:12 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As in a few cases before, since the insns here and in particular their
memory access patterns follow the AVX512_4FMAPS scheme, I didn't think
it was necessary to add contrived tests specifically for them, beyond
the Disp8 scaling ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -545,6 +545,11 @@ static const struct test avx512_4fmaps_5
     INSN(4fnmaddss, f2, 0f38, ab, el_4, d, vl),
 };
 
+static const struct test avx512_4vnniw_512[] = {
+    INSN(p4dpwssd,  f2, 0f38, 52, el_4, d, vl),
+    INSN(p4dpwssds, f2, 0f38, 53, el_4, d, vl),
+};
+
 static const struct test avx512_bitalg_all[] = {
     INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
     INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
@@ -949,6 +954,7 @@ void evex_disp8_test(void *instr, struct
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
     RUN(avx512_4fmaps, 512);
+    RUN(avx512_4vnniw, 512);
     RUN(avx512_bitalg, all);
     RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -146,6 +146,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
+#define cpu_has_avx512_4vnniw (cp.feat.avx512_4vnniw && xcr0_mask(0xe6))
 #define cpu_has_avx512_4fmaps (cp.feat.avx512_4fmaps && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -479,6 +479,7 @@ static const struct ext0f38_table {
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x52 ... 0x53] = { .simd_size = simd_128, .d8s = 4 },
     [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
@@ -1923,6 +1924,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
+#define vcpu_has_avx512_4vnniw() vcpu_has(       7, EDX,  2, ctxt, ops)
 #define vcpu_has_avx512_4fmaps() vcpu_has(       7, EDX,  3, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
@@ -8897,6 +8899,15 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x52): /* vp4dpwssd m128,zmm+3,zmm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x53): /* vp4dpwssds m128,zmm+3,zmm{k} */
+        host_and_vcpu_must_have(avx512_4vnniw);
+        generate_exception_if((ea.type != OP_MEM || evex.w || evex.br ||
+                               evex.lr != 2),
+                              EXC_UD);
+        op_mask = op_mask & 0xffff ? 0xf : 0;
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8f): /* vpshufbitqmb [xyz]mm/mem,[xyz]mm,k{k} */
         generate_exception_if(evex.w || !evex.r || !evex.R || evex.z, EXC_UD);
         /* fall through */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -117,6 +117,7 @@
 #define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
 
 /* CPUID level 0x00000007:0.edx */
+#define cpu_has_avx512_4vnniw   boot_cpu_has(X86_FEATURE_AVX512_4VNNIW)
 #define cpu_has_avx512_4fmaps   boot_cpu_has(X86_FEATURE_AVX512_4FMAPS)
 
 /* CPUID level 0x80000007.edx */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 00/49] x86emul: remaining AVX512 support
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (10 preceding siblings ...)
  2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
@ 2018-12-19 14:17 ` Jan Beulich
  2018-12-19 14:34   ` Jan Beulich
                     ` (49 more replies)
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
  12 siblings, 50 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:17 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

01: rename evex.br to evex.brs
02: support AVX512{F,BW} shift/rotate insns
03: support AVX512{F,BW,DQ} extract insns
04: support AVX512{F,BW,DQ} insert insns
05: basic AVX512F testing
06: support AVX512{F,BW,DQ} integer broadcast insns
07: basic AVX512VL testing
08: support AVX512{F,BW} zero- and sign-extending moves
09: support AVX512{F,BW} down conversion moves
10: support AVX512{F,BW} integer unpack insns
11: support AVX512{F,BW,_VBMI} full permute insns
12: support AVX512{F,BW} integer shuffle insns
13: support AVX512{BW,DQ} mask move insns
14: basic AVX512BW testing
15: basic AVX512DQ testing
16: support AVX512F move high/low insns
17: support AVX512F move duplicate insns
18: support AVX512{F,BW,VBMI} permute insns
19: support AVX512BW pack insns
20: support AVX512F floating-point conversion insns
21: support AVX512F legacy-equivalent packed int/FP conversion insns
22: support AVX512F legacy-equivalent scalar int/FP conversion insns
23: support AVX512DQ packed quad-int/FP conversion insns
24: support AVX512{F,DQ} uint-to-FP conversion insns
25: support AVX512{F,DQ} FP-to-uint conversion insns
26: support remaining AVX512F legacy-equivalent insns
27: support remaining AVX512BW legacy-equivalent insns
28: support AVX512{F,ER} reciprocal insns
29: support AVX512F floating point manipulation insns
30: support AVX512DQ floating point manipulation insns
31: support AVX512{F,_VBMI2} compress/expand insns
32: support remaining misc AVX512{F,BW} insns
33: support AVX512F gather insns
34: add high register S/G test cases
35: support AVX512F scatter insns
36: support AVX512PF insns
37: support AVX512CD insns
38: complete support of AVX512_VBMI insns
39: support of AVX512* population count insns
40: support of AVX512_IFMA insns
41: support remaining AVX512_VBMI2 insns
42: support AVX512_4FMAPS insns
43: support AVX512_4VNNIW insns
44: support AVX512_VNNI insns
45: support VPCLMULQDQ insns
46: support VAES insns
47: support GFNI insns
48: restore ordering within main switch statement
49: tools: re-sync CPUID leaf 7 tables

This adds support for all AVX512* insns in SDM rev 068 as well as
for those from ISA extensions rev 035. Besides a few added patches
the main change from v6 are fixes mostly pointed out by a few days
of fuzzing with afl.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 00/49] x86emul: remaining AVX512 support
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
@ 2018-12-19 14:34   ` Jan Beulich
  2018-12-19 14:35   ` [PATCH v7 01/49] x86emul: rename evex.br to evex.brs Jan Beulich
                     ` (48 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:34 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

>>> On 19.12.18 at 15:17, <JBeulich@suse.com> wrote:
> 01: rename evex.br to evex.brs
> 02: support AVX512{F,BW} shift/rotate insns
> 03: support AVX512{F,BW,DQ} extract insns
> 04: support AVX512{F,BW,DQ} insert insns
> 05: basic AVX512F testing
> 06: support AVX512{F,BW,DQ} integer broadcast insns
> 07: basic AVX512VL testing
> 08: support AVX512{F,BW} zero- and sign-extending moves
> 09: support AVX512{F,BW} down conversion moves
> 10: support AVX512{F,BW} integer unpack insns
> 11: support AVX512{F,BW,_VBMI} full permute insns
> 12: support AVX512{F,BW} integer shuffle insns
> 13: support AVX512{BW,DQ} mask move insns
> 14: basic AVX512BW testing
> 15: basic AVX512DQ testing
> 16: support AVX512F move high/low insns
> 17: support AVX512F move duplicate insns
> 18: support AVX512{F,BW,VBMI} permute insns
> 19: support AVX512BW pack insns
> 20: support AVX512F floating-point conversion insns
> 21: support AVX512F legacy-equivalent packed int/FP conversion insns
> 22: support AVX512F legacy-equivalent scalar int/FP conversion insns
> 23: support AVX512DQ packed quad-int/FP conversion insns
> 24: support AVX512{F,DQ} uint-to-FP conversion insns
> 25: support AVX512{F,DQ} FP-to-uint conversion insns
> 26: support remaining AVX512F legacy-equivalent insns
> 27: support remaining AVX512BW legacy-equivalent insns
> 28: support AVX512{F,ER} reciprocal insns
> 29: support AVX512F floating point manipulation insns
> 30: support AVX512DQ floating point manipulation insns
> 31: support AVX512{F,_VBMI2} compress/expand insns
> 32: support remaining misc AVX512{F,BW} insns
> 33: support AVX512F gather insns
> 34: add high register S/G test cases
> 35: support AVX512F scatter insns
> 36: support AVX512PF insns
> 37: support AVX512CD insns
> 38: complete support of AVX512_VBMI insns
> 39: support of AVX512* population count insns
> 40: support of AVX512_IFMA insns
> 41: support remaining AVX512_VBMI2 insns
> 42: support AVX512_4FMAPS insns
> 43: support AVX512_4VNNIW insns
> 44: support AVX512_VNNI insns
> 45: support VPCLMULQDQ insns
> 46: support VAES insns
> 47: support GFNI insns
> 48: restore ordering within main switch statement
> 49: tools: re-sync CPUID leaf 7 tables
> 
> This adds support for all AVX512* insns in SDM rev 068 as well as
> for those from ISA extensions rev 035. Besides a few added patches
> the main change from v6 are fixes mostly pointed out by a few days
> of fuzzing with afl.

Oh, btw - if in doubt, this version of the series may (textually)
depend on "x86emul: fix test harness and fuzzer build dependencies",
which is still lacking a tool stack maintainer ack.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 01/49] x86emul: rename evex.br to evex.brs
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
  2018-12-19 14:34   ` Jan Beulich
@ 2018-12-19 14:35   ` Jan Beulich
  2018-12-19 15:14     ` Andrew Cooper
  2018-12-19 14:36   ` [PATCH v7 02/49] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
                     ` (47 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This is to better reflect that it's an abbreviation for "broadcast,
rounding, or SAE" rather than just "broadcast".

Take the opportunity and also add SDM naming comments to both union vex
and union evex.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: New.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -615,15 +615,15 @@ static const uint8_t sse_prefix[] = { 0x
 
 union vex {
     uint8_t raw[2];
-    struct {
-        uint8_t opcx:5;
-        uint8_t b:1;
-        uint8_t x:1;
-        uint8_t r:1;
-        uint8_t pfx:2;
-        uint8_t l:1;
-        uint8_t reg:4;
-        uint8_t w:1;
+    struct {             /* SDM names */
+        uint8_t opcx:5;  /* mmmmm */
+        uint8_t b:1;     /* B */
+        uint8_t x:1;     /* X */
+        uint8_t r:1;     /* R */
+        uint8_t pfx:2;   /* pp */
+        uint8_t l:1;     /* L */
+        uint8_t reg:4;   /* vvvv */
+        uint8_t w:1;     /* W */
     };
 };
 
@@ -668,22 +668,22 @@ union vex {
 
 union evex {
     uint8_t raw[3];
-    struct {
-        uint8_t opcx:2;
+    struct {             /* SDM names */
+        uint8_t opcx:2;  /* mm */
         uint8_t mbz:2;
-        uint8_t R:1;
-        uint8_t b:1;
-        uint8_t x:1;
-        uint8_t r:1;
-        uint8_t pfx:2;
+        uint8_t R:1;     /* R' */
+        uint8_t b:1;     /* B */
+        uint8_t x:1;     /* X */
+        uint8_t r:1;     /* R */
+        uint8_t pfx:2;   /* pp */
         uint8_t mbs:1;
-        uint8_t reg:4;
-        uint8_t w:1;
-        uint8_t opmsk:3;
-        uint8_t RX:1;
-        uint8_t br:1;
-        uint8_t lr:2;
-        uint8_t z:1;
+        uint8_t reg:4;   /* vvvv */
+        uint8_t w:1;     /* W */
+        uint8_t opmsk:3; /* aaa */
+        uint8_t RX:1;    /* V' */
+        uint8_t brs:1;   /* b */
+        uint8_t lr:2;    /* L'L */
+        uint8_t z:1;     /* z */
     };
 };
 
@@ -2231,7 +2231,7 @@ static unsigned int decode_disp8scale(en
     default:
         if ( scale < d8s_vl )
             return scale;
-        if ( state->evex.br )
+        if ( state->evex.brs )
         {
     case d8s_dq:
             return 2 + state->evex.w;
@@ -5913,7 +5913,7 @@ x86_emulate(
         /* vmovs{s,d} to/from memory have only two operands. */
         if ( (b & ~1) == 0x10 && ea.type == OP_MEM )
             d |= TwoOp;
-        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.brs, EXC_UD);
         /* fall through */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x51):    /* vsqrtp{s,d} [xyz]mm/mem,[xyz]mm{k} */
                                             /* vsqrts{s,d} xmm/m32,xmm,xmm{k} */
@@ -5924,11 +5924,11 @@ x86_emulate(
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5e):    /* vdiv{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5f):    /* vmax{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
-                               (ea.type != OP_REG && evex.br &&
+                               (ea.type != OP_REG && evex.brs &&
                                 (evex.pfx & VEX_PREFIX_SCALAR_MASK))),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
-        if ( ea.type != OP_REG || !evex.br )
+        if ( ea.type != OP_REG || !evex.brs )
             avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
     simd_zmm:
         get_fpu(X86EMUL_FPU_zmm);
@@ -5999,7 +5999,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3f): /* vpmaxu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
-        generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
         avx512_vlen_check(false);
         goto simd_zmm;
 
@@ -6183,11 +6183,11 @@ x86_emulate(
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2e): /* vucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2f): /* vcomis{s,d} xmm/mem,xmm */
         generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
-                               (ea.type != OP_REG && evex.br) ||
+                               (ea.type != OP_REG && evex.brs) ||
                                evex.w != evex.pfx),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
-        if ( !evex.br )
+        if ( !evex.brs )
             avx512_vlen_check(true);
         get_fpu(X86EMUL_FPU_zmm);
 
@@ -6432,7 +6432,7 @@ x86_emulate(
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x56): /* vorp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x57): /* vxorp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
-                               (ea.type != OP_MEM && evex.br)),
+                               (ea.type != OP_MEM && evex.brs)),
                               EXC_UD);
         host_and_vcpu_must_have(avx512dq);
         avx512_vlen_check(false);
@@ -6638,7 +6638,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
-        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = 1 << (b & 1);
         goto avx512f_no_sae;
 
@@ -6663,7 +6663,7 @@ x86_emulate(
             goto avx512f_no_sae;
         }
         host_and_vcpu_must_have(avx512bw);
-        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = 1 << (ext == ext_0f ? b & 1 : evex.w);
         avx512_vlen_check(false);
         goto simd_zmm;
@@ -6717,7 +6717,7 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6e): /* vmov{d,q} r/m,xmm */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7e): /* vmov{d,q} xmm,r/m */
-        generate_exception_if((evex.lr || evex.opmsk || evex.br ||
+        generate_exception_if((evex.lr || evex.opmsk || evex.brs ||
                                evex.reg != 0xf || !evex.RX),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
@@ -6743,7 +6743,7 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_F3(0x0f, 0x7e): /* vmovq xmm/m64,xmm */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
-        generate_exception_if(evex.lr || !evex.w || evex.opmsk || evex.br,
+        generate_exception_if(evex.lr || !evex.w || evex.opmsk || evex.brs,
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
         d |= TwoOp;
@@ -6781,7 +6781,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7f): /* vmovdqa{32,64} [xyz]mm,[xyz]mm/mem{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f, 0x7f): /* vmovdqu{32,64} [xyz]mm,[xyz]mm/mem{k} */
     vmovdqa:
-        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.brs, EXC_UD);
         d |= TwoOp;
         op_bytes = 16 << evex.lr;
         goto avx512f_no_sae;
@@ -7626,12 +7626,12 @@ x86_emulate(
 
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0xc2): /* vcmp{p,s}{s,d} $imm8,[xyz]mm/mem,[xyz]mm,k{k} */
         generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
-                               (ea.type != OP_REG && evex.br &&
+                               (ea.type != OP_REG && evex.brs &&
                                 (evex.pfx & VEX_PREFIX_SCALAR_MASK)) ||
                                !evex.r || !evex.R || evex.z),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
-        if ( ea.type != OP_REG || !evex.br )
+        if ( ea.type != OP_REG || !evex.brs )
             avx512_vlen_check(evex.pfx & VEX_PREFIX_SCALAR_MASK);
     simd_imm8_zmm:
         if ( (d & SrcMask) == SrcImmByte )
@@ -7687,7 +7687,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_imm8_no_sae:
         host_and_vcpu_must_have(avx512f);
-        generate_exception_if(ea.type != OP_MEM && evex.br, EXC_UD);
+        generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
         avx512_vlen_check(false);
         goto simd_imm8_zmm;
 
@@ -7921,7 +7921,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xea): /* vpminsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xee): /* vpmaxsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
-        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = b & 0x10 ? 1 : 2;
         goto avx512f_no_sae;
 
@@ -8122,7 +8122,7 @@ x86_emulate(
         break;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
-        generate_exception_if(evex.w || evex.br, EXC_UD);
+        generate_exception_if(evex.w || evex.brs, EXC_UD);
     avx512_broadcast:
         /*
          * For the respective code below the main switch() to work we need to
@@ -8145,14 +8145,14 @@ x86_emulate(
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
                                             /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
-        generate_exception_if(!evex.lr || evex.br, EXC_UD);
+        generate_exception_if(!evex.lr || evex.brs, EXC_UD);
         if ( !evex.w )
             host_and_vcpu_must_have(avx512dq);
         goto avx512_broadcast;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
                                             /* vbroadcastf64x2 m128,{y,z}mm{k} */
-        generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.br,
+        generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.brs,
                               EXC_UD);
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
@@ -8304,7 +8304,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3c): /* vpmaxsb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3e): /* vpmaxuw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
-        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = b & 2 ?: 1;
         goto avx512f_no_sae;
 
@@ -8521,7 +8521,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512f);
-        if ( ea.type != OP_REG || !evex.br )
+        if ( ea.type != OP_REG || !evex.brs )
             avx512_vlen_check(false);
         goto simd_zmm;
 
@@ -8538,8 +8538,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
         host_and_vcpu_must_have(avx512f);
-        generate_exception_if(ea.type != OP_REG && evex.br, EXC_UD);
-        if ( !evex.br )
+        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
+        if ( !evex.brs )
             avx512_vlen_check(true);
         goto simd_zmm;
 
@@ -8883,7 +8883,7 @@ x86_emulate(
         if ( !(b & 0x20) )
             goto avx512f_imm8_no_sae;
         host_and_vcpu_must_have(avx512bw);
-        generate_exception_if(evex.br, EXC_UD);
+        generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = 1 << evex.w;
         avx512_vlen_check(false);
         goto simd_imm8_zmm;
@@ -9350,7 +9350,7 @@ x86_emulate(
                                   EXC_GP, 0);
 
             EXPECT(elem_bytes > 0);
-            if ( evex.br )
+            if ( evex.brs )
             {
                 ASSERT((d & DstMask) != DstMem);
                 op_bytes = elem_bytes;
@@ -9365,7 +9365,7 @@ x86_emulate(
             {
                 if ( !op_mask )
                     goto simd_no_mem;
-                if ( !evex.br )
+                if ( !evex.brs )
                 {
                     first_byte = __builtin_ctzll(op_mask);
                     op_mask >>= first_byte;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 02/49] x86emul: support AVX512{F, BW} shift/rotate insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
  2018-12-19 14:34   ` Jan Beulich
  2018-12-19 14:35   ` [PATCH v7 01/49] x86emul: rename evex.br to evex.brs Jan Beulich
@ 2018-12-19 14:36   ` Jan Beulich
  2018-12-19 16:00     ` Andrew Cooper
  2018-12-19 14:36   ` [PATCH v7 03/49] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
                     ` (46 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that simd_packed_fp for the opcode space 0f38 major opcodes 14 and
15 is not really correct, but sufficient for the purposes here. Further
adjustments may later be needed for the down conversion unsigned
saturating VPMOV* insns, first and foremost for the different Disp8
scaling those ones use.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Raise #UD for VPS{LL,RA,RL}VW with EVEX.W clear. Re-base.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -178,6 +178,24 @@ static const struct test avx512f_all[] =
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
     INSN(por,          66,   0f, eb,    vl,     dq, vl),
+    INSNX(prol,        66,   0f, 72, 1, vl,     dq, vl),
+    INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
+    INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
+    INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
+    INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
+    INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
+    INSNX(psllq,       66,   0f, 73, 6, vl,      q, vl),
+    INSN(psllv,        66, 0f38, 47,    vl,     dq, vl),
+    INSNX(psra,        66,   0f, 72, 4, vl,     dq, vl),
+    INSN(psrad,        66,   0f, e2,    el_4,    d, vl),
+    INSN(psraq,        66,   0f, e2,    el_2,    q, vl),
+    INSN(psrav,        66, 0f38, 46,    vl,     dq, vl),
+    INSN(psrld,        66,   0f, d2,    el_4,    d, vl),
+    INSNX(psrld,       66,   0f, 72, 2, vl,      d, vl),
+    INSN(psrlq,        66,   0f, d3,    el_2,    q, vl),
+    INSNX(psrlq,       66,   0f, 73, 2, vl,      q, vl),
+    INSN(psrlv,        66, 0f38, 45,    vl,     dq, vl),
     INSN(psubd,        66,   0f, fa,    vl,      d, vl),
     INSN(psubq,        66,   0f, fb,    vl,      q, vl),
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
@@ -241,6 +259,17 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSNX(pslldq,     66,   0f, 73, 7, vl,    b, vl),
+    INSN(psllvw,      66, 0f38, 12,    vl,    w, vl),
+    INSN(psllw,       66,   0f, f1,    el_8,  w, vl),
+    INSNX(psllw,      66,   0f, 71, 6, vl,    w, vl),
+    INSN(psravw,      66, 0f38, 11,    vl,    w, vl),
+    INSN(psraw,       66,   0f, e1,    el_8,  w, vl),
+    INSNX(psraw,      66,   0f, 71, 4, vl,    w, vl),
+    INSNX(psrldq,     66,   0f, 73, 3, vl,    b, vl),
+    INSN(psrlvw,      66, 0f38, 10,    vl,    w, vl),
+    INSN(psrlw,       66,   0f, d1,    el_8,  w, vl),
+    INSNX(psrlw,      66,   0f, 71, 2, vl,    w, vl),
     INSN(psubb,       66,   0f, f8,    vl,    b, vl),
     INSN(psubsb,      66,   0f, e8,    vl,    b, vl),
     INSN(psubsw,      66,   0f, e9,    vl,    w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -319,7 +319,7 @@ static const struct twobyte_table {
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
-    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM },
+    [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
@@ -366,19 +366,19 @@ static const struct twobyte_table {
     [0xc7] = { ImplicitOps|ModRM },
     [0xc8 ... 0xcf] = { ImplicitOps },
     [0xd0] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xd1 ... 0xd3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xd4 ... 0xd5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xd6] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0xd7] = { DstReg|SrcImplicit|ModRM|Mov },
     [0xd8 ... 0xdf] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128 },
+    [0xf1 ... 0xf3] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xf4 ... 0xf6] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf7] = { DstMem|SrcMem|ModRM|Mov, simd_packed_int },
     [0xf8 ... 0xfe] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -434,9 +434,9 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
-    [0x10] = { .simd_size = simd_packed_int },
+    [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
-    [0x14 ... 0x16] = { .simd_size = simd_packed_fp },
+    [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
     [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
@@ -453,7 +453,7 @@ static const struct ext0f38_table {
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x45 ... 0x47] = { .simd_size = simd_packed_int },
+    [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1 },
     [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
@@ -5993,10 +5993,15 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdf): /* vpandn{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xeb): /* vpor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xef): /* vpxor{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x14): /* vprorv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x15): /* vprolv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x39): /* vpmins{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3b): /* vpminu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3d): /* vpmaxs{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x3f): /* vpmaxu{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
@@ -6617,6 +6622,9 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
@@ -6916,6 +6924,37 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x71): /* Grp12 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsraw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512bw_shift_imm:
+            fault_suppression = false;
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512bw_imm;
+        }
+        goto unrecognized_insn;
+
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x72): /* Grp13 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpslld $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(evex.w, EXC_UD);
+            /* fall through */
+        case 0: /* vpror{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 1: /* vprol{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 4: /* vpsra{d,q} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        avx512f_shift_imm:
+            op_bytes = 16 << evex.lr;
+            state->simd_size = simd_packed_int;
+            goto avx512f_imm8_no_sae;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x73):        /* Grp14 */
         switch ( modrm_reg & 7 )
         {
@@ -6941,6 +6980,19 @@ x86_emulate(
         }
         goto unrecognized_insn;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x73): /* Grp14 */
+        switch ( modrm_reg & 7 )
+        {
+        case 2: /* vpsrlq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        case 6: /* vpsllq $imm8,[xyz]mm/mem,[xyz]mm{k} */
+            generate_exception_if(!evex.w, EXC_UD);
+            goto avx512f_shift_imm;
+        case 3: /* vpsrldq $imm8,{x,y}mm,{x,y}mm */
+        case 7: /* vpslldq $imm8,{x,y}mm,{x,y}mm */
+            goto avx512bw_shift_imm;
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC(0x0f, 0x77):        /* emms */
     case X86EMUL_OPC_VEX(0x0f, 0x77):    /* vzero{all,upper} */
         if ( vex.opcx != vex_none )
@@ -7881,6 +7933,16 @@ x86_emulate(
         }
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd2): /* vpsrld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd3): /* vpsrlq xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe2): /* vpsra{d,q} xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.brs, EXC_UD);
+        fault_suppression = false;
+        if ( b == 0xe2 )
+            goto avx512f_no_sae;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8121,6 +8183,14 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x10): /* vpsrlvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x11): /* vpsravw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x12): /* vpsllvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(!evex.w || evex.brs, EXC_UD);
+        elem_bytes = 2;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
         generate_exception_if(evex.w || evex.brs, EXC_UD);
     avx512_broadcast:
@@ -8882,6 +8952,7 @@ x86_emulate(
         generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
         if ( !(b & 0x20) )
             goto avx512f_imm8_no_sae;
+    avx512bw_imm:
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = 1 << evex.w;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 03/49] x86emul: support AVX512{F, BW, DQ} extract insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (2 preceding siblings ...)
  2018-12-19 14:36   ` [PATCH v7 02/49] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
@ 2018-12-19 14:36   ` Jan Beulich
  2018-12-19 18:20     ` Andrew Cooper
  2018-12-19 14:37   ` [PATCH v7 04/49] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
                     ` (45 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -212,6 +212,7 @@ static const struct test avx512f_all[] =
 };
 
 static const struct test avx512f_128[] = {
+    INSN(extractps, 66, 0f3a, 17, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -221,10 +222,14 @@ static const struct test avx512f_128[] =
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
+    INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
+    INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
+    INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -280,6 +285,12 @@ static const struct test avx512bw_all[]
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
 };
 
+static const struct test avx512bw_128[] = {
+    INSN(pextrb, 66, 0f3a, 14, el, b, el),
+//       pextrw, 66,   0f, c5,     w
+    INSN(pextrw, 66, 0f3a, 15, el, w, el),
+};
+
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
@@ -288,13 +299,21 @@ static const struct test avx512dq_all[]
     INSN_PFP(xor,              0f, 57),
 };
 
+static const struct test avx512dq_128[] = {
+    INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+};
+
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
+    INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
+    INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
@@ -632,7 +651,9 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, no128);
     RUN(avx512f, 512);
     RUN(avx512bw, all);
+    RUN(avx512bw, 128);
     RUN(avx512dq, all);
+    RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -512,9 +512,13 @@ static const struct ext0f3a_table {
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
-    [0x14 ... 0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1 },
+    [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
+    [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
+    [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
+    [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
     [0x18] = { .simd_size = simd_128 },
-    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none },
@@ -523,7 +527,8 @@ static const struct ext0f3a_table {
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
-    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
@@ -2676,6 +2681,8 @@ x86_decode_0f3a(
      ... X86EMUL_OPC_66(0, 0x17):     /* pextr*, extractps */
     case X86EMUL_OPC_VEX_66(0, 0x14)
      ... X86EMUL_OPC_VEX_66(0, 0x17): /* vpextr*, vextractps */
+    case X86EMUL_OPC_EVEX_66(0, 0x14)
+     ... X86EMUL_OPC_EVEX_66(0, 0x17): /* vpextr*, vextractps */
     case X86EMUL_OPC_VEX_F2(0, 0xf0): /* rorx */
         break;
 
@@ -8866,9 +8873,9 @@ x86_emulate(
         opc[0] = b;
         /* Convert memory/GPR operand to (%rAX). */
         rex_prefix &= ~REX_B;
-        vex.b = 1;
+        evex.b = vex.b = 1;
         if ( !mode_64bit() )
-            vex.w = 0;
+            evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
         opc[3] = 0xc3;
@@ -8878,7 +8885,10 @@ x86_emulate(
             --opc;
         }
 
-        copy_REX_VEX(opc, rex_prefix, vex);
+        if ( evex_encoded() )
+            copy_EVEX(opc, evex);
+        else
+            copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=m" (dst.val) : "a" (&dst.val));
         put_stub(stub);
 
@@ -8903,6 +8913,52 @@ x86_emulate(
         opc = init_prefixes(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
+        generate_exception_if(ea.type != OP_REG, EXC_UD);
+        /* Convert to alternative encoding: We want to use a memory operand. */
+        evex.opcx = ext_0f3a;
+        b = 0x15;
+        modrm <<= 3;
+        evex.r = evex.b;
+        evex.R = evex.x;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x14): /* vpextrb $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x15): /* vpextrw $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x16): /* vpextr{d,q} $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x17): /* vextractps $imm8,xmm,r/m */
+        generate_exception_if((evex.lr || evex.reg != 0xf || !evex.RX ||
+                               evex.opmsk || evex.brs),
+                              EXC_UD);
+        if ( !(b & 2) )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( !(b & 1) )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto pextr;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.lr || evex.brs, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(evex.lr != 2 || evex.brs, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
     {
         uint32_t mxcsr;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 04/49] x86emul: support AVX512{F, BW, DQ} insert insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (3 preceding siblings ...)
  2018-12-19 14:36   ` [PATCH v7 03/49] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
@ 2018-12-19 14:37   ` Jan Beulich
  2019-03-14 15:35     ` Andrew Cooper
  2018-12-19 14:37   ` [PATCH v7 05/49] x86emul: basic AVX512F testing Jan Beulich
                     ` (44 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:37 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also correct the comment of the AVX form of VINSERTPS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: Don't refuse to emulate VINSERTPS without AVX512VL.
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -213,6 +213,7 @@ static const struct test avx512f_all[] =
 
 static const struct test avx512f_128[] = {
     INSN(extractps, 66, 0f3a, 17, el,    d, el),
+    INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -224,12 +225,16 @@ static const struct test avx512f_no128[]
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
+    INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
+    INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
+    INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
+    INSN(inserti64x4,    66, 0f3a, 3a, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -289,6 +294,8 @@ static const struct test avx512bw_128[]
     INSN(pextrb, 66, 0f3a, 14, el, b, el),
 //       pextrw, 66,   0f, c5,     w
     INSN(pextrw, 66, 0f3a, 15, el, w, el),
+    INSN(pinsrb, 66, 0f3a, 20, el, b, el),
+    INSN(pinsrw, 66,   0f, c4, el, w, el),
 };
 
 static const struct test avx512dq_all[] = {
@@ -301,6 +308,7 @@ static const struct test avx512dq_all[]
 
 static const struct test avx512dq_128[] = {
     INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+    INSN(pinsr, 66, 0f3a, 22, el, dq64, el),
 };
 
 static const struct test avx512dq_no128[] = {
@@ -308,12 +316,16 @@ static const struct test avx512dq_no128[
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
+    INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
+    INSN(inserti64x2,    66, 0f3a, 38, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
+    INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
+    INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -360,7 +360,7 @@ static const struct twobyte_table {
     [0xc1] = { DstMem|SrcReg|ModRM },
     [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp, d8s_vl },
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
-    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
+    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int, 1 },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
     [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp, d8s_vl },
     [0xc7] = { ImplicitOps|ModRM },
@@ -516,17 +516,19 @@ static const struct ext0f3a_table {
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
     [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
-    [0x18] = { .simd_size = simd_128 },
+    [0x18] = { .simd_size = simd_128, .d8s = 4 },
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x20] = { .simd_size = simd_none },
-    [0x21] = { .simd_size = simd_other },
-    [0x22] = { .simd_size = simd_none },
+    [0x20] = { .simd_size = simd_none, .d8s = 0 },
+    [0x21] = { .simd_size = simd_other, .d8s = 2 },
+    [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
-    [0x38] = { .simd_size = simd_128 },
+    [0x38] = { .simd_size = simd_128, .d8s = 4 },
+    [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -2586,6 +2588,7 @@ x86_decode_twobyte(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
     case X86EMUL_OPC_VEX_66(0, 0xc4): /* vpinsrw */
+    case X86EMUL_OPC_EVEX_66(0, 0xc4): /* vpinsrw */
         state->desc = DstReg | SrcMem16;
         break;
 
@@ -2688,6 +2691,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x20):     /* pinsrb */
     case X86EMUL_OPC_VEX_66(0, 0x20): /* vpinsrb */
+    case X86EMUL_OPC_EVEX_66(0, 0x20): /* vpinsrb */
         state->desc = DstImplicit | SrcMem;
         if ( modrm_mod != 3 )
             state->desc |= ByteOp;
@@ -2695,6 +2699,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x22):     /* pinsr{d,q} */
     case X86EMUL_OPC_VEX_66(0, 0x22): /* vpinsr{d,q} */
+    case X86EMUL_OPC_EVEX_66(0, 0x22): /* vpinsr{d,q} */
         state->desc = DstImplicit | SrcMem;
         break;
 
@@ -7723,6 +7728,23 @@ x86_emulate(
         ea.type = OP_MEM;
         goto simd_0f_int_imm8;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc4):   /* vpinsrw $imm8,r32/m16,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x20): /* vpinsrb $imm8,r32/m8,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x22): /* vpinsr{d,q} $imm8,r/m,xmm,xmm */
+        generate_exception_if(evex.lr || evex.opmsk || evex.brs, EXC_UD);
+        if ( b & 2 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        if ( !mode_64bit() )
+            evex.w = 0;
+        memcpy(mmvalp, &src.val, op_bytes);
+        ea.type = OP_MEM;
+        op_bytes = src.bytes;
+        d = SrcMem16; /* Fake for the common SIMD code below. */
+        state->simd_size = simd_other;
+        goto avx512f_imm8_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0xc5):      /* pextrw $imm8,{,x}mm,reg */
     case X86EMUL_OPC_VEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
         generate_exception_if(vex.l, EXC_UD);
@@ -8939,8 +8961,12 @@ x86_emulate(
         opc = init_evex(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x18): /* vinsertf32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinsertf64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x38): /* vinserti32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinserti64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
@@ -8949,8 +8975,12 @@ x86_emulate(
         fault_suppression = false;
         goto avx512f_imm8_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1a): /* vinsertf32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinsertf64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3a): /* vinserti32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinserti64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
         if ( !evex.w )
@@ -9043,13 +9073,20 @@ x86_emulate(
         op_bytes = 4;
         goto simd_0f3a_common;
 
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m128,xmm,xmm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
         op_bytes = 4;
         /* fall through */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x41): /* vdppd $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.lr || evex.w || evex.opmsk || evex.brs,
+                              EXC_UD);
+        op_bytes = 4;
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 05/49] x86emul: basic AVX512F testing
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (4 preceding siblings ...)
  2018-12-19 14:37   ` [PATCH v7 04/49] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
@ 2018-12-19 14:37   ` Jan Beulich
  2019-03-14 15:42     ` Andrew Cooper
  2018-12-19 14:38   ` [PATCH v7 06/49] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
                     ` (43 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:37 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Fix formatting in simd.h.
v5: Add VSQRT* tests.
v4: Make eq() also work for 4- and 8-byte integer element sizes.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -19,7 +19,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -66,6 +66,9 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
+avx512f-vecs := 64
+avx512f-ints := 4 8
+avx512f-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
@@ -173,7 +176,7 @@ $(addsuffix .c,$(SG)):
 
 $(addsuffix .h,$(SIMD) $(FMA) $(SG)): simd.h
 
-xop.h: simd-fma.c
+xop.h avx512f.h: simd-fma.c
 
 endif # 32-bit override
 
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -2,7 +2,41 @@
 
 ENTRY(simd_test);
 
-#if VEC_SIZE == 8 && defined(__SSE__)
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if VEC_SIZE == 4
+#  define eq(x, y) ({ \
+    float x_ = (x)[0]; \
+    float __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpss $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif VEC_SIZE == 8
+#  define eq(x, y) ({ \
+    double x_ = (x)[0]; \
+    double __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpsd $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif FLOAT_SIZE == 4
+/*
+ * gcc's (up to at least 8.2) __builtin_ia32_cmpps256_mask() has an anomaly in
+ * that its return type is QI rather than UQI, and hence the value would get
+ * sign-extended before comapring to ALL_TRUE. The same oddity does not matter
+ * for __builtin_ia32_cmppd256_mask(), as there only 4 bits are significant.
+ * Hence the extra " & ALL_TRUE".
+ */
+#  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
+# elif FLOAT_SIZE == 8
+#  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif INT_SIZE == 4 || UINT_SIZE == 4
+#  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+# endif
+#elif VEC_SIZE == 8 && defined(__SSE__)
 # define to_bool(cmp) (__builtin_ia32_pmovmskb(cmp) == 0xff)
 #elif VEC_SIZE == 16
 # if defined(__AVX__) && defined(FLOAT_SIZE)
@@ -93,6 +127,56 @@ static inline bool _to_bool(byte_vec_t b
     touch(x); \
     __builtin_ia32_pfrcpit2(__builtin_ia32_pfrsqit1(__builtin_ia32_pfmul(t_, t_), x), t_); \
 })
+#elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
+# if FLOAT_SIZE == 4
+#  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
+# elif FLOAT_SIZE == 8
+#  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
+# endif
+#elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if FLOAT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastss %1, %0" \
+          : "=v" (t_) : "m" (*(float[1]){ x }) ); \
+    t_; \
+})
+#  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
+#   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
+#   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#  endif
+# elif FLOAT_SIZE == 8
+#  if VEC_SIZE >= 32
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastsd %1, %0" : "=v" (t_) \
+          : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#  else
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#  endif
+#  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
+#   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
+#   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#  endif
+# endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
 # if VEC_SIZE == 32 && defined(__AVX__)
 #  if defined(__AVX2__)
@@ -191,7 +275,30 @@ static inline bool _to_bool(byte_vec_t b
 #  define sqrt(x) scalar_1op(x, "sqrtsd %[in], %[out]")
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE2__)
+#if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
+     defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 4 || UINT_SIZE == 4
+#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+# endif
+# if INT_SIZE == 4
+#  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
+#  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 4
+#  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# endif
+#elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
 #  define interleave_lo(x, y) ((vec_t)__builtin_ia32_punpcklbw128((vqi_t)(x), (vqi_t)(y)))
@@ -587,6 +694,10 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if defined(__AVX512F__) && defined(FLOAT_SIZE)
+# include "simd-fma.c"
+#endif
+
 int simd_test(void)
 {
     unsigned int i, j;
@@ -1034,7 +1145,8 @@ int simd_test(void)
 # endif
 #endif
 
-#if defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)
+#if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
+    (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
 #endif
 
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,9 +70,111 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE == 16
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
+# define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
+#elif VEC_SIZE == 32
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 256 ## s(a)
+#elif VEC_SIZE == 64
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 512 ## s(a)
+# define BR(n, s, a...)  __builtin_ia32_ ## n ## 512 ## s(a, 4)
+#endif
+#ifndef B_
+# define B_ B
+#endif
+#ifndef BR
+# define BR B
+# define BR_ B_
+#endif
+#ifndef BR_
+# define BR_ BR
+#endif
+
+#ifdef __AVX512F__
+
+/*
+ * The original plan was to effect use of EVEX encodings for scalar as well as
+ * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
+ * only of course) XMM16-XMM31 only. All sorts of compiler errors result when
+ * doing this with gcc 8.2. Therefore resort to injecting {evex} prefixes,
+ * which has the benefit of also working for 32-bit. Granted, there is a lot of
+ * escaping to get right here.
+ */
+asm ( ".macro override insn    \n\t"
+      ".macro $\\insn o:vararg \n\t"
+      ".purgem \\insn          \n\t"
+      "{evex} \\insn \\(\\)o   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\(\\))o     \n\t"
+      ".endm                   \n\t"
+      ".endm                   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\)o         \n\t"
+      ".endm                   \n\t"
+      ".endm" );
+
+# define OVR(n) asm ( "override v" #n )
+# define OVR_SFP(n) OVR(n ## sd); OVR(n ## ss)
+
+# ifdef __AVX512VL__
+#  ifdef __AVX512BW__
+#   define OVR_BW(n) OVR(p ## n ## b); OVR(p ## n ## w)
+#  else
+#   define OVR_BW(n)
+#  endif
+#  define OVR_DQ(n) OVR(p ## n ## d); OVR(p ## n ## q)
+#  define OVR_VFP(n) OVR(n ## pd); OVR(n ## ps)
+# else
+#  define OVR_BW(n)
+#  define OVR_DQ(n)
+#  define OVR_VFP(n)
+# endif
+
+# define OVR_FMA(n, w) OVR_ ## w(n ## 132); OVR_ ## w(n ## 213); \
+                       OVR_ ## w(n ## 231)
+# define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
+# define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
+
+OVR_SFP(broadcast);
+OVR_SFP(comi);
+OVR_FP(add);
+OVR_FP(div);
+OVR(extractps);
+OVR_FMA(fmadd, FP);
+OVR_FMA(fmsub, FP);
+OVR_FMA(fnmadd, FP);
+OVR_FMA(fnmsub, FP);
+OVR(insertps);
+OVR_FP(max);
+OVR_FP(min);
+OVR(movd);
+OVR(movq);
+OVR_SFP(mov);
+OVR_FP(mul);
+OVR_FP(sqrt);
+OVR_FP(sub);
+OVR_SFP(ucomi);
+
+# undef OVR_VFP
+# undef OVR_SFP
+# undef OVR_INT
+# undef OVR_FP
+# undef OVR_FMA
+# undef OVR_DQ
+# undef OVR_BW
+# undef OVR
+
+#endif /* __AVX512F__ */
+
 /*
  * Suppress value propagation by the compiler, preventing unwanted
  * optimization. This at once makes the compiler use memory operands
  * more often, which for our purposes is the more interesting case.
  */
 #define touch(var) asm volatile ( "" : "+m" (var) )
+
+static inline vec_t undef(void)
+{
+    vec_t v = v;
+    return v;
+}
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -1,10 +1,9 @@
+#if !defined(__XOP__) && !defined(__AVX512F__)
 #include "simd.h"
-
-#ifndef __XOP__
 ENTRY(fma_test);
 #endif
 
-#if VEC_SIZE < 16
+#if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
 #elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
@@ -24,7 +23,13 @@ ENTRY(fma_test);
 # define eq(x, y) to_bool((x) == (y))
 #endif
 
-#if VEC_SIZE == 16
+#if defined(__AVX512F__) && VEC_SIZE > FLOAT_SIZE
+# if FLOAT_SIZE == 4
+#  define fmaddsub(x, y, z) BR(vfmaddsubps, _mask, x, y, z, ~0)
+# elif FLOAT_SIZE == 8
+#  define fmaddsub(x, y, z) BR(vfmaddsubpd, _mask, x, y, z, ~0)
+# endif
+#elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
 #  define addsub(x, y) __builtin_ia32_addsubps(x, y)
 #  if defined(__FMA4__) || defined(__FMA__)
@@ -50,6 +55,10 @@ ENTRY(fma_test);
 # endif
 #endif
 
+#if defined(fmaddsub) && !defined(addsub)
+# define addsub(x, y) fmaddsub(x, broadcast(1), y)
+#endif
+
 int fma_test(void)
 {
     unsigned int i;
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -21,6 +21,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f-opmask.h"
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
+#include "avx512f.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -248,6 +249,14 @@ static const struct {
     SIMD(OPMASK/b,    avx512dq_opmask,         1),
     SIMD(OPMASK/d,    avx512bw_opmask,         4),
     SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(AVX512F f32 scalar,  avx512f,        f4),
+    SIMD(AVX512F f32x16,      avx512f,      64f4),
+    SIMD(AVX512F f64 scalar,  avx512f,        f8),
+    SIMD(AVX512F f64x8,       avx512f,      64f8),
+    SIMD(AVX512F s32x16,      avx512f,      64i4),
+    SIMD(AVX512F u32x16,      avx512f,      64u4),
+    SIMD(AVX512F s64x8,       avx512f,      64i8),
+    SIMD(AVX512F u64x8,       avx512f,      64u8),
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 06/49] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (5 preceding siblings ...)
  2018-12-19 14:37   ` [PATCH v7 05/49] x86emul: basic AVX512F testing Jan Beulich
@ 2018-12-19 14:38   ` Jan Beulich
  2019-03-14 16:38     ` Andrew Cooper
  2018-12-19 14:38   ` [PATCH v7 07/49] x86emul: basic AVX512VL testing Jan Beulich
                     ` (42 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the pbroadcastw table entry in evex-disp8.c is slightly
different from what one would expect, due to it requiring EVEX.W to be
zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Use dummy output in invoke_stub(). Re-base.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -164,6 +164,9 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+//       pbroadcast,   66, 0f38, 7c,          dq64
+    INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
+    INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
     INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
     INSN(pcmpeqd,      66,   0f, 76,    vl,      d, vl),
     INSN(pcmpeqq,      66, 0f38, 29,    vl,      q, vl),
@@ -222,6 +225,7 @@ static const struct test avx512f_128[] =
 
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
+    INSN(broadcasti32x4, 66, 0f38, 5a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
@@ -231,6 +235,7 @@ static const struct test avx512f_no128[]
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(broadcasti64x4, 66, 0f38, 5b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
     INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
@@ -250,6 +255,10 @@ static const struct test avx512bw_all[]
     INSN(paddw,       66,   0f, fd,    vl,    w, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
+    INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
+//       pbroadcastb, 66, 0f38, 7a,           b
+    INSN(pbroadcastw, 66, 0f38, 79,    el_2,  b, vl),
+//       pbroadcastw, 66, 0f38, 7b,           b
     INSN(pcmp,        66, 0f3a, 3f,    vl,   bw, vl),
     INSN(pcmpeqb,     66,   0f, 74,    vl,    b, vl),
     INSN(pcmpeqw,     66,   0f, 75,    vl,    w, vl),
@@ -301,6 +310,7 @@ static const struct test avx512bw_128[]
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
+    INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
@@ -314,6 +324,7 @@ static const struct test avx512dq_128[]
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(broadcasti64x2, 66, 0f38, 5a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
     INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
@@ -322,6 +333,7 @@ static const struct test avx512dq_no128[
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(broadcasti32x8, 66, 0f38, 5b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
     INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -278,9 +278,33 @@ static inline bool _to_bool(byte_vec_t b
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if INT_SIZE == 4 || UINT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastd %1, %0" \
+          : "=v" (t_) : "m" (*(int[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(long long[1]){ x }) ); \
+    t_; \
+})
+#  ifdef __x86_64__
+#   define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastq %1, %0" : "=v" (t_) : "r" ((x) + 0ULL) ); \
+    t_; \
+})
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
@@ -977,10 +1001,14 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
-#if defined(broadcast)
+#ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#ifdef broadcast2
+    if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -454,9 +454,13 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
-    [0x5a] = { .simd_size = simd_128, .two_op = 1 },
-    [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
+    [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
+    [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
+    [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
+    [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x78] = { .simd_size = simd_other, .two_op = 1 },
+    [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
+    [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -2636,6 +2640,11 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
+    case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
+    case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
+        break;
+
     case 0xf0: /* movbe / crc32 */
         state->desc |= repne_prefix() ? ByteOp : Mov;
         if ( rep_prefix() )
@@ -8221,6 +8230,8 @@ x86_emulate(
         goto avx512f_no_sae;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
+        op_bytes = elem_bytes;
         generate_exception_if(evex.w || evex.brs, EXC_UD);
     avx512_broadcast:
         /*
@@ -8240,17 +8251,27 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
                                             /* vbroadcastf64x4 m256,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
+                                            /* vbroadcasti64x4 m256,zmm{k} */
         generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
                                             /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
-        generate_exception_if(!evex.lr || evex.brs, EXC_UD);
+        generate_exception_if(!evex.lr, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
+                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
+        if ( b == 0x59 )
+            op_bytes = 8;
+        generate_exception_if(evex.brs, EXC_UD);
         if ( !evex.w )
             host_and_vcpu_must_have(avx512dq);
         goto avx512_broadcast;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
                                             /* vbroadcastf64x2 m128,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
         generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.brs,
                               EXC_UD);
         if ( evex.w )
@@ -8444,6 +8465,45 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w || evex.brs, EXC_UD);
+        op_bytes = elem_bytes = 1 << (b & 1);
+        /* See the comment at the avx512_broadcast label. */
+        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
+        generate_exception_if((ea.type != OP_REG || evex.brs ||
+                               evex.reg != 0xf || !evex.RX),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        opc[1] = modrm & 0xf8;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "=g" (dummy) : "a" (src.val));
+
+        put_stub(stub);
+        ASSERT(!state->simd_size);
+        break;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 07/49] x86emul: basic AVX512VL testing
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (6 preceding siblings ...)
  2018-12-19 14:38   ` [PATCH v7 06/49] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
@ 2018-12-19 14:38   ` Jan Beulich
  2019-03-14 16:39     ` Andrew Cooper
  2018-12-19 14:41   ` [PATCH v7 08/49] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
                     ` (41 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test the 128- and 256-bit variants of the insns which have been
implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Don't enable AVX512VL for scalar tests, nor for S/G ones with index
    wider than data. Re-base over changes earlier in the series.
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -66,7 +66,7 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
-avx512f-vecs := 64
+avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
 
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -5,13 +5,13 @@ ENTRY(fma_test);
 
 #if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
-#elif VEC_SIZE == 16
+#elif VEC_SIZE == 16 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
 #  define to_bool(cmp) __builtin_ia32_vtestcpd(cmp, (vec_t){} == 0)
 # endif
-#elif VEC_SIZE == 32
+#elif VEC_SIZE == 32 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps256(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -539,7 +539,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define rotr(x, n) ((vec_t)__builtin_ia32_palignr128((vdi_t)(x), (vdi_t)(x), (n) * 64))
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE4_1__)
+#if VEC_SIZE == 16 && defined(__SSE4_1__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)__builtin_ia32_pmaxsb128((vqi_t)(x), (vqi_t)(y)))
 #  define min(x, y) ((vec_t)__builtin_ia32_pminsb128((vqi_t)(x), (vqi_t)(y)))
@@ -593,7 +593,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define mix(x, y) __builtin_ia32_blendpd(x, y, 0b10)
 # endif
 #endif
-#if VEC_SIZE == 32 && defined(__AVX__)
+#if VEC_SIZE == 32 && defined(__AVX__) && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define dot_product(x, y) ({ \
     vec_t t_ = __builtin_ia32_dpps256(x, y, 0b11110001); \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -92,6 +92,15 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+# if VEC_SIZE > ELEM_SIZE && (defined(VEC_MAX) ? VEC_MAX : VEC_SIZE) < 64
+#  pragma GCC target ( "avx512vl" )
+# endif
+
+# define REN(insn, old, new)                     \
+    asm ( ".macro v" #insn #old " o:vararg \n\t" \
+          "v" #insn #new " \\o             \n\t" \
+          ".endm" )
+
 /*
  * The original plan was to effect use of EVEX encodings for scalar as well as
  * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
@@ -135,25 +144,88 @@ asm ( ".macro override insn    \n\t"
 # define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
 # define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
 
+OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
+OVR_INT(add);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
+OVR_FMA(fmaddsub, VFP);
 OVR_FMA(fmsub, FP);
+OVR_FMA(fmsubadd, VFP);
 OVR_FMA(fnmadd, FP);
 OVR_FMA(fnmsub, FP);
 OVR(insertps);
 OVR_FP(max);
+OVR_INT(maxs);
+OVR_INT(maxu);
 OVR_FP(min);
+OVR_INT(mins);
+OVR_INT(minu);
 OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
+OVR_VFP(mova);
+OVR_VFP(movnt);
+OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(shuf);
+OVR_INT(sll);
+OVR_DQ(sllv);
 OVR_FP(sqrt);
+OVR_INT(sra);
+OVR_DQ(srav);
+OVR_INT(srl);
+OVR_DQ(srlv);
 OVR_FP(sub);
+OVR_INT(sub);
 OVR_SFP(ucomi);
+OVR_VFP(unpckh);
+OVR_VFP(unpckl);
+
+# ifdef __AVX512VL__
+#  if ELEM_SIZE == 8 && defined(__AVX512DQ__)
+REN(extract, f128, f64x2);
+REN(extract, i128, i64x2);
+REN(insert, f128, f64x2);
+REN(insert, i128, i64x2);
+#  else
+REN(extract, f128, f32x4);
+REN(extract, i128, i32x4);
+REN(insert, f128, f32x4);
+REN(insert, i128, i32x4);
+#  endif
+#  if ELEM_SIZE == 8
+REN(movdqa, , 64);
+REN(movdqu, , 64);
+REN(pand, , q);
+REN(pandn, , q);
+REN(por, , q);
+REN(pxor, , q);
+#  else
+#   if ELEM_SIZE == 1 && defined(__AVX512BW__)
+REN(movdq, a, u8);
+REN(movdqu, , 8);
+#   elif ELEM_SIZE == 2 && defined(__AVX512BW__)
+REN(movdq, a, u16);
+REN(movdqu, , 16);
+#   else
+REN(movdqa, , 32);
+REN(movdqu, , 32);
+#   endif
+REN(pand, , d);
+REN(pandn, , d);
+REN(por, , d);
+REN(pxor, , d);
+#  endif
+OVR(movntdq);
+OVR(movntdqa);
+OVR(pmulld);
+OVR(pmuldq);
+OVR(pmuludq);
+# endif
 
 # undef OVR_VFP
 # undef OVR_SFP
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -88,6 +88,11 @@ static bool simd_check_avx512f(void)
 }
 #define simd_check_avx512f_opmask simd_check_avx512f
 
+static bool simd_check_avx512f_vl(void)
+{
+    return cpu_has_avx512f && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512dq(void)
 {
     return cpu_has_avx512dq;
@@ -142,11 +147,21 @@ static const struct {
       .check_cpu = simd_check_ ## feat,                             \
       .set_regs = simd_set_regs,                                    \
       .check_regs = simd_check_regs }
+#define AVX512VL_(bits, desc, feat, form)                          \
+    { .code = feat ## _x86_ ## bits ## _D ## _ ## form,            \
+      .size = sizeof(feat ## _x86_ ## bits ## _D ## _ ## form),    \
+      .bitness = bits, .name = "AVX512" #desc,                     \
+      .check_cpu = simd_check_ ## feat ## _vl,                     \
+      .set_regs = simd_set_regs,                                   \
+      .check_regs = simd_check_regs }
 #ifdef __x86_64__
 # define SIMD(desc, feat, form) SIMD_(64, desc, feat, form), \
                                 SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(64, desc, feat, form), \
+                                    AVX512VL_(32, desc, feat, form)
 #else
 # define SIMD(desc, feat, form) SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(32, desc, feat, form)
 #endif
     SIMD(3DNow! single,          _3dnow,     8f4),
     SIMD(SSE scalar single,      sse,         f4),
@@ -257,6 +272,20 @@ static const struct {
     SIMD(AVX512F u32x16,      avx512f,      64u4),
     SIMD(AVX512F s64x8,       avx512f,      64i8),
     SIMD(AVX512F u64x8,       avx512f,      64u8),
+    AVX512VL(VL f32x4,        avx512f,      16f4),
+    AVX512VL(VL f64x2,        avx512f,      16f8),
+    AVX512VL(VL f32x8,        avx512f,      32f4),
+    AVX512VL(VL f64x4,        avx512f,      32f8),
+    AVX512VL(VL s32x4,        avx512f,      16i4),
+    AVX512VL(VL u32x4,        avx512f,      16u4),
+    AVX512VL(VL s32x8,        avx512f,      32i4),
+    AVX512VL(VL u32x8,        avx512f,      32u4),
+    AVX512VL(VL s64x2,        avx512f,      16i8),
+    AVX512VL(VL u64x2,        avx512f,      16u8),
+    AVX512VL(VL s64x4,        avx512f,      32i8),
+    AVX512VL(VL u64x4,        avx512f,      32u8),
+#undef AVX512VL_
+#undef AVX512VL
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 08/49] x86emul: support AVX512{F, BW} zero- and sign-extending moves
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (7 preceding siblings ...)
  2018-12-19 14:38   ` [PATCH v7 07/49] x86emul: basic AVX512VL testing Jan Beulich
@ 2018-12-19 14:41   ` Jan Beulich
  2018-12-19 14:41   ` [PATCH v7 09/49] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
                     ` (40 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:41 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the testing in simd.c doesn't really follow the ISA extension
pattern - to fit the scheme, extensions from byte and word granular
vectors can (currently) sensibly only happen in the AVX512BW case (and
hence respective abstraction macros will be added there rather than
here).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Raise #UD when EVEX.b is set. Re-base.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -177,6 +177,16 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
+    INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
+    INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
+    INSN(pmovzxwq,     66, 0f38, 34,    vl_4,    w, vl),
+    INSN(pmovzxdq,     66, 0f38, 35,    vl_2, d_nb, vl),
     INSN(pmuldq,       66, 0f38, 28,    vl,      q, vl),
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
@@ -274,6 +284,8 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -443,13 +443,23 @@ static const struct ext0f38_table {
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
+    [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x23] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x24] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
-    [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
+    [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x32] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x33] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x34] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x35] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
@@ -8337,6 +8347,25 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x25): /* vpmovsxdq {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.brs || (evex.w && (b & 7) == 5), EXC_UD);
+        op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
+        elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -311,10 +311,12 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxdq, _mask, x, (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 4
 #  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -222,6 +222,16 @@ REN(pxor, , d);
 #  endif
 OVR(movntdq);
 OVR(movntdqa);
+OVR(pmovsxbd);
+OVR(pmovsxbq);
+OVR(pmovsxdq);
+OVR(pmovsxwd);
+OVR(pmovsxwq);
+OVR(pmovzxbd);
+OVR(pmovzxbq);
+OVR(pmovzxdq);
+OVR(pmovzxwd);
+OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 09/49] x86emul: support AVX512{F, BW} down conversion moves
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (8 preceding siblings ...)
  2018-12-19 14:41   ` [PATCH v7 08/49] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
@ 2018-12-19 14:41   ` Jan Beulich
  2018-12-19 14:42   ` [PATCH v7 10/49] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
                     ` (39 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:41 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the vpmov{,s,us}{d,q}w table entries in evex-disp8.c are
slightly different from what one would expect, due to them requiring
EVEX.W to be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: ea.type == OP_* -> ea.type != OP_*. Re-base over change in previous
    patch. Re-base.
v5: Also adjust x86_insn_is_mem_write().
v4: Also #UD when evex.z is set with a memory operand.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -177,11 +177,26 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovdb,       f3, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovdw,       f3, 0f38, 33,    vl_2,    b, vl),
+    INSN(pmovqb,       f3, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovqd,       f3, 0f38, 35,    vl_2, d_nb, vl),
+    INSN(pmovqw,       f3, 0f38, 34,    vl_4,    b, vl),
+    INSN(pmovsdb,      f3, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsdw,      f3, 0f38, 23,    vl_2,    b, vl),
+    INSN(pmovsqb,      f3, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsqd,      f3, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovsqw,      f3, 0f38, 24,    vl_4,    b, vl),
     INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
     INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
     INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
     INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
     INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovusdb,     f3, 0f38, 11,    vl_4,    b, vl),
+    INSN(pmovusdw,     f3, 0f38, 13,    vl_2,    b, vl),
+    INSN(pmovusqb,     f3, 0f38, 12,    vl_8,    b, vl),
+    INSN(pmovusqd,     f3, 0f38, 15,    vl_2, d_nb, vl),
+    INSN(pmovusqw,     f3, 0f38, 14,    vl_4,    b, vl),
     INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
     INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
     INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
@@ -284,7 +299,10 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+    INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -277,6 +277,17 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextracti{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextracti32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -291,6 +302,7 @@ static inline bool _to_bool(byte_vec_t b
 })
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -720,6 +732,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if VEC_SIZE >= 16
+
+# if !defined(low_half) && defined(HALF_SIZE)
+static inline half_t low_half(vec_t x)
+{
+#  if HALF_SIZE < VEC_SIZE
+    half_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 2; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1087,6 +1120,21 @@ int simd_test(void)
 
 #endif
 
+#if defined(widen1) && defined(shrink1)
+    {
+        half_t aux1 = low_half(src), aux2;
+
+        touch(aux1);
+        x = widen1(aux1);
+        touch(x);
+        aux2 = shrink1(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 2; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
 #ifdef dup_lo
     touch(src);
     x = dup_lo(src);
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,23 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE >= 16
+
+# if ELEM_COUNT >= 2
+#  if VEC_SIZE > 32
+#   define HALF_SIZE (VEC_SIZE / 2)
+#  else
+#   define HALF_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(HALF_SIZE))) half_t;
+typedef char __attribute__((vector_size(HALF_SIZE))) vqi_half_t;
+typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
+typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
+typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+# endif
+
+#endif
+
 #if VEC_SIZE == 16
 # define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
 # define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3068,7 +3068,22 @@ x86_decode(
                 d |= vSIB;
             state->simd_size = ext0f38_table[b].simd_size;
             if ( evex_encoded() )
-                disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+            {
+                /*
+                 * VPMOVUS* are identical to VPMOVS* Disp8-scaling-wise, but
+                 * their attributes don't match those of the vex_66 encoded
+                 * insns with the same base opcodes. Rather than adding new
+                 * columns to the table, handle this here for now.
+                 */
+                if ( evex.pfx != vex_f3 || (b & 0xf8) != 0x10 )
+                    disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+                else
+                {
+                    disp8scale = decode_disp8scale(ext0f38_table[b + 0x10].d8s,
+                                                   state);
+                    state->simd_size = simd_other;
+                }
+            }
             break;
 
         case ext_0f3a:
@@ -8347,10 +8362,14 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10): /* vpmovuswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20): /* vpmovswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30): /* vpmovwb [xyz]mm,{x,y}mm/mem{k} */
         host_and_vcpu_must_have(avx512bw);
-        /* fall through */
+        if ( evex.pfx != vex_f3 )
+        {
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
@@ -8361,7 +8380,29 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
-        generate_exception_if(evex.brs || (evex.w && (b & 7) == 5), EXC_UD);
+            generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        }
+        else
+        {
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x11): /* vpmovusdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x12): /* vpmovusqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x13): /* vpmovusdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x14): /* vpmovusqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* vpmovusqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x21): /* vpmovsdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x22): /* vpmovsqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x23): /* vpmovsdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x24): /* vpmovsqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* vpmovsqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x31): /* vpmovdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x32): /* vpmovqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x33): /* vpmovdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x34): /* vpmovqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* vpmovqd [xyz]mm,{x,y}mm/mem{k} */
+            generate_exception_if(evex.w || (ea.type != OP_REG && evex.z), EXC_UD);
+            d = DstMem | SrcReg | TwoOp;
+        }
+        generate_exception_if(evex.brs, EXC_UD);
         op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
@@ -10200,6 +10241,12 @@ x86_insn_is_mem_write(const struct x86_e
     case X86EMUL_OPC(0x0f, 0xab):        /* BTS */
     case X86EMUL_OPC(0x0f, 0xb3):        /* BTR */
     case X86EMUL_OPC(0x0f, 0xbb):        /* BTC */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* VPMOVUS* */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* VPMOVS* */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* VPMOV{D,Q,W}* */
         return true;
 
     case 0xd9:




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 10/49] x86emul: support AVX512{F, BW} integer unpack insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (9 preceding siblings ...)
  2018-12-19 14:41   ` [PATCH v7 09/49] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
@ 2018-12-19 14:42   ` Jan Beulich
  2018-12-19 14:42   ` [PATCH v7 11/49] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
                     ` (38 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:42 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

There's once again one extra twobyte_table[] entry which gets its Disp8
shift value set right away without getting support implemented just yet,
again to avoid needlessly splitting groups of entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base over changes earlier in the series.
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -229,6 +229,10 @@ static const struct test avx512f_all[] =
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
     INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
     INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
+    INSN(punpckhdq,    66,   0f, 6a,    vl,      d, vl),
+    INSN(punpckhqdq,   66,   0f, 6d,    vl,      q, vl),
+    INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
+    INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
@@ -327,6 +331,10 @@ static const struct test avx512bw_all[]
     INSN(psubw,       66,   0f, f9,    vl,    w, vl),
     INSN(ptestm,      66, 0f38, 26,    vl,   bw, vl),
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
+    INSN(punpckhbw,   66,   0f, 68,    vl,    b, vl),
+    INSN(punpckhwd,   66,   0f, 69,    vl,    w, vl),
+    INSN(punpcklbw,   66,   0f, 60,    vl,    b, vl),
+    INSN(punpcklwd,   66,   0f, 61,    vl,    w, vl),
 };
 
 static const struct test avx512bw_128[] = {
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -300,6 +300,10 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
@@ -317,6 +321,10 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -252,6 +252,10 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(punpckhdq);
+OVR(punpckhqdq);
+OVR(punpckldq);
+OVR(punpcklqdq);
 # endif
 
 # undef OVR_VFP
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -312,10 +312,10 @@ static const struct twobyte_table {
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
+    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
@@ -6668,6 +6668,12 @@ x86_emulate(
         get_fpu(X86EMUL_FPU_mmx);
         goto simd_0f_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        op_bytes = 16 << evex.lr;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6696,6 +6702,13 @@ x86_emulate(
         elem_bytes = 1 << (b & 1);
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x62): /* vpunpckldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6a): /* vpunpckhdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        fault_suppression = false;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
         op_bytes = 16 << evex.lr;
@@ -6722,6 +6735,10 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd4): /* vpaddq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf4): /* vpmuludq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x28): /* vpmuldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 11/49] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (10 preceding siblings ...)
  2018-12-19 14:42   ` [PATCH v7 10/49] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
@ 2018-12-19 14:42   ` Jan Beulich
  2018-12-19 14:43   ` [PATCH v7 12/49] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
                     ` (37 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:42 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Take the liberty and also correct the (public interface) name of the
AVX512_VBMI feature flag, on the assumption that no external consumer
has actually been using that flag so far. Furthermore make it have
AVX512BW instead of AVX512F as a prerequisite, for requiring full
64-bit mask registers (the upper 48 bits of which can't be accessed
other than through XSAVE/XRSTOR without AVX512BW support).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v5: Re-base.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -173,6 +173,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
+    INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -294,6 +298,8 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
+    INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
@@ -378,6 +384,11 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512_vbmi_all[] = {
+    INSN(permi2b,       66, 0f38, 75, vl, b, vl),
+    INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -718,4 +729,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -150,6 +150,9 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#  else
+#   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
+#   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -175,6 +178,9 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#  else
+#   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
+#   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -303,6 +309,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -324,6 +333,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
@@ -769,6 +781,7 @@ int simd_test(void)
 {
     unsigned int i, j;
     vec_t x, y, z, src, inv, alt, sh;
+    vint_t interleave_lo, interleave_hi;
 
     for ( i = 0, j = ELEM_SIZE << 3; i < ELEM_COUNT; ++i )
     {
@@ -782,6 +795,9 @@ int simd_test(void)
         if ( !(i & (i + 1)) )
             --j;
         sh[i] = j;
+
+        interleave_lo[i] = ((i & 1) * ELEM_COUNT) | (i >> 1);
+        interleave_hi[i] = interleave_lo[i] + (ELEM_COUNT / 2);
     }
 
     touch(src);
@@ -1075,7 +1091,7 @@ int simd_test(void)
     x = src * alt;
     y = interleave_lo(x, alt < 0);
     touch(x);
-    z = widen1(x);
+    z = widen1(low_half(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1107,7 +1123,7 @@ int simd_test(void)
 
 # ifdef widen1
     touch(src);
-    x = widen1(src);
+    x = widen1(low_half(src));
     touch(src);
     if ( !eq(x, y) ) return __LINE__;
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,16 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if ELEM_SIZE == 1
+typedef vqi_t vint_t;
+#elif ELEM_SIZE == 2
+typedef vhi_t vint_t;
+#elif ELEM_SIZE == 4
+typedef vsi_t vint_t;
+#elif ELEM_SIZE == 8
+typedef vdi_t vint_t;
+#endif
+
 #if VEC_SIZE >= 16
 
 # if ELEM_COUNT >= 2
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -136,6 +136,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
+#define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -468,9 +468,13 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
     [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
+    [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -1861,6 +1865,7 @@ static bool vcpu_has(
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
+#define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -6033,6 +6038,11 @@ x86_emulate(
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x76): /* vpermi2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x77): /* vpermi2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7e): /* vpermt2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7f): /* vpermt2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdb): /* vpand{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8552,6 +8562,16 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512_vbmi);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.brs, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -108,6 +108,7 @@
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
 
 /* CPUID level 0x00000007:0.ecx */
+#define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
 /* CPUID level 0x80000007.edx */
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -224,7 +224,7 @@ XEN_CPUFEATURE(AVX512VL,      5*32+31) /
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.ecx, word 6 */
 XEN_CPUFEATURE(PREFETCHWT1,   6*32+ 0) /*A  PREFETCHWT1 instruction */
-XEN_CPUFEATURE(AVX512VBMI,    6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -259,12 +259,17 @@ def crunch_numbers(state):
         AVX2: [AVX512F],
 
         # AVX512F is taken to mean hardware support for 512bit registers
-        # (which in practice depends on the EVEX prefix to encode), and the
-        # instructions themselves. All further AVX512 features are built on
-        # top of AVX512F
+        # (which in practice depends on the EVEX prefix to encode) as well
+        # as mask registers, and the instructions themselves. All further
+        # AVX512 features are built on top of AVX512F
         AVX512F: [AVX512DQ, AVX512IFMA, AVX512PF, AVX512ER, AVX512CD,
-                  AVX512BW, AVX512VL, AVX512VBMI, AVX512_4VNNIW,
-                  AVX512_4FMAPS, AVX512_VPOPCNTDQ],
+                  AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
+                  AVX512_VPOPCNTDQ],
+
+        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # dependents of AVX512BW (as to requiring wider than 16-bit mask
+        # registers), despite the SDM not formally making this connection.
+        AVX512BW: [AVX512_VBMI],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 12/49] x86emul: support AVX512{F, BW} integer shuffle insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (11 preceding siblings ...)
  2018-12-19 14:42   ` [PATCH v7 11/49] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
@ 2018-12-19 14:43   ` Jan Beulich
  2018-12-19 14:43   ` [PATCH v7 13/49] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
                     ` (36 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:43 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also include vshuff{32x4,64x2} as being very similar to vshufi{32x4,64x2}.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Disable fault suppression for VPSHUF{D,{H,L}W}. Re-base.
v6: Re-base over changes earlier in the series.
v5: Re-base over changes earlier in the series.
v4: Move OVR() addition into __AVX512VL__ conditional. Correct comments.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -214,6 +214,7 @@ static const struct test avx512f_all[] =
     INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
     INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
     INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pshufd,       66,   0f, 70,    vl,      d, vl),
     INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
     INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
     INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
@@ -264,6 +265,10 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
+    INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
+    INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
+    INSN(shufi64x2,      66, 0f3a, 43, vl,    q, vl),
 };
 
 static const struct test avx512f_512[] = {
@@ -318,6 +323,9 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSN(pshufb,      66, 0f38, 00,    vl,    b, vl),
+    INSN(pshufhw,     f3,   0f, 70,    vl,    w, vl),
+    INSN(pshuflw,     f2,   0f, 70,    vl,    w, vl),
     INSNX(pslldq,     66,   0f, 73, 7, vl,    b, vl),
     INSN(psllvw,      66, 0f38, 12,    vl,    w, vl),
     INSN(psllw,       66,   0f, f1,    el_8,  w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -153,6 +153,10 @@ static inline bool _to_bool(byte_vec_t b
 #  else
 #   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
+#   define swap(x) ({ \
+    vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
+})
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -181,6 +185,10 @@ static inline bool _to_bool(byte_vec_t b
 #  else
 #   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
+#   define swap(x) ({ \
+    vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
+})
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -309,9 +317,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
+                               VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
+                             0b00011011, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -333,9 +346,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b01001110, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
+                                      VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -119,6 +119,12 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+/* Sadly there are a few exceptions to the general naming rules. */
+# define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
+# define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
+# define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
+# define __builtin_ia32_shuf_i64x2_512_mask __builtin_ia32_shuf_i64x2_mask
+
 # if VEC_SIZE > ELEM_SIZE && (defined(VEC_MAX) ? VEC_MAX : VEC_SIZE) < 64
 #  pragma GCC target ( "avx512vl" )
 # endif
@@ -262,6 +268,7 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(pshufd);
 OVR(punpckhdq);
 OVR(punpckhqdq);
 OVR(punpckldq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -318,7 +318,7 @@ static const struct twobyte_table {
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
-    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
+    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other, d8s_vl },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
@@ -432,7 +432,8 @@ static const struct ext0f38_table {
     uint8_t vsib:1;
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
-    [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
@@ -543,6 +544,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
+    [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
@@ -552,6 +554,7 @@ static const struct ext0f3a_table {
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
+    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6689,6 +6692,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6944,6 +6948,21 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 3;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x70): /* vpshufd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x70): /* vpshufhw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x70): /* vpshuflw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx == vex_66 )
+            generate_exception_if(evex.w, EXC_UD);
+        else
+        {
+            host_and_vcpu_must_have(avx512bw);
+            generate_exception_if(evex.brs, EXC_UD);
+        }
+        d = (d & ~SrcMask) | SrcMem | TwoOp;
+        op_bytes = 16 << evex.lr;
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x71):    /* Grp12 */
     case X86EMUL_OPC_VEX_66(0x0f, 0x71):
     CASE_SIMD_PACKED_INT(0x0f, 0x72):    /* Grp13 */
@@ -9138,7 +9157,13 @@ x86_emulate(
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
-        generate_exception_if(!evex.lr || evex.brs, EXC_UD);
+        generate_exception_if(evex.brs, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x23): /* vshuff32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshuff64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x43): /* vshufi32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshufi64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
         fault_suppression = false;
         goto avx512f_imm8_no_sae;
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 13/49] x86emul: support AVX512{BW, DQ} mask move insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (12 preceding siblings ...)
  2018-12-19 14:43   ` [PATCH v7 12/49] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
@ 2018-12-19 14:43   ` Jan Beulich
  2018-12-19 14:44   ` [PATCH v7 14/49] x86emul: basic AVX512BW testing Jan Beulich
                     ` (35 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:43 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Entries to the tables in evex-disp8.c are added despite these insns not
allowing for memory operands, with the goal of the tables giving a
complete picture of the supported EVEX-encoded insns in the end.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -314,9 +314,12 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+//       pmovb2m,     f3, 0f38, 29,           b
+//       pmovm2,      f3, 0f38, 28,          bw
     INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+//       pmovw2m,     f3, 0f38, 29,           w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
@@ -364,6 +367,9 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
+//       pmovd2m,        f3, 0f38, 39,        d
+//       pmovm2,         f3, 0f38, 38,       dq
+//       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
 };
--- a/tools/tests/x86_emulator/opmask.S
+++ b/tools/tests/x86_emulator/opmask.S
@@ -12,17 +12,23 @@
 
 #if SIZE == 1
 # define _(x) x##b
+# define _v(x, t) _v_(x##q, t)
 #elif SIZE == 2
 # define _(x) x##w
+# define _v(x, t) _v_(x##d, t)
 # define WIDEN(x) x##bw
 #elif SIZE == 4
 # define _(x) x##d
+# define _v(x, t) _v_(x##w, t)
 # define WIDEN(x) x##wd
 #elif SIZE == 8
 # define _(x) x##q
+# define _v(x, t) _v_(x##b, t)
 # define WIDEN(x) x##dq
 #endif
 
+#define _v_(x, t) v##x##t
+
     .macro check res1:req, res2:req, line:req
     _(kmov)       %\res1, DATA(out)
 #if SIZE < 8 || !defined(__i386__)
@@ -131,6 +137,15 @@ _start:
 
 #endif
 
+#if SIZE > 2 ? defined(__AVX512BW__) : defined(__AVX512DQ__)
+
+    _(kmov)       DATA(in1), %k0
+    _v(pmovm2,)   %k0, %zmm7
+    _v(pmov,2m)   %zmm7, %k3
+    check         k0, k3, __LINE__
+
+#endif
+
     xor           %eax, %eax
     ret
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -8453,6 +8453,21 @@ x86_emulate(
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x29): /* vpmov{b,w}2m [xyz]mm,k */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x39): /* vpmov{d,q}2m [xyz]mm,k */
+        generate_exception_if(!evex.r || !evex.R, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x28): /* vpmovm2{b,w} k,[xyz]mm */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x38): /* vpmovm2{d,q} k,[xyz]mm */
+        if ( b & 0x10 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.opmsk || ea.type != OP_REG, EXC_UD);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 14/49] x86emul: basic AVX512BW testing
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (13 preceding siblings ...)
  2018-12-19 14:43   ` [PATCH v7 13/49] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
@ 2018-12-19 14:44   ` Jan Beulich
  2018-12-19 14:46   ` [PATCH v7 15/49] x86emul: basic AVX512DQ testing Jan Beulich
                     ` (34 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base over changes earlier in the series.
v4: Add __AVX512VL__ conditional around majority of OVR() additions.
    Correct eq() for 1- and 2-byte cases.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -19,7 +19,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -69,6 +69,9 @@ xop-flts := $(avx-flts)
 avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
+avx512bw-vecs := $(avx512f-vecs)
+avx512bw-ints := 1 2
+avx512bw-flts :=
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -31,6 +31,10 @@ ENTRY(simd_test);
 #  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
 # elif FLOAT_SIZE == 8
 #  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif (INT_SIZE == 1 || UINT_SIZE == 1) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqb, _mask, (vqi_t)(x), (vqi_t)(y), -1) == ALL_TRUE)
+# elif (INT_SIZE == 2 || UINT_SIZE == 2) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqw, _mask, (vhi_t)(x), (vhi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 4 || UINT_SIZE == 4
 #  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 8 || UINT_SIZE == 8
@@ -374,6 +378,87 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # endif
+#elif (INT_SIZE == 1 || UINT_SIZE == 1 || INT_SIZE == 2 || UINT_SIZE == 2) && \
+      defined(__AVX512BW__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 1 || UINT_SIZE == 1
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastb %1, %0" \
+          : "=v" (t_) : "m" (*(char[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastb %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  elif defined(__AVX512VBMI__)
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
+#  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+# elif INT_SIZE == 2 || UINT_SIZE == 2
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastw %1, %0" \
+          : "=v" (t_) : "m" (*(short[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastw %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(pshufhw, _mask, \
+                                      B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
+                                      0b00011011, (vhi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+# endif
+# if INT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovsxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 2
+#  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
+#  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
+#  define widen1(x) ((vec_t)B(pmovsxwd, _mask, x, (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxwq, _mask, x, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 2
+#  define max(x, y) ((vec_t)B(pmaxuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define mul_hi(x, y) ((vec_t)B(pmulhuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxwd, _mask, (vhi_half_t)(x), (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxwq, _mask, (vhi_quarter_t)(x), (vdi_t)undef(), ~0))
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
@@ -565,7 +650,7 @@ static inline bool _to_bool(byte_vec_t b
 #  endif
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSSE3__)
+#if VEC_SIZE == 16 && defined(__SSSE3__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define abs(x) ((vec_t)__builtin_ia32_pabsb128((vqi_t)(x)))
 # elif INT_SIZE == 2
@@ -789,6 +874,40 @@ static inline half_t low_half(vec_t x)
 }
 # endif
 
+# if !defined(low_quarter) && defined(QUARTER_SIZE)
+static inline quarter_t low_quarter(vec_t x)
+{
+#  if QUARTER_SIZE < VEC_SIZE
+    quarter_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+# if !defined(low_eighth) && defined(EIGHTH_SIZE)
+static inline eighth_t low_eighth(vec_t x)
+{
+#  if EIGHTH_SIZE < VEC_SIZE
+    eighth_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
 #endif
 
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
@@ -1117,7 +1236,7 @@ int simd_test(void)
     y = interleave_lo(alt < 0, alt < 0);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen2(x);
+    z = widen2(low_quarter(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1126,7 +1245,7 @@ int simd_test(void)
     y = interleave_lo(y, y);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen3(x);
+    z = widen3(low_eighth(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 #  endif
@@ -1148,14 +1267,14 @@ int simd_test(void)
 
 # ifdef widen2
     touch(src);
-    x = widen2(src);
+    x = widen2(low_quarter(src));
     touch(src);
     if ( !eq(x, z) ) return __LINE__;
 # endif
 
 # ifdef widen3
     touch(src);
-    x = widen3(src);
+    x = widen3(low_eighth(src));
     touch(src);
     if ( !eq(x, interleave_lo(z, (vec_t){})) ) return __LINE__;
 # endif
@@ -1175,6 +1294,36 @@ int simd_test(void)
             if ( aux2[i] != src[i] )
                 return __LINE__;
     }
+#endif
+
+#if defined(widen2) && defined(shrink2)
+    {
+        quarter_t aux1 = low_quarter(src), aux2;
+
+        touch(aux1);
+        x = widen2(aux1);
+        touch(x);
+        aux2 = shrink2(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 4; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
+#if defined(widen3) && defined(shrink3)
+    {
+        eighth_t aux1 = low_eighth(src), aux2;
+
+        touch(aux1);
+        x = widen3(aux1);
+        touch(x);
+        aux2 = shrink3(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 8; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
 #endif
 
 #ifdef dup_lo
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -95,6 +95,32 @@ typedef int __attribute__((vector_size(H
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
 # endif
 
+# if ELEM_COUNT >= 4
+#  if VEC_SIZE > 64
+#   define QUARTER_SIZE (VEC_SIZE / 4)
+#  else
+#   define QUARTER_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(QUARTER_SIZE))) quarter_t;
+typedef char __attribute__((vector_size(QUARTER_SIZE))) vqi_quarter_t;
+typedef short __attribute__((vector_size(QUARTER_SIZE))) vhi_quarter_t;
+typedef int __attribute__((vector_size(QUARTER_SIZE))) vsi_quarter_t;
+typedef long long __attribute__((vector_size(QUARTER_SIZE))) vdi_quarter_t;
+# endif
+
+# if ELEM_COUNT >= 8
+#  if VEC_SIZE > 128
+#   define EIGHTH_SIZE (VEC_SIZE / 8)
+#  else
+#   define EIGHTH_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(EIGHTH_SIZE))) eighth_t;
+typedef char __attribute__((vector_size(EIGHTH_SIZE))) vqi_eighth_t;
+typedef short __attribute__((vector_size(EIGHTH_SIZE))) vhi_eighth_t;
+typedef int __attribute__((vector_size(EIGHTH_SIZE))) vsi_eighth_t;
+typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
+# endif
+
 #endif
 
 #if VEC_SIZE == 16
@@ -182,6 +208,9 @@ OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
 OVR_INT(add);
+OVR_BW(adds);
+OVR_BW(addus);
+OVR_BW(avg);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
@@ -214,6 +243,8 @@ OVR_INT(srl);
 OVR_DQ(srlv);
 OVR_FP(sub);
 OVR_INT(sub);
+OVR_BW(subs);
+OVR_BW(subus);
 OVR_SFP(ucomi);
 OVR_VFP(unpckh);
 OVR_VFP(unpckl);
@@ -275,6 +306,31 @@ OVR(punpckldq);
 OVR(punpcklqdq);
 # endif
 
+# ifdef __AVX512BW__
+OVR(pextrb);
+OVR(pextrw);
+OVR(pinsrb);
+OVR(pinsrw);
+#  ifdef __AVX512VL__
+OVR(pmaddwd);
+OVR(pmovsxbw);
+OVR(pmovzxbw);
+OVR(pmulhuw);
+OVR(pmulhw);
+OVR(pmullw);
+OVR(psadbw);
+OVR(pshufb);
+OVR(pshufhw);
+OVR(pshuflw);
+OVR(punpckhbw);
+OVR(punpckhwd);
+OVR(punpcklbw);
+OVR(punpcklwd);
+OVR(slldq);
+OVR(srldq);
+#  endif
+# endif
+
 # undef OVR_VFP
 # undef OVR_SFP
 # undef OVR_INT
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
+#include "avx512bw.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -105,6 +106,11 @@ static bool simd_check_avx512bw(void)
 }
 #define simd_check_avx512bw_opmask simd_check_avx512bw
 
+static bool simd_check_avx512bw_vl(void)
+{
+    return cpu_has_avx512bw && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -284,6 +290,18 @@ static const struct {
     AVX512VL(VL u64x2,        avx512f,      16u8),
     AVX512VL(VL s64x4,        avx512f,      32i8),
     AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512BW s8x64,     avx512bw,      64i1),
+    SIMD(AVX512BW u8x64,     avx512bw,      64u1),
+    SIMD(AVX512BW s16x32,    avx512bw,      64i2),
+    SIMD(AVX512BW u16x32,    avx512bw,      64u2),
+    AVX512VL(BW+VL s8x16,    avx512bw,      16i1),
+    AVX512VL(BW+VL u8x16,    avx512bw,      16u1),
+    AVX512VL(BW+VL s8x32,    avx512bw,      32i1),
+    AVX512VL(BW+VL u8x32,    avx512bw,      32u1),
+    AVX512VL(BW+VL s16x8,    avx512bw,      16i2),
+    AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
+    AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
+    AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 15/49] x86emul: basic AVX512DQ testing
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (14 preceding siblings ...)
  2018-12-19 14:44   ` [PATCH v7 14/49] x86emul: basic AVX512BW testing Jan Beulich
@ 2018-12-19 14:46   ` Jan Beulich
  2018-12-19 14:46   ` [PATCH v7 16/49] x86emul: support AVX512F move high/low insns Jan Beulich
                     ` (33 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:46 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base.
v5: Re-base over changes earlier in the series.
v4: Wrap OVR(pmullq) in __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -19,7 +19,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -72,9 +72,12 @@ avx512f-flts := 4 8
 avx512bw-vecs := $(avx512f-vecs)
 avx512bw-ints := 1 2
 avx512bw-flts :=
+avx512dq-vecs := $(avx512f-vecs)
+avx512dq-ints := $(avx512f-ints)
+avx512dq-flts := $(avx512f-flts)
 
 avx512f-opmask-vecs := 2
-avx512dq-opmask-vecs := 1
+avx512dq-opmask-vecs := 1 2
 avx512bw-opmask-vecs := 4 8
 
 # Suppress building by default of the harness if the compiler can't deal
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -121,6 +121,34 @@ typedef int __attribute__((vector_size(E
 typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
 # endif
 
+# define DECL_PAIR(w) \
+typedef w ## _t pair_t; \
+typedef vsi_ ## w ## _t vsi_pair_t; \
+typedef vdi_ ## w ## _t vdi_pair_t
+# define DECL_QUARTET(w) \
+typedef w ## _t quartet_t; \
+typedef vsi_ ## w ## _t vsi_quartet_t; \
+typedef vdi_ ## w ## _t vdi_quartet_t
+# define DECL_OCTET(w) \
+typedef w ## _t octet_t; \
+typedef vsi_ ## w ## _t vsi_octet_t; \
+typedef vdi_ ## w ## _t vdi_octet_t
+
+# if ELEM_COUNT == 4
+DECL_PAIR(half);
+# elif ELEM_COUNT == 8
+DECL_PAIR(quarter);
+DECL_QUARTET(half);
+# elif ELEM_COUNT == 16
+DECL_PAIR(eighth);
+DECL_QUARTET(quarter);
+DECL_OCTET(half);
+# endif
+
+# undef DECL_OCTET
+# undef DECL_QUARTET
+# undef DECL_PAIR
+
 #endif
 
 #if VEC_SIZE == 16
@@ -146,6 +174,14 @@ typedef long long __attribute__((vector_
 #ifdef __AVX512F__
 
 /* Sadly there are a few exceptions to the general naming rules. */
+# define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
+# define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+# define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
+# define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
+# define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
+# define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
+# define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
+# define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -331,6 +367,20 @@ OVR(srldq);
 #  endif
 # endif
 
+# ifdef __AVX512DQ__
+OVR_VFP(and);
+OVR_VFP(andn);
+OVR_VFP(or);
+OVR(pextrd);
+OVR(pextrq);
+OVR(pinsrd);
+OVR(pinsrq);
+#  ifdef __AVX512VL__
+OVR(pmullq);
+#  endif
+OVR_VFP(xor);
+# endif
+
 # undef OVR_VFP
 # undef OVR_SFP
 # undef OVR_INT
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -139,6 +139,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
+     (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if FLOAT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -146,6 +167,17 @@ static inline bool _to_bool(byte_vec_t b
           : "=v" (t_) : "m" (*(float[1]){ x }) ); \
     t_; \
 })
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcastf32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
+#   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
+#  endif
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
@@ -155,6 +187,13 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
 #  else
+#   define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
+#   define insert_pair(x, y, p) \
+    B(insertf32x4_, _mask, x, \
+      /* Cast needed below to work around gcc 7.x quirk. */ \
+      (p) & 1 ? (typeof(y))__builtin_ia32_shufps(y, y, 0b01000100) : (y), \
+      (p) >> 1, x, 3 << ((p) * 2))
+#   define insert_quartet(x, y, p) B(insertf32x4_, _mask, x, y, p, undef(), ~0)
 #   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #   define swap(x) ({ \
@@ -178,6 +217,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) B(broadcastf64x2_, _mask, x, undef(), ~0)
+#   define insert_pair(x, y, p) B(insertf64x2_, _mask, x, y, p, undef(), ~0)
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
+#   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
+#  endif
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
@@ -306,6 +353,16 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 # endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextracti32x4 */ || \
+       (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -318,11 +375,30 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  ifdef __AVX512DQ__
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcasti32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) ((vec_t)B(broadcasti32x8_, _mask, (vsi_octet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_octet(x, y, p) ((vec_t)B(inserti32x8_, _mask, (vsi_t)(x), (vsi_octet_t)(y), p, (vsi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti32x4_, _mask, (vsi_quartet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_pair(x, y, p) \
+    (vec_t)(B(inserti32x4_, _mask, (vsi_t)(x), \
+              /* First cast needed below to work around gcc 7.x quirk. */ \
+              (p) & 1 ? (vsi_pair_t)__builtin_ia32_pshufd((vsi_pair_t)(y), 0b01000100) \
+                      : (vsi_pair_t)(y), \
+              (p) >> 1, (vsi_t)(x), 3 << ((p) * 2)))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti32x4_, _mask, (vsi_t)(x), (vsi_quartet_t)(y), p, (vsi_t)undef(), ~0))
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
@@ -347,6 +423,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ((vec_t)B(broadcasti64x2_, _mask, (vdi_pair_t)(x), (vdi_t)undef(), ~0))
+#   define insert_pair(x, y, p) ((vec_t)B(inserti64x2_, _mask, (vdi_t)(x), (vdi_pair_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti64x4_, , (vdi_quartet_t)(x), (vdi_t)undef(), ~0))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti64x4_, _mask, (vdi_t)(x), (vdi_quartet_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
@@ -898,7 +982,7 @@ static inline eighth_t low_eighth(vec_t
     eighth_t y;
     unsigned int i;
 
-    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+    for ( i = 0; i < ELEM_COUNT / 8; ++i )
         y[i] = x[i];
 
     return y;
@@ -910,6 +994,50 @@ static inline eighth_t low_eighth(vec_t
 
 #endif
 
+#ifdef broadcast_pair
+# if ELEM_COUNT == 4
+#  define broadcast_half broadcast_pair
+# elif ELEM_COUNT == 8
+#  define broadcast_quarter broadcast_pair
+# elif ELEM_COUNT == 16
+#  define broadcast_eighth broadcast_pair
+# endif
+#endif
+
+#ifdef insert_pair
+# if ELEM_COUNT == 4
+#  define insert_half insert_pair
+# elif ELEM_COUNT == 8
+#  define insert_quarter insert_pair
+# elif ELEM_COUNT == 16
+#  define insert_eighth insert_pair
+# endif
+#endif
+
+#ifdef broadcast_quartet
+# if ELEM_COUNT == 8
+#  define broadcast_half broadcast_quartet
+# elif ELEM_COUNT == 16
+#  define broadcast_quarter broadcast_quartet
+# endif
+#endif
+
+#ifdef insert_quartet
+# if ELEM_COUNT == 8
+#  define insert_half insert_quartet
+# elif ELEM_COUNT == 16
+#  define insert_quarter insert_quartet
+# endif
+#endif
+
+#if defined(broadcast_octet) && ELEM_COUNT == 16
+# define broadcast_half broadcast_octet
+#endif
+
+#if defined(insert_octet) && ELEM_COUNT == 16
+# define insert_half insert_octet
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1205,6 +1333,60 @@ int simd_test(void)
     if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#if defined(broadcast_half) && defined(insert_half)
+    {
+        half_t aux = low_half(src);
+
+        touch(aux);
+        x = broadcast_half(aux);
+        touch(aux);
+        y = insert_half(src, aux, 1);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_quarter) && defined(insert_quarter)
+    {
+        quarter_t aux = low_quarter(src);
+
+        touch(aux);
+        x = broadcast_quarter(aux);
+        touch(aux);
+        y = insert_quarter(src, aux, 1);
+        touch(aux);
+        y = insert_quarter(y, aux, 2);
+        touch(aux);
+        y = insert_quarter(y, aux, 3);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_eighth) && defined(insert_eighth) && \
+    /* At least gcc 7.3 "optimizes" away all insert_eighth() calls below. */ \
+    __GNUC__ >= 8
+    {
+        eighth_t aux = low_eighth(src);
+
+        touch(aux);
+        x = broadcast_eighth(aux);
+        touch(aux);
+        y = insert_eighth(src, aux, 1);
+        touch(aux);
+        y = insert_eighth(y, aux, 2);
+        touch(aux);
+        y = insert_eighth(y, aux, 3);
+        touch(aux);
+        y = insert_eighth(y, aux, 4);
+        touch(aux);
+        y = insert_eighth(y, aux, 5);
+        touch(aux);
+        y = insert_eighth(y, aux, 6);
+        touch(aux);
+        y = insert_eighth(y, aux, 7);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -23,6 +23,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
 #include "avx512bw.h"
+#include "avx512dq.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -100,6 +101,11 @@ static bool simd_check_avx512dq(void)
 }
 #define simd_check_avx512dq_opmask simd_check_avx512dq
 
+static bool simd_check_avx512dq_vl(void)
+{
+    return cpu_has_avx512dq && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -267,9 +273,10 @@ static const struct {
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
     SIMD(OPMASK/w,     avx512f_opmask,         2),
-    SIMD(OPMASK/b,    avx512dq_opmask,         1),
-    SIMD(OPMASK/d,    avx512bw_opmask,         4),
-    SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(OPMASK+DQ/b, avx512dq_opmask,         1),
+    SIMD(OPMASK+DQ/w, avx512dq_opmask,         2),
+    SIMD(OPMASK+BW/d, avx512bw_opmask,         4),
+    SIMD(OPMASK+BW/q, avx512bw_opmask,         8),
     SIMD(AVX512F f32 scalar,  avx512f,        f4),
     SIMD(AVX512F f32x16,      avx512f,      64f4),
     SIMD(AVX512F f64 scalar,  avx512f,        f8),
@@ -302,6 +309,24 @@ static const struct {
     AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
     AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
     AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
+    SIMD(AVX512DQ f32x16,    avx512dq,      64f4),
+    SIMD(AVX512DQ f64x8,     avx512dq,      64f8),
+    SIMD(AVX512DQ s32x16,    avx512dq,      64i4),
+    SIMD(AVX512DQ u32x16,    avx512dq,      64u4),
+    SIMD(AVX512DQ s64x8,     avx512dq,      64i8),
+    SIMD(AVX512DQ u64x8,     avx512dq,      64u8),
+    AVX512VL(DQ+VL f32x4,    avx512dq,      16f4),
+    AVX512VL(DQ+VL f64x2,    avx512dq,      16f8),
+    AVX512VL(DQ+VL f32x8,    avx512dq,      32f4),
+    AVX512VL(DQ+VL f64x4,    avx512dq,      32f8),
+    AVX512VL(DQ+VL s32x4,    avx512dq,      16i4),
+    AVX512VL(DQ+VL u32x4,    avx512dq,      16u4),
+    AVX512VL(DQ+VL s32x8,    avx512dq,      32i4),
+    AVX512VL(DQ+VL u32x8,    avx512dq,      32u4),
+    AVX512VL(DQ+VL s64x2,    avx512dq,      16i8),
+    AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
+    AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
+    AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 16/49] x86emul: support AVX512F move high/low insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (15 preceding siblings ...)
  2018-12-19 14:46   ` [PATCH v7 15/49] x86emul: basic AVX512DQ testing Jan Beulich
@ 2018-12-19 14:46   ` Jan Beulich
  2018-12-19 14:47   ` [PATCH v7 17/49] x86emul: support AVX512F move duplicate insns Jan Beulich
                     ` (32 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:46 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -253,6 +253,16 @@ static const struct test avx512f_128[] =
     INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
+//       movhlps,     ,   0f, 12,        d
+    INSN(movhpd,    66,   0f, 16, el,    q, vl),
+    INSN(movhpd,    66,   0f, 17, el,    q, vl),
+    INSN(movhps,      ,   0f, 16, el_2,  d, vl),
+    INSN(movhps,      ,   0f, 17, el_2,  d, vl),
+//       movlhps,     ,   0f, 16,        d
+    INSN(movlpd,    66,   0f, 12, el,    q, vl),
+    INSN(movlpd,    66,   0f, 13, el,    q, vl),
+    INSN(movlps,      ,   0f, 12, el_2,  d, vl),
+    INSN(movlps,      ,   0f, 13, el_2,  d, vl),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
     INSN(movq,      66,   0f, d6, el,    q, el),
 };
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -266,6 +266,12 @@ OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
 OVR_VFP(mova);
+OVR(movhlps);
+OVR(movhpd);
+OVR(movhps);
+OVR(movlhps);
+OVR(movlpd);
+OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -286,11 +286,11 @@ static const struct twobyte_table {
     [0x0f] = { ModRM|SrcImmByte },
     [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
-    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
     [0x20 ... 0x21] = { DstMem|SrcImplicit|ModRM },
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
@@ -6022,6 +6022,26 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_fp;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x12):   /* vmovlpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x13): /* vmovlp{s,d} xmm,m64 */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x16):   /* vmovhpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x17): /* vmovhp{s,d} xmm,m64 */
+        generate_exception_if(ea.type != OP_MEM, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x12):      /* vmovlps m64,xmm,xmm */
+                                            /* vmovhlps xmm,xmm,xmm */
+    case X86EMUL_OPC_EVEX(0x0f, 0x16):      /* vmovhps m64,xmm,xmm */
+                                            /* vmovlhps xmm,xmm,xmm */
+        generate_exception_if((evex.lr || evex.opmsk || evex.brs ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( (d & DstMask) != DstMem )
+            d &= ~TwoOp;
+        op_bytes = 8;
+        fault_suppression = false;
+        goto simd_zmm;
+
     case X86EMUL_OPC_F3(0x0f, 0x12):       /* movsldup xmm/m128,xmm */
     case X86EMUL_OPC_VEX_F3(0x0f, 0x12):   /* vmovsldup {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F2(0x0f, 0x12):       /* movddup xmm/m64,xmm */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 17/49] x86emul: support AVX512F move duplicate insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (16 preceding siblings ...)
  2018-12-19 14:46   ` [PATCH v7 16/49] x86emul: support AVX512F move high/low insns Jan Beulich
@ 2018-12-19 14:47   ` Jan Beulich
  2018-12-19 14:47   ` [PATCH v7 18/49] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
                     ` (31 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:47 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Judging from insn prefixes, these are scalar insns, but their (memory)
operands are vector ones (with the exception of 128-bit VMOVDDUP). For
this some adjustments to disp8scale calculation code are needed.

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: Fix Disp8 test for VMOVDDUP when AVX512VL is unavailable.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -146,6 +146,7 @@ static const struct test avx512f_all[] =
     INSN_SFP(mov,            0f, 11),
     INSN_PFP_NB(mova,        0f, 28),
     INSN_PFP_NB(mova,        0f, 29),
+    INSN(movddup,      f2,   0f, 12,    vl,   q_nb, vl),
     INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
     INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
     INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
@@ -157,6 +158,8 @@ static const struct test avx512f_all[] =
     INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
     INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
     INSN_PFP_NB(movnt,       0f, 2b),
+    INSN(movshdup,     f3,   0f, 16,    vl,   d_nb, vl),
+    INSN(movsldup,     f3,   0f, 12,    vl,   d_nb, vl),
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
@@ -694,6 +697,19 @@ static void test_group(const struct test
 
             switch ( tests[i].esz )
             {
+            case ESZ_q_nb:
+                /* The 128-bit form of VMOVDDUP needs special casing. */
+                if ( vl[j] == VL_128 && tests[i].spc == SPC_0f &&
+                     tests[i].opc == 0x12 && tests[i].pfx == PFX_f2 )
+                {
+                    struct test test = tests[i];
+
+                    test.vsz = VSZ_el;
+                    test.scale = SC_el;
+                    test_one(&test, vl[j], instr, ctxt);
+                    continue;
+                }
+                /* fall through */
             default:
                 test_one(&tests[i], vl[j], instr, ctxt);
                 break;
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -326,8 +326,11 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
+OVR(movshdup);
+OVR(movsldup);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3048,6 +3048,15 @@ x86_decode(
 
             switch ( b )
             {
+            case 0x12: /* vmovsldup / vmovddup */
+                if ( evex.pfx == vex_f2 )
+                    disp8scale = evex.lr ? 4 + evex.lr : 3;
+                /* fall through */
+            case 0x16: /* vmovshdup */
+                if ( evex.pfx == vex_f3 )
+                    disp8scale = 4 + evex.lr;
+                break;
+
             case 0x20: /* mov cr,reg */
             case 0x21: /* mov dr,reg */
             case 0x22: /* mov reg,cr */
@@ -6057,6 +6066,20 @@ x86_emulate(
         host_and_vcpu_must_have(sse3);
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x12):   /* vmovsldup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x12):   /* vmovddup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x16):   /* vmovshdup [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if((evex.brs ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = !(evex.pfx & VEX_PREFIX_DOUBLE_MASK) || evex.lr
+                   ? 16 << evex.lr : 8;
+        fault_suppression = false;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x14): /* vunpcklp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 18/49] x86emul: support AVX512{F, BW, _VBMI} permute insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (17 preceding siblings ...)
  2018-12-19 14:47   ` [PATCH v7 17/49] x86emul: support AVX512F move duplicate insns Jan Beulich
@ 2018-12-19 14:47   ` Jan Beulich
  2018-12-19 14:48   ` [PATCH v7 19/49] x86emul: support AVX512BW pack insns Jan Beulich
                     ` (30 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:47 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -178,6 +178,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permilpd,     66, 0f38, 0d,    vl,      q, vl),
+    INSN(permilpd,     66, 0f3a, 05,    vl,      q, vl),
+    INSN(permilps,     66, 0f38, 0c,    vl,      d, vl),
+    INSN(permilps,     66, 0f3a, 04,    vl,      d, vl),
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
@@ -278,6 +282,10 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(perm,           66, 0f38, 36, vl,   dq, vl),
+    INSN(perm,           66, 0f38, 16, vl,   sd, vl),
+    INSN(permpd,         66, 0f3a, 01, vl,    q, vl),
+    INSN(permq,          66, 0f3a, 00, vl,    q, vl),
     INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
     INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
     INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
@@ -316,6 +324,7 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permw,       66, 0f38, 8d,    vl,    w, vl),
     INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
     INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
@@ -412,6 +421,7 @@ static const struct test avx512dq_512[]
 };
 
 static const struct test avx512_vbmi_all[] = {
+    INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
 };
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -186,6 +186,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#   define swap2(x) B_(vpermilps, _mask, x, 0b00011011, undef(), ~0)
 #  else
 #   define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
 #   define insert_pair(x, y, p) \
@@ -200,6 +201,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
 })
+#   define swap2(x) B(vpermilps, _mask, \
+                       B(shuf_f32x4_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b00011011, undef(), ~0)
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -233,6 +238,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#   define swap2(x) B_(vpermilpd, _mask, x, 0b01, undef(), ~0)
 #  else
 #   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
@@ -240,6 +246,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
 })
+#   define swap2(x) B(vpermilpd, _mask, \
+                       B(shuf_f64x2_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b01010101, undef(), ~0)
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -405,6 +415,7 @@ static inline bool _to_bool(byte_vec_t b
                              B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
                                VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
                              0b00011011, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B_(permvarsi, _mask, (vsi_t)(x), (vsi_t)(inv - 1), (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -442,8 +453,17 @@ static inline bool _to_bool(byte_vec_t b
                              (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
                                       VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
                              0b01001110, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B(permvardi, _mask, (vdi_t)(x), (vdi_t)(inv - 1), (vdi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+#  if VEC_SIZE == 32
+#   define swap3(x) ((vec_t)B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0))
+#  elif VEC_SIZE == 64
+#   define swap3(x) ({ \
+    vdi_t t_ = B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0); \
+    B(shuf_i64x2_, _mask, t_, t_, 0b01001110, (vdi_t)undef(), ~0); \
+})
+#  endif
 # endif
 # if INT_SIZE == 4
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
@@ -489,6 +509,9 @@ static inline bool _to_bool(byte_vec_t b
 #  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
 #  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+#  ifdef __AVX512VBMI__
+#   define swap2(x) ((vec_t)B(permvarqi, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  endif
 # elif INT_SIZE == 2 || UINT_SIZE == 2
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -517,6 +540,7 @@ static inline bool _to_bool(byte_vec_t b
                               (0b01010101010101010101010101010101 & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+#  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
 # endif
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
@@ -1325,6 +1349,12 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
+#ifdef swap3
+    touch(src);
+    if ( !eq(swap3(src), inv) ) return __LINE__;
+    touch(src);
+#endif
+
 #ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -275,6 +275,8 @@ OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(perm);
+OVR_VFP(permil);
 OVR_VFP(shuf);
 OVR_INT(sll);
 OVR_DQ(sllv);
@@ -331,6 +333,8 @@ OVR(movntdq);
 OVR(movntdqa);
 OVR(movshdup);
 OVR(movsldup);
+OVR(permd);
+OVR(permq);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -434,7 +434,8 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
-    [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
+    [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -477,6 +478,7 @@ static const struct ext0f38_table {
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
+    [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -522,10 +524,10 @@ static const struct ext0f3a_table {
     uint8_t four_op:1;
     disp8scale_t d8s:4;
 } ext0f3a_table[256] = {
-    [0x00] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x00] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
+    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x02] = { .simd_size = simd_packed_int },
-    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
@@ -8091,6 +8093,9 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.brs, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0c): /* vpermilps [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0d): /* vpermilpd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         if ( b == 0xe2 )
             goto avx512f_no_sae;
@@ -8436,6 +8441,12 @@ x86_emulate(
         generate_exception_if(!vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x16): /* vpermp{s,d} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x36): /* vperm{d,q} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x20): /* vpmovsxbw xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,{x,y}mm */
@@ -8641,6 +8652,7 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         if ( !evex.w )
             host_and_vcpu_must_have(avx512_vbmi);
         else
@@ -9066,6 +9078,12 @@ x86_emulate(
         generate_exception_if(!vex.l || !vex.w, EXC_UD);
         goto simd_0f_imm8_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x00): /* vpermq $imm8,{y,z}mm/mem,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x01): /* vpermpd $imm8,{y,z}mm/mem,{y,z}mm{k} */
+        generate_exception_if(!evex.lr || !evex.w, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x38): /* vinserti128 $imm8,xmm/m128,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x39): /* vextracti128 $imm8,ymm,xmm/m128 */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x46): /* vperm2i128 $imm8,ymm/m256,ymm,ymm */
@@ -9085,6 +9103,12 @@ x86_emulate(
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x04): /* vpermilps $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x05): /* vpermilpd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_66(0x0f3a, 0x08): /* roundps $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x09): /* roundpd $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x0a): /* roundss $imm8,xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 19/49] x86emul: support AVX512BW pack insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (18 preceding siblings ...)
  2018-12-19 14:47   ` [PATCH v7 18/49] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
@ 2018-12-19 14:48   ` Jan Beulich
  2018-12-19 14:48   ` [PATCH v7 20/49] x86emul: support AVX512F floating-point conversion insns Jan Beulich
                     ` (29 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:48 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

No further test harness additions - what is there is good enough for
these rather "regular" insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -306,6 +306,10 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(packssdw,    66,   0f, 6b,    vl, d_nb, vl),
+    INSN(packsswb,    66,   0f, 63,    vl,    w, vl),
+    INSN(packusdw,    66, 0f38, 2b,    vl, d_nb, vl),
+    INSN(packuswb,    66,   0f, 67,    vl,    w, vl),
     INSN(paddb,       66,   0f, fc,    vl,    b, vl),
     INSN(paddsb,      66,   0f, ec,    vl,    b, vl),
     INSN(paddsw,      66,   0f, ed,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -361,6 +361,10 @@ OVR(pextrw);
 OVR(pinsrb);
 OVR(pinsrw);
 #  ifdef __AVX512VL__
+OVR(packssdw);
+OVR(packsswb);
+OVR(packusdw);
+OVR(packuswb);
 OVR(pmaddwd);
 OVR(pmovsxbw);
 OVR(pmovzxbw);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -453,7 +453,7 @@ static const struct ext0f38_table {
     [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
-    [0x2b] = { .simd_size = simd_packed_int },
+    [0x2b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
@@ -6732,6 +6732,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         op_bytes = 16 << evex.lr;
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x63): /* vpacksswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x67): /* vpackuswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6794,6 +6796,12 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6b): /* vpackssdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2b): /* vpackusdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w || evex.brs, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 20/49] x86emul: support AVX512F floating-point conversion insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (19 preceding siblings ...)
  2018-12-19 14:48   ` [PATCH v7 19/49] x86emul: support AVX512BW pack insns Jan Beulich
@ 2018-12-19 14:48   ` Jan Beulich
  2018-12-19 14:48   ` [PATCH v7 21/49] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
                     ` (28 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:48 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVTPS2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size change for twobyte_table[0x5a] is benign to pre-existing
code, but allows decode_disp8scale() to work as is here.

Also correct the comment on an AVX counterpart.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: ea.type == OP_* -> ea.type != OP_*. Re-base.
v6: Re-base over changes earlier in the series.
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,6 +109,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
+    INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
+    INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -181,7 +181,9 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  define widen1(x) ((vec_t)BR(cvtps2pd, _mask, x, (vdf_t)undef(), ~0))
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -68,6 +68,7 @@ typedef short __attribute__((vector_size
 typedef int __attribute__((vector_size(VEC_SIZE))) vsi_t;
 #if VEC_SIZE >= 8
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
+typedef double __attribute__((vector_size(VEC_SIZE))) vdf_t;
 #endif
 
 #if ELEM_SIZE == 1
@@ -93,6 +94,7 @@ typedef char __attribute__((vector_size(
 typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
 typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+typedef float __attribute__((vector_size(HALF_SIZE))) vsf_half_t;
 # endif
 
 # if ELEM_COUNT >= 4
@@ -328,6 +330,13 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(cvtpd2psx);
+OVR(cvtpd2psy);
+OVR(cvtph2ps);
+OVR(cvtps2pd);
+OVR(cvtps2ph);
+OVR(cvtsd2ss);
+OVR(cvtss2sd);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3871,6 +3871,49 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vcvtph2ps 32(%ecx),%zmm7{%k4}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vcvtph2ps);
+        decl_insn(evex_vcvtps2ph);
+
+        asm volatile ( "vpternlogd $0x81, %%zmm7, %%zmm7, %%zmm7\n\t"
+                       "kmovw %1,%%k4\n"
+                       put_insn(evex_vcvtph2ps, "vcvtph2ps 32(%0), %%zmm7%{%%k4%}")
+                       :: "c" (NULL), "r" (0x3333) );
+
+        set_insn(evex_vcvtph2ps);
+        memset(res, 0xff, 128);
+        res[8] = 0x40003c00; /* (1.0, 2.0) */
+        res[10] = 0x44004200; /* (3.0, 4.0) */
+        res[12] = 0x3400b800; /* (-.5, .25) */
+        res[14] = 0xbc000000; /* (0.0, -1.) */
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        asm volatile ( "vmovups %%zmm7, %0" : "=m" (res[16]) );
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtph2ps) )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vcvtps2ph $0,%zmm3,64(%edx){%k4}...");
+        asm volatile ( "vmovups %0, %%zmm3\n"
+                       put_insn(evex_vcvtps2ph, "vcvtps2ph $0, %%zmm3, 128(%1)%{%%k4%}")
+                       :: "m" (res[16]), "d" (NULL) );
+
+        set_insn(evex_vcvtps2ph);
+        regs.edx = (unsigned long)res;
+        memset(res + 32, 0xcc, 32);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtps2ph) )
+            goto fail;
+        res[15] = res[13] = res[11] = res[9] = 0xcccccccc;
+        if ( memcmp(res + 8, res + 32, 32) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -310,7 +310,8 @@ static const struct twobyte_table {
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -437,7 +438,7 @@ static const struct ext0f38_table {
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x13] = { .simd_size = simd_other, .two_op = 1 },
+    [0x13] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
@@ -541,7 +542,7 @@ static const struct ext0f3a_table {
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
-    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
+    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
@@ -3071,6 +3072,11 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x5a: /* vcvtps2pd needs special casing */
+                if ( disp8scale && !evex.pfx && !evex.brs )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -5988,6 +5994,7 @@ x86_emulate(
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5d):    /* vmin{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5e):    /* vdiv{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5f):    /* vmax{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    avx512f_all_fp:
         generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
                                (ea.type != OP_REG && evex.brs &&
                                 (evex.pfx & VEX_PREFIX_SCALAR_MASK))),
@@ -6548,7 +6555,7 @@ x86_emulate(
         goto simd_zmm;
 
     CASE_SIMD_ALL_FP(, 0x0f, 0x5a):        /* cvt{p,s}{s,d}2{p,s}{s,d} xmm/mem,xmm */
-    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} xmm/mem,xmm */
+    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} {x,y}mm/mem,{x,y}mm */
                                            /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm */
         op_bytes = 4 << (((vex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + vex.l) +
                          !!(vex.pfx & VEX_PREFIX_DOUBLE_MASK));
@@ -6557,6 +6564,12 @@ x86_emulate(
             goto simd_0f_sse2;
         goto simd_0f_avx;
 
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5a):   /* vcvtp{s,d}2p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+                                           /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm{k} */
+        op_bytes = 4 << (((evex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + evex.lr) +
+                         evex.w);
+        goto avx512f_all_fp;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x5b):     /* cvt{ps,dq}2{dq,ps} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x5b): /* vcvt{ps,dq}2{dq,ps} {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F3(0x0f, 0x5b):       /* cvttps2dq xmm/mem,xmm */
@@ -8444,6 +8457,15 @@ x86_emulate(
         op_bytes = 8 << vex.l;
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x13): /* vcvtph2ps {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w || (ea.type != OP_REG && evex.brs), EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.brs )
+            avx512_vlen_check(false);
+        op_bytes = 8 << evex.lr;
+        elem_bytes = 2;
+        goto simd_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x16): /* vpermps ymm/m256,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x36): /* vpermd ymm/m256,ymm,ymm */
         generate_exception_if(!vex.l || vex.w, EXC_UD);
@@ -9272,27 +9294,79 @@ x86_emulate(
         goto avx512f_imm8_no_sae;
 
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,[xyz]mm,{x,y}mm/mem{k} */
     {
         uint32_t mxcsr;
 
-        generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
-        host_and_vcpu_must_have(f16c);
         fail_if(!ops->write);
+        if ( evex_encoded() )
+        {
+            generate_exception_if((evex.w || evex.reg != 0xf || !evex.RX ||
+                                   (ea.type != OP_REG && (evex.z || evex.brs))),
+                                  EXC_UD);
+            host_and_vcpu_must_have(avx512f);
+            avx512_vlen_check(false);
+            opc = init_evex(stub);
+        }
+        else
+        {
+            generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
+            host_and_vcpu_must_have(f16c);
+            opc = init_prefixes(stub);
+        }
+
+        op_bytes = 8 << evex.lr;
 
-        opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
         if ( ea.type == OP_MEM )
         {
             /* Convert memory operand to (%rAX). */
             vex.b = 1;
+            evex.b = 1;
             opc[1] &= 0x38;
         }
         opc[2] = imm1;
-        insn_bytes = PFX_BYTES + 3;
+        if ( evex_encoded() )
+        {
+            unsigned int full = 0;
+
+            insn_bytes = EVEX_PFX_BYTES + 3;
+            copy_EVEX(opc, evex);
+
+            if ( ea.type == OP_MEM && evex.opmsk )
+            {
+                full = 0xffff >> (16 - op_bytes / 2);
+                op_mask &= full;
+                if ( !op_mask )
+                    goto complete_insn;
+
+                first_byte = __builtin_ctz(op_mask);
+                op_mask >>= first_byte;
+                full >>= first_byte;
+                first_byte <<= 1;
+                op_bytes = (32 - __builtin_clz(op_mask)) << 1;
+
+                /*
+                 * We may need to read (parts of) the memory operand for the
+                 * purpose of merging in order to avoid splitting the write
+                 * below into multiple ones.
+                 */
+                if ( op_mask != full &&
+                     (rc = ops->read(ea.mem.seg,
+                                     truncate_ea(ea.mem.off + first_byte),
+                                     (void *)mmvalp + first_byte, op_bytes,
+                                     ctxt)) != X86EMUL_OKAY )
+                    goto done;
+            }
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 3;
+            copy_VEX(opc, vex);
+        }
         opc[3] = 0xc3;
 
-        copy_VEX(opc, vex);
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
                     "=m" (*mmvalp), [mxcsr] "=m" (mxcsr) : "a" (mmvalp));
@@ -9301,7 +9375,8 @@ x86_emulate(
 
         if ( ea.type == OP_MEM )
         {
-            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
+            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
+                            (void *)mmvalp + first_byte, op_bytes, ctxt);
             if ( rc != X86EMUL_OKAY )
             {
                 asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 21/49] x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (20 preceding siblings ...)
  2018-12-19 14:48   ` [PATCH v7 20/49] x86emul: support AVX512F floating-point conversion insns Jan Beulich
@ 2018-12-19 14:48   ` Jan Beulich
  2018-12-19 14:51   ` [PATCH v7 22/49] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
                     ` (27 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:48 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

... including the two AVX512DQ forms which shared encodings, just with
EVEX.W set there.

VCVTDQ2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size changes for the twobyte_table[] entries are benign to
pre-existing code, but allow decode_disp8scale() to work as is here.

The at this point wrong placement of the 0xe6 case block is once again
in anticipation of further additions of case labels.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: ea.type == OP_* -> ea.type != OP_*. Re-base.
v6: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,8 +109,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
+    INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
+    INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
@@ -398,6 +402,8 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
+    INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -92,6 +92,13 @@ static inline bool _to_bool(byte_vec_t b
 # define to_int(x) ((vec_t){ (int)(x)[0] })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
+#elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if FLOAT_SIZE == 4
+#  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+# elif FLOAT_SIZE == 8
+#  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
 #  define to_int(x) __builtin_ia32_cvtdq2ps(__builtin_ia32_cvtps2dq(x))
@@ -1142,15 +1149,21 @@ int simd_test(void)
     touch(src);
     if ( !eq(x * -alt, -src) ) return __LINE__;
 
-# if defined(recip) && defined(to_int)
+# ifdef to_int
+
+    touch(src);
+    x = to_int(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
 
+#  ifdef recip
     touch(src);
     x = recip(src);
     touch(src);
     touch(x);
     if ( !eq(to_int(recip(x)), src) ) return __LINE__;
 
-#  ifdef rsqrt
+#   ifdef rsqrt
     x = src * src;
     touch(x);
     y = rsqrt(x);
@@ -1158,6 +1171,7 @@ int simd_test(void)
     if ( !eq(to_int(recip(y)), src) ) return __LINE__;
     touch(src);
     if ( !eq(to_int(y), to_int(recip(src))) ) return __LINE__;
+#   endif
 #  endif
 
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -244,6 +244,7 @@ asm ( ".macro override insn    \n\t"
 OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
+OVR_VFP(cvtdq2);
 OVR_FP(add);
 OVR_INT(add);
 OVR_BW(adds);
@@ -330,13 +331,19 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(cvtpd2dqx);
+OVR(cvtpd2dqy);
 OVR(cvtpd2psx);
 OVR(cvtpd2psy);
 OVR(cvtph2ps);
+OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
 OVR(cvtss2sd);
+OVR(cvttpd2dqx);
+OVR(cvttpd2dqy);
+OVR(cvttps2dq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -311,7 +311,7 @@ static const struct twobyte_table {
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -375,7 +375,7 @@ static const struct twobyte_table {
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
@@ -3081,6 +3081,11 @@ x86_decode(
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
                 break;
+
+            case 0xe6: /* vcvtdq2pd needs special casing */
+                if ( disp8scale && evex.pfx == vex_f3 && !evex.w && !evex.brs )
+                    --disp8scale;
+                break;
             }
             break;
 
@@ -6578,6 +6583,22 @@ x86_emulate(
         op_bytes = 16 << vex.l;
         goto simd_0f_cvt;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x5b): /* vcvtps2dq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x5b): /* vcvttps2dq [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        if ( ea.type != OP_REG || !evex.brs )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x60):    /* punpcklbw {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x60): /* vpunpcklbw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0x61):    /* punpcklwd {,x}mm/mem,{,x}mm */
@@ -7240,6 +7261,27 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe6):   /* vcvttpd2dq [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx != vex_f3 )
+            host_and_vcpu_must_have(avx512f);
+        else if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+        {
+            host_and_vcpu_must_have(avx512f);
+            generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
+        }
+        if ( ea.type != OP_REG || !evex.brs )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 8 << (evex.w + evex.lr);
+        goto simd_zmm;
+
     case X86EMUL_OPC_F2(0x0f, 0xf0):     /* lddqu m128,xmm */
     case X86EMUL_OPC_VEX_F2(0x0f, 0xf0): /* vlddqu mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 22/49] x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (21 preceding siblings ...)
  2018-12-19 14:48   ` [PATCH v7 21/49] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
@ 2018-12-19 14:51   ` Jan Beulich
  2018-12-19 14:51   ` [PATCH v7 23/49] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
                     ` (26 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:51 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVT{,T}S{S,D}2SI use EVEX.W for their destination (register) rather
than their (possibly memory) source operand size and hence need a
"manual" override of disp8scale.

While the SDM claims that EVEX.L'L needs to be zero for the 32-bit forms
of VCVT{,U}SI2SD (exception type E10NF), observations on my test system
do not confirm this (and I've got informal confirmation that this is a
doc mistake). Nevertheless, to be on the safe side, force evex.lr to be
zero in this case though when constructing the stub.

Slightly adjust the scalar to_int() in the test harness, to increase the
chances of the operand ending up in memory.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix VCVTSI2SS - cannot re-use VMOV{D,Q} code here, as the register
    form can't be converted to a memory one when embedded rounding is in
    effect. Force evex.lr to zero for 32-bit VCVTSI2SD. Permit embedded
    rounding for VCVT{,T}S{S,D}2SI. Re-base.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -117,8 +117,16 @@ static const struct test avx512f_all[] =
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
+    INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
+    INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -746,8 +754,9 @@ static void test_group(const struct test
                 break;
 
             case ESZ_dq:
-                test_pair(&tests[i], vl[j], ESZ_d, "d", ESZ_q, "q",
-                          instr, ctxt);
+                test_pair(&tests[i], vl[j], ESZ_d,
+                          strncmp(tests[i].mnemonic, "cvt", 3) ? "d" : "l",
+                          ESZ_q, "q", instr, ctxt);
                 break;
 
 #ifdef __i386__
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -89,7 +89,7 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 
 #if VEC_SIZE == FLOAT_SIZE
-# define to_int(x) ((vec_t){ (int)(x)[0] })
+# define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -340,10 +340,28 @@ OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
+OVR(cvtsd2si);
+OVR(cvtsd2sil);
+OVR(cvtsd2siq);
+OVR(cvtsi2sd);
+OVR(cvtsi2sdl);
+OVR(cvtsi2sdq);
+OVR(cvtsi2ss);
+OVR(cvtsi2ssl);
+OVR(cvtsi2ssq);
 OVR(cvtss2sd);
+OVR(cvtss2si);
+OVR(cvtss2sil);
+OVR(cvtss2siq);
 OVR(cvttpd2dqx);
 OVR(cvttpd2dqy);
 OVR(cvttps2dq);
+OVR(cvttsd2si);
+OVR(cvttsd2sil);
+OVR(cvttsd2siq);
+OVR(cvttss2si);
+OVR(cvttss2sil);
+OVR(cvttss2siq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -296,7 +296,7 @@ static const struct twobyte_table {
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
     [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp, simd_none, d8s_dq },
@@ -3072,6 +3072,12 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x2c: /* vcvtts{s,d}2si need special casing */
+            case 0x2d: /* vcvts{s,d}2si need special casing */
+                if ( evex_encoded() )
+                    disp8scale = 2 + (evex.pfx & VEX_PREFIX_DOUBLE_MASK);
+                break;
+
             case 0x5a: /* vcvtps2pd needs special casing */
                 if ( disp8scale && !evex.pfx && !evex.brs )
                     --disp8scale;
@@ -6190,6 +6196,48 @@ x86_emulate(
         state->simd_size = simd_none;
         goto simd_0f_rm;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+        generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.brs),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.brs )
+            avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        if ( ea.type == OP_MEM )
+        {
+            rc = read_ulong(ea.mem.seg, ea.mem.off, &src.val,
+                            rex_prefix & REX_W ? 8 : 4, ctxt, ops);
+            if ( rc != X86EMUL_OKAY )
+                goto done;
+        }
+        else
+            src.val = *ea.reg;
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert memory/GPR source to %rAX. */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        /*
+         * SDM version 067 claims that exception type E10NF implies #UD when
+         * EVEX.L'L is non-zero for 32-bit VCVT{,U}SI2SD. Experimentally this
+         * cannot be confirmed, but be on the safe side for the stub.
+         */
+        if ( !evex.w && evex.pfx == vex_f2 )
+            evex.lr = 0;
+        opc[1] = (modrm & 0x38) | 0xc0;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "=g" (dummy) : "a" (src.val));
+
+        put_stub(stub);
+        state->simd_size = simd_none;
+        break;
+
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2c):     /* cvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_VEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2d):     /* cvts{s,d}2si xmm/mem,reg */
@@ -6213,14 +6261,17 @@ x86_emulate(
         }
 
         opc = init_prefixes(stub);
+    cvts_2si:
         opc[0] = b;
         /* Convert GPR destination to %rAX and memory operand to (%rCX). */
         rex_prefix &= ~REX_R;
         vex.r = 1;
+        evex.r = 1;
         if ( ea.type == OP_MEM )
         {
             rex_prefix &= ~REX_B;
             vex.b = 1;
+            evex.b = 1;
             opc[1] = 0x01;
 
             rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp,
@@ -6231,11 +6282,22 @@ x86_emulate(
         else
             opc[1] = modrm & 0xc7;
         if ( !mode_64bit() )
+        {
             vex.w = 0;
-        insn_bytes = PFX_BYTES + 2;
+            evex.w = 0;
+        }
+        if ( evex_encoded() )
+        {
+            insn_bytes = EVEX_PFX_BYTES + 2;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 2;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
         opc[2] = 0xc3;
 
-        copy_REX_VEX(opc, rex_prefix, vex);
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
 
@@ -6243,6 +6305,18 @@ x86_emulate(
         state->simd_size = simd_none;
         break;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+        generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
+                               (ea.type != OP_REG && evex.brs)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.brs )
+            avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto cvts_2si;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2e):     /* ucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x2e): /* vucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2f):     /* comis{s,d} xmm/mem,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 23/49] x86emul: support AVX512DQ packed quad-int/FP conversion insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (22 preceding siblings ...)
  2018-12-19 14:51   ` [PATCH v7 22/49] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
@ 2018-12-19 14:51   ` Jan Beulich
  2018-12-19 14:52   ` [PATCH v7 24/49] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
                     ` (25 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:51 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVT{,T}PS2QQ, sharing their main opcodes with others, once again need
"manual" overrides of disp8scale.

While not directly related here, also add a scalar variant of to_wint()
to the test harness.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: Workaround for gcc 7 quirk.
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -410,8 +410,12 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
+    INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -90,14 +90,35 @@ static inline bool _to_bool(byte_vec_t b
 
 #if VEC_SIZE == FLOAT_SIZE
 # define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
+# ifdef __x86_64__
+#  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) ({ \
+    vsf_half_t t_ = low_half(x); \
+    vdi_t lo_, hi_; \
+    touch(t_); \
+    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    t_ = high_half(x); \
+    touch(t_); \
+    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    touch(lo_); touch(hi_); \
+    insert_half(insert_half(undef(), \
+                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+})
+#  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
@@ -121,6 +142,21 @@ static inline bool _to_bool(byte_vec_t b
 })
 #endif
 
+#if VEC_SIZE == 16 && FLOAT_SIZE == 4 && defined(__SSE__)
+# define low_half(x) (x)
+# define high_half(x) B_(movhlps, , undef(), x)
+/*
+ * GCC 7 (and perhaps earlier) report a bogus type mismatch for the conditional
+ * expression below. All works well with this no-op wrapper.
+ */
+static inline vec_t movlhps(vec_t x, vec_t y) {
+    return __builtin_ia32_movlhps(x, y);
+}
+# define insert_pair(x, y, p) \
+    ((p) ? movlhps(x, y) \
+         : ({ vec_t t_ = (x); t_[0] = (y)[0]; t_[1] = (y)[1]; t_; }))
+#endif
+
 #if VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW_A__)
 # define max __builtin_ia32_pfmax
 # define min __builtin_ia32_pfmin
@@ -149,13 +185,16 @@ static inline bool _to_bool(byte_vec_t b
 # if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
      (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
      (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
-#  define low_half(x) ({ \
+#  define _half(x, lh) ({ \
     half_t t_; \
-    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+    asm ( "vextractf%c[w]x%c[n] %[sel], %[s], %[d]" \
           : [d] "=m" (t_) \
-          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+          : [s] "v" (x), [sel] "i" (lh), \
+            [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
     t_; \
 })
+#  define low_half(x)  _half(x, 0)
+#  define high_half(x) _half(x, 1)
 # endif
 # if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
      (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
@@ -1176,6 +1215,13 @@ int simd_test(void)
 
 # endif
 
+# ifdef to_wint
+    touch(src);
+    x = to_wint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
 # ifdef sqrt
     x = src * src;
     touch(x);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -325,6 +325,8 @@ static const struct twobyte_table {
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3083,6 +3085,12 @@ x86_decode(
                     --disp8scale;
                 break;
 
+            case 0x7a: /* vcvttps2qq needs special casing */
+            case 0x7b: /* vcvtps2qq needs special casing */
+                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.brs )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -7344,7 +7352,13 @@ x86_emulate(
         if ( evex.pfx != vex_f3 )
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
+        {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2qq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512dq);
+        }
         else
         {
             host_and_vcpu_must_have(avx512f);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 24/49] x86emul: support AVX512{F, DQ} uint-to-FP conversion insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (23 preceding siblings ...)
  2018-12-19 14:51   ` [PATCH v7 23/49] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
@ 2018-12-19 14:52   ` Jan Beulich
  2018-12-19 14:52   ` [PATCH v7 25/49] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
                     ` (24 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:52 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Some "manual" overrides of disp8scale are needed here again. In
particular code ends up simpler when using d8s_dq64 in the
twobyte_table[] entry.

Test harness additions will be done once the reverse conversions are
also available.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -127,6 +127,10 @@ static const struct test avx512f_all[] =
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
+    INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
+    INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
+    INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -416,6 +420,8 @@ static const struct test avx512dq_all[]
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
+    INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -326,7 +326,7 @@ static const struct twobyte_table {
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3085,12 +3085,16 @@ x86_decode(
                     --disp8scale;
                 break;
 
-            case 0x7a: /* vcvttps2qq needs special casing */
-            case 0x7b: /* vcvtps2qq needs special casing */
-                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.brs )
+            case 0x7a: /* vcvttps2qq and vcvtudq2pd need special casing */
+                if ( disp8scale && evex.pfx != vex_f2 && !evex.w && !evex.brs )
                     --disp8scale;
                 break;
 
+            case 0x7b: /* vcvtp{s,d}2qq need special casing */
+                if ( disp8scale && evex.pfx == vex_66 )
+                    disp8scale = (evex.brs ? 2 : 3 + evex.lr) + evex.w;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -6205,6 +6209,7 @@ x86_emulate(
         goto simd_0f_rm;
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x7b): /* vcvtusi2s{s,d} r/m,xmm,xmm */
         generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.brs),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
@@ -6671,6 +6676,8 @@ x86_emulate(
         /* fall through */
     case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
                                           /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x7a): /* vcvtudq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtuqq2ps [xyz]mm/mem,{x,y}mm{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
@@ -7347,6 +7354,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
         generate_exception_if(!evex.w, EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7a):   /* vcvtudq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtuqq2pd [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
         if ( evex.pfx != vex_f3 )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 25/49] x86emul: support AVX512{F, DQ} FP-to-uint conversion insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (24 preceding siblings ...)
  2018-12-19 14:52   ` [PATCH v7 24/49] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
@ 2018-12-19 14:52   ` Jan Beulich
  2018-12-19 14:53   ` [PATCH v7 26/49] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
                     ` (23 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:52 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Along the lines of prior patches, VCVT{,T}PS2UQQ as well as
VCVT{,T}S{S,D}2USI need "manual" overrides of disp8scale.

The twobyte_table[] entries get altered, with their prior values
now put in place in x86_decode_twobyte().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -112,21 +112,29 @@ static const struct test avx512f_all[] =
     INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
     INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
     INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
+    INSN(cvtpd2udq,      ,   0f, 79,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtps2udq,      ,   0f, 79,    vl,      d, vl),
     INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
+    INSN(cvtsd2usi,    f2,   0f, 79,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
     INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
     INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvtss2usi,    f3,   0f, 79,    el,      d, el),
     INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttpd2udq,     ,   0f, 78,    vl,      q, vl),
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttps2udq,     ,   0f, 78,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttsd2usi,   f2,   0f, 78,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvttss2usi,   f3,   0f, 78,    el,      d, el),
     INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
     INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
@@ -415,11 +423,15 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtpd2uqq,      66,   0f, 79,   vl,  q, vl),
     INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
+    INSN(cvtps2uqq,      66,   0f, 79, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttpd2uqq,     66,   0f, 78,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvttps2uqq,     66,   0f, 78, vl_2,  d, vl),
     INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
     INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -93,31 +93,65 @@ static inline bool _to_bool(byte_vec_t b
 # ifdef __x86_64__
 #  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
 # endif
+# ifdef __AVX512F__
+/*
+ * Sadly even gcc 9.x, at the time of writing, does not carry out at least
+ * uint -> FP conversions using VCVTUSI2S{S,D}, so we need to use builtins
+ * or inline assembly here. The full-vector parameter types of the builtins
+ * aren't very helpful for our purposes, so use inline assembly.
+ */
+#  if FLOAT_SIZE == 4
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    float __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtss2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2ss%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  elif FLOAT_SIZE == 8
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    double __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtsd2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2sd%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  endif
+#  define to_uint(x) to_u_int(int, x)
+#  ifdef __x86_64__
+#   define to_uwint(x) to_u_int(long, x)
+#  endif
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  define to_uint(x) BR(cvtudq2ps, _mask, BR(cvtps2udq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
-#   define to_wint(x) ({ \
+#   define to_w_int(x, s) ({ \
     vsf_half_t t_ = low_half(x); \
     vdi_t lo_, hi_; \
     touch(t_); \
-    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    lo_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     t_ = high_half(x); \
     touch(t_); \
-    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    hi_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     touch(lo_); touch(hi_); \
     insert_half(insert_half(undef(), \
-                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
-                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+                            BR(cvt ## s ## qq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvt ## s ## qq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
 })
+#   define to_wint(x) to_w_int(x, )
+#   define to_uwint(x) to_w_int(x, u)
 #  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  define to_uint(x) B(cvtudq2pd, _mask, BR(cvtpd2udq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
 #   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#   define to_uwint(x) BR(cvtuqq2pd, _mask, BR(cvtpd2uqq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
 #  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
@@ -1221,6 +1255,20 @@ int simd_test(void)
     touch(src);
     if ( !eq(x, src) ) return __LINE__;
 # endif
+
+# ifdef to_uint
+    touch(src);
+    x = to_uint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
+# ifdef to_uwint
+    touch(src);
+    x = to_uwint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
 
 # ifdef sqrt
     x = src * src;
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -323,8 +323,7 @@ static const struct twobyte_table {
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
-    [0x78] = { ImplicitOps|ModRM },
-    [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x78 ... 0x79] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -2523,6 +2522,8 @@ x86_decode_twobyte(
         break;
 
     case 0x78:
+        state->desc = ImplicitOps;
+        state->simd_size = simd_none;
         switch ( vex.pfx )
         {
         case vex_66: /* extrq $imm8, $imm8, xmm */
@@ -2535,7 +2536,7 @@ x86_decode_twobyte(
     case 0x10 ... 0x18:
     case 0x28 ... 0x2f:
     case 0x50 ... 0x77:
-    case 0x79 ... 0x7d:
+    case 0x7a ... 0x7d:
     case 0x7f:
     case 0xc2 ... 0xc3:
     case 0xc5 ... 0xc6:
@@ -2557,6 +2558,12 @@ x86_decode_twobyte(
         op_bytes = mode_64bit() ? 8 : 4;
         break;
 
+    case 0x79:
+        state->desc = DstReg | SrcMem;
+        state->simd_size = simd_packed_int;
+        ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
+        break;
+
     case 0x7e:
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         if ( vex.pfx == vex_f3 ) /* movq xmm/m64,xmm */
@@ -3074,6 +3081,18 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x78:
+            case 0x79:
+                if ( !evex.pfx )
+                    break;
+                /* vcvt{,t}ps2uqq need special casing */
+                if ( evex.pfx == vex_66 )
+                {
+                    if ( !evex.w && !evex.brs )
+                        --disp8scale;
+                    break;
+                }
+                /* vcvt{,t}s{s,d}2usi need special casing: fall through */
             case 0x2c: /* vcvtts{s,d}2si need special casing */
             case 0x2d: /* vcvts{s,d}2si need special casing */
                 if ( evex_encoded() )
@@ -6320,6 +6339,8 @@ x86_emulate(
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x78): /* vcvtts{s,d}2usi xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x79): /* vcvts{s,d}2usi xmm/mem,reg */
         generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
                                (ea.type != OP_REG && evex.brs)),
                               EXC_UD);
@@ -6681,7 +6702,11 @@ x86_emulate(
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
+        {
+    case X86EMUL_OPC_EVEX(0x0f, 0x78):    /* vcvttp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX(0x0f, 0x79):    /* vcvtp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512f);
+        }
         if ( ea.type != OP_REG || !evex.brs )
             avx512_vlen_check(false);
         d |= TwoOp;
@@ -7362,6 +7387,10 @@ x86_emulate(
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
         {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x78):   /* vcvttps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2uqq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x79):   /* vcvtps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2uqq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 26/49] x86emul: support remaining AVX512F legacy-equivalent insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (25 preceding siblings ...)
  2018-12-19 14:52   ` [PATCH v7 25/49] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
@ 2018-12-19 14:53   ` Jan Beulich
  2018-12-19 14:53   ` [PATCH v7 27/49] x86emul: support remaining AVX512BW " Jan Beulich
                     ` (22 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Plus their AVX512BW counterparts.

Take the opportunity and also eliminate a pair of open coded instances
of scalar_1op().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: Re-base over changes earlier in the series.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -193,6 +193,8 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(pabsd,        66, 0f38, 1e,    vl,      d, vl),
+    INSN(pabsq,        66, 0f38, 1f,    vl,      q, vl),
     INSN(paddd,        66,   0f, fe,    vl,      d, vl),
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
@@ -276,6 +278,10 @@ static const struct test avx512f_all[] =
     INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
     INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
+    INSN(rndscalepd,   66, 0f3a, 09,    vl,      q, vl),
+    INSN(rndscaleps,   66, 0f3a, 08,    vl,      d, vl),
+    INSN(rndscalesd,   66, 0f3a, 0b,    el,      q, el),
+    INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
@@ -336,6 +342,8 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(pabsb,       66, 0f38, 1c,    vl,    b, vl),
+    INSN(pabsw,       66, 0f38, 1d,    vl,    w, vl),
     INSN(packssdw,    66,   0f, 6b,    vl, d_nb, vl),
     INSN(packsswb,    66,   0f, 63,    vl,    w, vl),
     INSN(packusdw,    66, 0f38, 2b,    vl, d_nb, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -211,8 +211,10 @@ static inline vec_t movlhps(vec_t x, vec
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
+#  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
+#  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
@@ -263,6 +265,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
 #  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  define trunc(x) BR(rndscaleps_, _mask, x, 0b1011, undef(), ~0)
 #  define widen1(x) ((vec_t)BR(cvtps2pd, _mask, x, (vdf_t)undef(), ~0))
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
@@ -316,6 +319,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
 #  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
+#  define trunc(x) BR(rndscalepd_, _mask, x, 0b1011, undef(), ~0)
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
@@ -548,6 +552,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  endif
 # endif
 # if INT_SIZE == 4
+#  define abs(x) B(pabsd, _mask, x, undef(), ~0)
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
@@ -558,6 +563,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
 #  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
+#  define abs(x) ((vec_t)B(pabsq, _mask, (vdi_t)(x), (vdi_t)undef(), ~0))
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 8
@@ -625,6 +631,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
 # endif
 # if INT_SIZE == 1
+#  define abs(x) ((vec_t)B(pabsb, _mask, (vqi_t)(x), (vqi_t)undef(), ~0))
 #  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
@@ -637,6 +644,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
 #  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 2
+#  define abs(x) B(pabsw, _mask, x, undef(), ~0)
 #  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
 #  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
@@ -948,19 +956,11 @@ static inline vec_t movlhps(vec_t x, vec
 #if VEC_SIZE == FLOAT_SIZE
 # define max(x, y) ((vec_t){({ typeof(x[0]) x_ = (x)[0], y_ = (y)[0]; x_ > y_ ? x_ : y_; })})
 # define min(x, y) ((vec_t){({ typeof(x[0]) x_ = (x)[0], y_ = (y)[0]; x_ < y_ ? x_ : y_; })})
-# ifdef __SSE4_1__
+# if defined(__SSE4_1__) && !defined(__AVX512F__)
 #  if FLOAT_SIZE == 4
-#   define trunc(x) ({ \
-    float __attribute__((vector_size(16))) r_; \
-    asm ( "roundss $0b1011,%1,%0" : "=x" (r_) : "m" (x) ); \
-    (vec_t){ r_[0] }; \
-})
+#   define trunc(x) scalar_1op(x, "roundss $0b1011, %[in], %[out]")
 #  elif FLOAT_SIZE == 8
-#   define trunc(x) ({ \
-    double __attribute__((vector_size(16))) r_; \
-    asm ( "roundsd $0b1011,%1,%0" : "=x" (r_) : "m" (x) ); \
-    (vec_t){ r_[0] }; \
-})
+#   define trunc(x) scalar_1op(x, "roundsd $0b1011, %[in], %[out]")
 #  endif
 # endif
 #endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -184,6 +184,8 @@ DECL_OCTET(half);
 # define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
 # define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
 # define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
+# define __builtin_ia32_rndscalepd_512_mask __builtin_ia32_rndscalepd_mask
+# define __builtin_ia32_rndscaleps_512_mask __builtin_ia32_rndscaleps_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -245,6 +247,7 @@ OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_VFP(cvtdq2);
+OVR_INT(abs);
 OVR_FP(add);
 OVR_INT(add);
 OVR_BW(adds);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -446,7 +446,7 @@ static const struct ext0f38_table {
     [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
-    [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x1c ... 0x1f] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
     [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
@@ -531,8 +531,8 @@ static const struct ext0f3a_table {
     [0x02] = { .simd_size = simd_packed_int },
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
-    [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
-    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
+    [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc, .d8s = d8s_dq },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
     [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
@@ -6906,6 +6906,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = 1 << (b & 1);
@@ -8292,6 +8294,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1e): /* vpabsd [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1f): /* vpabsq [xyz]mm/mem,[xyz]mm{k} */
         generate_exception_if(evex.w != (b & 1), EXC_UD);
         goto avx512f_no_sae;
 
@@ -9320,6 +9324,17 @@ x86_emulate(
         host_and_vcpu_must_have(sse4_1);
         goto simd_0f3a_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0a): /* vrndscaless $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0b): /* vrndscalesd $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x08): /* vrndscaleps $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x09): /* vrndscalepd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        avx512_vlen_check(b & 2);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC(0x0f3a, 0x0f):    /* palignr $imm8,mm/m64,mm */
     case X86EMUL_OPC_66(0x0f3a, 0x0f): /* palignr $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(ssse3);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 27/49] x86emul: support remaining AVX512BW legacy-equivalent insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (26 preceding siblings ...)
  2018-12-19 14:53   ` [PATCH v7 26/49] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
@ 2018-12-19 14:53   ` Jan Beulich
  2018-12-19 14:54   ` [PATCH v7 28/49] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
                     ` (21 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -354,6 +354,7 @@ static const struct test avx512bw_all[]
     INSN(paddusb,     66,   0f, dc,    vl,    b, vl),
     INSN(paddusw,     66,   0f, dd,    vl,    w, vl),
     INSN(paddw,       66,   0f, fd,    vl,    w, vl),
+    INSN(palignr,     66, 0f3a, 0f,    vl,    b, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
     INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
@@ -369,6 +370,7 @@ static const struct test avx512bw_all[]
     INSN(permw,       66, 0f38, 8d,    vl,    w, vl),
     INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
     INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
+    INSN(pmaddubsw,   66, 0f38, 04,    vl,    b, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
@@ -386,6 +388,7 @@ static const struct test avx512bw_all[]
 //       pmovw2m,     f3, 0f38, 29,           w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
+    INSN(pmulhrsw,    66, 0f38, 0b,    vl,    w, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -587,6 +587,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define rotr(x, n) ((vec_t)B(palignr, _mask, (vdi_t)(x), (vdi_t)(x), (n) * 8, (vdi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
 #  elif defined(__AVX512VBMI__)
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
@@ -615,6 +616,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define rotr(x, n) ((vec_t)B(palignr, _mask, (vdi_t)(x), (vdi_t)(x), (n) * 16, (vdi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
                              (vsi_t)B(pshufhw, _mask, \
                                       B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -402,9 +402,12 @@ OVR(packssdw);
 OVR(packsswb);
 OVR(packusdw);
 OVR(packuswb);
+OVR(palignr);
+OVR(pmaddubsw);
 OVR(pmaddwd);
 OVR(pmovsxbw);
 OVR(pmovzxbw);
+OVR(pmulhrsw);
 OVR(pmulhuw);
 OVR(pmulhw);
 OVR(pmullw);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -435,7 +435,10 @@ static const struct ext0f38_table {
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x01 ... 0x03] = { .simd_size = simd_packed_int },
+    [0x04] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x05 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x0b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -534,7 +537,8 @@ static const struct ext0f3a_table {
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc, .d8s = d8s_dq },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
-    [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
+    [0x0e] = { .simd_size = simd_packed_int },
+    [0x0f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
@@ -6888,6 +6892,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x04): /* vpmaddubsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6906,6 +6911,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0b): /* vpmulhrsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
@@ -9363,6 +9369,10 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 4;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0f): /* vpalignr $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        goto avx512bw_imm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x14): /* pextrb $imm8,xmm,r/m */
     case X86EMUL_OPC_66(0x0f3a, 0x15): /* pextrw $imm8,xmm,r/m */
     case X86EMUL_OPC_66(0x0f3a, 0x16): /* pextr{d,q} $imm8,xmm,r/m */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 28/49] x86emul: support AVX512{F, ER} reciprocal insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (27 preceding siblings ...)
  2018-12-19 14:53   ` [PATCH v7 27/49] x86emul: support remaining AVX512BW " Jan Beulich
@ 2018-12-19 14:54   ` Jan Beulich
  2018-12-19 14:55   ` [PATCH v7 29/49] x86emul: support AVX512F floating point manipulation insns Jan Beulich
                     ` (20 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:54 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also include the only other AVX512ER insn pair, VEXP2P{D,S}.

Note that despite the replacement of the SHA insns' table slots there's
no need to special case their decoding: Their insn-specific code already
sets op_bytes (as was required due to simd_other), and TwoOp is of no
relevance for legacy encoded SIMD insns.

The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
to be on the safe side. The SDM does not clarify behavior there, and
it's even more ambiguous here (without AVX512VL in the picture).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix vector length check for AVX512ER insns. ea.type == OP_* ->
    ea.type != OP_*. Re-base.
v6: Re-base. AVX512ER tests now also successfully run.
v5: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -19,7 +19,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -75,6 +75,9 @@ avx512bw-flts :=
 avx512dq-vecs := $(avx512f-vecs)
 avx512dq-ints := $(avx512f-ints)
 avx512dq-flts := $(avx512f-flts)
+avx512er-vecs := 64
+avx512er-ints :=
+avx512er-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -278,10 +278,14 @@ static const struct test avx512f_all[] =
     INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
     INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
+    INSN(rcp14,        66, 0f38, 4c,    vl,     sd, vl),
+    INSN(rcp14,        66, 0f38, 4d,    el,     sd, el),
     INSN(rndscalepd,   66, 0f3a, 09,    vl,      q, vl),
     INSN(rndscaleps,   66, 0f3a, 08,    vl,      d, vl),
     INSN(rndscalesd,   66, 0f3a, 0b,    el,      q, el),
     INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
+    INSN(rsqrt14,      66, 0f38, 4e,    vl,     sd, vl),
+    INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
@@ -477,6 +481,14 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512er_512[] = {
+    INSN(exp2,    66, 0f38, c8, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, ca, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, cb, el, sd, el),
+    INSN(rsqrt28, 66, 0f38, cc, vl, sd, vl),
+    INSN(rsqrt28, 66, 0f38, cd, el, sd, el),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -837,5 +849,6 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512er, 512);
     RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -210,9 +210,23 @@ static inline vec_t movlhps(vec_t x, vec
 })
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28ss %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14ss %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28sd %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14sd %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
@@ -263,6 +277,13 @@ static inline vec_t movlhps(vec_t x, vec
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28ps, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14ps, _mask, x, undef(), ~0)
+#  endif
 #  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscaleps_, _mask, x, 0b1011, undef(), ~0)
@@ -318,6 +339,13 @@ static inline vec_t movlhps(vec_t x, vec
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28pd, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14pd, _mask, x, undef(), ~0)
+#  endif
 #  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscalepd_, _mask, x, 0b1011, undef(), ~0)
 #  if VEC_SIZE == 16
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -178,14 +178,20 @@ DECL_OCTET(half);
 /* Sadly there are a few exceptions to the general naming rules. */
 # define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
 # define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+# define __builtin_ia32_exp2pd512_mask __builtin_ia32_exp2pd_mask
+# define __builtin_ia32_exp2ps512_mask __builtin_ia32_exp2ps_mask
 # define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
 # define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
 # define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
 # define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
 # define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
 # define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
+# define __builtin_ia32_rcp28pd512_mask __builtin_ia32_rcp28pd_mask
+# define __builtin_ia32_rcp28ps512_mask __builtin_ia32_rcp28ps_mask
 # define __builtin_ia32_rndscalepd_512_mask __builtin_ia32_rndscalepd_mask
 # define __builtin_ia32_rndscaleps_512_mask __builtin_ia32_rndscaleps_mask
+# define __builtin_ia32_rsqrt28pd512_mask __builtin_ia32_rsqrt28pd_mask
+# define __builtin_ia32_rsqrt28ps512_mask __builtin_ia32_rsqrt28ps_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -24,6 +24,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f.h"
 #include "avx512bw.h"
 #include "avx512dq.h"
+#include "avx512er.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -106,6 +107,11 @@ static bool simd_check_avx512dq_vl(void)
     return cpu_has_avx512dq && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx512er(void)
+{
+    return cpu_has_avx512er;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -327,6 +333,10 @@ static const struct {
     AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
     AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
     AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
+    SIMD(AVX512ER f32 scalar,avx512er,        f4),
+    SIMD(AVX512ER f32x16,    avx512er,      64f4),
+    SIMD(AVX512ER f64 scalar,avx512er,        f8),
+    SIMD(AVX512ER f64x8,     avx512er,      64f8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -134,6 +134,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_bmi2       cp.feat.bmi2
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
+#define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -471,6 +471,10 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
@@ -510,7 +514,12 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
-    [0xc8 ... 0xcd] = { .simd_size = simd_other },
+    [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xc9] = { .simd_size = simd_other },
+    [0xca] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xcc] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
     [0xf0] = { .two_op = 1 },
@@ -1873,6 +1882,7 @@ static bool vcpu_has(
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
+#define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
@@ -6159,6 +6169,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4c): /* vrcp14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4e): /* vrsqrt14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
@@ -8854,6 +8866,13 @@ x86_emulate(
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4d): /* vrcp14s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4f): /* vrsqrt14s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.brs, EXC_UD);
+        avx512_vlen_check(true);
+        goto simd_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x5a): /* vbroadcasti128 m128,ymm */
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
@@ -9101,6 +9120,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
         host_and_vcpu_must_have(avx512f);
+    simd_zmm_scalar_sae:
         generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
         if ( !evex.brs )
             avx512_vlen_check(true);
@@ -9116,6 +9136,19 @@ x86_emulate(
         op_bytes = 16;
         goto simd_0f38_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc8): /* vexp2p{s,d} zmm/mem,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xca): /* vrcp28p{s,d} zmm/mem,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcc): /* vrsqrt28p{s,d} zmm/mem,zmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        generate_exception_if((ea.type != OP_REG || !evex.brs) && evex.lr != 2,
+                              EXC_UD);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcb): /* vrcp28s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcd): /* vrsqrt28s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        goto simd_zmm_scalar_sae;
+
     case X86EMUL_OPC(0x0f38, 0xf0): /* movbe m,r */
     case X86EMUL_OPC(0x0f38, 0xf1): /* movbe r,m */
         vcpu_must_have(movbe);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -103,6 +103,7 @@
 #define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
+#define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 29/49] x86emul: support AVX512F floating point manipulation insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (28 preceding siblings ...)
  2018-12-19 14:54   ` [PATCH v7 28/49] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
@ 2018-12-19 14:55   ` Jan Beulich
  2018-12-19 14:55   ` [PATCH v7 30/49] x86emul: support AVX512DQ " Jan Beulich
                     ` (19 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix vector length check for scalar insns. ea.type == OP_* ->
    ea.type != OP_*. Re-base.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -140,6 +140,8 @@ static const struct test avx512f_all[] =
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
     INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
+    INSN(fixupimm,     66, 0f3a, 54,    vl,     sd, vl),
+    INSN(fixupimm,     66, 0f3a, 55,    el,     sd, el),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
     INSN(fmadd213,     66, 0f38, a8,    vl,     sd, vl),
@@ -170,6 +172,10 @@ static const struct test avx512f_all[] =
     INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
     INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
     INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
+    INSN(getexp,       66, 0f38, 42,    vl,     sd, vl),
+    INSN(getexp,       66, 0f38, 43,    el,     sd, el),
+    INSN(getmant,      66, 0f3a, 26,    vl,     sd, vl),
+    INSN(getmant,      66, 0f3a, 27,    el,     sd, el),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
@@ -286,6 +292,8 @@ static const struct test avx512f_all[] =
     INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
     INSN(rsqrt14,      66, 0f38, 4e,    vl,     sd, vl),
     INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
+    INSN(scalef,       66, 0f38, 2c,    vl,     sd, vl),
+    INSN(scalef,       66, 0f38, 2d,    el,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -174,6 +174,11 @@ static inline bool _to_bool(byte_vec_t b
     asm ( op : [out] "=&x" (r_) : [in] "m" (x) ); \
     (vec_t){ r_[0] }; \
 })
+# define scalar_2op(x, y, op) ({ \
+    typeof((x)[0]) __attribute__((vector_size(16))) r_ = { x[0] }; \
+    asm ( op : [out] "=&x" (r_) : [in1] "[out]" (r_), [in2] "m" (y) ); \
+    (vec_t){ r_[0] }; \
+})
 #endif
 
 #if VEC_SIZE == 16 && FLOAT_SIZE == 4 && defined(__SSE__)
@@ -210,6 +215,8 @@ static inline vec_t movlhps(vec_t x, vec
 })
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
+#  define getexp(x) scalar_1op(x, "vgetexpss %[in], %[out], %[out]")
+#  define getmant(x) scalar_1op(x, "vgetmantss $0, %[in], %[out], %[out]")
 #  ifdef __AVX512ER__
 #   define recip(x) scalar_1op(x, "vrcp28ss %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt28ss %[in], %[out], %[out]")
@@ -217,9 +224,12 @@ static inline vec_t movlhps(vec_t x, vec
 #   define recip(x) scalar_1op(x, "vrcp14ss %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt14ss %[in], %[out], %[out]")
 #  endif
+#  define scale(x, y) scalar_2op(x, y, "vscalefss %[in2], %[in1], %[out]")
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
+#  define getexp(x) scalar_1op(x, "vgetexpsd %[in], %[out], %[out]")
+#  define getmant(x) scalar_1op(x, "vgetmantsd $0, %[in], %[out], %[out]")
 #  ifdef __AVX512ER__
 #   define recip(x) scalar_1op(x, "vrcp28sd %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt28sd %[in], %[out], %[out]")
@@ -227,6 +237,7 @@ static inline vec_t movlhps(vec_t x, vec
 #   define recip(x) scalar_1op(x, "vrcp14sd %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt14sd %[in], %[out], %[out]")
 #  endif
+#  define scale(x, y) scalar_2op(x, y, "vscalefsd %[in2], %[in1], %[out]")
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
@@ -274,9 +285,12 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
 #   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  define getexp(x) BR(getexpps, _mask, x, undef(), ~0)
+#  define getmant(x) BR(getmantps, _mask, x, 0, undef(), ~0)
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
 #   define rsqrt(x) BR(rsqrt28ps, _mask, x, undef(), ~0)
@@ -336,9 +350,12 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
 #   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  define getexp(x) BR(getexppd, _mask, x, undef(), ~0)
+#  define getmant(x) BR(getmantpd, _mask, x, 0, undef(), ~0)
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
 #   define rsqrt(x) BR(rsqrt28pd, _mask, x, undef(), ~0)
@@ -1766,6 +1783,28 @@ int simd_test(void)
 # endif
 #endif
 
+#if defined(getexp) && defined(getmant)
+    touch(src);
+    x = getmant(src);
+    touch(src);
+    y = getexp(src);
+    touch(src);
+    for ( j = i = 0; i < ELEM_COUNT; ++i )
+    {
+        if ( y[i] != j ) return __LINE__;
+
+        if ( !((i + 1) & (i + 2)) )
+            ++j;
+
+        if ( !(i & (i + 1)) && x[i] != 1 ) return __LINE__;
+    }
+# ifdef scale
+    touch(y);
+    z = scale(x, y);
+    if ( !eq(src, z) ) return __LINE__;
+# endif
+#endif
+
 #if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
     (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3924,6 +3924,44 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vfixupimmpd $0,8(%edx){1to8},%zmm3,%zmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vfixupimmpd);
+        static const struct {
+            double d[4];
+        }
+        src = { { -1, 0, 1, 2 } },
+        dst = { { 3, 4, 5, 6 } },
+        out = { { .5, -1, 90, 2 } };
+
+        asm volatile ( "vbroadcastf64x4 %1, %%zmm3\n\t"
+                       "vbroadcastf64x4 %2, %%zmm4\n"
+                       put_insn(vfixupimmpd,
+                                "vfixupimmpd $0, 8(%0)%{1to8%}, %%zmm3, %%zmm4")
+                       :: "d" (NULL), "m" (src), "m" (dst) );
+
+        set_insn(vfixupimmpd);
+        /*
+         * Nibble (token) mapping (unused ones simply set to zero):
+         * 2 (ZERO)    ->  -1 (0x9)
+         * 3 (POS_ONE) ->  90 (0xc)
+         * 6 (NEG)     -> 1/2 (0xb)
+         * 7 (POS)     -> src (0x1)
+         */
+        res[2] = 0x1b00c900;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        asm volatile ( "vmovupd %%zmm4, %0" : "=m" (res[0]) );
+        if ( rc != X86EMUL_OKAY || !check_eip(vfixupimmpd) ||
+             memcmp(res + 0, &out, sizeof(out)) ||
+             memcmp(res + 8, &out, sizeof(out)) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -459,7 +459,8 @@ static const struct ext0f38_table {
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
+    [0x2c] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x2d] = { .simd_size = simd_packed_fp, .d8s = d8s_dq },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
@@ -470,6 +471,8 @@ static const struct ext0f38_table {
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x42] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x43] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -563,6 +566,8 @@ static const struct ext0f3a_table {
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
     [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x26] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x27] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
     [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
@@ -577,6 +582,8 @@ static const struct ext0f3a_table {
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4c] = { .simd_size = simd_packed_int, .four_op = 1 },
+    [0x54] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x55] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -2684,6 +2691,10 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x2d): /* vscalefs{s,d} */
+        state->simd_size = simd_scalar_vexw;
+        break;
+
     case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
     case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
     case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
@@ -9084,6 +9095,8 @@ x86_emulate(
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2c): /* vscalefp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x42): /* vgetexpp{s,d} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -9107,6 +9120,8 @@ x86_emulate(
             avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2d): /* vscalefs{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x43): /* vgetexps{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
@@ -9670,6 +9685,21 @@ x86_emulate(
         op_bytes = 4;
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type != OP_REG || !evex.brs )
+            avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
+        if ( !evex.brs )
+            avx512_vlen_check(true);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 30/49] x86emul: support AVX512DQ floating point manipulation insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (29 preceding siblings ...)
  2018-12-19 14:55   ` [PATCH v7 29/49] x86emul: support AVX512F floating point manipulation insns Jan Beulich
@ 2018-12-19 14:55   ` Jan Beulich
  2018-12-19 14:56   ` [PATCH v7 31/49] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
                     ` (18 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512DQ in the insn emulator.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix vector length check for scalar insns. Re-base.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -457,11 +457,17 @@ static const struct test avx512dq_all[]
     INSN(cvttps2uqq,     66,   0f, 78, vl_2,  d, vl),
     INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
     INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
+    INSN(fpclass,        66, 0f3a, 66,   vl, sd, vl),
+    INSN(fpclass,        66, 0f3a, 67,   el, sd, el),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
 //       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
+    INSN(range,          66, 0f3a, 50,   vl, sd, vl),
+    INSN(range,          66, 0f3a, 51,   el, sd, el),
+    INSN(reduce,         66, 0f3a, 56,   vl, sd, vl),
+    INSN(reduce,         66, 0f3a, 57,   el, sd, el),
     INSN_PFP(xor,              0f, 57),
 };
 
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -285,10 +285,18 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
 #   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  ifdef __AVX512DQ__
+#   define frac(x) B(reduceps, _mask, x, 0b00001011, undef(), ~0)
+#  endif
 #  define getexp(x) BR(getexpps, _mask, x, undef(), ~0)
 #  define getmant(x) BR(getmantps, _mask, x, 0, undef(), ~0)
-#  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
-#  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define max(x, y) BR(rangeps, _mask, x, y, 0b0101, undef(), ~0)
+#   define min(x, y) BR(rangeps, _mask, x, y, 0b0100, undef(), ~0)
+#  else
+#   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  endif
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
 #  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
@@ -350,10 +358,18 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
 #   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  ifdef __AVX512DQ__
+#   define frac(x) B(reducepd, _mask, x, 0b00001011, undef(), ~0)
+#  endif
 #  define getexp(x) BR(getexppd, _mask, x, undef(), ~0)
 #  define getmant(x) BR(getmantpd, _mask, x, 0, undef(), ~0)
-#  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
-#  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define max(x, y) BR(rangepd, _mask, x, y, 0b0101, undef(), ~0)
+#   define min(x, y) BR(rangepd, _mask, x, y, 0b0100, undef(), ~0)
+#  else
+#   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  endif
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
 #  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3962,6 +3962,39 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+
+    printf("%-40s", "Testing vfpclasspsz $0x46,64(%edx),%k2...");
+    if ( stack_exec && cpu_has_avx512dq )
+    {
+        decl_insn(vfpclassps);
+
+        asm volatile ( put_insn(vfpclassps,
+                                /* 0x46: check for +/- 0 and neg. */
+                                "vfpclasspsz $0x46, 64(%0), %%k2")
+                       :: "d" (NULL) );
+
+        set_insn(vfpclassps);
+        for ( i = 0; i < 3; ++i )
+        {
+            res[16 + i * 5 + 0] = 0x00000000; /* +0 */
+            res[16 + i * 5 + 1] = 0x80000000; /* -0 */
+            res[16 + i * 5 + 2] = 0x80000001; /* -DEN */
+            res[16 + i * 5 + 3] = 0xff000000; /* -FIN */
+            res[16 + i * 5 + 4] = 0x7f000000; /* +FIN */
+        }
+        res[31] = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vfpclassps) )
+            goto fail;
+        asm volatile ( "kmovw %%k2, %0" : "=g" (rc) );
+        if ( rc != 0xbdef )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -582,10 +582,16 @@ static const struct ext0f3a_table {
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4c] = { .simd_size = simd_packed_int, .four_op = 1 },
+    [0x50] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x51] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x54] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x55] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x56] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x57] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x66] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x67] = { .simd_size = simd_scalar_vexw, .two_op = 1, .d8s = d8s_dq },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x6a ... 0x6b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x6c ... 0x6d] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -9685,6 +9691,10 @@ x86_emulate(
         op_bytes = 4;
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x50): /* vrangep{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x56): /* vreducep{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512dq);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512f);
@@ -9692,6 +9702,10 @@ x86_emulate(
             avx512_vlen_check(false);
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x51): /* vranges{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x57): /* vreduces{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512dq);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
         host_and_vcpu_must_have(avx512f);
@@ -9847,6 +9861,16 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x66): /* vfpclassp{s,d} $imm8,[xyz]mm/mem,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x67): /* vfpclasss{s,d} $imm8,[xyz]mm/mem,k{k} */
+        host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( !(b & 1) )
+            goto avx512f_imm8_no_sae;
+        generate_exception_if(evex.brs, EXC_UD);
+        avx512_vlen_check(true);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC(0x0f3a, 0xcc):     /* sha1rnds4 $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(sha);
         op_bytes = 16;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 31/49] x86emul: support AVX512{F, _VBMI2} compress/expand insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (30 preceding siblings ...)
  2018-12-19 14:55   ` [PATCH v7 30/49] x86emul: support AVX512DQ " Jan Beulich
@ 2018-12-19 14:56   ` Jan Beulich
  2018-12-19 14:56   ` [PATCH v7 32/49] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
                     ` (17 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: Re-base. Add tests for the byte/word forms.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,6 +109,7 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(compress,     66, 0f38, 8a,    vl,     sd, el),
     INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
     INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
     INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
@@ -140,6 +141,7 @@ static const struct test avx512f_all[] =
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
     INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
+    INSN(expand,       66, 0f38, 88,    vl,     sd, el),
     INSN(fixupimm,     66, 0f3a, 54,    vl,     sd, vl),
     INSN(fixupimm,     66, 0f3a, 55,    el,     sd, el),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
@@ -214,6 +216,7 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(pcompress,    66, 0f38, 8b,    vl,     dq, el),
     INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
     INSN(permilpd,     66, 0f38, 0d,    vl,      q, vl),
@@ -222,6 +225,7 @@ static const struct test avx512f_all[] =
     INSN(permilps,     66, 0f3a, 04,    vl,      d, vl),
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
+    INSN(pexpand,      66, 0f38, 89,    vl,     dq, el),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -509,6 +513,11 @@ static const struct test avx512_vbmi_all
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
 };
 
+static const struct test avx512_vbmi2_all[] = {
+    INSN(pcompress, 66, 0f38, 63, vl, bw, el),
+    INSN(pexpand,   66, 0f38, 62, vl, bw, el),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -865,4 +874,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 512);
     RUN(avx512er, 512);
     RUN(avx512_vbmi, all);
+    RUN(avx512_vbmi2, all);
 }
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3995,6 +3995,227 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    /*
+     * The following compress/expand tests are not only making sure the
+     * accessed data is correct, but they also verify (by placing operands
+     * on the mapping boundaries) that elements controlled by clear mask
+     * bits don't get accessed.
+     */
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vpcompressd);
+        decl_insn(vpcompressq);
+        decl_insn(vpexpandd);
+        decl_insn(vpexpandq);
+        static const struct {
+            unsigned int d[16];
+        } dsrc = { { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 } };
+        static const struct {
+            unsigned long long q[8];
+        } qsrc = { { 0, 1, 2, 3, 4, 5, 6, 7 } };
+        unsigned int *ptr = res + MMAP_SZ / sizeof(*res) - 32;
+
+        printf("%-40s", "Testing vpcompressd %zmm1,24*4(%ecx){%k2}...");
+        asm volatile ( "kmovw %1, %%k2\n\t"
+                       "vmovdqu32 %2, %%zmm1\n"
+                       put_insn(vpcompressd,
+                                "vpcompressd %%zmm1, 24*4(%0)%{%%k2%}")
+                       :: "c" (NULL), "r" (0x55aa), "m" (dsrc) );
+
+        memset(ptr, 0xdb, 32 * 4);
+        set_insn(vpcompressd);
+        regs.ecx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressd) ||
+             memcmp(ptr, ptr + 8, 16 * 4) )
+            goto fail;
+        for ( i = 0; i < 4; ++i )
+            if ( ptr[24 + i] != 2 * i + 1 )
+                goto fail;
+        for ( ; i < 8; ++i )
+            if ( ptr[24 + i] != 2 * i )
+                goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandd 8*4(%edx),%zmm3{%k2}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm3, %%zmm3, %%zmm3\n"
+                       put_insn(vpexpandd,
+                                "vpexpandd 8*4(%0), %%zmm3%{%%k2%}%{z%}")
+                       :: "d" (NULL) );
+        set_insn(vpexpandd);
+        regs.edx = (unsigned long)(ptr + 16);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandd) )
+            goto fail;
+        asm ( "vmovdqa32 %%zmm1, %%zmm2%{%%k2%}%{z%}\n\t"
+              "vpcmpeqd %%zmm2, %%zmm3, %%k0\n\t"
+              "kmovw %%k0, %0"
+              : "=r" (rc) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpcompressq %zmm4,12*8(%edx){%k3}...");
+        asm volatile ( "kmovw %1, %%k3\n\t"
+                       "vmovdqu64 %2, %%zmm4\n"
+                       put_insn(vpcompressq,
+                                "vpcompressq %%zmm4, 12*8(%0)%{%%k3%}")
+                       :: "d" (NULL), "r" (0x5a), "m" (qsrc) );
+
+        memset(ptr, 0xdb, 16 * 8);
+        set_insn(vpcompressq);
+        regs.edx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressq) ||
+             memcmp(ptr, ptr + 8, 8 * 8) )
+            goto fail;
+        for ( i = 0; i < 2; ++i )
+        {
+            if ( ptr[(12 + i) * 2] != 2 * i + 1 ||
+                 ptr[(12 + i) * 2 + 1] )
+                goto fail;
+        }
+        for ( ; i < 4; ++i )
+        {
+            if ( ptr[(12 + i) * 2] != 2 * i ||
+                 ptr[(12 + i) * 2 + 1] )
+                goto fail;
+        }
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandq 4*8(%ecx),%zmm5{%k3}{z}...");
+        asm volatile ( "vpternlogq $0x81, %%zmm5, %%zmm5, %%zmm5\n"
+                       put_insn(vpexpandq,
+                                "vpexpandq 4*8(%0), %%zmm5%{%%k3%}%{z%}")
+                       :: "c" (NULL) );
+        set_insn(vpexpandq);
+        regs.ecx = (unsigned long)(ptr + 16);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandq) )
+            goto fail;
+        asm ( "vmovdqa64 %%zmm4, %%zmm6%{%%k3%}%{z%}\n\t"
+              "vpcmpeqq %%zmm5, %%zmm6, %%k0\n\t"
+              "kmovw %%k0, %0"
+              : "=r" (rc) );
+        if ( rc != 0xff )
+            goto fail;
+        printf("okay\n");
+    }
+
+#if __GNUC__ > 7 /* can't check for __AVX512VBMI2__ here */
+    if ( stack_exec && cpu_has_avx512_vbmi2 )
+    {
+        decl_insn(vpcompressb);
+        decl_insn(vpcompressw);
+        decl_insn(vpexpandb);
+        decl_insn(vpexpandw);
+        static const struct {
+            unsigned char b[64];
+        } bsrc = { { 0,  1,  2,  3,  4,  5,  6,  7,
+                     8,  9, 10, 11, 12, 13, 14, 15,
+                    16, 17, 18, 19, 20, 21, 22, 23,
+                    24, 25, 26, 27, 28, 29, 30, 31,
+                    32, 33, 34, 35, 36, 37, 38, 39,
+                    40, 41, 42, 43, 44, 45, 46, 47,
+                    48, 49, 50, 51, 52, 53, 54, 55,
+                    56, 57, 58, 59, 60, 61, 62, 63 } };
+        static const struct {
+            unsigned short w[32];
+        } wsrc = { { 0,  1,  2,  3,  4,  5,  6,  7,
+                     8,  9, 10, 11, 12, 13, 14, 15,
+                    16, 17, 18, 19, 20, 21, 22, 23,
+                    24, 25, 26, 27, 28, 29, 30, 31 } };
+        unsigned char *ptr = (void *)res + MMAP_SZ - 128;
+        unsigned long long w = 0x55555555aaaaaaaaULL;
+
+        printf("%-40s", "Testing vpcompressb %zmm1,96*1(%ecx){%k2}...");
+        asm volatile ( "kmovq %1, %%k2\n\t"
+                       "vmovdqu8 %2, %%zmm1\n"
+                       put_insn(vpcompressb,
+                                "vpcompressb %%zmm1, 96*1(%0)%{%%k2%}")
+                       :: "c" (NULL), "m" (w), "m" (bsrc) );
+
+        memset(ptr, 0xdb, 128 * 1);
+        set_insn(vpcompressb);
+        regs.ecx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressb) ||
+             memcmp(ptr, ptr + 32, 64 * 1) )
+            goto fail;
+        for ( i = 0; i < 16; ++i )
+            if ( ptr[96 + i] != 2 * i + 1 )
+                goto fail;
+        for ( ; i < 32; ++i )
+            if ( ptr[96 + i] != 2 * i )
+                goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandb 32*1(%edx),%zmm3{%k2}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm3, %%zmm3, %%zmm3\n"
+                       put_insn(vpexpandb,
+                                "vpexpandb 32*1(%0), %%zmm3%{%%k2%}%{z%}")
+                       :: "d" (NULL) );
+        set_insn(vpexpandb);
+        regs.edx = (unsigned long)(ptr + 64);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandb) )
+            goto fail;
+        asm ( "vmovdqu8 %%zmm1, %%zmm2%{%%k2%}%{z%}\n\t"
+              "vpcmpeqb %%zmm2, %%zmm3, %%k0\n\t"
+              "kmovq %%k0, %0"
+              : "=m" (w) );
+        if ( w != 0xffffffffffffffffULL )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpcompressw %zmm4,48*2(%edx){%k3}...");
+        asm volatile ( "kmovd %1, %%k3\n\t"
+                       "vmovdqu16 %2, %%zmm4\n"
+                       put_insn(vpcompressw,
+                                "vpcompressw %%zmm4, 48*2(%0)%{%%k3%}")
+                       :: "d" (NULL), "r" (0x5555aaaa), "m" (wsrc) );
+
+        memset(ptr, 0xdb, 64 * 2);
+        set_insn(vpcompressw);
+        regs.edx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressw) ||
+             memcmp(ptr, ptr + 32, 32 * 2) )
+            goto fail;
+        for ( i = 0; i < 8; ++i )
+        {
+            if ( ptr[(48 + i) * 2] != 2 * i + 1 ||
+                 ptr[(48 + i) * 2 + 1] )
+                goto fail;
+        }
+        for ( ; i < 16; ++i )
+        {
+            if ( ptr[(48 + i) * 2] != 2 * i ||
+                 ptr[(48 + i) * 2 + 1] )
+                goto fail;
+        }
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandw 16*2(%ecx),%zmm5{%k3}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm5, %%zmm5, %%zmm5\n"
+                       put_insn(vpexpandw,
+                                "vpexpandw 16*2(%0), %%zmm5%{%%k3%}%{z%}")
+                       :: "c" (NULL) );
+        set_insn(vpexpandw);
+        regs.ecx = (unsigned long)(ptr + 64);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandw) )
+            goto fail;
+        asm ( "vmovdqu16 %%zmm4, %%zmm6%{%%k3%}%{z%}\n\t"
+              "vpcmpeqw %%zmm5, %%zmm6, %%k0\n\t"
+              "kmovq %%k0, %0"
+              : "=m" (w) );
+        if ( w != 0xffffffff )
+            goto fail;
+        printf("okay\n");
+    }
+#endif
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -59,6 +59,9 @@
     (type *)((char *)mptr__ - offsetof(type, member)); \
 })
 
+#define hweight32 __builtin_popcount
+#define hweight64 __builtin_popcountll
+
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
 extern uint32_t mxcsr_mask;
@@ -138,6 +141,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
+#define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -482,6 +482,8 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
+    [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -489,6 +491,10 @@ static const struct ext0f38_table {
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x88] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_dq },
+    [0x89] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_dq },
+    [0x8a] = { .simd_size = simd_packed_fp, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
+    [0x8b] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
@@ -1900,6 +1906,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
+#define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -8894,6 +8901,36 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x62): /* vpexpand{b,w} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x63): /* vpcompress{b,w} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        elem_bytes = 1 << evex.w;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x88): /* vexpandp{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x89): /* vpexpand{d,q} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8a): /* vcompressp{s,d} [xyz]mm,[xyz]mm/mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8b): /* vpcompress{d,q} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.brs, EXC_UD);
+        avx512_vlen_check(false);
+        /*
+         * For the respective code below the main switch() to work we need to
+         * compact op_mask here: Memory accesses are non-sparse even if the
+         * mask register has sparsely set bits.
+         */
+        if ( likely(fault_suppression) )
+        {
+            n = 1 << ((b & 8 ? 2 : 4) + evex.lr - evex.w);
+            EXPECT(elem_bytes > 0);
+            ASSERT(op_bytes == n * elem_bytes);
+            op_mask &= ~0ULL >> (64 - n);
+            n = hweight64(op_mask);
+            op_bytes = n * elem_bytes;
+            if ( n )
+                op_mask = ~0ULL >> (64 - n);
+        }
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -110,6 +110,7 @@
 
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
+#define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
 /* CPUID level 0x80000007.edx */
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -228,6 +228,7 @@ XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
+XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
 XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
 
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -266,10 +266,10 @@ def crunch_numbers(state):
                   AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
                   AVX512_VPOPCNTDQ],
 
-        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask
         # registers), despite the SDM not formally making this connection.
-        AVX512BW: [AVX512_VBMI],
+        AVX512BW: [AVX512_VBMI, AVX512_VBMI2],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 32/49] x86emul: support remaining misc AVX512{F, BW} insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (31 preceding siblings ...)
  2018-12-19 14:56   ` [PATCH v7 31/49] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
@ 2018-12-19 14:56   ` Jan Beulich
  2018-12-19 14:57   ` [PATCH v7 33/49] x86emul: support AVX512F gather insns Jan Beulich
                     ` (16 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512BW in the insn emulator, and leaves just
the scatter/gather ones open in the AVX512F set.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.
---
TBD: The *blendm* inline functions don't reliably produce the intended
     insns, as the respective moves are about as good a fit for the
     compiler when looking for a match for the intended operation. We'd
     need to switch to inline assembly if we wanted to guarantee the
     testing of those insns. Thoughts?

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -105,6 +105,8 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN(align,        66, 0f3a, 03,    vl,     dq, vl),
+    INSN(blendm,       66, 0f38, 65,    vl,     sd, vl),
     INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
@@ -207,6 +209,7 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(pblendm,      66, 0f38, 64,    vl,     dq, vl),
 //       pbroadcast,   66, 0f38, 7c,          dq64
     INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
     INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
@@ -354,6 +357,7 @@ static const struct test avx512f_512[] =
 };
 
 static const struct test avx512bw_all[] = {
+    INSN(dbpsadbw,    66, 0f3a, 42,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
@@ -373,6 +377,7 @@ static const struct test avx512bw_all[]
     INSN(palignr,     66, 0f3a, 0f,    vl,    b, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
+    INSN(pblendm,     66, 0f38, 66,    vl,   bw, vl),
     INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
 //       pbroadcastb, 66, 0f38, 7a,           b
     INSN(pbroadcastw, 66, 0f38, 79,    el_2,  b, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -297,7 +297,7 @@ static inline vec_t movlhps(vec_t x, vec
 #   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  endif
-#  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define mix(x, y) B(blendmps_, _mask, x, y, (0b1010101010101010 & ALL_TRUE))
 #  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
@@ -370,7 +370,7 @@ static inline vec_t movlhps(vec_t x, vec
 #   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  endif
-#  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define mix(x, y) B(blendmpd_, _mask, x, y, 0b10101010)
 #  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
@@ -564,8 +564,9 @@ static inline vec_t movlhps(vec_t x, vec
                              0b00011011, (vsi_t)undef(), ~0))
 #   define swap2(x) ((vec_t)B_(permvarsi, _mask, (vsi_t)(x), (vsi_t)(inv - 1), (vsi_t)undef(), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
-                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define mix(x, y) ((vec_t)B(blendmd_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b1010101010101010 & ((1 << ELEM_COUNT) - 1))))
+#  define rotr(x, n) ((vec_t)B(alignd, _mask, (vsi_t)(x), (vsi_t)(x), n, (vsi_t)undef(), ~0))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
@@ -602,7 +603,8 @@ static inline vec_t movlhps(vec_t x, vec
                              0b01001110, (vsi_t)undef(), ~0))
 #   define swap2(x) ((vec_t)B(permvardi, _mask, (vdi_t)(x), (vdi_t)(inv - 1), (vdi_t)undef(), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+#  define mix(x, y) ((vec_t)B(blendmq_, _mask, (vdi_t)(x), (vdi_t)(y), 0b10101010))
+#  define rotr(x, n) ((vec_t)B(alignq, _mask, (vdi_t)(x), (vdi_t)(x), n, (vdi_t)undef(), ~0))
 #  if VEC_SIZE == 32
 #   define swap3(x) ((vec_t)B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0))
 #  elif VEC_SIZE == 64
@@ -654,8 +656,8 @@ static inline vec_t movlhps(vec_t x, vec
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
-                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define mix(x, y) ((vec_t)B(blendmb_, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b1010101010101010101010101010101010101010101010101010101010101010LL & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
 #  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
@@ -687,8 +689,8 @@ static inline vec_t movlhps(vec_t x, vec
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
-                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define mix(x, y) ((vec_t)B(blendmw_, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b10101010101010101010101010101010 & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
 #  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -484,6 +484,7 @@ static const struct ext0f38_table {
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
     [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
+    [0x64 ... 0x66] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -550,6 +551,7 @@ static const struct ext0f3a_table {
     [0x00] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x01] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x02] = { .simd_size = simd_packed_int },
+    [0x03] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -581,8 +583,7 @@ static const struct ext0f3a_table {
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
-    [0x42] = { .simd_size = simd_packed_int },
-    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x42 ... 0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6195,6 +6196,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x4c): /* vrcp14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x4e): /* vrsqrt14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x64): /* vpblendm{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x65): /* vblendmp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
@@ -6950,6 +6953,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x0b): /* vpmulhrsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x66): /* vpblendm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = 1 << (b & 1);
@@ -8119,10 +8123,12 @@ x86_emulate(
         goto simd_0f_to_gpr;
 
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        fault_suppression = false;
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x03): /* valign{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_imm8_no_sae:
         host_and_vcpu_must_have(avx512f);
@@ -9460,6 +9466,9 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 4;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x42): /* vdbpsadbw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0f): /* vpalignr $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         goto avx512bw_imm;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 33/49] x86emul: support AVX512F gather insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (32 preceding siblings ...)
  2018-12-19 14:56   ` [PATCH v7 32/49] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
@ 2018-12-19 14:57   ` Jan Beulich
  2018-12-19 14:57   ` [PATCH v7 34/49] x86emul: add high register S/G test cases Jan Beulich
                     ` (15 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:57 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This requires getting modrm_reg and sib_index set correctly in the EVEX
case, to account for the high 16 [XYZ]MM registers. Extend the
adjustments to modrm_rm as well, such that x86_insn_modrm() would
correctly report register numbers (this was a latent issue only as we
don't currently have callers of that function which would care about an
EVEX case). The adjustment in turn requires dropping the assertion from
decode_gpr(), as we now need to actively mask off the high bit when a
GPR is meant. _decode_gpr() invocations also need slight adjustments,
when invoked in generic code ahead of the main switch(). All other uses
of modrm_reg and modrm_rm already get suitably masked where necessary.

There was also an encoding mistake in the EVEX Disp8 test code, which
was benign (due to %rdx getting set to zero) to all non-vSIB tests as it
mistakenly encoded <disp8>(%rdx,%rdx) instead of <disp8>(%rdx,%riz). In
the vSIB case this meant <disp8>(%rdx,%zmm2) instead of the intended
<disp8>(%rdx,%zmm4).

Likewise the access count check wasn't entirely correct for the S/G
case: In the quad-word-index but dword-data case only half the number
of full vector elements get accessed.

As an unrelated change in the main test harness source file distinguish
the "n/a" messages by bitness.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix ByteOp register decode. Re-base.
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -21,7 +21,7 @@ CFLAGS += $(CFLAGS_xeninclude)
 
 SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
 FMA := fma4 fma
-SG := avx2-sg
+SG := avx2-sg avx512f-sg avx512vl-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
 
 OPMASK := avx512f avx512dq avx512bw
@@ -69,6 +69,14 @@ xop-flts := $(avx-flts)
 avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
+avx512f-sg-vecs := 64
+avx512f-sg-idxs := 4 8
+avx512f-sg-ints := $(avx512f-ints)
+avx512f-sg-flts := $(avx512f-flts)
+avx512vl-sg-vecs := 16 32
+avx512vl-sg-idxs := $(avx512f-sg-idxs)
+avx512vl-sg-ints := $(avx512f-ints)
+avx512vl-sg-flts := $(avx512f-flts)
 avx512bw-vecs := $(avx512f-vecs)
 avx512bw-ints := 1 2
 avx512bw-flts :=
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -176,6 +176,8 @@ static const struct test avx512f_all[] =
     INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
     INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
     INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
+    INSN(gatherd,      66, 0f38, 92,    vl,     sd, el),
+    INSN(gatherq,      66, 0f38, 93,    vl,     sd, el),
     INSN(getexp,       66, 0f38, 42,    vl,     sd, vl),
     INSN(getexp,       66, 0f38, 43,    el,     sd, el),
     INSN(getmant,      66, 0f3a, 26,    vl,     sd, vl),
@@ -229,6 +231,8 @@ static const struct test avx512f_all[] =
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pexpand,      66, 0f38, 89,    vl,     dq, el),
+    INSN(pgatherd,     66, 0f38, 90,    vl,     dq, el),
+    INSN(pgatherq,     66, 0f38, 91,    vl,     dq, el),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -698,7 +702,7 @@ static void test_one(const struct test *
     instr[3] = evex.raw[2];
     instr[4] = test->opc;
     instr[5] = 0x44 | (test->ext << 3); /* ModR/M */
-    instr[6] = 0x12; /* SIB: base rDX, index none / xMM4 */
+    instr[6] = 0x22; /* SIB: base rDX, index none / xMM4 */
     instr[7] = 1; /* Disp8 */
     instr[8] = 0; /* immediate, if any */
 
@@ -718,7 +722,8 @@ static void test_one(const struct test *
          if ( accessed[i] )
              goto fail;
     for ( ; i < (test->scale == SC_vl ? vsz : esz) + (sg ? esz : vsz); ++i )
-         if ( accessed[i] != (sg ? vsz / esz : 1) )
+         if ( accessed[i] != (sg ? (vsz / esz) >> (test->opc & 1 & !evex.w)
+                                 : 1) )
              goto fail;
     for ( ; i < ARRAY_SIZE(accessed); ++i )
          if ( accessed[i] )
--- a/tools/tests/x86_emulator/simd-sg.c
+++ b/tools/tests/x86_emulator/simd-sg.c
@@ -35,13 +35,78 @@ typedef long long __attribute__((vector_
 #define ITEM_COUNT (VEC_SIZE / ELEM_SIZE < IVEC_SIZE / IDX_SIZE ? \
                     VEC_SIZE / ELEM_SIZE : IVEC_SIZE / IDX_SIZE)
 
-#if VEC_SIZE == 16
-# define to_bool(cmp) __builtin_ia32_ptestc128(cmp, (vec_t){} == 0)
-#else
-# define to_bool(cmp) __builtin_ia32_ptestc256(cmp, (vec_t){} == 0)
-#endif
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if ELEM_SIZE == 4
+#  if IDX_SIZE == 4 || defined(__AVX512VL__)
+#   define to_mask(msk) B(ptestmd, , (vsi_t)(msk), (vsi_t)(msk), ~0)
+#   define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
+#  else
+#   define widen(x) __builtin_ia32_pmovzxdq512_mask((vsi_t)(x), (idi_t){}, ~0)
+#   define to_mask(msk) __builtin_ia32_ptestmq512(widen(msk), widen(msk), ~0)
+#   define eq(x, y) (__builtin_ia32_pcmpeqq512_mask(widen(x), widen(y), ~0) == ALL_TRUE)
+#  endif
+#  define BG_(dt, it, reg, mem, idx, msk, scl) \
+    __builtin_ia32_gather##it##dt(reg, mem, idx, to_mask(msk), scl)
+# else
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+#  define BG_(dt, it, reg, mem, idx, msk, scl) \
+    __builtin_ia32_gather##it##dt(reg, mem, idx, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), scl)
+# endif
+/*
+ * Instead of replicating the main IDX_SIZE conditional below three times, use
+ * a double layer of macro invocations, allowing for substitution of the
+ * respective relevant macro argument tokens.
+ */
+# define BG(dt, it, reg, mem, idx, msk, scl) BG_(dt, it, reg, mem, idx, msk, scl)
+# if VEC_MAX < 64
+/*
+ * The sub-512-bit built-ins have an extra "3" infix, presumably because the
+ * 512-bit names were chosen without the AVX512VL extension in mind (and hence
+ * making the latter collide with the AVX2 ones).
+ */
+#  define si 3si
+#  define di 3di
+# endif
+# if VEC_MAX == 16
+#  define v8df v2df
+#  define v8di v2di
+#  define v16sf v4sf
+#  define v16si v4si
+# elif VEC_MAX == 32
+#  define v8df v4df
+#  define v8di v4di
+#  define v16sf v8sf
+#  define v16si v8si
+# endif
+# if IDX_SIZE == 4
+#  if INT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16si, si, reg, mem, idx, msk, scl)
+#  elif INT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, si, (vdi_t)(reg), mem, idx, msk, scl))
+#  elif FLOAT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16sf, si, reg, mem, idx, msk, scl)
+#  elif FLOAT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) BG(v8df, si, reg, mem, idx, msk, scl)
+#  endif
+# elif IDX_SIZE == 8
+#  if INT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16si, di, reg, mem, (idi_t)(idx), msk, scl)
+#  elif INT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, di, (vdi_t)(reg), mem, (idi_t)(idx), msk, scl))
+#  elif FLOAT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16sf, di, reg, mem, (idi_t)(idx), msk, scl)
+#  elif FLOAT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) BG(v8df, di, reg, mem, (idi_t)(idx), msk, scl)
+#  endif
+# endif
+#elif defined(__AVX2__)
+# if VEC_SIZE == 16
+#  define to_bool(cmp) __builtin_ia32_ptestc128(cmp, (vec_t){} == 0)
+# else
+#  define to_bool(cmp) __builtin_ia32_ptestc256(cmp, (vec_t){} == 0)
+# endif
 
-#if defined(__AVX2__)
 # if VEC_MAX == 16
 #  if IDX_SIZE == 4
 #   if INT_SIZE == 4
@@ -111,6 +176,10 @@ typedef long long __attribute__((vector_
 # endif
 #endif
 
+#ifndef eq
+# define eq(x, y) to_bool((x) == (y))
+#endif
+
 #define GLUE_(x, y) x ## y
 #define GLUE(x, y) GLUE_(x, y)
 
@@ -119,6 +188,7 @@ typedef long long __attribute__((vector_
 #define PUT8(n)  PUT4(n),   PUT4((n) +  4)
 #define PUT16(n) PUT8(n),   PUT8((n) +  8)
 #define PUT32(n) PUT16(n), PUT16((n) + 16)
+#define PUT64(n) PUT32(n), PUT32((n) + 32)
 
 const typeof((vec_t){}[0]) array[] = {
     GLUE(PUT, VEC_MAX)(1),
@@ -174,7 +244,7 @@ int sg_test(void)
 
     y = gather(full, array + ITEM_COUNT, -idx, full, ELEM_SIZE);
 #if ITEM_COUNT == ELEM_COUNT
-    if ( !to_bool(y == x - 1) )
+    if ( !eq(y, x - 1) )
         return __LINE__;
 #else
     for ( i = 0; i < ITEM_COUNT; ++i )
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,8 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
+#include "avx512f-sg.h"
+#include "avx512vl-sg.h"
 #include "avx512bw.h"
 #include "avx512dq.h"
 #include "avx512er.h"
@@ -90,11 +92,13 @@ static bool simd_check_avx512f(void)
     return cpu_has_avx512f;
 }
 #define simd_check_avx512f_opmask simd_check_avx512f
+#define simd_check_avx512f_sg simd_check_avx512f
 
 static bool simd_check_avx512f_vl(void)
 {
     return cpu_has_avx512f && cpu_has_avx512vl;
 }
+#define simd_check_avx512vl_sg simd_check_avx512f_vl
 
 static bool simd_check_avx512dq(void)
 {
@@ -291,6 +295,14 @@ static const struct {
     SIMD(AVX512F u32x16,      avx512f,      64u4),
     SIMD(AVX512F s64x8,       avx512f,      64i8),
     SIMD(AVX512F u64x8,       avx512f,      64u8),
+    SIMD(AVX512F S/G f32[16x32], avx512f_sg, 64x4f4),
+    SIMD(AVX512F S/G f64[ 8x32], avx512f_sg, 64x4f8),
+    SIMD(AVX512F S/G f32[ 8x64], avx512f_sg, 64x8f4),
+    SIMD(AVX512F S/G f64[ 8x64], avx512f_sg, 64x8f8),
+    SIMD(AVX512F S/G i32[16x32], avx512f_sg, 64x4i4),
+    SIMD(AVX512F S/G i64[ 8x32], avx512f_sg, 64x4i8),
+    SIMD(AVX512F S/G i32[ 8x64], avx512f_sg, 64x8i4),
+    SIMD(AVX512F S/G i64[ 8x64], avx512f_sg, 64x8i8),
     AVX512VL(VL f32x4,        avx512f,      16f4),
     AVX512VL(VL f64x2,        avx512f,      16f8),
     AVX512VL(VL f32x8,        avx512f,      32f4),
@@ -303,6 +315,22 @@ static const struct {
     AVX512VL(VL u64x2,        avx512f,      16u8),
     AVX512VL(VL s64x4,        avx512f,      32i8),
     AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512VL S/G f32[4x32], avx512vl_sg, 16x4f4),
+    SIMD(AVX512VL S/G f64[2x32], avx512vl_sg, 16x4f8),
+    SIMD(AVX512VL S/G f32[2x64], avx512vl_sg, 16x8f4),
+    SIMD(AVX512VL S/G f64[2x64], avx512vl_sg, 16x8f8),
+    SIMD(AVX512VL S/G f32[8x32], avx512vl_sg, 32x4f4),
+    SIMD(AVX512VL S/G f64[4x32], avx512vl_sg, 32x4f8),
+    SIMD(AVX512VL S/G f32[4x64], avx512vl_sg, 32x8f4),
+    SIMD(AVX512VL S/G f64[4x64], avx512vl_sg, 32x8f8),
+    SIMD(AVX512VL S/G i32[4x32], avx512vl_sg, 16x4i4),
+    SIMD(AVX512VL S/G i64[2x32], avx512vl_sg, 16x4i8),
+    SIMD(AVX512VL S/G i32[2x64], avx512vl_sg, 16x8i4),
+    SIMD(AVX512VL S/G i64[2x64], avx512vl_sg, 16x8i8),
+    SIMD(AVX512VL S/G i32[8x32], avx512vl_sg, 32x4i4),
+    SIMD(AVX512VL S/G i64[4x32], avx512vl_sg, 32x4i8),
+    SIMD(AVX512VL S/G i32[4x64], avx512vl_sg, 32x8i4),
+    SIMD(AVX512VL S/G i64[4x64], avx512vl_sg, 32x8i8),
     SIMD(AVX512BW s8x64,     avx512bw,      64i1),
     SIMD(AVX512BW u8x64,     avx512bw,      64u1),
     SIMD(AVX512BW s16x32,    avx512bw,      64i2),
@@ -4260,7 +4288,7 @@ int main(int argc, char **argv)
 
         if ( !blobs[j].size )
         {
-            printf("%-39s n/a\n", blobs[j].name);
+            printf("%-39s n/a (%u-bit)\n", blobs[j].name, blobs[j].bitness);
             continue;
         }
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -499,7 +499,7 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
-    [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
+    [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x9a] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -3054,7 +3054,8 @@ x86_decode(
 
         d &= ~ModRM;
 #undef ModRM /* Only its aliases are valid to use from here on. */
-        modrm_reg = ((rex_prefix & 4) << 1) | ((modrm & 0x38) >> 3);
+        modrm_reg = ((rex_prefix & 4) << 1) | ((modrm & 0x38) >> 3) |
+                    ((evex_encoded() && !evex.R) << 4);
         modrm_rm  = modrm & 0x07;
 
         /*
@@ -3224,7 +3225,8 @@ x86_decode(
         if ( modrm_mod == 3 )
         {
             generate_exception_if(d & vSIB, EXC_UD);
-            modrm_rm |= (rex_prefix & 1) << 3;
+            modrm_rm |= ((rex_prefix & 1) << 3) |
+                        (evex_encoded() && !evex.x) << 4;
             ea.type = OP_REG;
         }
         else if ( ad_bytes == 2 )
@@ -3289,7 +3291,10 @@ x86_decode(
 
                 state->sib_index = ((sib >> 3) & 7) | ((rex_prefix << 2) & 8);
                 state->sib_scale = (sib >> 6) & 3;
-                if ( state->sib_index != 4 && !(d & vSIB) )
+                if ( unlikely(d & vSIB) )
+                    state->sib_index |= (mode_64bit() && evex_encoded() &&
+                                         !evex.RX) << 4;
+                else if ( state->sib_index != 4 )
                 {
                     ea.mem.off = *decode_gpr(state->regs, state->sib_index);
                     ea.mem.off <<= state->sib_scale;
@@ -3592,7 +3597,7 @@ x86_emulate(
     generate_exception_if(state->not_64bit && mode_64bit(), EXC_UD);
 
     if ( ea.type == OP_REG )
-        ea.reg = _decode_gpr(&_regs, modrm_rm, (d & ByteOp) && !rex_prefix);
+        ea.reg = _decode_gpr(&_regs, modrm_rm, (d & ByteOp) && !rex_prefix && !vex.opcx);
 
     memset(mmvalp, 0xaa /* arbitrary */, sizeof(*mmvalp));
 
@@ -3606,7 +3611,7 @@ x86_emulate(
         src.type = OP_REG;
         if ( d & ByteOp )
         {
-            src.reg = _decode_gpr(&_regs, modrm_reg, !rex_prefix);
+            src.reg = _decode_gpr(&_regs, modrm_reg, !rex_prefix && !vex.opcx);
             src.val = *(uint8_t *)src.reg;
             src.bytes = 1;
         }
@@ -3704,7 +3709,7 @@ x86_emulate(
         dst.type = OP_REG;
         if ( d & ByteOp )
         {
-            dst.reg = _decode_gpr(&_regs, modrm_reg, !rex_prefix);
+            dst.reg = _decode_gpr(&_regs, modrm_reg, !rex_prefix && !vex.opcx);
             dst.val = *(uint8_t *)dst.reg;
             dst.bytes = 1;
         }
@@ -9108,6 +9113,130 @@ x86_emulate(
         put_stub(stub);
 
         state->simd_size = simd_none;
+        break;
+    }
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x90): /* vpgatherd{d,q} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x91): /* vpgatherq{d,q} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x92): /* vgatherdp{s,d} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x93): /* vgatherqp{s,d} mem,[xyz]mm{k} */
+    {
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+        bool done = false;
+
+        ASSERT(ea.type == OP_MEM);
+        generate_exception_if((!evex.opmsk || evex.brs || evex.z ||
+                               evex.reg != 0xf ||
+                               modrm_reg == state->sib_index),
+                              EXC_UD);
+        avx512_vlen_check(false);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        /* Read destination and index registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x7f; /* vmovdqa{32,64} */
+        /*
+         * The register writeback below has to retain masked-off elements, but
+         * needs to clear upper portions in the index-wider-than-data cases.
+         * Therefore read (and write below) the full register. The alternative
+         * would have been to fiddle with the mask register used.
+         */
+        pevex->opmsk = 0;
+        /* Use (%rax) as destination and modrm_reg as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
+
+        pevex->pfx = vex_f3; /* vmovdqu{32,64} */
+        pevex->w = b & 1;
+        /* Switch to sib_index as source. */
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        opc[1] = (state->sib_index & 7) << 3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the destination and mask values. */
+        n = 1 << (2 + evex.lr - ((b & 1) | evex.w));
+        op_bytes = 4 << evex.w;
+        memset((void *)mmvalp + n * op_bytes, 0, 64 - n * op_bytes);
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = ops->read(ea.mem.seg,
+                           truncate_ea(ea.mem.off + (idx << state->sib_scale)),
+                           (void *)mmvalp + i * op_bytes, op_bytes, ctxt);
+            if ( rc != X86EMUL_OKAY )
+            {
+                /*
+                 * If we've made some progress and the access did not fault,
+                 * force a retry instead. This is for example necessary to
+                 * cope with the limited capacity of HVM's MMIO cache.
+                 */
+                if ( rc != X86EMUL_EXCEPTION && done )
+                    rc = X86EMUL_RETRY;
+                break;
+            }
+
+            op_mask &= ~(1 << i);
+            done = true;
+
+#ifdef __XEN__
+            if ( op_mask && local_events_need_delivery() )
+            {
+                rc = X86EMUL_RETRY;
+                break;
+            }
+#endif
+        }
+
+        /* Write destination and mask registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x6f; /* vmovdqa{32,64} */
+        pevex->opmsk = 0;
+        /* Use modrm_reg as destination and (%rax) as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
+
+        /*
+         * kmovw: This is VEX-encoded, so we can't use pevex. Avoid copy_VEX() etc
+         * as well, since we can easily use the 2-byte VEX form here.
+         */
+        opc -= EVEX_PFX_BYTES;
+        opc[0] = 0xc5;
+        opc[1] = 0xf8;
+        opc[2] = 0x90;
+        /* Use (%rax) as source. */
+        opc[3] = evex.opmsk << 3;
+        opc[4] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+        put_stub(stub);
+
+        state->simd_size = simd_none;
         break;
     }
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -656,9 +656,6 @@ static inline unsigned long *decode_gpr(
     BUILD_BUG_ON(ARRAY_SIZE(cpu_user_regs_gpr_offsets) &
                  (ARRAY_SIZE(cpu_user_regs_gpr_offsets) - 1));
 
-    ASSERT(modrm < ARRAY_SIZE(cpu_user_regs_gpr_offsets));
-
-    /* For safety in release builds.  Debug builds will hit the ASSERT() */
     modrm &= ARRAY_SIZE(cpu_user_regs_gpr_offsets) - 1;
 
     return (void *)regs + cpu_user_regs_gpr_offsets[modrm];




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 34/49] x86emul: add high register S/G test cases
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (33 preceding siblings ...)
  2018-12-19 14:57   ` [PATCH v7 33/49] x86emul: support AVX512F gather insns Jan Beulich
@ 2018-12-19 14:57   ` Jan Beulich
  2018-12-19 14:58   ` [PATCH v7 35/49] x86emul: support AVX512F scatter insns Jan Beulich
                     ` (14 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:57 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

In order to verify that in particular the index register decoding works
correctly in the S/G emulation paths, add dedicated (64-bit only) cases
disallowing the compiler to use the lower registers. Other than in the
generic SIMD case, where occasional uses of %xmm or %ymm registers in
generated code cause various internal compiler errors when disallowing
use of all of the lower 16 registers (apparently due to insn templates
trying to use AVX2 encodings), doing so here in the AVX512F case looks
to be fine.

While the main goal here is the AVX512F case, add an AVX2 variant as
well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -150,6 +150,12 @@ $(foreach flavor,$(SIMD) $(FMA),$(eval $
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
 $(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
+first-string = $(shell for s in $(1); do echo "$$s"; break; done)
+
+avx2-sg-cflags-x86_64    := "-D_high $(foreach n,7 6 5 4 3 2 1,-ffixed-ymm$(n)) $(call first-string,$(avx2-sg-cflags))"
+avx512f-sg-cflags-x86_64 := "-D_higher $(foreach n,7 6 5 4 3 2 1,-ffixed-zmm$(n)) $(call first-string,$(avx512f-sg-cflags))"
+avx512f-sg-cflags-x86_64 += "-D_highest $(foreach n,15 14 13 12 11 10 9 8,-ffixed-zmm$(n)) $(call first-string,$(avx512f-sg-cflags-x86_64))"
+
 $(addsuffix .h,$(TESTCASES)): %.h: %.c testcase.mk Makefile
 	rm -f $@.new $*.bin
 	$(foreach arch,$(filter-out $(XEN_COMPILE_ARCH),x86_32) $(XEN_COMPILE_ARCH), \
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -266,6 +266,9 @@ static const struct {
     SIMD(AVX2 S/G i64[4x32],  avx2_sg,    32x4i8),
     SIMD(AVX2 S/G i32[4x64],  avx2_sg,    32x8i4),
     SIMD(AVX2 S/G i64[4x64],  avx2_sg,    32x8i8),
+#ifdef __x86_64__
+    SIMD_(64, AVX2 S/G %ymm8+, avx2_sg,     high),
+#endif
     SIMD(XOP 128bit single,       xop,      16f4),
     SIMD(XOP 256bit single,       xop,      32f4),
     SIMD(XOP 128bit double,       xop,      16f8),
@@ -303,6 +306,10 @@ static const struct {
     SIMD(AVX512F S/G i64[ 8x32], avx512f_sg, 64x4i8),
     SIMD(AVX512F S/G i32[ 8x64], avx512f_sg, 64x8i4),
     SIMD(AVX512F S/G i64[ 8x64], avx512f_sg, 64x8i8),
+#ifdef __x86_64__
+    SIMD_(64, AVX512F S/G %zmm8+, avx512f_sg, higher),
+    SIMD_(64, AVX512F S/G %zmm16+, avx512f_sg, highest),
+#endif
     AVX512VL(VL f32x4,        avx512f,      16f4),
     AVX512VL(VL f64x2,        avx512f,      16f8),
     AVX512VL(VL f32x8,        avx512f,      32f4),





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 35/49] x86emul: support AVX512F scatter insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (34 preceding siblings ...)
  2018-12-19 14:57   ` [PATCH v7 34/49] x86emul: add high register S/G test cases Jan Beulich
@ 2018-12-19 14:58   ` Jan Beulich
  2018-12-19 14:59   ` [PATCH v7 36/49] x86emul: support AVX512PF insns Jan Beulich
                     ` (13 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:58 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512F in the insn emulator.

Note that in the test harness there's a little bit of trickery needed to
get around the not fully consistent naming of AVX512VL gather and
scatter built-ins. To suppress expansion of the "di" and "si" tokens
they get constructed by token concatenation in BS(), which is different
from BG().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: I couldn't really decide whether to duplicate code or merge scatter
     into gather emulation.
---
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -270,6 +270,8 @@ static const struct test avx512f_all[] =
     INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
     INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
     INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pscatterd,    66, 0f38, a0,    vl,     dq, el),
+    INSN(pscatterq,    66, 0f38, a1,    vl,     dq, el),
     INSN(pshufd,       66,   0f, 70,    vl,      d, vl),
     INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
     INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
@@ -305,6 +307,8 @@ static const struct test avx512f_all[] =
     INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
     INSN(scalef,       66, 0f38, 2c,    vl,     sd, vl),
     INSN(scalef,       66, 0f38, 2d,    el,     sd, el),
+    INSN(scatterd,     66, 0f38, a2,    vl,     sd, el),
+    INSN(scatterq,     66, 0f38, a3,    vl,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/tools/tests/x86_emulator/simd-sg.c
+++ b/tools/tests/x86_emulator/simd-sg.c
@@ -48,10 +48,14 @@ typedef long long __attribute__((vector_
 #  endif
 #  define BG_(dt, it, reg, mem, idx, msk, scl) \
     __builtin_ia32_gather##it##dt(reg, mem, idx, to_mask(msk), scl)
+#  define BS_(dt, it, mem, idx, reg, msk, scl) \
+    __builtin_ia32_scatter##it##dt(mem, to_mask(msk), idx, reg, scl)
 # else
 #  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
 #  define BG_(dt, it, reg, mem, idx, msk, scl) \
     __builtin_ia32_gather##it##dt(reg, mem, idx, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), scl)
+#  define BS_(dt, it, mem, idx, reg, msk, scl) \
+    __builtin_ia32_scatter##it##dt(mem, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), idx, reg, scl)
 # endif
 /*
  * Instead of replicating the main IDX_SIZE conditional below three times, use
@@ -59,6 +63,7 @@ typedef long long __attribute__((vector_
  * respective relevant macro argument tokens.
  */
 # define BG(dt, it, reg, mem, idx, msk, scl) BG_(dt, it, reg, mem, idx, msk, scl)
+# define BS(dt, it, mem, idx, reg, msk, scl) BS_(dt, it##i, mem, idx, reg, msk, scl)
 # if VEC_MAX < 64
 /*
  * The sub-512-bit built-ins have an extra "3" infix, presumably because the
@@ -82,22 +87,30 @@ typedef long long __attribute__((vector_
 # if IDX_SIZE == 4
 #  if INT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16si, si, reg, mem, idx, msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16si, s, mem, idx, reg, msk, scl)
 #  elif INT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, si, (vdi_t)(reg), mem, idx, msk, scl))
+#   define scatter(mem, idx, reg, msk, scl) BS(v8di, s, mem, idx, (vdi_t)(reg), msk, scl)
 #  elif FLOAT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16sf, si, reg, mem, idx, msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16sf, s, mem, idx, reg, msk, scl)
 #  elif FLOAT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) BG(v8df, si, reg, mem, idx, msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v8df, s, mem, idx, reg, msk, scl)
 #  endif
 # elif IDX_SIZE == 8
 #  if INT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16si, di, reg, mem, (idi_t)(idx), msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16si, d, mem, (idi_t)(idx), reg, msk, scl)
 #  elif INT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, di, (vdi_t)(reg), mem, (idi_t)(idx), msk, scl))
+#   define scatter(mem, idx, reg, msk, scl) BS(v8di, d, mem, (idi_t)(idx), (vdi_t)(reg), msk, scl)
 #  elif FLOAT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16sf, di, reg, mem, (idi_t)(idx), msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16sf, d, mem, (idi_t)(idx), reg, msk, scl)
 #  elif FLOAT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) BG(v8df, di, reg, mem, (idi_t)(idx), msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v8df, d, mem, (idi_t)(idx), reg, msk, scl)
 #  endif
 # endif
 #elif defined(__AVX2__)
@@ -195,6 +208,8 @@ const typeof((vec_t){}[0]) array[] = {
     GLUE(PUT, VEC_MAX)(VEC_MAX + 1)
 };
 
+typeof((vec_t){}[0]) out[VEC_MAX * 2];
+
 int sg_test(void)
 {
     unsigned int i;
@@ -275,5 +290,41 @@ int sg_test(void)
 # endif
 #endif
 
+#ifdef scatter
+
+    for ( i = 0; i < sizeof(out) / sizeof(*out); ++i )
+        out[i] = 0;
+
+    for ( i = 0; i < ITEM_COUNT; ++i )
+        x[i] = i + 1;
+
+    touch(x);
+
+    scatter(out, (idx_t){}, x, (vec_t){ 1 } != 0, 1);
+    if ( out[0] != 1 )
+        return __LINE__;
+    for ( i = 1; i < ITEM_COUNT; ++i )
+        if ( out[i] )
+            return __LINE__;
+
+    scatter(out, (idx_t){}, x, full, 1);
+    if ( out[0] != ITEM_COUNT )
+        return __LINE__;
+    for ( i = 1; i < ITEM_COUNT; ++i )
+        if ( out[i] )
+            return __LINE__;
+
+    scatter(out, idx, x, full, ELEM_SIZE);
+    for ( i = 1; i <= ITEM_COUNT; ++i )
+        if ( out[i] != i )
+            return __LINE__;
+
+    scatter(out, inv, x, full, ELEM_SIZE);
+    for ( i = 1; i <= ITEM_COUNT; ++i )
+        if ( out[i] != ITEM_COUNT + 1 - i )
+            return __LINE__;
+
+#endif
+
     return 0;
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -508,6 +508,7 @@ static const struct ext0f38_table {
     [0x9d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x9e] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x9f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xa0 ... 0xa3] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xa9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xaa] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -9319,6 +9320,102 @@ x86_emulate(
             avx512_vlen_check(true);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa0): /* vpscatterd{d,q} [xyz]mm,mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa1): /* vpscatterq{d,q} [xyz]mm,mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa2): /* vscatterdp{s,d} [xyz]mm,mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa3): /* vscatterqp{s,d} [xyz]mm,mem{k} */
+    {
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+        bool done = false;
+
+        ASSERT(ea.type == OP_MEM);
+        fail_if(!ops->write);
+        generate_exception_if((!evex.opmsk || evex.brs || evex.z ||
+                               evex.reg != 0xf ||
+                               modrm_reg == state->sib_index),
+                              EXC_UD);
+        avx512_vlen_check(false);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        /* Read source and index registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x7f; /* vmovdqa{32,64} */
+        /* Use (%rax) as destination and modrm_reg as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
+
+        pevex->pfx = vex_f3; /* vmovdqu{32,64} */
+        pevex->w = b & 1;
+        /* Switch to sib_index as source. */
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        opc[1] = (state->sib_index & 7) << 3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the mask value. */
+        n = 1 << (2 + evex.lr - ((b & 1) | evex.w));
+        op_bytes = 4 << evex.w;
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = ops->write(ea.mem.seg,
+                            truncate_ea(ea.mem.off + (idx << state->sib_scale)),
+                            (void *)mmvalp + i * op_bytes, op_bytes, ctxt);
+            if ( rc != X86EMUL_OKAY )
+            {
+                /* See comment in gather emulation. */
+                if ( rc != X86EMUL_EXCEPTION && done )
+                    rc = X86EMUL_RETRY;
+                break;
+            }
+
+            op_mask &= ~(1 << i);
+            done = true;
+
+#ifdef __XEN__
+            if ( op_mask && local_events_need_delivery() )
+            {
+                rc = X86EMUL_RETRY;
+                break;
+            }
+#endif
+        }
+
+        /* Write mask register. See comment in gather emulation. */
+        opc = get_stub(stub);
+        opc[0] = 0xc5;
+        opc[1] = 0xf8;
+        opc[2] = 0x90;
+        /* Use (%rax) as source. */
+        opc[3] = evex.opmsk << 3;
+        opc[4] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+        put_stub(stub);
+
+        state->simd_size = simd_none;
+        break;
+    }
+
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xc9):     /* sha1msg1 xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xca):     /* sha1msg2 xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 36/49] x86emul: support AVX512PF insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (35 preceding siblings ...)
  2018-12-19 14:58   ` [PATCH v7 35/49] x86emul: support AVX512F scatter insns Jan Beulich
@ 2018-12-19 14:59   ` Jan Beulich
  2018-12-19 14:59   ` [PATCH v7 37/49] x86emul: support AVX512CD insns Jan Beulich
                     ` (12 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:59 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Some adjustments are necessary to the EVEX Disp8 scaling test code to
account for the zero byte reads/writes. I have to admit though that I'm
not fully convinced the SDM describes the faulting behavior correctly:
Other prefetch insns, including the Xeon Phi Coprocessor S/G ones, don't
produce #GP/#SS. Until proven otherwise this gets implemented as
specified, not the least because the respective exception specification
table, besides listing #GP and #SS, also explicitly says "EVEX-encoded
prefetch instructions that do not cause #PF follow exception class ...".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -520,6 +520,17 @@ static const struct test avx512er_512[]
     INSN(rsqrt28, 66, 0f38, cd, el, sd, el),
 };
 
+static const struct test avx512pf_512[] = {
+    INSNX(gatherpf0d,  66, 0f38, c6, 1, vl, sd, el),
+    INSNX(gatherpf0q,  66, 0f38, c7, 1, vl, sd, el),
+    INSNX(gatherpf1d,  66, 0f38, c6, 2, vl, sd, el),
+    INSNX(gatherpf1q,  66, 0f38, c7, 2, vl, sd, el),
+    INSNX(scatterpf0d, 66, 0f38, c6, 5, vl, sd, el),
+    INSNX(scatterpf0q, 66, 0f38, c7, 5, vl, sd, el),
+    INSNX(scatterpf1d, 66, 0f38, c6, 6, vl, sd, el),
+    INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -580,7 +591,7 @@ static bool record_access(enum x86_segme
 static int read(enum x86_segment seg, unsigned long offset, void *p_data,
                 unsigned int bytes, struct x86_emulate_ctxt *ctxt)
 {
-    if ( !record_access(seg, offset, bytes) )
+    if ( !record_access(seg, offset, bytes + !bytes) )
         return X86EMUL_UNHANDLEABLE;
     memset(p_data, 0, bytes);
     return X86EMUL_OKAY;
@@ -589,7 +600,7 @@ static int read(enum x86_segment seg, un
 static int write(enum x86_segment seg, unsigned long offset, void *p_data,
                  unsigned int bytes, struct x86_emulate_ctxt *ctxt)
 {
-    if ( !record_access(seg, offset, bytes) )
+    if ( !record_access(seg, offset, bytes + !bytes) )
         return X86EMUL_UNHANDLEABLE;
     return X86EMUL_OKAY;
 }
@@ -597,7 +608,7 @@ static int write(enum x86_segment seg, u
 static void test_one(const struct test *test, enum vl vl,
                      unsigned char *instr, struct x86_emulate_ctxt *ctxt)
 {
-    unsigned int vsz, esz, i;
+    unsigned int vsz, esz, i, n;
     int rc;
     bool sg = strstr(test->mnemonic, "gather") ||
               strstr(test->mnemonic, "scatter");
@@ -725,10 +736,20 @@ static void test_one(const struct test *
     for ( i = 0; i < (test->scale == SC_vl ? vsz : esz); ++i )
          if ( accessed[i] )
              goto fail;
-    for ( ; i < (test->scale == SC_vl ? vsz : esz) + (sg ? esz : vsz); ++i )
+
+    n = test->scale == SC_vl ? vsz : esz;
+    if ( !sg )
+        n += vsz;
+    else if ( !strstr(test->mnemonic, "pf") )
+        n += esz;
+    else
+        ++n;
+
+    for ( ; i < n; ++i )
          if ( accessed[i] != (sg ? (vsz / esz) >> (test->opc & 1 & !evex.w)
                                  : 1) )
              goto fail;
+
     for ( ; i < ARRAY_SIZE(accessed); ++i )
          if ( accessed[i] )
              goto fail;
@@ -887,6 +908,8 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
     RUN(avx512er, 512);
+#define cpu_has_avx512pf cpu_has_avx512f
+    RUN(avx512pf, 512);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
 }
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -135,12 +135,12 @@ int emul_test_cpuid(
         res->c |= 1U << 22;
 
     /*
-     * The emulator doesn't itself use ADCX/ADOX/RDPID, so we can always run
-     * the respective tests.
+     * The emulator doesn't itself use ADCX/ADOX/RDPID nor the S/G prefetch
+     * insns, so we can always run the respective tests.
      */
     if ( leaf == 7 && subleaf == 0 )
     {
-        res->b |= 1U << 19;
+        res->b |= (1U << 19) | (1U << 26);
         res->c |= 1U << 22;
     }
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -525,6 +525,7 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xc6 ... 0xc7] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xc9] = { .simd_size = simd_other },
     [0xca] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -1903,6 +1904,7 @@ static bool vcpu_has(
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
+#define vcpu_has_avx512pf()    vcpu_has(         7, EBX, 26, ctxt, ops)
 #define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
@@ -9414,6 +9416,80 @@ x86_emulate(
 
         state->simd_size = simd_none;
         break;
+    }
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc6):
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc7):
+    {
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+
+        ASSERT(ea.type == OP_MEM);
+        generate_exception_if((!cpu_has_avx512f || !evex.opmsk || evex.brs ||
+                               evex.z || evex.reg != 0xf || evex.lr != 2),
+                              EXC_UD);
+        vcpu_must_have(avx512pf);
+
+        switch ( modrm_reg & 7 )
+        {
+        case 1: /* vgatherpf0{d,q}p{s,d} mem{k} */
+        case 2: /* vgatherpf1{d,q}p{s,d} mem{k} */
+            break;
+        case 5: /* vscatterpf0{d,q}p{s,d} mem{k} */
+        case 6: /* vscatterpf1{d,q}p{s,d} mem{k} */
+            fail_if(!ops->write);
+            break;
+        default:
+            generate_exception(EXC_UD);
+        }
+
+        get_fpu(X86EMUL_FPU_zmm);
+
+        /* Read index register. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        /* vmovdqu{32,64} */
+        opc[0] = 0x7f;
+        pevex->pfx = vex_f3;
+        pevex->w = b & 1;
+        /* Use (%rax) as destination and sib_index as source. */
+        pevex->b = 1;
+        opc[1] = (state->sib_index & 7) << 3;
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the mask value. */
+        n = 1 << (4 - ((b & 1) | evex.w));
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; rc == X86EMUL_OKAY && op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = (modrm_reg & 4
+                  ? ops->write
+                  : ops->read)(ea.mem.seg,
+                               truncate_ea(ea.mem.off +
+                                           (idx << state->sib_scale)),
+                               NULL, 0, ctxt);
+
+            op_mask &= ~(1 << i);
+        }
+
+        state->simd_size = simd_none;
+        break;
     }
 
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 37/49] x86emul: support AVX512CD insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (36 preceding siblings ...)
  2018-12-19 14:59   ` [PATCH v7 36/49] x86emul: support AVX512PF insns Jan Beulich
@ 2018-12-19 14:59   ` Jan Beulich
  2018-12-19 15:00   ` [PATCH v7 38/49] x86emul: complete support of AVX512_VBMI insns Jan Beulich
                     ` (11 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 14:59 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Since the insns here and in particular their memory access patterns
follow the usual scheme I didn't think it was necessary to add
contrived tests specifically for them, beyond the Disp8 scaling ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -458,6 +458,13 @@ static const struct test avx512bw_128[]
     INSN(pinsrw, 66,   0f, c4, el, w, el),
 };
 
+static const struct test avx512cd_all[] = {
+//       pbroadcastmb2q, f3, 0f38, 2a,      q
+//       pbroadcastmw2d, f3, 0f38, 3a,      d
+    INSN(pconflict,      66, 0f38, c4, vl, dq, vl),
+    INSN(plzcnt,         66, 0f38, 44, vl, dq, vl),
+};
+
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
@@ -903,6 +910,7 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, 512);
     RUN(avx512bw, all);
     RUN(avx512bw, 128);
+    RUN(avx512cd, all);
     RUN(avx512dq, all);
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -138,6 +138,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
 #define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
+#define cpu_has_avx512cd  (cp.feat.avx512cd && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -473,6 +473,7 @@ static const struct ext0f38_table {
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x42] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x43] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x44] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -525,6 +526,7 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xc4] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0xc6 ... 0xc7] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xc9] = { .simd_size = simd_other },
@@ -1906,6 +1908,7 @@ static bool vcpu_has(
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_avx512pf()    vcpu_has(         7, EBX, 26, ctxt, ops)
 #define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
+#define vcpu_has_avx512cd()    vcpu_has(         7, EBX, 28, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
@@ -8805,6 +8808,20 @@ x86_emulate(
         evex.opcx = vex_0f;
         goto vmovdqa;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x2a): /* vpbroadcastmb2q k,[xyz]mm */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x3a): /* vpbroadcastmw2d k,[xyz]mm */
+        generate_exception_if((ea.type != OP_REG || evex.opmsk ||
+                               evex.w == ((b >> 4) & 1)),
+                              EXC_UD);
+        d |= TwoOp;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc4): /* vpconflict{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x44): /* vplzcnt{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512cd);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2c): /* vmaskmovps mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2d): /* vmaskmovpd mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2e): /* vmaskmovps {x,y}mm,{x,y}mm,mem */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -104,6 +104,7 @@
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
+#define cpu_has_avx512cd        boot_cpu_has(X86_FEATURE_AVX512CD)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 38/49] x86emul: complete support of AVX512_VBMI insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (37 preceding siblings ...)
  2018-12-19 14:59   ` [PATCH v7 37/49] x86emul: support AVX512CD insns Jan Beulich
@ 2018-12-19 15:00   ` Jan Beulich
  2018-12-19 15:00   ` [PATCH v7 39/49] x86emul: support of AVX512* population count insns Jan Beulich
                     ` (10 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:00 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also add testing of ones support for which was added before. Sadly gcc's
command line option naming is not in line with Intel's naming of the
feature, which makes it necessary to mis-name things in the test harness.

Since the only new insn here and in particular its memory access pattern
follows the usual scheme, I didn't think it was necessary to add a
contrived test specifically for it, beyond the Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -19,7 +19,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er avx512vbmi
 FMA := fma4 fma
 SG := avx2-sg avx512f-sg avx512vl-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -86,6 +86,9 @@ avx512dq-flts := $(avx512f-flts)
 avx512er-vecs := 64
 avx512er-ints :=
 avx512er-flts := 4 8
+avx512vbmi-vecs := $(avx512bw-vecs)
+avx512vbmi-ints := $(avx512bw-ints)
+avx512vbmi-flts := $(avx512bw-flts)
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -542,6 +542,7 @@ static const struct test avx512_vbmi_all
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
+    INSN(pmultishiftqb, 66, 0f38, 83, vl, q, vl),
 };
 
 static const struct test avx512_vbmi2_all[] = {
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -27,6 +27,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw.h"
 #include "avx512dq.h"
 #include "avx512er.h"
+#include "avx512vbmi.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -127,6 +128,16 @@ static bool simd_check_avx512bw_vl(void)
     return cpu_has_avx512bw && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx512vbmi(void)
+{
+    return cpu_has_avx512_vbmi;
+}
+
+static bool simd_check_avx512vbmi_vl(void)
+{
+    return cpu_has_avx512_vbmi && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -372,6 +383,18 @@ static const struct {
     SIMD(AVX512ER f32x16,    avx512er,      64f4),
     SIMD(AVX512ER f64 scalar,avx512er,        f8),
     SIMD(AVX512ER f64x8,     avx512er,      64f8),
+    SIMD(AVX512_VBMI s8x64,  avx512vbmi,    64i1),
+    SIMD(AVX512_VBMI u8x64,  avx512vbmi,    64u1),
+    SIMD(AVX512_VBMI s16x32, avx512vbmi,    64i2),
+    SIMD(AVX512_VBMI u16x32, avx512vbmi,    64u2),
+    AVX512VL(_VBMI+VL s8x16, avx512vbmi,    16i1),
+    AVX512VL(_VBMI+VL u8x16, avx512vbmi,    16u1),
+    AVX512VL(_VBMI+VL s8x32, avx512vbmi,    32i1),
+    AVX512VL(_VBMI+VL u8x32, avx512vbmi,    32u1),
+    AVX512VL(_VBMI+VL s16x8, avx512vbmi,    16i2),
+    AVX512VL(_VBMI+VL u16x8, avx512vbmi,    16u2),
+    AVX512VL(_VBMI+VL s16x16, avx512vbmi,   32i2),
+    AVX512VL(_VBMI+VL u16x16, avx512vbmi,   32u2),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -493,6 +493,7 @@ static const struct ext0f38_table {
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x83] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x88] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_dq },
     [0x89] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_dq },
     [0x8a] = { .simd_size = simd_packed_fp, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
@@ -9012,6 +9013,12 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x83): /* vpmultishiftqb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        host_and_vcpu_must_have(avx512_vbmi);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 39/49] x86emul: support of AVX512* population count insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (38 preceding siblings ...)
  2018-12-19 15:00   ` [PATCH v7 38/49] x86emul: complete support of AVX512_VBMI insns Jan Beulich
@ 2018-12-19 15:00   ` Jan Beulich
  2018-12-19 15:01   ` [PATCH v7 40/49] x86emul: support of AVX512_IFMA insns Jan Beulich
                     ` (9 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:00 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Plus the only other AVX512_BITALG one.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -538,6 +538,11 @@ static const struct test avx512pf_512[]
     INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
 };
 
+static const struct test avx512_bitalg_all[] = {
+    INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
+    INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -550,6 +555,10 @@ static const struct test avx512_vbmi2_al
     INSN(pexpand,   66, 0f38, 62, vl, bw, el),
 };
 
+static const struct test avx512_vpopcntdq_all[] = {
+    INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -919,6 +928,8 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512er, 512);
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
+    RUN(avx512_bitalg, all);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
+    RUN(avx512_vpopcntdq, all);
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -143,6 +143,8 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
+#define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -479,6 +479,7 @@ static const struct ext0f38_table {
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
@@ -501,6 +502,7 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
+    [0x8f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -1915,6 +1917,8 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
+#define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -8912,6 +8916,19 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8f): /* vpshufbitqmb [xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(evex.w || !evex.r || !evex.R || evex.z, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x54): /* vpopcnt{b,w} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_bitalg);
+        generate_exception_if(evex.brs, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x55): /* vpopcnt{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vpopcntdq);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,{x,y}mm */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -112,6 +112,8 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
+#define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
 /* CPUID level 0x80000007.edx */
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
 XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
 
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -269,7 +269,7 @@ def crunch_numbers(state):
         # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask
         # registers), despite the SDM not formally making this connection.
-        AVX512BW: [AVX512_VBMI, AVX512_VBMI2],
+        AVX512BW: [AVX512_VBMI, AVX512_BITALG, AVX512_VBMI2],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 40/49] x86emul: support of AVX512_IFMA insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (39 preceding siblings ...)
  2018-12-19 15:00   ` [PATCH v7 39/49] x86emul: support of AVX512* population count insns Jan Beulich
@ 2018-12-19 15:01   ` Jan Beulich
  2018-12-19 15:01   ` [PATCH v7 42/49] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
                     ` (8 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:01 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Once again take the liberty and also correct the (public interface) name
of the AVX512_IFMA feature flag to match the SDM, on the assumption that
no external consumer has actually been using that flag so far.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Reject EVEX.W=0.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -543,6 +543,11 @@ static const struct test avx512_bitalg_a
     INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
 };
 
+static const struct test avx512_ifma_all[] = {
+    INSN(pmadd52huq, 66, 0f38, b5, vl, q, vl),
+    INSN(pmadd52luq, 66, 0f38, b4, vl, q, vl),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -929,6 +934,7 @@ void evex_disp8_test(void *instr, struct
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
     RUN(avx512_bitalg, all);
+    RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
     RUN(avx512_vpopcntdq, all);
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -137,6 +137,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_bmi2       cp.feat.bmi2
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
+#define cpu_has_avx512_ifma (cp.feat.avx512_ifma && xcr0_mask(0xe6))
 #define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
 #define cpu_has_avx512cd  (cp.feat.avx512cd && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -521,6 +521,7 @@ static const struct ext0f38_table {
     [0xad] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xae] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xaf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xb4 ... 0xb5] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xb9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xba] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -1907,6 +1908,7 @@ static bool vcpu_has(
 #define vcpu_has_rdseed()      vcpu_has(         7, EBX, 18, ctxt, ops)
 #define vcpu_has_adx()         vcpu_has(         7, EBX, 19, ctxt, ops)
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
+#define vcpu_has_avx512_ifma() vcpu_has(         7, EBX, 21, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_avx512pf()    vcpu_has(         7, EBX, 26, ctxt, ops)
@@ -9459,6 +9461,12 @@ x86_emulate(
         break;
     }
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb4): /* vpmadd52luq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb5): /* vpmadd52huq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_ifma);
+        generate_exception_if(!evex.w, EXC_UD);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xc6):
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xc7):
     {
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -103,6 +103,7 @@
 #define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
+#define cpu_has_avx512_ifma     boot_cpu_has(X86_FEATURE_AVX512_IFMA)
 #define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_avx512cd        boot_cpu_has(X86_FEATURE_AVX512CD)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -212,7 +212,7 @@ XEN_CPUFEATURE(AVX512DQ,      5*32+17) /
 XEN_CPUFEATURE(RDSEED,        5*32+18) /*A  RDSEED instruction */
 XEN_CPUFEATURE(ADX,           5*32+19) /*A  ADCX, ADOX instructions */
 XEN_CPUFEATURE(SMAP,          5*32+20) /*S  Supervisor Mode Access Prevention */
-XEN_CPUFEATURE(AVX512IFMA,    5*32+21) /*A  AVX-512 Integer Fused Multiply Add */
+XEN_CPUFEATURE(AVX512_IFMA,   5*32+21) /*A  AVX-512 Integer Fused Multiply Add */
 XEN_CPUFEATURE(CLFLUSHOPT,    5*32+23) /*A  CLFLUSHOPT instruction */
 XEN_CPUFEATURE(CLWB,          5*32+24) /*A  CLWB instruction */
 XEN_CPUFEATURE(AVX512PF,      5*32+26) /*A  AVX-512 Prefetch Instructions */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -262,7 +262,7 @@ def crunch_numbers(state):
         # (which in practice depends on the EVEX prefix to encode) as well
         # as mask registers, and the instructions themselves. All further
         # AVX512 features are built on top of AVX512F
-        AVX512F: [AVX512DQ, AVX512IFMA, AVX512PF, AVX512ER, AVX512CD,
+        AVX512F: [AVX512DQ, AVX512_IFMA, AVX512PF, AVX512ER, AVX512CD,
                   AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
                   AVX512_VPOPCNTDQ],
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 42/49] x86emul: support remaining AVX512_VBMI2 insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (40 preceding siblings ...)
  2018-12-19 15:01   ` [PATCH v7 40/49] x86emul: support of AVX512_IFMA insns Jan Beulich
@ 2018-12-19 15:01   ` Jan Beulich
  2018-12-19 15:02   ` [PATCH v7 42/49] x86emul: support AVX512_4FMAPS insns Jan Beulich
                     ` (7 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:01 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base over change earlier in the series.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -558,6 +558,14 @@ static const struct test avx512_vbmi_all
 static const struct test avx512_vbmi2_all[] = {
     INSN(pcompress, 66, 0f38, 63, vl, bw, el),
     INSN(pexpand,   66, 0f38, 62, vl, bw, el),
+    INSN(pshld,     66, 0f3a, 71, vl, dq, vl),
+    INSN(pshldv,    66, 0f38, 71, vl, dq, vl),
+    INSN(pshldvw,   66, 0f38, 70, vl,  w, vl),
+    INSN(pshldw,    66, 0f3a, 70, vl,  w, vl),
+    INSN(pshrd,     66, 0f3a, 73, vl, dq, vl),
+    INSN(pshrdv,    66, 0f38, 73, vl, dq, vl),
+    INSN(pshrdvw,   66, 0f38, 72, vl,  w, vl),
+    INSN(pshrdw,    66, 0f3a, 72, vl,  w, vl),
 };
 
 static const struct test avx512_vpopcntdq_all[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -487,6 +487,7 @@ static const struct ext0f38_table {
     [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
     [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
     [0x64 ... 0x66] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x70 ... 0x73] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -611,6 +612,7 @@ static const struct ext0f3a_table {
     [0x6a ... 0x6b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x6c ... 0x6d] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x6e ... 0x6f] = { .simd_size = simd_scalar_opc, .four_op = 1 },
+    [0x70 ... 0x73] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x78 ... 0x79] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x7a ... 0x7b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x7c ... 0x7d] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -8982,6 +8984,16 @@ x86_emulate(
         }
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x70): /* vpshldvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x72): /* vpshrdvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        elem_bytes = 2;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x71): /* vpshldv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x73): /* vpshrdv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -10268,6 +10280,16 @@ x86_emulate(
         avx512_vlen_check(true);
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x70): /* vpshldw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x72): /* vpshrdw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        elem_bytes = 2;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x71): /* vpshld{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x73): /* vpshrd{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC(0x0f3a, 0xcc):     /* sha1rnds4 $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(sha);
         op_bytes = 16;





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 42/49] x86emul: support AVX512_4FMAPS insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (41 preceding siblings ...)
  2018-12-19 15:01   ` [PATCH v7 42/49] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
@ 2018-12-19 15:02   ` Jan Beulich
  2018-12-19 15:05   ` [PATCH v7 43/49] x86emul: support AVX512_4VNNIW insns Jan Beulich
                     ` (6 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:02 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -538,6 +538,13 @@ static const struct test avx512pf_512[]
     INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
 };
 
+static const struct test avx512_4fmaps_512[] = {
+    INSN(4fmaddps,  f2, 0f38, 9a, el_4, d, vl),
+    INSN(4fmaddss,  f2, 0f38, 9b, el_4, d, vl),
+    INSN(4fnmaddps, f2, 0f38, aa, el_4, d, vl),
+    INSN(4fnmaddss, f2, 0f38, ab, el_4, d, vl),
+};
+
 static const struct test avx512_bitalg_all[] = {
     INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
     INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
@@ -941,6 +948,7 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512er, 512);
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
+    RUN(avx512_4fmaps, 512);
     RUN(avx512_bitalg, all);
     RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -4274,6 +4274,81 @@ int main(int argc, char **argv)
     }
 #endif
 
+    printf("%-40s", "Testing v4fmaddps 32(%ecx),%zmm4,%zmm4{%k5}...");
+    if ( stack_exec && cpu_has_avx512_4fmaps )
+    {
+        decl_insn(v4fmaddps);
+        static const struct {
+            float f[16];
+        } in = {{
+            1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
+        }}, out = {{
+            1 + 1 * 9 + 2 * 10 + 3 * 11 + 4 * 12,
+            2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+            16 + 16 * 9 + 17 * 10 + 18 * 11 + 19 * 12
+        }};
+
+        asm volatile ( "vmovups %1, %%zmm4\n\t"
+                       "vbroadcastss %%xmm4, %%zmm7\n\t"
+                       "vaddps %%zmm4, %%zmm7, %%zmm5\n\t"
+                       "vaddps %%zmm5, %%zmm7, %%zmm6\n\t"
+                       "vaddps %%zmm6, %%zmm7, %%zmm7\n\t"
+                       "kmovw %2, %%k5\n"
+                       put_insn(v4fmaddps,
+                                "v4fmaddps 32(%0), %%zmm4, %%zmm4%{%%k5%}")
+                       :: "c" (NULL), "m" (in), "rmk" (0x8001) );
+
+        set_insn(v4fmaddps);
+        regs.ecx = (unsigned long)&in;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(v4fmaddps) )
+            goto fail;
+
+        asm ( "vcmpeqps %1, %%zmm4, %%k0\n\t"
+              "kmovw %%k0, %0" : "=g" (rc) : "m" (out) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing v4fnmaddss 16(%edx),%zmm4,%zmm4{%k3}...");
+    if ( stack_exec && cpu_has_avx512_4fmaps )
+    {
+        decl_insn(v4fnmaddss);
+        static const struct {
+            float f[16];
+        } in = {{
+            1, 2, 3, 4, 5, 6, 7, 8
+        }}, out = {{
+            1 - 1 * 5 - 2 * 6 - 3 * 7 - 4 * 8, 2, 3, 4
+        }};
+
+        asm volatile ( "vmovups %1, %%xmm4\n\t"
+                       "vaddss %%xmm4, %%xmm4, %%xmm5\n\t"
+                       "vaddss %%xmm5, %%xmm4, %%xmm6\n\t"
+                       "vaddss %%xmm6, %%xmm4, %%xmm7\n\t"
+                       "kmovw %2, %%k3\n"
+                       put_insn(v4fnmaddss,
+                                "v4fnmaddss 16(%0), %%xmm4, %%xmm4%{%%k3%}")
+                       :: "d" (NULL), "m" (in), "rmk" (1) );
+
+        set_insn(v4fnmaddss);
+        regs.edx = (unsigned long)&in;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(v4fnmaddss) )
+            goto fail;
+
+        asm ( "vcmpeqps %1, %%zmm4, %%k0\n\t"
+              "kmovw %%k0, %0" : "=g" (rc) : "m" (out) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -146,6 +146,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
+#define cpu_has_avx512_4fmaps (cp.feat.avx512_4fmaps && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1923,6 +1923,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
+#define vcpu_has_avx512_4fmaps() vcpu_has(       7, EDX,  3, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -3205,6 +3206,18 @@ x86_decode(
                                                    state);
                     state->simd_size = simd_other;
                 }
+
+                switch ( b )
+                {
+                /* v4f{,n}madd{p,s}s need special casing */
+                case 0x9a: case 0x9b: case 0xaa: case 0xab:
+                    if ( evex.pfx == vex_f2 )
+                    {
+                        disp8scale = 4;
+                        state->simd_size = simd_128;
+                    }
+                    break;
+                }
             }
             break;
 
@@ -9377,6 +9390,24 @@ x86_emulate(
             avx512_vlen_check(true);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9a): /* v4fmaddps m128,zmm+3,zmm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0xaa): /* v4fnmaddps m128,zmm+3,zmm{k} */
+        host_and_vcpu_must_have(avx512_4fmaps);
+        generate_exception_if((ea.type != OP_MEM || evex.w || evex.brs ||
+                               evex.lr != 2),
+                              EXC_UD);
+        op_mask = op_mask & 0xffff ? 0xf : 0;
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9b): /* v4fmaddss m128,xmm+3,xmm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0xab): /* v4fnmaddss m128,xmm+3,xmm{k} */
+        host_and_vcpu_must_have(avx512_4fmaps);
+        generate_exception_if((ea.type != OP_MEM || evex.w || evex.brs ||
+                               evex.lr == 3),
+                              EXC_UD);
+        op_mask = op_mask & 1 ? 0xf : 0;
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xa0): /* vpscatterd{d,q} [xyz]mm,mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xa1): /* vpscatterq{d,q} [xyz]mm,mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xa2): /* vscatterdp{s,d} [xyz]mm,mem{k} */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -117,6 +117,9 @@
 #define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
+/* CPUID level 0x00000007:0.edx */
+#define cpu_has_avx512_4fmaps   boot_cpu_has(X86_FEATURE_AVX512_4FMAPS)
+
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 43/49] x86emul: support AVX512_4VNNIW insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (42 preceding siblings ...)
  2018-12-19 15:02   ` [PATCH v7 42/49] x86emul: support AVX512_4FMAPS insns Jan Beulich
@ 2018-12-19 15:05   ` Jan Beulich
  2018-12-19 15:06   ` [PATCH v7 44/49] x86emul: support AVX512_VNNI insns Jan Beulich
                     ` (5 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:05 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As in a few cases before, since the insns here and in particular their
memory access patterns follow the AVX512_4FMAPS scheme, I didn't think
it was necessary to add contrived tests specifically for them, beyond
the Disp8 scaling ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -545,6 +545,11 @@ static const struct test avx512_4fmaps_5
     INSN(4fnmaddss, f2, 0f38, ab, el_4, d, vl),
 };
 
+static const struct test avx512_4vnniw_512[] = {
+    INSN(p4dpwssd,  f2, 0f38, 52, el_4, d, vl),
+    INSN(p4dpwssds, f2, 0f38, 53, el_4, d, vl),
+};
+
 static const struct test avx512_bitalg_all[] = {
     INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
     INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
@@ -949,6 +954,7 @@ void evex_disp8_test(void *instr, struct
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
     RUN(avx512_4fmaps, 512);
+    RUN(avx512_4vnniw, 512);
     RUN(avx512_bitalg, all);
     RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -146,6 +146,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
+#define cpu_has_avx512_4vnniw (cp.feat.avx512_4vnniw && xcr0_mask(0xe6))
 #define cpu_has_avx512_4fmaps (cp.feat.avx512_4fmaps && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -479,6 +479,7 @@ static const struct ext0f38_table {
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x52 ... 0x53] = { .simd_size = simd_128, .d8s = 4 },
     [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
@@ -1923,6 +1924,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
+#define vcpu_has_avx512_4vnniw() vcpu_has(       7, EDX,  2, ctxt, ops)
 #define vcpu_has_avx512_4fmaps() vcpu_has(       7, EDX,  3, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
@@ -8933,6 +8935,15 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x52): /* vp4dpwssd m128,zmm+3,zmm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x53): /* vp4dpwssds m128,zmm+3,zmm{k} */
+        host_and_vcpu_must_have(avx512_4vnniw);
+        generate_exception_if((ea.type != OP_MEM || evex.w || evex.brs ||
+                               evex.lr != 2),
+                              EXC_UD);
+        op_mask = op_mask & 0xffff ? 0xf : 0;
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8f): /* vpshufbitqmb [xyz]mm/mem,[xyz]mm,k{k} */
         generate_exception_if(evex.w || !evex.r || !evex.R || evex.z, EXC_UD);
         /* fall through */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -118,6 +118,7 @@
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
 /* CPUID level 0x00000007:0.edx */
+#define cpu_has_avx512_4vnniw   boot_cpu_has(X86_FEATURE_AVX512_4VNNIW)
 #define cpu_has_avx512_4fmaps   boot_cpu_has(X86_FEATURE_AVX512_4FMAPS)
 
 /* CPUID level 0x80000007.edx */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 44/49] x86emul: support AVX512_VNNI insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (43 preceding siblings ...)
  2018-12-19 15:05   ` [PATCH v7 43/49] x86emul: support AVX512_4VNNIW insns Jan Beulich
@ 2018-12-19 15:06   ` Jan Beulich
  2018-12-19 15:06   ` [PATCH v7 45/49] x86emul: support VPCLMULQDQ insns Jan Beulich
                     ` (4 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:06 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -580,6 +580,13 @@ static const struct test avx512_vbmi2_al
     INSN(pshrdw,    66, 0f3a, 72, vl,  w, vl),
 };
 
+static const struct test avx512_vnni_all[] = {
+    INSN(pdpbusd,  66, 0f38, 50, vl, d, vl),
+    INSN(pdpbusds, 66, 0f38, 51, vl, d, vl),
+    INSN(pdpwssd,  66, 0f38, 52, vl, d, vl),
+    INSN(pdpwssds, 66, 0f38, 53, vl, d, vl),
+};
+
 static const struct test avx512_vpopcntdq_all[] = {
     INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
 };
@@ -959,5 +966,6 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
+    RUN(avx512_vnni, all);
     RUN(avx512_vpopcntdq, all);
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -144,6 +144,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
 #define cpu_has_avx512_4vnniw (cp.feat.avx512_4vnniw && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -479,7 +479,7 @@ static const struct ext0f38_table {
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
-    [0x52 ... 0x53] = { .simd_size = simd_128, .d8s = 4 },
+    [0x50 ... 0x53] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
@@ -1922,6 +1922,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
 #define vcpu_has_avx512_4vnniw() vcpu_has(       7, EDX,  2, ctxt, ops)
@@ -3211,6 +3212,8 @@ x86_decode(
 
                 switch ( b )
                 {
+                /* vp4dpwssd{,s} need special casing */
+                case 0x52: case 0x53:
                 /* v4f{,n}madd{p,s}s need special casing */
                 case 0x9a: case 0x9b: case 0xaa: case 0xab:
                     if ( evex.pfx == vex_f2 )
@@ -9401,6 +9404,14 @@ x86_emulate(
             avx512_vlen_check(true);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x50): /* vpdpbusd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x51): /* vpdpbusds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x52): /* vpdpwssd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x53): /* vpdpwssds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vnni);
+        generate_exception_if(evex.w, EXC_UD);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9a): /* v4fmaddps m128,zmm+3,zmm{k} */
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0xaa): /* v4fnmaddps m128,zmm+3,zmm{k} */
         host_and_vcpu_must_have(avx512_4fmaps);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -113,6 +113,7 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_avx512_vnni     boot_cpu_has(X86_FEATURE_AVX512_VNNI)
 #define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
 #define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_VNNI,   6*32+11) /*A  Vector Neural Network Instrs */
 XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
 XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -264,7 +264,7 @@ def crunch_numbers(state):
         # AVX512 features are built on top of AVX512F
         AVX512F: [AVX512DQ, AVX512_IFMA, AVX512PF, AVX512ER, AVX512CD,
                   AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
-                  AVX512_VPOPCNTDQ],
+                  AVX512_VNNI, AVX512_VPOPCNTDQ],
 
         # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 45/49] x86emul: support VPCLMULQDQ insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (44 preceding siblings ...)
  2018-12-19 15:06   ` [PATCH v7 44/49] x86emul: support AVX512_VNNI insns Jan Beulich
@ 2018-12-19 15:06   ` Jan Beulich
  2018-12-19 15:07   ` [PATCH v7 46/49] x86emul: support VAES insns Jan Beulich
                     ` (3 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:06 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As to the feature dependency adjustment, while strictly speaking AVX is
a sufficient prereq (to have YMM registers), 256-bit vectors of integers
have got fully introduced with AVX2 only. Sadly gcc can't be used as a
reference here: They don't provide any AVX512-independent built-in at
all.

Along the lines of PCLMULQDQ, since the insns here and in particular
their memory access patterns follow the usual scheme, I didn't think it
was necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -591,6 +591,10 @@ static const struct test avx512_vpopcntd
     INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
 };
 
+static const struct test vpclmulqdq_all[] = {
+    INSN(pclmulqdq, 66, 0f3a, 44, vl, q_nb, vl)
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -968,4 +972,9 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512_vbmi2, all);
     RUN(avx512_vnni, all);
     RUN(avx512_vpopcntdq, all);
+
+    if ( cpu_has_avx512f )
+    {
+        RUN(vpclmulqdq, all);
+    }
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -144,6 +144,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_vpclmulqdq (cp.feat.vpclmulqdq && xcr0_mask(6))
 #define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -594,7 +594,7 @@ static const struct ext0f3a_table {
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42 ... 0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x44] = { .simd_size = simd_packed_int },
+    [0x44] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -1922,6 +1922,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_vpclmulqdq()  vcpu_has(         7, ECX, 10, ctxt, ops)
 #define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
@@ -10194,13 +10195,20 @@ x86_emulate(
         goto opmask_shift_imm;
 
     case X86EMUL_OPC_66(0x0f3a, 0x44):     /* pclmulqdq $imm8,xmm/m128,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,xmm/m128,xmm,xmm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         host_and_vcpu_must_have(pclmulqdq);
         if ( vex.opcx == vex_none )
             goto simd_0f3a_common;
-        generate_exception_if(vex.l, EXC_UD);
+        if ( vex.l )
+            host_and_vcpu_must_have(vpclmulqdq);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm */
+        host_and_vcpu_must_have(vpclmulqdq);
+        generate_exception_if(evex.brs || evex.opmsk, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x4a): /* vblendvps {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x4b): /* vblendvpd {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.w, EXC_UD);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -113,6 +113,7 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_vpclmulqdq      boot_cpu_has(X86_FEATURE_VPCLMULQDQ)
 #define cpu_has_avx512_vnni     boot_cpu_has(X86_FEATURE_AVX512_VNNI)
 #define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
 #define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -121,7 +121,7 @@ XEN_CPUFEATURE(PBE,           0*32+31) /
 
 /* Intel-defined CPU features, CPUID level 0x00000001.ecx, word 1 */
 XEN_CPUFEATURE(SSE3,          1*32+ 0) /*A  Streaming SIMD Extensions-3 */
-XEN_CPUFEATURE(PCLMULQDQ,     1*32+ 1) /*A  Carry-less mulitplication */
+XEN_CPUFEATURE(PCLMULQDQ,     1*32+ 1) /*A  Carry-less multiplication */
 XEN_CPUFEATURE(DTES64,        1*32+ 2) /*   64-bit Debug Store */
 XEN_CPUFEATURE(MONITOR,       1*32+ 3) /*   Monitor/Mwait support */
 XEN_CPUFEATURE(DSCPL,         1*32+ 4) /*   CPL Qualified Debug Store */
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(VPCLMULQDQ,    6*32+10) /*A  Vector Carry-less Multiplication Instrs */
 XEN_CPUFEATURE(AVX512_VNNI,   6*32+11) /*A  Vector Neural Network Instrs */
 XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -255,8 +255,9 @@ def crunch_numbers(state):
 
         # This is just the dependency between AVX512 and AVX2 of XSTATE
         # feature flags.  If want to use AVX512, AVX2 must be supported and
-        # enabled.
-        AVX2: [AVX512F],
+        # enabled.  Certain later extensions, acting on 256-bit vectors of
+        # integers, better depend on AVX2 than AVX.
+        AVX2: [AVX512F, VPCLMULQDQ],
 
         # AVX512F is taken to mean hardware support for 512bit registers
         # (which in practice depends on the EVEX prefix to encode) as well




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 46/49] x86emul: support VAES insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (45 preceding siblings ...)
  2018-12-19 15:06   ` [PATCH v7 45/49] x86emul: support VPCLMULQDQ insns Jan Beulich
@ 2018-12-19 15:07   ` Jan Beulich
  2018-12-19 15:07   ` [PATCH v7 47/49] x86emul: support GFNI insns Jan Beulich
                     ` (2 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:07 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As to the feature dependency adjustment, just like for VPCLMULQDQ while
strictly speaking AVX is a sufficient prereq (to have YMM registers),
256-bit vectors of integers have got fully introduced with AVX2 only.

Along the lines of AESNI, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one. I've been carrying a todo item though to add both AES
and SHA test cases.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -591,6 +591,18 @@ static const struct test avx512_vpopcntd
     INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
 };
 
+/*
+ * The uses of b in this table are simply (one of) the shortest form(s) of
+ * saying "no broadcast" without introducing a 128-bit granularity enumerator.
+ * Due to all of the insns being WIG, w, d_nb, and q_nb would all also fit.
+ */
+static const struct test vaes_all[] = {
+    INSN(aesdec,     66, 0f38, de, vl, b, vl),
+    INSN(aesdeclast, 66, 0f38, df, vl, b, vl),
+    INSN(aesenc,     66, 0f38, dc, vl, b, vl),
+    INSN(aesenclast, 66, 0f38, dd, vl, b, vl),
+};
+
 static const struct test vpclmulqdq_all[] = {
     INSN(pclmulqdq, 66, 0f3a, 44, vl, q_nb, vl)
 };
@@ -975,6 +987,7 @@ void evex_disp8_test(void *instr, struct
 
     if ( cpu_has_avx512f )
     {
+        RUN(vaes, all);
         RUN(vpclmulqdq, all);
     }
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -144,6 +144,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_vaes      (cp.feat.vaes && xcr0_mask(6))
 #define cpu_has_vpclmulqdq (cp.feat.vpclmulqdq && xcr0_mask(6))
 #define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -541,7 +541,7 @@ static const struct ext0f38_table {
     [0xcc] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xcd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
+    [0xdc ... 0xdf] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xf0] = { .two_op = 1 },
     [0xf1] = { .to_mem = 1, .two_op = 1 },
     [0xf2 ... 0xf3] = {},
@@ -1922,6 +1922,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_vaes()        vcpu_has(         7, ECX,  9, ctxt, ops)
 #define vcpu_has_vpclmulqdq()  vcpu_has(         7, ECX, 10, ctxt, ops)
 #define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
@@ -8924,13 +8925,9 @@ x86_emulate(
     case X86EMUL_OPC_66(0x0f38, 0xdb):     /* aesimc xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xdb): /* vaesimc xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xdc):     /* aesenc xmm/m128,xmm,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xdc): /* vaesenc xmm/m128,xmm,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xdd):     /* aesenclast xmm/m128,xmm,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xdd): /* vaesenclast xmm/m128,xmm,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xde):     /* aesdec xmm/m128,xmm,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xde): /* vaesdec xmm/m128,xmm,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xdf):     /* aesdeclast xmm/m128,xmm,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xdf): /* vaesdeclast xmm/m128,xmm,xmm */
         host_and_vcpu_must_have(aesni);
         if ( vex.opcx == vex_none )
             goto simd_0f38_common;
@@ -9630,6 +9627,25 @@ x86_emulate(
         host_and_vcpu_must_have(avx512er);
         goto simd_zmm_scalar_sae;
 
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xdc):  /* vaesenc {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xdd):  /* vaesenclast {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xde):  /* vaesdec {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xdf):  /* vaesdeclast {x,y}mm/mem,{x,y}mm,{x,y}mm */
+        if ( !vex.l )
+            host_and_vcpu_must_have(aesni);
+        else
+            host_and_vcpu_must_have(vaes);
+        goto simd_0f_avx;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xdc): /* vaesenc [xyz]mm/mem,[xyz]mm,[xyz]mm */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xdd): /* vaesenclast [xyz]mm/mem,[xyz]mm,[xyz]mm */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xde): /* vaesdec [xyz]mm/mem,[xyz]mm,[xyz]mm */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xdf): /* vaesdeclast [xyz]mm/mem,[xyz]mm,[xyz]mm */
+        host_and_vcpu_must_have(vaes);
+        generate_exception_if(evex.brs || evex.opmsk, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC(0x0f38, 0xf0): /* movbe m,r */
     case X86EMUL_OPC(0x0f38, 0xf1): /* movbe r,m */
         vcpu_must_have(movbe);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -113,6 +113,7 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_vaes            boot_cpu_has(X86_FEATURE_VAES)
 #define cpu_has_vpclmulqdq      boot_cpu_has(X86_FEATURE_VPCLMULQDQ)
 #define cpu_has_avx512_vnni     boot_cpu_has(X86_FEATURE_AVX512_VNNI)
 #define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(VAES,          6*32+ 9) /*A  Vector AES Instrs */
 XEN_CPUFEATURE(VPCLMULQDQ,    6*32+10) /*A  Vector Carry-less Multiplication Instrs */
 XEN_CPUFEATURE(AVX512_VNNI,   6*32+11) /*A  Vector Neural Network Instrs */
 XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -257,7 +257,7 @@ def crunch_numbers(state):
         # feature flags.  If want to use AVX512, AVX2 must be supported and
         # enabled.  Certain later extensions, acting on 256-bit vectors of
         # integers, better depend on AVX2 than AVX.
-        AVX2: [AVX512F, VPCLMULQDQ],
+        AVX2: [AVX512F, VAES, VPCLMULQDQ],
 
         # AVX512F is taken to mean hardware support for 512bit registers
         # (which in practice depends on the EVEX prefix to encode) as well




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 47/49] x86emul: support GFNI insns
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (46 preceding siblings ...)
  2018-12-19 15:07   ` [PATCH v7 46/49] x86emul: support VAES insns Jan Beulich
@ 2018-12-19 15:07   ` Jan Beulich
  2018-12-19 15:07   ` [PATCH v7 48/49] x86emul: restore ordering within main switch statement Jan Beulich
  2018-12-19 15:08   ` [PATCH v7 49/49] tools: re-sync CPUID leaf 7 tables Jan Beulich
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:07 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the ISA extensions document revision 035 is ambiguous
regarding fault suppression for VGF2P8MULB: Text says it's supported,
while the exception specification listed is E4NF. Given the wording here
and for the other two insns I'm inclined to trust the text more than the
exception reference, which was also confirmed informally.

As to the feature dependency adjustment, while strictly speaking SSE is
a sufficient prereq (to have XMM registers), vectors of bytes and qwords
have got introduced only with SSE2. gcc, for example, uses a similar
connection in its respective intrinsics header.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -22,7 +22,8 @@ CFLAGS += $(CFLAGS_xeninclude)
 SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er avx512vbmi
 FMA := fma4 fma
 SG := avx2-sg avx512f-sg avx512vl-sg
-TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
+GF := sse2-gf avx2-gf avx512bw-gf
+TESTCASES := blowfish $(SIMD) $(FMA) $(SG) $(GF)
 
 OPMASK := avx512f avx512dq avx512bw
 
@@ -145,12 +146,17 @@ $(1)-cflags := \
 	   $(foreach flt,$($(1)-flts), \
 	     "-D_$(vec)x$(idx)f$(flt) -m$(1:-sg=) $(call non-sse,$(1)) -Os -DVEC_MAX=$(vec) -DIDX_SIZE=$(idx) -DFLOAT_SIZE=$(flt)")))
 endef
+define simd-gf-defs
+$(1)-cflags := $(foreach vec,$($(1:-gf=)-vecs), \
+	         "-D_$(vec) -mgfni -m$(1:-gf=) $(call non-sse,$(1)) -Os -DVEC_SIZE=$(vec)")
+endef
 define opmask-defs
 $(1)-opmask-cflags := $(foreach vec,$($(1)-opmask-vecs), "-D_$(vec) -m$(1) -Os -DSIZE=$(vec)")
 endef
 
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
+$(foreach flavor,$(GF),$(eval $(call simd-gf-defs,$(flavor))))
 $(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
 first-string = $(shell for s in $(1); do echo "$$s"; break; done)
@@ -200,6 +206,9 @@ $(addsuffix .c,$(FMA)):
 $(addsuffix .c,$(SG)):
 	ln -sf simd-sg.c $@
 
+$(addsuffix .c,$(GF)):
+	ln -sf simd-gf.c $@
+
 $(addsuffix .h,$(SIMD) $(FMA) $(SG)): simd.h
 
 xop.h avx512f.h: simd-fma.c
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -591,6 +591,12 @@ static const struct test avx512_vpopcntd
     INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
 };
 
+static const struct test gfni_all[] = {
+    INSN(gf2p8affineinvqb, 66, 0f3a, cf, vl, q, vl),
+    INSN(gf2p8affineqb,    66, 0f3a, ce, vl, q, vl),
+    INSN(gf2p8mulb,        66, 0f38, cf, vl, b, vl),
+};
+
 /*
  * The uses of b in this table are simply (one of) the shortest form(s) of
  * saying "no broadcast" without introducing a 128-bit granularity enumerator.
@@ -987,6 +993,7 @@ void evex_disp8_test(void *instr, struct
 
     if ( cpu_has_avx512f )
     {
+        RUN(gfni, all);
         RUN(vaes, all);
         RUN(vpclmulqdq, all);
     }
--- /dev/null
+++ b/tools/tests/x86_emulator/simd-gf.c
@@ -0,0 +1,80 @@
+#define UINT_SIZE 1
+
+#include "simd.h"
+ENTRY(gf_test);
+
+#if VEC_SIZE == 16
+# define GF(op, s, a...) __builtin_ia32_vgf2p8 ## op ## _v16qi ## s(a)
+#elif VEC_SIZE == 32
+# define GF(op, s, a...) __builtin_ia32_vgf2p8 ## op ## _v32qi ## s(a)
+#elif VEC_SIZE == 64
+# define GF(op, s, a...) __builtin_ia32_vgf2p8 ## op ## _v64qi ## s(a)
+#endif
+
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# define eq(x, y) (B(pcmpeqb, _mask, (vqi_t)(x), (vqi_t)(y), -1) == ALL_TRUE)
+# define mul(x, y) GF(mulb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0)
+# define transform(m, dir, x, c) ({ \
+    vec_t t_; \
+    asm ( "vgf2p8affine" #dir "qb %[imm], %[matrix]%{1to%c[n]%}, %[src], %[dst]" \
+          : [dst] "=v" (t_) \
+          : [matrix] "m" (m), [src] "v" (x), [imm] "i" (c), [n] "i" (VEC_SIZE / 8) ); \
+    t_; \
+})
+#else
+# if defined(__AVX2__)
+#  define bcstq(x) ({ \
+    vdi_t t_; \
+    asm ( "vpbroadcastq %1, %0" : "=x" (t_) : "m" (x) ); \
+    t_; \
+})
+#  define to_bool(cmp) B(ptestc, , cmp, (vdi_t){} == 0)
+# else
+#  define bcstq(x) ((vdi_t){x, x})
+#  define to_bool(cmp) (__builtin_ia32_pmovmskb128(cmp) == 0xffff)
+# endif
+# define eq(x, y) to_bool((x) == (y))
+# define mul(x, y) GF(mulb, , (vqi_t)(x), (vqi_t)(y))
+# define transform(m, dir, x, c) ({ \
+    vdi_t m_ = bcstq(m); \
+    touch(m_); \
+    ((vec_t)GF(affine ## dir ## qb, , (vqi_t)(x), (vqi_t)m_, c)); \
+})
+#endif
+
+const unsigned __attribute__((mode(DI))) ident = 0x0102040810204080ULL;
+
+int gf_test(void)
+{
+    unsigned int i;
+    vec_t src, one;
+
+    for ( i = 0; i < ELEM_COUNT; ++i )
+    {
+        src[i] = i;
+        one[i] = 1;
+    }
+
+    /* Special case for first iteration. */
+    one[0] = 0;
+
+    do {
+        vec_t inv = transform(ident, inv, src, 0);
+
+        touch(src);
+        touch(inv);
+        if ( !eq(mul(src, inv), one) ) return __LINE__;
+
+        touch(src);
+        touch(inv);
+        if ( !eq(mul(inv, src), one) ) return __LINE__;
+
+        one[0] = 1;
+
+        src += ELEM_COUNT;
+        i += ELEM_COUNT;
+    } while ( i < 256 );
+
+    return 0;
+}
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -11,12 +11,14 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "3dnow.h"
 #include "sse.h"
 #include "sse2.h"
+#include "sse2-gf.h"
 #include "sse4.h"
 #include "avx.h"
 #include "fma4.h"
 #include "fma.h"
 #include "avx2.h"
 #include "avx2-sg.h"
+#include "avx2-gf.h"
 #include "xop.h"
 #include "avx512f-opmask.h"
 #include "avx512dq-opmask.h"
@@ -25,6 +27,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f-sg.h"
 #include "avx512vl-sg.h"
 #include "avx512bw.h"
+#include "avx512bw-gf.h"
 #include "avx512dq.h"
 #include "avx512er.h"
 #include "avx512vbmi.h"
@@ -138,6 +141,26 @@ static bool simd_check_avx512vbmi_vl(voi
     return cpu_has_avx512_vbmi && cpu_has_avx512vl;
 }
 
+static bool simd_check_sse2_gf(void)
+{
+    return cpu_has_gfni && cpu_has_sse2;
+}
+
+static bool simd_check_avx2_gf(void)
+{
+    return cpu_has_gfni && cpu_has_avx2;
+}
+
+static bool simd_check_avx512bw_gf(void)
+{
+    return cpu_has_gfni && cpu_has_avx512bw;
+}
+
+static bool simd_check_avx512bw_gf_vl(void)
+{
+    return cpu_has_gfni && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -395,6 +418,12 @@ static const struct {
     AVX512VL(_VBMI+VL u16x8, avx512vbmi,    16u2),
     AVX512VL(_VBMI+VL s16x16, avx512vbmi,   32i2),
     AVX512VL(_VBMI+VL u16x16, avx512vbmi,   32u2),
+    SIMD(GFNI (legacy),      sse2_gf,         16),
+    SIMD(GFNI (VEX/x16),     avx2_gf,         16),
+    SIMD(GFNI (VEX/x32),     avx2_gf,         32),
+    SIMD(GFNI (EVEX/x64),    avx512bw_gf,     64),
+    AVX512VL(VL+GFNI (x16),  avx512bw_gf,     16),
+    AVX512VL(VL+GFNI (x32),  avx512bw_gf,     32),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -144,6 +144,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_gfni       cp.feat.gfni
 #define cpu_has_vaes      (cp.feat.vaes && xcr0_mask(6))
 #define cpu_has_vpclmulqdq (cp.feat.vpclmulqdq && xcr0_mask(6))
 #define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -540,6 +540,7 @@ static const struct ext0f38_table {
     [0xcb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xcc] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xcd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xcf] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xf0] = { .two_op = 1 },
@@ -619,6 +620,7 @@ static const struct ext0f3a_table {
     [0x7c ... 0x7d] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x7e ... 0x7f] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0xcc] = { .simd_size = simd_other },
+    [0xce ... 0xcf] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xdf] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xf0] = {},
 };
@@ -1922,6 +1924,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_gfni()        vcpu_has(         7, ECX,  8, ctxt, ops)
 #define vcpu_has_vaes()        vcpu_has(         7, ECX,  9, ctxt, ops)
 #define vcpu_has_vpclmulqdq()  vcpu_has(         7, ECX, 10, ctxt, ops)
 #define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
@@ -9627,6 +9630,21 @@ x86_emulate(
         host_and_vcpu_must_have(avx512er);
         goto simd_zmm_scalar_sae;
 
+    case X86EMUL_OPC_66(0x0f38, 0xcf):      /* gf2p8mulb xmm/m128,xmm */
+        host_and_vcpu_must_have(gfni);
+        goto simd_0f38_common;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xcf):  /* vgf2p8mulb {x,y}mm/mem,{x,y}mm,{x,y}mm */
+        host_and_vcpu_must_have(gfni);
+        generate_exception_if(vex.w, EXC_UD);
+        goto simd_0f_avx;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcf): /* vgf2p8mulb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(gfni);
+        generate_exception_if(evex.w || evex.brs, EXC_UD);
+        elem_bytes = 1;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0xdc):  /* vaesenc {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xdd):  /* vaesenclast {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xde):  /* vaesdec {x,y}mm/mem,{x,y}mm,{x,y}mm */
@@ -10372,6 +10390,24 @@ x86_emulate(
         op_bytes = 16;
         goto simd_0f3a_common;
 
+    case X86EMUL_OPC_66(0x0f3a, 0xce):      /* gf2p8affineqb $imm8,xmm/m128,xmm */
+    case X86EMUL_OPC_66(0x0f3a, 0xcf):      /* gf2p8affineinvqb $imm8,xmm/m128,xmm */
+        host_and_vcpu_must_have(gfni);
+        goto simd_0f3a_common;
+
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0xce):  /* vgf2p8affineqb $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0xcf):  /* vgf2p8affineinvqb $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
+        host_and_vcpu_must_have(gfni);
+        generate_exception_if(!vex.w, EXC_UD);
+        goto simd_0f_imm8_avx;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0xce): /* vgf2p8affineqb $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0xcf): /* vgf2p8affineinvqb $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(gfni);
+        generate_exception_if(!evex.w, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_66(0x0f3a, 0xdf):     /* aeskeygenassist $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0xdf): /* vaeskeygenassist $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(aesni);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -113,6 +113,7 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_gfni            boot_cpu_has(X86_FEATURE_GFNI)
 #define cpu_has_vaes            boot_cpu_has(X86_FEATURE_VAES)
 #define cpu_has_vpclmulqdq      boot_cpu_has(X86_FEATURE_VPCLMULQDQ)
 #define cpu_has_avx512_vnni     boot_cpu_has(X86_FEATURE_AVX512_VNNI)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(GFNI,          6*32+ 8) /*A  Galois Field Instrs */
 XEN_CPUFEATURE(VAES,          6*32+ 9) /*A  Vector AES Instrs */
 XEN_CPUFEATURE(VPCLMULQDQ,    6*32+10) /*A  Vector Carry-less Multiplication Instrs */
 XEN_CPUFEATURE(AVX512_VNNI,   6*32+11) /*A  Vector Neural Network Instrs */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -200,7 +200,7 @@ def crunch_numbers(state):
               AESNI, SHA],
 
         # SSE2 was re-specified as core instructions for 64bit.
-        SSE2: [LM],
+        SSE2: [LM, GFNI],
 
         # SSE4.1 explicitly depends on SSE3 and SSSE3
         SSE3: [SSE4_1],




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 48/49] x86emul: restore ordering within main switch statement
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (47 preceding siblings ...)
  2018-12-19 15:07   ` [PATCH v7 47/49] x86emul: support GFNI insns Jan Beulich
@ 2018-12-19 15:07   ` Jan Beulich
  2018-12-19 15:08   ` [PATCH v7 49/49] tools: re-sync CPUID leaf 7 tables Jan Beulich
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:07 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Incremental additions and/or mistakes have lead to some code blocks
sitting in "unexpected" places. Re-sort the case blocks (opcode space;
major opcode; 66/F3/F2 prefix; legacy/VEX/EVEX encoding).

As an exception the opcode space 0x0f EVEX-encoded VPEXTRW is left at
its current place, to keep it close to the "pextr" label.

Pure code movement.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: New.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -7118,15 +7118,6 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
-    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7e): /* vmovq xmm/m64,xmm */
-    case X86EMUL_OPC_EVEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
-        generate_exception_if(evex.lr || !evex.w || evex.opmsk || evex.brs,
-                              EXC_UD);
-        host_and_vcpu_must_have(avx512f);
-        d |= TwoOp;
-        op_bytes = 8;
-        goto simd_zmm;
-
     case X86EMUL_OPC_66(0x0f, 0xe7):     /* movntdq xmm,m128 */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe7): /* vmovntdq {x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
@@ -7524,6 +7515,15 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7e): /* vmovq xmm/m64,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
+        generate_exception_if(evex.lr || !evex.w || evex.opmsk || evex.brs,
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        d |= TwoOp;
+        op_bytes = 8;
+        goto simd_zmm;
+
     case X86EMUL_OPC(0x0f, 0x80) ... X86EMUL_OPC(0x0f, 0x8f): /* jcc (near) */
         if ( test_cc(b, _regs.eflags) )
             jmp_rel((int32_t)src.val);
@@ -8624,63 +8624,6 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x10): /* vpsrlvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x11): /* vpsravw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x12): /* vpsllvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        host_and_vcpu_must_have(avx512bw);
-        generate_exception_if(!evex.w || evex.brs, EXC_UD);
-        elem_bytes = 2;
-        goto avx512f_no_sae;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
-        op_bytes = elem_bytes;
-        generate_exception_if(evex.w || evex.brs, EXC_UD);
-    avx512_broadcast:
-        /*
-         * For the respective code below the main switch() to work we need to
-         * fold op_mask here: A source element gets read whenever any of its
-         * respective destination elements' mask bits is set.
-         */
-        if ( fault_suppression )
-        {
-            n = 1 << ((b & 3) - evex.w);
-            EXPECT(elem_bytes > 0);
-            ASSERT(op_bytes == n * elem_bytes);
-            for ( i = n; i < (16 << evex.lr) / elem_bytes; i += n )
-                op_mask |= (op_mask >> i) & ((1 << n) - 1);
-        }
-        goto avx512f_no_sae;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
-                                            /* vbroadcastf64x4 m256,zmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
-                                            /* vbroadcasti64x4 m256,zmm{k} */
-        generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
-        /* fall through */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
-                                            /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
-        generate_exception_if(!evex.lr, EXC_UD);
-        /* fall through */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
-                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
-        if ( b == 0x59 )
-            op_bytes = 8;
-        generate_exception_if(evex.brs, EXC_UD);
-        if ( !evex.w )
-            host_and_vcpu_must_have(avx512dq);
-        goto avx512_broadcast;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
-                                            /* vbroadcastf64x2 m128,{y,z}mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
-                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
-        generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.brs,
-                              EXC_UD);
-        if ( evex.w )
-            host_and_vcpu_must_have(avx512dq);
-        goto avx512_broadcast;
-
     case X86EMUL_OPC_66(0x0f38, 0x20): /* pmovsxbw xmm/m64,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x21): /* pmovsxbd xmm/m32,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x22): /* pmovsxbq xmm/m16,xmm */
@@ -8714,47 +8657,14 @@ x86_emulate(
         host_and_vcpu_must_have(sse4_1);
         goto simd_0f38_common;
 
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x13): /* vcvtph2ps xmm/mem,{x,y}mm */
-        generate_exception_if(vex.w, EXC_UD);
-        host_and_vcpu_must_have(f16c);
-        op_bytes = 8 << vex.l;
-        goto simd_0f_ymm;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x13): /* vcvtph2ps {x,y}mm/mem,[xyz]mm{k} */
-        generate_exception_if(evex.w || (ea.type != OP_REG && evex.brs), EXC_UD);
-        host_and_vcpu_must_have(avx512f);
-        if ( !evex.brs )
-            avx512_vlen_check(false);
-        op_bytes = 8 << evex.lr;
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x10): /* vpsrlvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x11): /* vpsravw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x12): /* vpsllvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(!evex.w || evex.brs, EXC_UD);
         elem_bytes = 2;
-        goto simd_zmm;
-
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x16): /* vpermps ymm/m256,ymm,ymm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x36): /* vpermd ymm/m256,ymm,ymm */
-        generate_exception_if(!vex.l || vex.w, EXC_UD);
-        goto simd_0f_avx2;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x16): /* vpermp{s,d} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x36): /* vperm{d,q} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
-        generate_exception_if(!evex.lr, EXC_UD);
-        fault_suppression = false;
         goto avx512f_no_sae;
 
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x20): /* vpmovsxbw xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x23): /* vpmovsxwd xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x25): /* vpmovsxdq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x30): /* vpmovzxbw xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x33): /* vpmovzxwd xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x35): /* vpmovzxdq xmm/mem,{x,y}mm */
-        op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
-        goto simd_0f_int;
-
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10): /* vpmovuswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20): /* vpmovswb [xyz]mm,{x,y}mm/mem{k} */
@@ -8800,6 +8710,96 @@ x86_emulate(
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x13): /* vcvtph2ps xmm/mem,{x,y}mm */
+        generate_exception_if(vex.w, EXC_UD);
+        host_and_vcpu_must_have(f16c);
+        op_bytes = 8 << vex.l;
+        goto simd_0f_ymm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x13): /* vcvtph2ps {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w || (ea.type != OP_REG && evex.brs), EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.brs )
+            avx512_vlen_check(false);
+        op_bytes = 8 << evex.lr;
+        elem_bytes = 2;
+        goto simd_zmm;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x16): /* vpermps ymm/m256,ymm,ymm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x36): /* vpermd ymm/m256,ymm,ymm */
+        generate_exception_if(!vex.l || vex.w, EXC_UD);
+        goto simd_0f_avx2;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x16): /* vpermp{s,d} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x36): /* vperm{d,q} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
+        op_bytes = elem_bytes;
+        generate_exception_if(evex.w || evex.brs, EXC_UD);
+    avx512_broadcast:
+        /*
+         * For the respective code below the main switch() to work we need to
+         * fold op_mask here: A source element gets read whenever any of its
+         * respective destination elements' mask bits is set.
+         */
+        if ( fault_suppression )
+        {
+            n = 1 << ((b & 3) - evex.w);
+            EXPECT(elem_bytes > 0);
+            ASSERT(op_bytes == n * elem_bytes);
+            for ( i = n; i < (16 << evex.lr) / elem_bytes; i += n )
+                op_mask |= (op_mask >> i) & ((1 << n) - 1);
+        }
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
+                                            /* vbroadcastf64x4 m256,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
+                                            /* vbroadcasti64x4 m256,zmm{k} */
+        generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
+                                            /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
+                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
+        if ( b == 0x59 )
+            op_bytes = 8;
+        generate_exception_if(evex.brs, EXC_UD);
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcastf64x2 m128,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
+        generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.brs,
+                              EXC_UD);
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x20): /* vpmovsxbw xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x23): /* vpmovsxwd xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x25): /* vpmovsxdq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x30): /* vpmovzxbw xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x33): /* vpmovzxwd xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x35): /* vpmovzxdq xmm/mem,{x,y}mm */
+        op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
+        goto simd_0f_int;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x29): /* vpmov{b,w}2m [xyz]mm,k */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x39): /* vpmov{d,q}2m [xyz]mm,k */
         generate_exception_if(!evex.r || !evex.R, EXC_UD);
@@ -8907,6 +8907,52 @@ x86_emulate(
         break;
     }
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2c): /* vscalefp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x42): /* vgetexpp{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9a): /* vfmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9c): /* vfnmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9e): /* vfnmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa6): /* vfmaddsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa7): /* vfmsubadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa8): /* vfmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaa): /* vfmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xac): /* vfnmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xae): /* vfnmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb6): /* vfmaddsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb7): /* vfmsubadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb8): /* vfmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xba): /* vfmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type != OP_REG || !evex.brs )
+            avx512_vlen_check(false);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2d): /* vscalefs{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x43): /* vgetexps{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+    simd_zmm_scalar_sae:
+        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
+        if ( !evex.brs )
+            avx512_vlen_check(true);
+        goto simd_zmm;
+
     case X86EMUL_OPC_66(0x0f38, 0x37): /* pcmpgtq xmm/m128,xmm */
         host_and_vcpu_must_have(sse4_2);
         goto simd_0f38_common;
@@ -8939,6 +8985,31 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x50): /* vpdpbusd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x51): /* vpdpbusds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x52): /* vpdpwssd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x53): /* vpdpwssds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vnni);
+        generate_exception_if(evex.w, EXC_UD);
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,{x,y}mm */
+        op_bytes = 1 << ((!(b & 0x20) * 2) + (b & 1));
+        /* fall through */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x46): /* vpsravd {x,y}mm/mem,{x,y}mm,{x,y}mm */
+        generate_exception_if(vex.w, EXC_UD);
+        goto simd_0f_avx2;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4d): /* vrcp14s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4f): /* vrsqrt14s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.brs, EXC_UD);
+        avx512_vlen_check(true);
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0x52): /* vp4dpwssd m128,zmm+3,zmm{k} */
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0x53): /* vp4dpwssds m128,zmm+3,zmm{k} */
         host_and_vcpu_must_have(avx512_4vnniw);
@@ -8961,23 +9032,6 @@ x86_emulate(
         host_and_vcpu_must_have(avx512_vpopcntdq);
         goto avx512f_no_sae;
 
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,{x,y}mm */
-        op_bytes = 1 << ((!(b & 0x20) * 2) + (b & 1));
-        /* fall through */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x46): /* vpsravd {x,y}mm/mem,{x,y}mm,{x,y}mm */
-        generate_exception_if(vex.w, EXC_UD);
-        goto simd_0f_avx2;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4d): /* vrcp14s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4f): /* vrsqrt14s{s,d} xmm/mem,xmm,xmm{k} */
-        host_and_vcpu_must_have(avx512f);
-        generate_exception_if(evex.brs, EXC_UD);
-        avx512_vlen_check(true);
-        goto simd_zmm;
-
     case X86EMUL_OPC_VEX_66(0x0f38, 0x5a): /* vbroadcasti128 m128,ymm */
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
@@ -9359,60 +9413,6 @@ x86_emulate(
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2c): /* vscalefp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x42): /* vgetexpp{s,d} [xyz]mm/mem,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9a): /* vfmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9c): /* vfnmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9e): /* vfnmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa6): /* vfmaddsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa7): /* vfmsubadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa8): /* vfmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaa): /* vfmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xac): /* vfnmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xae): /* vfnmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb6): /* vfmaddsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb7): /* vfmsubadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb8): /* vfmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xba): /* vfmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        host_and_vcpu_must_have(avx512f);
-        if ( ea.type != OP_REG || !evex.brs )
-            avx512_vlen_check(false);
-        goto simd_zmm;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2d): /* vscalefs{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x43): /* vgetexps{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
-        host_and_vcpu_must_have(avx512f);
-    simd_zmm_scalar_sae:
-        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
-        if ( !evex.brs )
-            avx512_vlen_check(true);
-        goto simd_zmm;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x50): /* vpdpbusd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x51): /* vpdpbusds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x52): /* vpdpwssd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x53): /* vpdpwssds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        host_and_vcpu_must_have(avx512_vnni);
-        generate_exception_if(evex.w, EXC_UD);
-        goto avx512f_no_sae;
-
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9a): /* v4fmaddps m128,zmm+3,zmm{k} */
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0xaa): /* v4fnmaddps m128,zmm+3,zmm{k} */
         host_and_vcpu_must_have(avx512_4fmaps);
@@ -10243,11 +10243,6 @@ x86_emulate(
         fault_suppression = false;
         goto avx512f_imm8_no_sae;
 
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x4a): /* vblendvps {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x4b): /* vblendvpd {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
-        generate_exception_if(vex.w, EXC_UD);
-        goto simd_0f_imm8_avx;
-
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x48): /* vpermil2ps $imm,{x,y}mm/mem,{x,y}mm,{x,y}mm,{x,y}mm */
                                            /* vpermil2ps $imm,{x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x49): /* vpermil2pd $imm,{x,y}mm/mem,{x,y}mm,{x,y}mm,{x,y}mm */
@@ -10255,6 +10250,11 @@ x86_emulate(
         host_and_vcpu_must_have(xop);
         goto simd_0f_imm8_ymm;
 
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x4a): /* vblendvps {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x4b): /* vblendvpd {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
+        generate_exception_if(vex.w, EXC_UD);
+        goto simd_0f_imm8_avx;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x4c): /* vpblendvb {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_int_imm8;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v7 49/49] tools: re-sync CPUID leaf 7 tables
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
                     ` (48 preceding siblings ...)
  2018-12-19 15:07   ` [PATCH v7 48/49] x86emul: restore ordering within main switch statement Jan Beulich
@ 2018-12-19 15:08   ` Jan Beulich
  2019-03-14 11:07     ` Andrew Cooper
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2018-12-19 15:08 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Andrew Cooper, Wei Liu, Ian Jackson, Roger Pau Monne

Bring libxl's in line with the public header, and update xen-cpuid's to
the latest information available in Intel's documentation (SDM ver 068
and ISA extensions ver 035), with (as before) the exception on MAWAU.

Some pre-existing strings get changed to match SDM naming. This should
be benign in xen-cpuid, and I hope it's also acceptable in libxl, where
people actually using the slightly wrong names would have to update
their guest config files.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/libxl/libxl_cpuid.c
+++ b/tools/libxl/libxl_cpuid.c
@@ -185,7 +185,7 @@ int libxl_cpuid_parse_config(libxl_cpuid
         {"rdseed",       0x00000007,  0, CPUID_REG_EBX, 18,  1},
         {"adx",          0x00000007,  0, CPUID_REG_EBX, 19,  1},
         {"smap",         0x00000007,  0, CPUID_REG_EBX, 20,  1},
-        {"avx512ifma",   0x00000007,  0, CPUID_REG_EBX, 21,  1},
+        {"avx512-ifma",  0x00000007,  0, CPUID_REG_EBX, 21,  1},
         {"clflushopt",   0x00000007,  0, CPUID_REG_EBX, 23,  1},
         {"clwb",         0x00000007,  0, CPUID_REG_EBX, 24,  1},
         {"avx512pf",     0x00000007,  0, CPUID_REG_EBX, 26,  1},
@@ -195,10 +195,19 @@ int libxl_cpuid_parse_config(libxl_cpuid
         {"avx512bw",     0x00000007,  0, CPUID_REG_EBX, 30,  1},
         {"avx512vl",     0x00000007,  0, CPUID_REG_EBX, 31,  1},
 
-        {"avx512vbmi",   0x00000007,  0, CPUID_REG_ECX,  1,  1},
+        {"prefetchwt1",  0x00000007,  0, CPUID_REG_ECX,  0,  1},
+        {"avx512-vbmi",  0x00000007,  0, CPUID_REG_ECX,  1,  1},
         {"umip",         0x00000007,  0, CPUID_REG_ECX,  2,  1},
         {"pku",          0x00000007,  0, CPUID_REG_ECX,  3,  1},
         {"ospke",        0x00000007,  0, CPUID_REG_ECX,  4,  1},
+        {"avx512-vbmi2", 0x00000007,  0, CPUID_REG_ECX,  6,  1},
+        {"gfni",         0x00000007,  0, CPUID_REG_ECX,  8,  1},
+        {"vaes",         0x00000007,  0, CPUID_REG_ECX,  9,  1},
+        {"vpclmulqdq",   0x00000007,  0, CPUID_REG_ECX, 10,  1},
+        {"avx512-vnni",  0x00000007,  0, CPUID_REG_ECX, 11,  1},
+        {"avx512-bitalg",0x00000007,  0, CPUID_REG_ECX, 12,  1},
+        {"avx512-vpopcntdq",0x00000007,0,CPUID_REG_ECX, 14,  1},
+        {"rdpid",        0x00000007,  0, CPUID_REG_ECX, 22,  1},
 
         {"avx512-4vnniw",0x00000007,  0, CPUID_REG_EDX,  2,  1},
         {"avx512-4fmaps",0x00000007,  0, CPUID_REG_EDX,  3,  1},
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -104,8 +104,8 @@ static const char *str_7b0[32] =
     [14] = "mpx",      [15] = "pqe",
     [16] = "avx512f",  [17] = "avx512dq",
     [18] = "rdseed",   [19] = "adx",
-    [20] = "smap",     [21] = "avx512ifma",
-    [22] = "pcomit",   [23] = "clflushopt",
+    [20] = "smap",     [21] = "avx512-ifma",
+    [22] = "pcommit",  [23] = "clflushopt",
     [24] = "clwb",     [25] = "pt",
     [26] = "avx512pf", [27] = "avx512er",
     [28] = "avx512cd", [29] = "sha",
@@ -120,13 +120,20 @@ static const char *str_Da1[32] =
 
 static const char *str_7c0[32] =
 {
-    [ 0] = "prechwt1", [ 1] = "avx512vbmi",
-    [ 2] = "umip",     [ 3] = "pku",
-    [ 4] = "ospke",
-
+    [ 0] = "prefetchwt1",      [ 1] = "avx512_vbmi",
+    [ 2] = "umip",             [ 3] = "pku",
+    [ 4] = "ospke",            [ 5] = "waitpkg",
+    [ 6] = "avx512_vbmi2",
+    [ 8] = "gfni",             [ 9] = "vaes",
+    [10] = "vpclmulqdq",       [11] = "avx512_vnni",
+    [12] = "avx512_bitalg",
     [14] = "avx512_vpopcntdq",
 
     [22] = "rdpid",
+    /* 24 */                   [25] = "cldemote",
+    /* 26 */                   [27] = "movdiri",
+    [28] = "movdir64b",
+    [30] = "sgx_lc",
 };
 
 static const char *str_e7d[32] =
@@ -145,6 +152,9 @@ static const char *str_e8b[32] =
 static const char *str_7d0[32] =
 {
     [ 2] = "avx512_4vnniw", [ 3] = "avx512_4fmaps",
+    [ 4] = "fsrm",
+
+    [18] = "pconfig",
 
     [26] = "ibrsb",         [27] = "stibp",
     [28] = "l1d_flush",     [29] = "arch_caps",





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 01/49] x86emul: rename evex.br to evex.brs
  2018-12-19 14:35   ` [PATCH v7 01/49] x86emul: rename evex.br to evex.brs Jan Beulich
@ 2018-12-19 15:14     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-12-19 15:14 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/12/2018 14:35, Jan Beulich wrote:
> This is to better reflect that it's an abbreviation for "broadcast,
> rounding, or SAE" rather than just "broadcast".
>
> Take the opportunity and also add SDM naming comments to both union vex
> and union evex.
>
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 02/49] x86emul: support AVX512{F, BW} shift/rotate insns
  2018-12-19 14:36   ` [PATCH v7 02/49] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
@ 2018-12-19 16:00     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2018-12-19 16:00 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/12/2018 14:36, Jan Beulich wrote:
> Note that simd_packed_fp for the opcode space 0f38 major opcodes 14 and
> 15 is not really correct, but sufficient for the purposes here. Further
> adjustments may later be needed for the down conversion unsigned
> saturating VPMOV* insns, first and foremost for the different Disp8
> scaling those ones use.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 03/49] x86emul: support AVX512{F, BW, DQ} extract insns
  2018-12-19 14:36   ` [PATCH v7 03/49] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
@ 2018-12-19 18:20     ` Andrew Cooper
  2018-12-20  7:49       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2018-12-19 18:20 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/12/2018 14:36, Jan Beulich wrote:
> @@ -280,6 +285,12 @@ static const struct test avx512bw_all[]
>      INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
>  };
>  
> +static const struct test avx512bw_128[] = {
> +    INSN(pextrb, 66, 0f3a, 14, el, b, el),
> +//       pextrw, 66,   0f, c5,     w

I presume this isn't tested, due to its lack of a memory operand?

It does appear to be a particularly odd encoding.

~Andrew

> +    INSN(pextrw, 66, 0f3a, 15, el, w, el),
> +};
> +
>  static const struct test avx512dq_all[] = {
>      INSN_PFP(and,              0f, 54),
>      INSN_PFP(andn,             0f, 55),
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 03/49] x86emul: support AVX512{F, BW, DQ} extract insns
  2018-12-19 18:20     ` Andrew Cooper
@ 2018-12-20  7:49       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2018-12-20  7:49 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 19.12.18 at 19:20, <andrew.cooper3@citrix.com> wrote:
> On 19/12/2018 14:36, Jan Beulich wrote:
>> @@ -280,6 +285,12 @@ static const struct test avx512bw_all[]
>>      INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
>>  };
>>  
>> +static const struct test avx512bw_128[] = {
>> +    INSN(pextrb, 66, 0f3a, 14, el, b, el),
>> +//       pextrw, 66,   0f, c5,     w
> 
> I presume this isn't tested, due to its lack of a memory operand?

Correct - without a memory operand there's no Disp8 either,
yet validation of the handling of that is what the file is for.
Nevertheless I wanted to add all instructions to their tables,
to have a way to validate complete coverage for insns at
the end of the series.

> It does appear to be a particularly odd encoding.

Not sure what you find odd with it. When its original, legacy
form was first introduced, a memory operand was apparently
not deemed useful, and hence the (register only) encoding
was with the source operands encoded in the place of where
a memory operand would be encoded. That's the common
(but not exclusive) principle - unless a memory write is
intended, it's preferably the source to allow for encoding of
a memory operand. Later, when they extended the insn to
have byte/dword/qword equivalents, they allowed for
memory operands, yet deemed it more useful for them to
be the destination (i.e. alongside the GPR).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 49/49] tools: re-sync CPUID leaf 7 tables
  2018-12-19 15:08   ` [PATCH v7 49/49] tools: re-sync CPUID leaf 7 tables Jan Beulich
@ 2019-03-14 11:07     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-03-14 11:07 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: George Dunlap, Ian Jackson, Wei Liu, Roger Pau Monne

On 19/12/2018 15:08, Jan Beulich wrote:
> Bring libxl's in line with the public header, and update xen-cpuid's to
> the latest information available in Intel's documentation (SDM ver 068
> and ISA extensions ver 035), with (as before) the exception on MAWAU.
>
> Some pre-existing strings get changed to match SDM naming. This should
> be benign in xen-cpuid, and I hope it's also acceptable in libxl, where
> people actually using the slightly wrong names would have to update
> their guest config files.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 04/49] x86emul: support AVX512{F, BW, DQ} insert insns
  2018-12-19 14:37   ` [PATCH v7 04/49] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
@ 2019-03-14 15:35     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-03-14 15:35 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/12/2018 14:37, Jan Beulich wrote:
> Also correct the comment of the AVX form of VINSERTPS.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 05/49] x86emul: basic AVX512F testing
  2018-12-19 14:37   ` [PATCH v7 05/49] x86emul: basic AVX512F testing Jan Beulich
@ 2019-03-14 15:42     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-03-14 15:42 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/12/2018 14:37, Jan Beulich wrote:
> Test various of the insns which have been implemented already.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 06/49] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2018-12-19 14:38   ` [PATCH v7 06/49] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
@ 2019-03-14 16:38     ` Andrew Cooper
  2019-03-14 17:15       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-03-14 16:38 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/12/2018 14:38, Jan Beulich wrote:
> @@ -8444,6 +8465,45 @@ x86_emulate(
>          generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
>          goto simd_0f_avx2;
>  
> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
> +        host_and_vcpu_must_have(avx512bw);
> +        generate_exception_if(evex.w || evex.brs, EXC_UD);
> +        op_bytes = elem_bytes = 1 << (b & 1);
> +        /* See the comment at the avx512_broadcast label. */
> +        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
> +        goto avx512f_no_sae;
> +
> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
> +        host_and_vcpu_must_have(avx512bw);
> +        generate_exception_if(evex.w, EXC_UD);
> +        /* fall through */
> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
> +        generate_exception_if((ea.type != OP_REG || evex.brs ||
> +                               evex.reg != 0xf || !evex.RX),
> +                              EXC_UD);

generate_exception_if(ea.type != OP_REG || evex.brs ||
                      evex.reg != 0xf || !evex.RX, EXC_UD);

?

Otherwise, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 07/49] x86emul: basic AVX512VL testing
  2018-12-19 14:38   ` [PATCH v7 07/49] x86emul: basic AVX512VL testing Jan Beulich
@ 2019-03-14 16:39     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-03-14 16:39 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 19/12/2018 14:38, Jan Beulich wrote:
> Test the 128- and 256-bit variants of the insns which have been
> implemented already.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 06/49] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2019-03-14 16:38     ` Andrew Cooper
@ 2019-03-14 17:15       ` Jan Beulich
  2019-03-15 16:39         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-14 17:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 14.03.19 at 17:38, <andrew.cooper3@citrix.com> wrote:
> On 19/12/2018 14:38, Jan Beulich wrote:
>> @@ -8444,6 +8465,45 @@ x86_emulate(
>>          generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
>>          goto simd_0f_avx2;
>>  
>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
>> +        host_and_vcpu_must_have(avx512bw);
>> +        generate_exception_if(evex.w || evex.brs, EXC_UD);
>> +        op_bytes = elem_bytes = 1 << (b & 1);
>> +        /* See the comment at the avx512_broadcast label. */
>> +        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
>> +        goto avx512f_no_sae;
>> +
>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
>> +        host_and_vcpu_must_have(avx512bw);
>> +        generate_exception_if(evex.w, EXC_UD);
>> +        /* fall through */
>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
>> +        generate_exception_if((ea.type != OP_REG || evex.brs ||
>> +                               evex.reg != 0xf || !evex.RX),
>> +                              EXC_UD);
> 
> generate_exception_if(ea.type != OP_REG || evex.brs ||
>                       evex.reg != 0xf || !evex.RX, EXC_UD);
> 
> ?

Well, no - I don't really want the second argument on a continued
line of the first one. Multiple full arguments on one line are fine
with me. If you want me to drop just the inner parentheses, that
would be fine (as mentioned in another context on this series, I've
mainly added them because of what I understood your editor's
behavior is when splitting lines like this, and I may have easily
misunderstood).

> Otherwise, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Let me know on what variant(s) of the above this holds.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 00/50] x86emul: remaining AVX512 support
  2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
                   ` (11 preceding siblings ...)
  2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
@ 2019-03-15 10:30 ` Jan Beulich
  2019-03-15 10:36   ` [PATCH v8 01/50] x86emul: no need to set fault_suppression to false for VMOVNT* Jan Beulich
                     ` (49 more replies)
  12 siblings, 50 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:30 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This adds support for all AVX512* insns in SDM rev 069 as well as
for those from ISA extensions rev 035. Besides a few added patches
there are only minor changes from v7, perhaps apart from the
AVX512PF change to match the recently changed insn page. See
individual patches for details.

This goes on top of "x86emul: avoid speculative out of bounds
accesses", or else there's a conflict in at least the "gather" patch
here.

01: no need to set fault_suppression to false for VMOVNT*
02: support AVX512{F,BW,DQ} extract insns
03: support AVX512{F,BW,DQ} insert insns
04: basic AVX512F testing
05: support AVX512{F,BW,DQ} integer broadcast insns
06: basic AVX512VL testing
07: support AVX512{F,BW} zero- and sign-extending moves
08: support AVX512{F,BW} down conversion moves
09: support AVX512{F,BW} integer unpack insns
10: support AVX512{F,BW,_VBMI} full permute insns
11: support AVX512{F,BW} integer shuffle insns
12: support AVX512{BW,DQ} mask move insns
13: basic AVX512BW testing
14: basic AVX512DQ testing
15: support AVX512F move high/low insns
16: support AVX512F move duplicate insns
17: support AVX512{F,BW,VBMI} permute insns
18: support AVX512BW pack insns
19: support AVX512F floating-point conversion insns
20: support AVX512F legacy-equivalent packed int/FP conversion insns
21: support AVX512F legacy-equivalent scalar int/FP conversion insns
22: support AVX512DQ packed quad-int/FP conversion insns
23: support AVX512{F,DQ} uint-to-FP conversion insns
24: support AVX512{F,DQ} FP-to-uint conversion insns
25: support remaining AVX512F legacy-equivalent insns
26: support remaining AVX512BW legacy-equivalent insns
27: support AVX512{F,ER} reciprocal insns
28: support AVX512F floating point manipulation insns
29: support AVX512DQ floating point manipulation insns
30: support AVX512{F,_VBMI2} compress/expand insns
31: support remaining misc AVX512{F,BW} insns
32: support AVX512F gather insns
33: add high register S/G test cases
34: support AVX512F scatter insns
35: support AVX512PF insns
36: support AVX512CD insns
37: complete support of AVX512_VBMI insns
38: support of AVX512* population count insns
39: support of AVX512_IFMA insns
40: support remaining AVX512_VBMI2 insns
41: support AVX512_4FMAPS insns
42: support AVX512_4VNNIW insns
43: support AVX512_VNNI insns
44: support VPCLMULQDQ insns
45: support VAES insns
46: support GFNI insns
47: restore ordering within main switch statement
48: add an AES/VAES test case to the harness
49: add a SHA test case to the harness
50: add a PCLMUL/VPCLMUL test case to the harness

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 01/50] x86emul: no need to set fault_suppression to false for VMOVNT*
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
@ 2019-03-15 10:36   ` Jan Beulich
  2019-03-15 16:52     ` Andrew Cooper
  2019-03-15 10:36   ` [PATCH v8 02/50] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
                     ` (48 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

When evex.opmsk is required to be zero there's no need for this, as it
won't have been set to true in the first place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Add this previously standalone patch into the series.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -5911,7 +5911,6 @@ x86_emulate(
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x2b): /* vmovntp{s,d} [xyz]mm,mem */
         generate_exception_if(ea.type != OP_MEM || evex.opmsk, EXC_UD);
         sfence = true;
-        fault_suppression = false;
         /* fall through */
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x10): /* vmovup{s,d} [xyz]mm/mem,[xyz]mm{k} */
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x10): /* vmovs{s,d} mem,xmm{k} */
@@ -6795,7 +6794,6 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || evex.opmsk || evex.w,
                               EXC_UD);
         sfence = true;
-        fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6f): /* vmovdqa{32,64} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f, 0x6f): /* vmovdqu{32,64} [xyz]mm/mem,[xyz]mm{k} */



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 02/50] x86emul: support AVX512{F, BW, DQ} extract insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
  2019-03-15 10:36   ` [PATCH v8 01/50] x86emul: no need to set fault_suppression to false for VMOVNT* Jan Beulich
@ 2019-03-15 10:36   ` Jan Beulich
  2019-03-15 17:51     ` Andrew Cooper
  2019-03-15 10:37   ` [PATCH v8 03/50] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
                     ` (47 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -212,6 +212,7 @@ static const struct test avx512f_all[] =
 };
 
 static const struct test avx512f_128[] = {
+    INSN(extractps, 66, 0f3a, 17, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -221,10 +222,14 @@ static const struct test avx512f_128[] =
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
+    INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
+    INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
+    INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -280,6 +285,12 @@ static const struct test avx512bw_all[]
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
 };
 
+static const struct test avx512bw_128[] = {
+    INSN(pextrb, 66, 0f3a, 14, el, b, el),
+//       pextrw, 66,   0f, c5,     w
+    INSN(pextrw, 66, 0f3a, 15, el, w, el),
+};
+
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
@@ -288,13 +299,21 @@ static const struct test avx512dq_all[]
     INSN_PFP(xor,              0f, 57),
 };
 
+static const struct test avx512dq_128[] = {
+    INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+};
+
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
+    INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
+    INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
@@ -632,7 +651,9 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, no128);
     RUN(avx512f, 512);
     RUN(avx512bw, all);
+    RUN(avx512bw, 128);
     RUN(avx512dq, all);
+    RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -512,9 +512,13 @@ static const struct ext0f3a_table {
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
-    [0x14 ... 0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1 },
+    [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
+    [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
+    [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
+    [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
     [0x18] = { .simd_size = simd_128 },
-    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none },
@@ -523,7 +527,8 @@ static const struct ext0f3a_table {
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128 },
-    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1 },
+    [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
@@ -2676,6 +2681,8 @@ x86_decode_0f3a(
      ... X86EMUL_OPC_66(0, 0x17):     /* pextr*, extractps */
     case X86EMUL_OPC_VEX_66(0, 0x14)
      ... X86EMUL_OPC_VEX_66(0, 0x17): /* vpextr*, vextractps */
+    case X86EMUL_OPC_EVEX_66(0, 0x14)
+     ... X86EMUL_OPC_EVEX_66(0, 0x17): /* vpextr*, vextractps */
     case X86EMUL_OPC_VEX_F2(0, 0xf0): /* rorx */
         break;
 
@@ -8878,9 +8885,9 @@ x86_emulate(
         opc[0] = b;
         /* Convert memory/GPR operand to (%rAX). */
         rex_prefix &= ~REX_B;
-        vex.b = 1;
+        evex.b = vex.b = 1;
         if ( !mode_64bit() )
-            vex.w = 0;
+            evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
         opc[3] = 0xc3;
@@ -8890,7 +8897,10 @@ x86_emulate(
             --opc;
         }
 
-        copy_REX_VEX(opc, rex_prefix, vex);
+        if ( evex_encoded() )
+            copy_EVEX(opc, evex);
+        else
+            copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=m" (dst.val) : "a" (&dst.val));
         put_stub(stub);
 
@@ -8915,6 +8925,52 @@ x86_emulate(
         opc = init_prefixes(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
+        generate_exception_if(ea.type != OP_REG, EXC_UD);
+        /* Convert to alternative encoding: We want to use a memory operand. */
+        evex.opcx = ext_0f3a;
+        b = 0x15;
+        modrm <<= 3;
+        evex.r = evex.b;
+        evex.R = evex.x;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x14): /* vpextrb $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x15): /* vpextrw $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x16): /* vpextr{d,q} $imm8,xmm,r/m */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x17): /* vextractps $imm8,xmm,r/m */
+        generate_exception_if((evex.lr || evex.reg != 0xf || !evex.RX ||
+                               evex.opmsk || evex.brs),
+                              EXC_UD);
+        if ( !(b & 2) )
+            host_and_vcpu_must_have(avx512bw);
+        else if ( !(b & 1) )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto pextr;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
+                                            /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.lr || evex.brs, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
+                                            /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(evex.lr != 2 || evex.brs, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
     {
         uint32_t mxcsr;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 03/50] x86emul: support AVX512{F, BW, DQ} insert insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
  2019-03-15 10:36   ` [PATCH v8 01/50] x86emul: no need to set fault_suppression to false for VMOVNT* Jan Beulich
  2019-03-15 10:36   ` [PATCH v8 02/50] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
@ 2019-03-15 10:37   ` Jan Beulich
  2019-03-15 10:38   ` [PATCH v8 04/50] x86emul: basic AVX512F testing Jan Beulich
                     ` (46 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:37 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also correct the comment of the AVX form of VINSERTPS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v7: Re-base.
v6: Don't refuse to emulate VINSERTPS without AVX512VL.
v4: Make use of d8s_dq64.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -213,6 +213,7 @@ static const struct test avx512f_all[] =
 
 static const struct test avx512f_128[] = {
     INSN(extractps, 66, 0f3a, 17, el,    d, el),
+    INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
@@ -224,12 +225,16 @@ static const struct test avx512f_no128[]
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
+    INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
+    INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
 };
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
+    INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
+    INSN(inserti64x4,    66, 0f3a, 3a, el_4, q, vl),
 };
 
 static const struct test avx512bw_all[] = {
@@ -289,6 +294,8 @@ static const struct test avx512bw_128[]
     INSN(pextrb, 66, 0f3a, 14, el, b, el),
 //       pextrw, 66,   0f, c5,     w
     INSN(pextrw, 66, 0f3a, 15, el, w, el),
+    INSN(pinsrb, 66, 0f3a, 20, el, b, el),
+    INSN(pinsrw, 66,   0f, c4, el, w, el),
 };
 
 static const struct test avx512dq_all[] = {
@@ -301,6 +308,7 @@ static const struct test avx512dq_all[]
 
 static const struct test avx512dq_128[] = {
     INSN(pextr, 66, 0f3a, 16, el, dq64, el),
+    INSN(pinsr, 66, 0f3a, 22, el, dq64, el),
 };
 
 static const struct test avx512dq_no128[] = {
@@ -308,12 +316,16 @@ static const struct test avx512dq_no128[
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
+    INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
+    INSN(inserti64x2,    66, 0f3a, 38, el_2, q, vl),
 };
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
+    INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
+    INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -360,7 +360,7 @@ static const struct twobyte_table {
     [0xc1] = { DstMem|SrcReg|ModRM },
     [0xc2] = { DstImplicit|SrcImmByte|ModRM, simd_any_fp, d8s_vl },
     [0xc3] = { DstMem|SrcReg|ModRM|Mov },
-    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int },
+    [0xc4] = { DstReg|SrcImmByte|ModRM, simd_packed_int, 1 },
     [0xc5] = { DstReg|SrcImmByte|ModRM|Mov },
     [0xc6] = { DstImplicit|SrcImmByte|ModRM, simd_packed_fp, d8s_vl },
     [0xc7] = { ImplicitOps|ModRM },
@@ -516,17 +516,19 @@ static const struct ext0f3a_table {
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
     [0x17] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 2 },
-    [0x18] = { .simd_size = simd_128 },
+    [0x18] = { .simd_size = simd_128, .d8s = 4 },
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
+    [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x20] = { .simd_size = simd_none },
-    [0x21] = { .simd_size = simd_other },
-    [0x22] = { .simd_size = simd_none },
+    [0x20] = { .simd_size = simd_none, .d8s = 0 },
+    [0x21] = { .simd_size = simd_other, .d8s = 2 },
+    [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
-    [0x38] = { .simd_size = simd_128 },
+    [0x38] = { .simd_size = simd_128, .d8s = 4 },
+    [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x39] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -2586,6 +2588,7 @@ x86_decode_twobyte(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         /* fall through */
     case X86EMUL_OPC_VEX_66(0, 0xc4): /* vpinsrw */
+    case X86EMUL_OPC_EVEX_66(0, 0xc4): /* vpinsrw */
         state->desc = DstReg | SrcMem16;
         break;
 
@@ -2688,6 +2691,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x20):     /* pinsrb */
     case X86EMUL_OPC_VEX_66(0, 0x20): /* vpinsrb */
+    case X86EMUL_OPC_EVEX_66(0, 0x20): /* vpinsrb */
         state->desc = DstImplicit | SrcMem;
         if ( modrm_mod != 3 )
             state->desc |= ByteOp;
@@ -2695,6 +2699,7 @@ x86_decode_0f3a(
 
     case X86EMUL_OPC_66(0, 0x22):     /* pinsr{d,q} */
     case X86EMUL_OPC_VEX_66(0, 0x22): /* vpinsr{d,q} */
+    case X86EMUL_OPC_EVEX_66(0, 0x22): /* vpinsr{d,q} */
         state->desc = DstImplicit | SrcMem;
         break;
 
@@ -7735,6 +7740,23 @@ x86_emulate(
         ea.type = OP_MEM;
         goto simd_0f_int_imm8;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xc4):   /* vpinsrw $imm8,r32/m16,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x20): /* vpinsrb $imm8,r32/m8,xmm,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x22): /* vpinsr{d,q} $imm8,r/m,xmm,xmm */
+        generate_exception_if(evex.lr || evex.opmsk || evex.brs, EXC_UD);
+        if ( b & 2 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        if ( !mode_64bit() )
+            evex.w = 0;
+        memcpy(mmvalp, &src.val, op_bytes);
+        ea.type = OP_MEM;
+        op_bytes = src.bytes;
+        d = SrcMem16; /* Fake for the common SIMD code below. */
+        state->simd_size = simd_other;
+        goto avx512f_imm8_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0xc5):      /* pextrw $imm8,{,x}mm,reg */
     case X86EMUL_OPC_VEX_66(0x0f, 0xc5):   /* vpextrw $imm8,xmm,reg */
         generate_exception_if(vex.l, EXC_UD);
@@ -8951,8 +8973,12 @@ x86_emulate(
         opc = init_evex(stub);
         goto pextr;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x18): /* vinsertf32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinsertf64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x19): /* vextractf32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextractf64x2 $imm8,{y,z}mm,xmm/m128{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x38): /* vinserti32x4 $imm8,xmm/m128,{y,z}mm{k} */
+                                            /* vinserti64x2 $imm8,xmm/m128,{y,z}mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x39): /* vextracti32x4 $imm8,{y,z}mm,xmm/m128{k} */
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
@@ -8961,8 +8987,12 @@ x86_emulate(
         fault_suppression = false;
         goto avx512f_imm8_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1a): /* vinsertf32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinsertf64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1b): /* vextractf32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextractf64x4 $imm8,zmm,ymm/m256{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3a): /* vinserti32x4 $imm8,ymm/m256,zmm{k} */
+                                            /* vinserti64x2 $imm8,ymm/m256,zmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x3b): /* vextracti32x8 $imm8,zmm,ymm/m256{k} */
                                             /* vextracti64x4 $imm8,zmm,ymm/m256{k} */
         if ( !evex.w )
@@ -9055,13 +9085,20 @@ x86_emulate(
         op_bytes = 4;
         goto simd_0f3a_common;
 
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m128,xmm,xmm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
         op_bytes = 4;
         /* fall through */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x41): /* vdppd $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x21): /* vinsertps $imm8,xmm/m32,xmm,xmm */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.lr || evex.w || evex.opmsk || evex.brs,
+                              EXC_UD);
+        op_bytes = 4;
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 04/50] x86emul: basic AVX512F testing
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (2 preceding siblings ...)
  2019-03-15 10:37   ` [PATCH v8 03/50] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
@ 2019-03-15 10:38   ` Jan Beulich
  2019-03-15 10:39   ` [PATCH v8 05/50] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
                     ` (45 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v6: Fix formatting in simd.h.
v5: Add VSQRT* tests.
v4: Make eq() also work for 4- and 8-byte integer element sizes.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -63,6 +63,9 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
+avx512f-vecs := 64
+avx512f-ints := 4 8
+avx512f-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
@@ -170,7 +173,7 @@ $(addsuffix .c,$(SG)):
 
 $(addsuffix .h,$(SIMD) $(FMA) $(SG)): simd.h
 
-xop.h: simd-fma.c
+xop.h avx512f.h: simd-fma.c
 
 endif # 32-bit override
 
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -2,7 +2,41 @@
 
 ENTRY(simd_test);
 
-#if VEC_SIZE == 8 && defined(__SSE__)
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if VEC_SIZE == 4
+#  define eq(x, y) ({ \
+    float x_ = (x)[0]; \
+    float __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpss $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif VEC_SIZE == 8
+#  define eq(x, y) ({ \
+    double x_ = (x)[0]; \
+    double __attribute__((vector_size(16))) y_ = { (y)[0] }; \
+    unsigned short r_; \
+    asm ( "vcmpsd $0, %1, %2, %0"  : "=k" (r_) : "m" (x_), "v" (y_) ); \
+    r_ == 1; \
+})
+# elif FLOAT_SIZE == 4
+/*
+ * gcc's (up to at least 8.2) __builtin_ia32_cmpps256_mask() has an anomaly in
+ * that its return type is QI rather than UQI, and hence the value would get
+ * sign-extended before comapring to ALL_TRUE. The same oddity does not matter
+ * for __builtin_ia32_cmppd256_mask(), as there only 4 bits are significant.
+ * Hence the extra " & ALL_TRUE".
+ */
+#  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
+# elif FLOAT_SIZE == 8
+#  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif INT_SIZE == 4 || UINT_SIZE == 4
+#  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+# endif
+#elif VEC_SIZE == 8 && defined(__SSE__)
 # define to_bool(cmp) (__builtin_ia32_pmovmskb(cmp) == 0xff)
 #elif VEC_SIZE == 16
 # if defined(__AVX__) && defined(FLOAT_SIZE)
@@ -93,6 +127,56 @@ static inline bool _to_bool(byte_vec_t b
     touch(x); \
     __builtin_ia32_pfrcpit2(__builtin_ia32_pfrsqit1(__builtin_ia32_pfmul(t_, t_), x), t_); \
 })
+#elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
+# if FLOAT_SIZE == 4
+#  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
+# elif FLOAT_SIZE == 8
+#  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
+# endif
+#elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if FLOAT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastss %1, %0" \
+          : "=v" (t_) : "m" (*(float[1]){ x }) ); \
+    t_; \
+})
+#  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
+#   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
+#   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#  endif
+# elif FLOAT_SIZE == 8
+#  if VEC_SIZE >= 32
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vbroadcastsd %1, %0" : "=v" (t_) \
+          : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#  else
+#   define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(double[1]){ x }) ); \
+    t_; \
+})
+#  endif
+#  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
+#   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
+#   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#  endif
+# endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
 # if VEC_SIZE == 32 && defined(__AVX__)
 #  if defined(__AVX2__)
@@ -191,7 +275,30 @@ static inline bool _to_bool(byte_vec_t b
 #  define sqrt(x) scalar_1op(x, "sqrtsd %[in], %[out]")
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE2__)
+#if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
+     defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 4 || UINT_SIZE == 4
+#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+# elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+# endif
+# if INT_SIZE == 4
+#  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
+#  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 4
+#  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 8
+#  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+# endif
+#elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
 #  define interleave_lo(x, y) ((vec_t)__builtin_ia32_punpcklbw128((vqi_t)(x), (vqi_t)(y)))
@@ -587,6 +694,10 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if defined(__AVX512F__) && defined(FLOAT_SIZE)
+# include "simd-fma.c"
+#endif
+
 int simd_test(void)
 {
     unsigned int i, j;
@@ -1034,7 +1145,8 @@ int simd_test(void)
 # endif
 #endif
 
-#if defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)
+#if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
+    (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
 #endif
 
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,9 +70,111 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE == 16
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
+# define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
+#elif VEC_SIZE == 32
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 256 ## s(a)
+#elif VEC_SIZE == 64
+# define B(n, s, a...)   __builtin_ia32_ ## n ## 512 ## s(a)
+# define BR(n, s, a...)  __builtin_ia32_ ## n ## 512 ## s(a, 4)
+#endif
+#ifndef B_
+# define B_ B
+#endif
+#ifndef BR
+# define BR B
+# define BR_ B_
+#endif
+#ifndef BR_
+# define BR_ BR
+#endif
+
+#ifdef __AVX512F__
+
+/*
+ * The original plan was to effect use of EVEX encodings for scalar as well as
+ * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
+ * only of course) XMM16-XMM31 only. All sorts of compiler errors result when
+ * doing this with gcc 8.2. Therefore resort to injecting {evex} prefixes,
+ * which has the benefit of also working for 32-bit. Granted, there is a lot of
+ * escaping to get right here.
+ */
+asm ( ".macro override insn    \n\t"
+      ".macro $\\insn o:vararg \n\t"
+      ".purgem \\insn          \n\t"
+      "{evex} \\insn \\(\\)o   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\(\\))o     \n\t"
+      ".endm                   \n\t"
+      ".endm                   \n\t"
+      ".macro \\insn o:vararg  \n\t"
+      "$\\insn \\(\\)o         \n\t"
+      ".endm                   \n\t"
+      ".endm" );
+
+# define OVR(n) asm ( "override v" #n )
+# define OVR_SFP(n) OVR(n ## sd); OVR(n ## ss)
+
+# ifdef __AVX512VL__
+#  ifdef __AVX512BW__
+#   define OVR_BW(n) OVR(p ## n ## b); OVR(p ## n ## w)
+#  else
+#   define OVR_BW(n)
+#  endif
+#  define OVR_DQ(n) OVR(p ## n ## d); OVR(p ## n ## q)
+#  define OVR_VFP(n) OVR(n ## pd); OVR(n ## ps)
+# else
+#  define OVR_BW(n)
+#  define OVR_DQ(n)
+#  define OVR_VFP(n)
+# endif
+
+# define OVR_FMA(n, w) OVR_ ## w(n ## 132); OVR_ ## w(n ## 213); \
+                       OVR_ ## w(n ## 231)
+# define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
+# define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
+
+OVR_SFP(broadcast);
+OVR_SFP(comi);
+OVR_FP(add);
+OVR_FP(div);
+OVR(extractps);
+OVR_FMA(fmadd, FP);
+OVR_FMA(fmsub, FP);
+OVR_FMA(fnmadd, FP);
+OVR_FMA(fnmsub, FP);
+OVR(insertps);
+OVR_FP(max);
+OVR_FP(min);
+OVR(movd);
+OVR(movq);
+OVR_SFP(mov);
+OVR_FP(mul);
+OVR_FP(sqrt);
+OVR_FP(sub);
+OVR_SFP(ucomi);
+
+# undef OVR_VFP
+# undef OVR_SFP
+# undef OVR_INT
+# undef OVR_FP
+# undef OVR_FMA
+# undef OVR_DQ
+# undef OVR_BW
+# undef OVR
+
+#endif /* __AVX512F__ */
+
 /*
  * Suppress value propagation by the compiler, preventing unwanted
  * optimization. This at once makes the compiler use memory operands
  * more often, which for our purposes is the more interesting case.
  */
 #define touch(var) asm volatile ( "" : "+m" (var) )
+
+static inline vec_t undef(void)
+{
+    vec_t v = v;
+    return v;
+}
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -1,10 +1,9 @@
+#if !defined(__XOP__) && !defined(__AVX512F__)
 #include "simd.h"
-
-#ifndef __XOP__
 ENTRY(fma_test);
 #endif
 
-#if VEC_SIZE < 16
+#if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
 #elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
@@ -24,7 +23,13 @@ ENTRY(fma_test);
 # define eq(x, y) to_bool((x) == (y))
 #endif
 
-#if VEC_SIZE == 16
+#if defined(__AVX512F__) && VEC_SIZE > FLOAT_SIZE
+# if FLOAT_SIZE == 4
+#  define fmaddsub(x, y, z) BR(vfmaddsubps, _mask, x, y, z, ~0)
+# elif FLOAT_SIZE == 8
+#  define fmaddsub(x, y, z) BR(vfmaddsubpd, _mask, x, y, z, ~0)
+# endif
+#elif VEC_SIZE == 16
 # if FLOAT_SIZE == 4
 #  define addsub(x, y) __builtin_ia32_addsubps(x, y)
 #  if defined(__FMA4__) || defined(__FMA__)
@@ -50,6 +55,10 @@ ENTRY(fma_test);
 # endif
 #endif
 
+#if defined(fmaddsub) && !defined(addsub)
+# define addsub(x, y) fmaddsub(x, broadcast(1), y)
+#endif
+
 int fma_test(void)
 {
     unsigned int i;
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -21,6 +21,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f-opmask.h"
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
+#include "avx512f.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -248,6 +249,14 @@ static const struct {
     SIMD(OPMASK/b,    avx512dq_opmask,         1),
     SIMD(OPMASK/d,    avx512bw_opmask,         4),
     SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(AVX512F f32 scalar,  avx512f,        f4),
+    SIMD(AVX512F f32x16,      avx512f,      64f4),
+    SIMD(AVX512F f64 scalar,  avx512f,        f8),
+    SIMD(AVX512F f64x8,       avx512f,      64f8),
+    SIMD(AVX512F s32x16,      avx512f,      64i4),
+    SIMD(AVX512F u32x16,      avx512f,      64u4),
+    SIMD(AVX512F s64x8,       avx512f,      64i8),
+    SIMD(AVX512F u64x8,       avx512f,      64u8),
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 05/50] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (3 preceding siblings ...)
  2019-03-15 10:38   ` [PATCH v8 04/50] x86emul: basic AVX512F testing Jan Beulich
@ 2019-03-15 10:39   ` Jan Beulich
  2019-03-15 10:39   ` [PATCH v8 06/50] x86emul: basic AVX512VL testing Jan Beulich
                     ` (44 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:39 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the pbroadcastw table entry in evex-disp8.c is slightly
different from what one would expect, due to it requiring EVEX.W to be
zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Use dummy output in invoke_stub(). Re-base.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -164,6 +164,9 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+//       pbroadcast,   66, 0f38, 7c,          dq64
+    INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
+    INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
     INSN(pcmp,         66, 0f3a, 1f,    vl,     dq, vl),
     INSN(pcmpeqd,      66,   0f, 76,    vl,      d, vl),
     INSN(pcmpeqq,      66, 0f38, 29,    vl,      q, vl),
@@ -222,6 +225,7 @@ static const struct test avx512f_128[] =
 
 static const struct test avx512f_no128[] = {
     INSN(broadcastf32x4, 66, 0f38, 1a, el_4,  d, vl),
+    INSN(broadcasti32x4, 66, 0f38, 5a, el_4,  d, vl),
     INSN(broadcastsd,    66, 0f38, 19, el,    q, el),
     INSN(extractf32x4,   66, 0f3a, 19, el_4,  d, vl),
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
@@ -231,6 +235,7 @@ static const struct test avx512f_no128[]
 
 static const struct test avx512f_512[] = {
     INSN(broadcastf64x4, 66, 0f38, 1b, el_4, q, vl),
+    INSN(broadcasti64x4, 66, 0f38, 5b, el_4, q, vl),
     INSN(extractf64x4,   66, 0f3a, 1b, el_4, q, vl),
     INSN(extracti64x4,   66, 0f3a, 3b, el_4, q, vl),
     INSN(insertf64x4,    66, 0f3a, 1a, el_4, q, vl),
@@ -250,6 +255,10 @@ static const struct test avx512bw_all[]
     INSN(paddw,       66,   0f, fd,    vl,    w, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
+    INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
+//       pbroadcastb, 66, 0f38, 7a,           b
+    INSN(pbroadcastw, 66, 0f38, 79,    el_2,  b, vl),
+//       pbroadcastw, 66, 0f38, 7b,           b
     INSN(pcmp,        66, 0f3a, 3f,    vl,   bw, vl),
     INSN(pcmpeqb,     66,   0f, 74,    vl,    b, vl),
     INSN(pcmpeqw,     66,   0f, 75,    vl,    w, vl),
@@ -301,6 +310,7 @@ static const struct test avx512bw_128[]
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
+    INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
@@ -314,6 +324,7 @@ static const struct test avx512dq_128[]
 static const struct test avx512dq_no128[] = {
     INSN(broadcastf32x2, 66, 0f38, 19, el_2, d, vl),
     INSN(broadcastf64x2, 66, 0f38, 1a, el_2, q, vl),
+    INSN(broadcasti64x2, 66, 0f38, 5a, el_2, q, vl),
     INSN(extractf64x2,   66, 0f3a, 19, el_2, q, vl),
     INSN(extracti64x2,   66, 0f3a, 39, el_2, q, vl),
     INSN(insertf64x2,    66, 0f3a, 18, el_2, q, vl),
@@ -322,6 +333,7 @@ static const struct test avx512dq_no128[
 
 static const struct test avx512dq_512[] = {
     INSN(broadcastf32x8, 66, 0f38, 1b, el_8, d, vl),
+    INSN(broadcasti32x8, 66, 0f38, 5b, el_8, d, vl),
     INSN(extractf32x8,   66, 0f3a, 1b, el_8, d, vl),
     INSN(extracti32x8,   66, 0f3a, 3b, el_8, d, vl),
     INSN(insertf32x8,    66, 0f3a, 1a, el_8, d, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -278,9 +278,33 @@ static inline bool _to_bool(byte_vec_t b
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if INT_SIZE == 4 || UINT_SIZE == 4
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastd %1, %0" \
+          : "=v" (t_) : "m" (*(int[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastq %1, %0" \
+          : "=v" (t_) : "m" (*(long long[1]){ x }) ); \
+    t_; \
+})
+#  ifdef __x86_64__
+#   define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastq %1, %0" : "=v" (t_) : "r" ((x) + 0ULL) ); \
+    t_; \
+})
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
@@ -977,10 +1001,14 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
-#if defined(broadcast)
+#ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#ifdef broadcast2
+    if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -454,9 +454,13 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x58 ... 0x59] = { .simd_size = simd_other, .two_op = 1 },
-    [0x5a] = { .simd_size = simd_128, .two_op = 1 },
-    [0x78 ... 0x79] = { .simd_size = simd_other, .two_op = 1 },
+    [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
+    [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
+    [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
+    [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x78] = { .simd_size = simd_other, .two_op = 1 },
+    [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
+    [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -2636,6 +2640,11 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
+    case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
+    case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
+        break;
+
     case 0xf0: /* movbe / crc32 */
         state->desc |= repne_prefix() ? ByteOp : Mov;
         if ( rep_prefix() )
@@ -8233,6 +8242,8 @@ x86_emulate(
         goto avx512f_no_sae;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
+        op_bytes = elem_bytes;
         generate_exception_if(evex.w || evex.brs, EXC_UD);
     avx512_broadcast:
         /*
@@ -8252,17 +8263,27 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
                                             /* vbroadcastf64x4 m256,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
+                                            /* vbroadcasti64x4 m256,zmm{k} */
         generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
                                             /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
-        generate_exception_if(!evex.lr || evex.brs, EXC_UD);
+        generate_exception_if(!evex.lr, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
+                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
+        if ( b == 0x59 )
+            op_bytes = 8;
+        generate_exception_if(evex.brs, EXC_UD);
         if ( !evex.w )
             host_and_vcpu_must_have(avx512dq);
         goto avx512_broadcast;
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
                                             /* vbroadcastf64x2 m128,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
         generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.brs,
                               EXC_UD);
         if ( evex.w )
@@ -8456,6 +8477,45 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w || evex.brs, EXC_UD);
+        op_bytes = elem_bytes = 1 << (b & 1);
+        /* See the comment at the avx512_broadcast label. */
+        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
+        generate_exception_if((ea.type != OP_REG || evex.brs ||
+                               evex.reg != 0xf || !evex.RX),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert GPR source to %rAX. */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        opc[1] = modrm & 0xf8;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "=g" (dummy) : "a" (src.val));
+
+        put_stub(stub);
+        ASSERT(!state->simd_size);
+        break;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 06/50] x86emul: basic AVX512VL testing
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (4 preceding siblings ...)
  2019-03-15 10:39   ` [PATCH v8 05/50] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
@ 2019-03-15 10:39   ` Jan Beulich
  2019-03-15 10:40   ` [PATCH v8 07/50] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
                     ` (43 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:39 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test the 128- and 256-bit variants of the insns which have been
implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v6: Don't enable AVX512VL for scalar tests, nor for S/G ones with index
    wider than data. Re-base over changes earlier in the series.
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -63,7 +63,7 @@ avx2-sg-flts := 4 8
 xop-vecs := $(avx-vecs)
 xop-ints := 1 2 4 8
 xop-flts := $(avx-flts)
-avx512f-vecs := 64
+avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
 
--- a/tools/tests/x86_emulator/simd-fma.c
+++ b/tools/tests/x86_emulator/simd-fma.c
@@ -5,13 +5,13 @@ ENTRY(fma_test);
 
 #if VEC_SIZE < 16 && !defined(to_bool)
 # define to_bool(cmp) (!~(cmp)[0])
-#elif VEC_SIZE == 16
+#elif VEC_SIZE == 16 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
 #  define to_bool(cmp) __builtin_ia32_vtestcpd(cmp, (vec_t){} == 0)
 # endif
-#elif VEC_SIZE == 32
+#elif VEC_SIZE == 32 && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define to_bool(cmp) __builtin_ia32_vtestcps256(cmp, (vec_t){} == 0)
 # elif FLOAT_SIZE == 8
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -539,7 +539,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define rotr(x, n) ((vec_t)__builtin_ia32_palignr128((vdi_t)(x), (vdi_t)(x), (n) * 64))
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSE4_1__)
+#if VEC_SIZE == 16 && defined(__SSE4_1__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)__builtin_ia32_pmaxsb128((vqi_t)(x), (vqi_t)(y)))
 #  define min(x, y) ((vec_t)__builtin_ia32_pminsb128((vqi_t)(x), (vqi_t)(y)))
@@ -593,7 +593,7 @@ static inline bool _to_bool(byte_vec_t b
 #  define mix(x, y) __builtin_ia32_blendpd(x, y, 0b10)
 # endif
 #endif
-#if VEC_SIZE == 32 && defined(__AVX__)
+#if VEC_SIZE == 32 && defined(__AVX__) && !defined(__AVX512VL__)
 # if FLOAT_SIZE == 4
 #  define dot_product(x, y) ({ \
     vec_t t_ = __builtin_ia32_dpps256(x, y, 0b11110001); \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -92,6 +92,15 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+# if VEC_SIZE > ELEM_SIZE && (defined(VEC_MAX) ? VEC_MAX : VEC_SIZE) < 64
+#  pragma GCC target ( "avx512vl" )
+# endif
+
+# define REN(insn, old, new)                     \
+    asm ( ".macro v" #insn #old " o:vararg \n\t" \
+          "v" #insn #new " \\o             \n\t" \
+          ".endm" )
+
 /*
  * The original plan was to effect use of EVEX encodings for scalar as well as
  * 128- and 256-bit insn variants by restricting the compiler to use (on 64-bit
@@ -135,25 +144,88 @@ asm ( ".macro override insn    \n\t"
 # define OVR_FP(n) OVR_VFP(n); OVR_SFP(n)
 # define OVR_INT(n) OVR_BW(n); OVR_DQ(n)
 
+OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
+OVR_INT(add);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
+OVR_FMA(fmaddsub, VFP);
 OVR_FMA(fmsub, FP);
+OVR_FMA(fmsubadd, VFP);
 OVR_FMA(fnmadd, FP);
 OVR_FMA(fnmsub, FP);
 OVR(insertps);
 OVR_FP(max);
+OVR_INT(maxs);
+OVR_INT(maxu);
 OVR_FP(min);
+OVR_INT(mins);
+OVR_INT(minu);
 OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
+OVR_VFP(mova);
+OVR_VFP(movnt);
+OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(shuf);
+OVR_INT(sll);
+OVR_DQ(sllv);
 OVR_FP(sqrt);
+OVR_INT(sra);
+OVR_DQ(srav);
+OVR_INT(srl);
+OVR_DQ(srlv);
 OVR_FP(sub);
+OVR_INT(sub);
 OVR_SFP(ucomi);
+OVR_VFP(unpckh);
+OVR_VFP(unpckl);
+
+# ifdef __AVX512VL__
+#  if ELEM_SIZE == 8 && defined(__AVX512DQ__)
+REN(extract, f128, f64x2);
+REN(extract, i128, i64x2);
+REN(insert, f128, f64x2);
+REN(insert, i128, i64x2);
+#  else
+REN(extract, f128, f32x4);
+REN(extract, i128, i32x4);
+REN(insert, f128, f32x4);
+REN(insert, i128, i32x4);
+#  endif
+#  if ELEM_SIZE == 8
+REN(movdqa, , 64);
+REN(movdqu, , 64);
+REN(pand, , q);
+REN(pandn, , q);
+REN(por, , q);
+REN(pxor, , q);
+#  else
+#   if ELEM_SIZE == 1 && defined(__AVX512BW__)
+REN(movdq, a, u8);
+REN(movdqu, , 8);
+#   elif ELEM_SIZE == 2 && defined(__AVX512BW__)
+REN(movdq, a, u16);
+REN(movdqu, , 16);
+#   else
+REN(movdqa, , 32);
+REN(movdqu, , 32);
+#   endif
+REN(pand, , d);
+REN(pandn, , d);
+REN(por, , d);
+REN(pxor, , d);
+#  endif
+OVR(movntdq);
+OVR(movntdqa);
+OVR(pmulld);
+OVR(pmuldq);
+OVR(pmuludq);
+# endif
 
 # undef OVR_VFP
 # undef OVR_SFP
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -88,6 +88,11 @@ static bool simd_check_avx512f(void)
 }
 #define simd_check_avx512f_opmask simd_check_avx512f
 
+static bool simd_check_avx512f_vl(void)
+{
+    return cpu_has_avx512f && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512dq(void)
 {
     return cpu_has_avx512dq;
@@ -142,11 +147,21 @@ static const struct {
       .check_cpu = simd_check_ ## feat,                             \
       .set_regs = simd_set_regs,                                    \
       .check_regs = simd_check_regs }
+#define AVX512VL_(bits, desc, feat, form)                          \
+    { .code = feat ## _x86_ ## bits ## _D ## _ ## form,            \
+      .size = sizeof(feat ## _x86_ ## bits ## _D ## _ ## form),    \
+      .bitness = bits, .name = "AVX512" #desc,                     \
+      .check_cpu = simd_check_ ## feat ## _vl,                     \
+      .set_regs = simd_set_regs,                                   \
+      .check_regs = simd_check_regs }
 #ifdef __x86_64__
 # define SIMD(desc, feat, form) SIMD_(64, desc, feat, form), \
                                 SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(64, desc, feat, form), \
+                                    AVX512VL_(32, desc, feat, form)
 #else
 # define SIMD(desc, feat, form) SIMD_(32, desc, feat, form)
+# define AVX512VL(desc, feat, form) AVX512VL_(32, desc, feat, form)
 #endif
     SIMD(3DNow! single,          _3dnow,     8f4),
     SIMD(SSE scalar single,      sse,         f4),
@@ -257,6 +272,20 @@ static const struct {
     SIMD(AVX512F u32x16,      avx512f,      64u4),
     SIMD(AVX512F s64x8,       avx512f,      64i8),
     SIMD(AVX512F u64x8,       avx512f,      64u8),
+    AVX512VL(VL f32x4,        avx512f,      16f4),
+    AVX512VL(VL f64x2,        avx512f,      16f8),
+    AVX512VL(VL f32x8,        avx512f,      32f4),
+    AVX512VL(VL f64x4,        avx512f,      32f8),
+    AVX512VL(VL s32x4,        avx512f,      16i4),
+    AVX512VL(VL u32x4,        avx512f,      16u4),
+    AVX512VL(VL s32x8,        avx512f,      32i4),
+    AVX512VL(VL u32x8,        avx512f,      32u4),
+    AVX512VL(VL s64x2,        avx512f,      16i8),
+    AVX512VL(VL u64x2,        avx512f,      16u8),
+    AVX512VL(VL s64x4,        avx512f,      32i8),
+    AVX512VL(VL u64x4,        avx512f,      32u8),
+#undef AVX512VL_
+#undef AVX512VL
 #undef SIMD_
 #undef SIMD
 };




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 07/50] x86emul: support AVX512{F, BW} zero- and sign-extending moves
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (5 preceding siblings ...)
  2019-03-15 10:39   ` [PATCH v8 06/50] x86emul: basic AVX512VL testing Jan Beulich
@ 2019-03-15 10:40   ` Jan Beulich
  2019-03-15 18:02     ` Andrew Cooper
  2019-03-15 10:40   ` [PATCH v8 08/50] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
                     ` (42 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:40 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the testing in simd.c doesn't really follow the ISA extension
pattern - to fit the scheme, extensions from byte and word granular
vectors can (currently) sensibly only happen in the AVX512BW case (and
hence respective abstraction macros will be added there rather than
here).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Raise #UD when EVEX.b is set. Re-base.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -177,6 +177,16 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
+    INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
+    INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
+    INSN(pmovzxwq,     66, 0f38, 34,    vl_4,    w, vl),
+    INSN(pmovzxdq,     66, 0f38, 35,    vl_2, d_nb, vl),
     INSN(pmuldq,       66, 0f38, 28,    vl,      q, vl),
     INSN(pmulld,       66, 0f38, 40,    vl,      d, vl),
     INSN(pmuludq,      66,   0f, f4,    vl,      q, vl),
@@ -274,6 +284,8 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -443,13 +443,23 @@ static const struct ext0f38_table {
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x20 ... 0x25] = { .simd_size = simd_other, .two_op = 1 },
+    [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x23] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x24] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
-    [0x30 ... 0x35] = { .simd_size = simd_other, .two_op = 1 },
+    [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x32] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
+    [0x33] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x34] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
+    [0x35] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
@@ -8349,6 +8359,25 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x25): /* vpmovsxdq {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.brs || (evex.w && (b & 7) == 5), EXC_UD);
+        op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
+        elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -311,10 +311,12 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxdq, _mask, x, (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 4
 #  define max(x, y) ((vec_t)B(pmaxud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminud, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -222,6 +222,16 @@ REN(pxor, , d);
 #  endif
 OVR(movntdq);
 OVR(movntdqa);
+OVR(pmovsxbd);
+OVR(pmovsxbq);
+OVR(pmovsxdq);
+OVR(pmovsxwd);
+OVR(pmovsxwq);
+OVR(pmovzxbd);
+OVR(pmovzxbq);
+OVR(pmovzxdq);
+OVR(pmovzxwd);
+OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 08/50] x86emul: support AVX512{F, BW} down conversion moves
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (6 preceding siblings ...)
  2019-03-15 10:40   ` [PATCH v8 07/50] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
@ 2019-03-15 10:40   ` Jan Beulich
  2019-03-15 18:10     ` Andrew Cooper
  2019-03-15 10:41   ` [PATCH v8 09/50] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
                     ` (41 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:40 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the vpmov{,s,us}{d,q}w table entries in evex-disp8.c are
slightly different from what one would expect, due to them requiring
EVEX.W to be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Adjustment for XSA-289: Use XOR instead of ADD when fiddling with b
    as an array index.
v7: ea.type == OP_* -> ea.type != OP_*. Re-base over change in previous
    patch. Re-base.
v5: Also adjust x86_insn_is_mem_write().
v4: Also #UD when evex.z is set with a memory operand.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -177,11 +177,26 @@ static const struct test avx512f_all[] =
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
     INSN(pminu,        66, 0f38, 3b,    vl,     dq, vl),
+    INSN(pmovdb,       f3, 0f38, 31,    vl_4,    b, vl),
+    INSN(pmovdw,       f3, 0f38, 33,    vl_2,    b, vl),
+    INSN(pmovqb,       f3, 0f38, 32,    vl_8,    b, vl),
+    INSN(pmovqd,       f3, 0f38, 35,    vl_2, d_nb, vl),
+    INSN(pmovqw,       f3, 0f38, 34,    vl_4,    b, vl),
+    INSN(pmovsdb,      f3, 0f38, 21,    vl_4,    b, vl),
+    INSN(pmovsdw,      f3, 0f38, 23,    vl_2,    b, vl),
+    INSN(pmovsqb,      f3, 0f38, 22,    vl_8,    b, vl),
+    INSN(pmovsqd,      f3, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovsqw,      f3, 0f38, 24,    vl_4,    b, vl),
     INSN(pmovsxbd,     66, 0f38, 21,    vl_4,    b, vl),
     INSN(pmovsxbq,     66, 0f38, 22,    vl_8,    b, vl),
     INSN(pmovsxwd,     66, 0f38, 23,    vl_2,    w, vl),
     INSN(pmovsxwq,     66, 0f38, 24,    vl_4,    w, vl),
     INSN(pmovsxdq,     66, 0f38, 25,    vl_2, d_nb, vl),
+    INSN(pmovusdb,     f3, 0f38, 11,    vl_4,    b, vl),
+    INSN(pmovusdw,     f3, 0f38, 13,    vl_2,    b, vl),
+    INSN(pmovusqb,     f3, 0f38, 12,    vl_8,    b, vl),
+    INSN(pmovusqd,     f3, 0f38, 15,    vl_2, d_nb, vl),
+    INSN(pmovusqw,     f3, 0f38, 14,    vl_4,    b, vl),
     INSN(pmovzxbd,     66, 0f38, 31,    vl_4,    b, vl),
     INSN(pmovzxbq,     66, 0f38, 32,    vl_8,    b, vl),
     INSN(pmovzxwd,     66, 0f38, 33,    vl_2,    w, vl),
@@ -284,7 +299,10 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+    INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
+    INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+    INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -277,6 +277,17 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 #if (INT_SIZE == 4 || UINT_SIZE == 4 || INT_SIZE == 8 || UINT_SIZE == 8) && \
      defined(__AVX512F__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextracti{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextracti32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -291,6 +302,7 @@ static inline bool _to_bool(byte_vec_t b
 })
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -720,6 +732,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #endif
 
+#if VEC_SIZE >= 16
+
+# if !defined(low_half) && defined(HALF_SIZE)
+static inline half_t low_half(vec_t x)
+{
+#  if HALF_SIZE < VEC_SIZE
+    half_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 2; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1087,6 +1120,21 @@ int simd_test(void)
 
 #endif
 
+#if defined(widen1) && defined(shrink1)
+    {
+        half_t aux1 = low_half(src), aux2;
+
+        touch(aux1);
+        x = widen1(aux1);
+        touch(x);
+        aux2 = shrink1(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 2; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
 #ifdef dup_lo
     touch(src);
     x = dup_lo(src);
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,23 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if VEC_SIZE >= 16
+
+# if ELEM_COUNT >= 2
+#  if VEC_SIZE > 32
+#   define HALF_SIZE (VEC_SIZE / 2)
+#  else
+#   define HALF_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(HALF_SIZE))) half_t;
+typedef char __attribute__((vector_size(HALF_SIZE))) vqi_half_t;
+typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
+typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
+typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+# endif
+
+#endif
+
 #if VEC_SIZE == 16
 # define B(n, s, a...)   __builtin_ia32_ ## n ## 128 ## s(a)
 # define B_(n, s, a...)  __builtin_ia32_ ## n ##        s(a)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3068,7 +3068,22 @@ x86_decode(
                 d |= vSIB;
             state->simd_size = ext0f38_table[b].simd_size;
             if ( evex_encoded() )
-                disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+            {
+                /*
+                 * VPMOVUS* are identical to VPMOVS* Disp8-scaling-wise, but
+                 * their attributes don't match those of the vex_66 encoded
+                 * insns with the same base opcodes. Rather than adding new
+                 * columns to the table, handle this here for now.
+                 */
+                if ( evex.pfx != vex_f3 || (b & 0xf8) != 0x10 )
+                    disp8scale = decode_disp8scale(ext0f38_table[b].d8s, state);
+                else
+                {
+                    disp8scale = decode_disp8scale(ext0f38_table[b ^ 0x30].d8s,
+                                                   state);
+                    state->simd_size = simd_other;
+                }
+            }
             break;
 
         case ext_0f3a:
@@ -8359,10 +8374,14 @@ x86_emulate(
         op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10): /* vpmovuswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20): /* vpmovswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x30): /* vpmovzxbw {x,y}mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30): /* vpmovwb [xyz]mm,{x,y}mm/mem{k} */
         host_and_vcpu_must_have(avx512bw);
-        /* fall through */
+        if ( evex.pfx != vex_f3 )
+        {
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x23): /* vpmovsxwd {x,y}mm/mem,[xyz]mm{k} */
@@ -8373,7 +8392,29 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x33): /* vpmovzxwd {x,y}mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x35): /* vpmovzxdq {x,y}mm/mem,[xyz]mm{k} */
-        generate_exception_if(evex.brs || (evex.w && (b & 7) == 5), EXC_UD);
+            generate_exception_if(evex.w && (b & 7) == 5, EXC_UD);
+        }
+        else
+        {
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x11): /* vpmovusdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x12): /* vpmovusqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x13): /* vpmovusdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x14): /* vpmovusqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* vpmovusqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x21): /* vpmovsdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x22): /* vpmovsqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x23): /* vpmovsdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x24): /* vpmovsqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* vpmovsqd [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x31): /* vpmovdb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x32): /* vpmovqb [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x33): /* vpmovdw [xyz]mm,{x,y}mm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x34): /* vpmovqw [xyz]mm,xmm/mem{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* vpmovqd [xyz]mm,{x,y}mm/mem{k} */
+            generate_exception_if(evex.w || (ea.type != OP_REG && evex.z), EXC_UD);
+            d = DstMem | SrcReg | TwoOp;
+        }
+        generate_exception_if(evex.brs, EXC_UD);
         op_bytes = 32 >> (pmov_convert_delta[b & 7] + 1 - evex.lr);
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
@@ -10212,6 +10253,12 @@ x86_insn_is_mem_write(const struct x86_e
     case X86EMUL_OPC(0x0f, 0xab):        /* BTS */
     case X86EMUL_OPC(0x0f, 0xb3):        /* BTR */
     case X86EMUL_OPC(0x0f, 0xbb):        /* BTC */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x15): /* VPMOVUS* */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x25): /* VPMOVS* */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x30) ...
+         X86EMUL_OPC_EVEX_F3(0x0f38, 0x35): /* VPMOV{D,Q,W}* */
         return true;
 
     case 0xd9:




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 09/50] x86emul: support AVX512{F, BW} integer unpack insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (7 preceding siblings ...)
  2019-03-15 10:40   ` [PATCH v8 08/50] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
@ 2019-03-15 10:41   ` Jan Beulich
  2019-03-15 18:21     ` Andrew Cooper
  2019-03-15 10:41   ` [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
                     ` (40 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:41 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

There's once again one extra twobyte_table[] entry which gets its Disp8
shift value set right away without getting support implemented just yet,
again to avoid needlessly splitting groups of entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Re-base.
v6: Re-base over changes earlier in the series.
v4: Move OVR() additions into __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -229,6 +229,10 @@ static const struct test avx512f_all[] =
     INSN(pternlog,     66, 0f3a, 25,    vl,     dq, vl),
     INSN(ptestm,       66, 0f38, 27,    vl,     dq, vl),
     INSN(ptestnm,      f3, 0f38, 27,    vl,     dq, vl),
+    INSN(punpckhdq,    66,   0f, 6a,    vl,      d, vl),
+    INSN(punpckhqdq,   66,   0f, 6d,    vl,      q, vl),
+    INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
+    INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
@@ -327,6 +331,10 @@ static const struct test avx512bw_all[]
     INSN(psubw,       66,   0f, f9,    vl,    w, vl),
     INSN(ptestm,      66, 0f38, 26,    vl,   bw, vl),
     INSN(ptestnm,     f3, 0f38, 26,    vl,   bw, vl),
+    INSN(punpckhbw,   66,   0f, 68,    vl,    b, vl),
+    INSN(punpckhwd,   66,   0f, 69,    vl,    w, vl),
+    INSN(punpcklbw,   66,   0f, 60,    vl,    b, vl),
+    INSN(punpcklwd,   66,   0f, 61,    vl,    w, vl),
 };
 
 static const struct test avx512bw_128[] = {
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -300,6 +300,10 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
@@ -317,6 +321,10 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
 # if INT_SIZE == 4
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -252,6 +252,10 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(punpckhdq);
+OVR(punpckhqdq);
+OVR(punpckldq);
+OVR(punpcklqdq);
 # endif
 
 # undef OVR_VFP
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -312,10 +312,10 @@ static const struct twobyte_table {
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other },
+    [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other },
-    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int },
+    [0x68 ... 0x6a] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
+    [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
     [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
@@ -6681,6 +6681,12 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm */
         generate_exception_if(evex.opmsk, EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        op_bytes = 16 << evex.lr;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6708,6 +6714,13 @@ x86_emulate(
         elem_bytes = 1 << (b & 1);
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x62): /* vpunpckldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6a): /* vpunpckhdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        fault_suppression = false;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x26): /* vptestnm{b,w} [xyz]mm/mem,[xyz]mm,k{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x27): /* vptestnm{d,q} [xyz]mm/mem,[xyz]mm,k{k} */
         op_bytes = 16 << evex.lr;
@@ -6734,6 +6747,10 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd4): /* vpaddq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf4): /* vpmuludq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x28): /* vpmuldq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (8 preceding siblings ...)
  2019-03-15 10:41   ` [PATCH v8 09/50] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
@ 2019-03-15 10:41   ` Jan Beulich
  2019-05-17 16:50     ` Andrew Cooper
  2019-03-15 10:43   ` [PATCH v8 11/50] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
                     ` (39 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:41 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Take the liberty and also correct the (public interface) name of the
AVX512_VBMI feature flag, on the assumption that no external consumer
has actually been using that flag so far. Furthermore make it have
AVX512BW instead of AVX512F as a prerequisite, for requiring full
64-bit mask registers (the upper 48 bits of which can't be accessed
other than through XSAVE/XRSTOR without AVX512BW support).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v5: Re-base.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -173,6 +173,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
+    INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
+    INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -294,6 +298,8 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
+    INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
@@ -378,6 +384,11 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512_vbmi_all[] = {
+    INSN(permi2b,       66, 0f38, 75, vl, b, vl),
+    INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -718,4 +729,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -150,6 +150,9 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#  else
+#   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
+#   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -175,6 +178,9 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#  else
+#   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
+#   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -303,6 +309,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -324,6 +333,9 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
@@ -769,6 +781,7 @@ int simd_test(void)
 {
     unsigned int i, j;
     vec_t x, y, z, src, inv, alt, sh;
+    vint_t interleave_lo, interleave_hi;
 
     for ( i = 0, j = ELEM_SIZE << 3; i < ELEM_COUNT; ++i )
     {
@@ -782,6 +795,9 @@ int simd_test(void)
         if ( !(i & (i + 1)) )
             --j;
         sh[i] = j;
+
+        interleave_lo[i] = ((i & 1) * ELEM_COUNT) | (i >> 1);
+        interleave_hi[i] = interleave_lo[i] + (ELEM_COUNT / 2);
     }
 
     touch(src);
@@ -1075,7 +1091,7 @@ int simd_test(void)
     x = src * alt;
     y = interleave_lo(x, alt < 0);
     touch(x);
-    z = widen1(x);
+    z = widen1(low_half(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1107,7 +1123,7 @@ int simd_test(void)
 
 # ifdef widen1
     touch(src);
-    x = widen1(src);
+    x = widen1(low_half(src));
     touch(src);
     if ( !eq(x, y) ) return __LINE__;
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -70,6 +70,16 @@ typedef int __attribute__((vector_size(V
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
 #endif
 
+#if ELEM_SIZE == 1
+typedef vqi_t vint_t;
+#elif ELEM_SIZE == 2
+typedef vhi_t vint_t;
+#elif ELEM_SIZE == 4
+typedef vsi_t vint_t;
+#elif ELEM_SIZE == 8
+typedef vdi_t vint_t;
+#endif
+
 #if VEC_SIZE >= 16
 
 # if ELEM_COUNT >= 2
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -136,6 +136,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
+#define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -468,9 +468,13 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
     [0x79] = { .simd_size = simd_other, .two_op = 1, .d8s = 1 },
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
+    [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
@@ -1861,6 +1865,7 @@ static bool vcpu_has(
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
+#define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -6043,6 +6048,11 @@ x86_emulate(
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x76): /* vpermi2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x77): /* vpermi2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7e): /* vpermt2{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7f): /* vpermt2p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xdb): /* vpand{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -8564,6 +8574,16 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512_vbmi);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.brs, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -107,6 +107,7 @@
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
 
 /* CPUID level 0x00000007:0.ecx */
+#define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
 /* CPUID level 0x80000007.edx */
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -224,7 +224,7 @@ XEN_CPUFEATURE(AVX512VL,      5*32+31) /
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.ecx, word 6 */
 XEN_CPUFEATURE(PREFETCHWT1,   6*32+ 0) /*A  PREFETCHWT1 instruction */
-XEN_CPUFEATURE(AVX512VBMI,    6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /*A  AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -259,12 +259,17 @@ def crunch_numbers(state):
         AVX2: [AVX512F],
 
         # AVX512F is taken to mean hardware support for 512bit registers
-        # (which in practice depends on the EVEX prefix to encode), and the
-        # instructions themselves. All further AVX512 features are built on
-        # top of AVX512F
+        # (which in practice depends on the EVEX prefix to encode) as well
+        # as mask registers, and the instructions themselves. All further
+        # AVX512 features are built on top of AVX512F
         AVX512F: [AVX512DQ, AVX512IFMA, AVX512PF, AVX512ER, AVX512CD,
-                  AVX512BW, AVX512VL, AVX512VBMI, AVX512_4VNNIW,
-                  AVX512_4FMAPS, AVX512_VPOPCNTDQ],
+                  AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
+                  AVX512_VPOPCNTDQ],
+
+        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # dependents of AVX512BW (as to requiring wider than 16-bit mask
+        # registers), despite the SDM not formally making this connection.
+        AVX512BW: [AVX512_VBMI],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 11/50] x86emul: support AVX512{F, BW} integer shuffle insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (9 preceding siblings ...)
  2019-03-15 10:41   ` [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
@ 2019-03-15 10:43   ` Jan Beulich
  2019-05-17 17:01     ` Andrew Cooper
  2019-03-15 10:43   ` [PATCH v8 12/50] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
                     ` (38 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:43 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also include vshuff{32x4,64x2} as being very similar to vshufi{32x4,64x2}.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Re-base.
v7: Disable fault suppression for VPSHUF{D,{H,L}W}. Re-base.
v6: Re-base over changes earlier in the series.
v5: Re-base over changes earlier in the series.
v4: Move OVR() addition into __AVX512VL__ conditional. Correct comments.
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -214,6 +214,7 @@ static const struct test avx512f_all[] =
     INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
     INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
     INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pshufd,       66,   0f, 70,    vl,      d, vl),
     INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
     INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
     INSN(psllq,        66,   0f, f3,    el_2,    q, vl),
@@ -264,6 +265,10 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
+    INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
+    INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
+    INSN(shufi64x2,      66, 0f3a, 43, vl,    q, vl),
 };
 
 static const struct test avx512f_512[] = {
@@ -318,6 +323,9 @@ static const struct test avx512bw_all[]
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
     INSN(psadbw,      66,   0f, f6,    vl,    b, vl),
+    INSN(pshufb,      66, 0f38, 00,    vl,    b, vl),
+    INSN(pshufhw,     f3,   0f, 70,    vl,    w, vl),
+    INSN(pshuflw,     f2,   0f, 70,    vl,    w, vl),
     INSNX(pslldq,     66,   0f, 73, 7, vl,    b, vl),
     INSN(psllvw,      66, 0f38, 12,    vl,    w, vl),
     INSN(psllw,       66,   0f, f1,    el_8,  w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -153,6 +153,10 @@ static inline bool _to_bool(byte_vec_t b
 #  else
 #   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
+#   define swap(x) ({ \
+    vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
+})
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -181,6 +185,10 @@ static inline bool _to_bool(byte_vec_t b
 #  else
 #   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
+#   define swap(x) ({ \
+    vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
+    B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
+})
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -309,9 +317,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
+                               VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
+                             0b00011011, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -333,9 +346,14 @@ static inline bool _to_bool(byte_vec_t b
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b01001110, (vsi_t)undef(), ~0))
 #  else
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varq, _mask, (vdi_t)(x), interleave_hi, (vdi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varq, _mask, interleave_lo, (vdi_t)(x), (vdi_t)(y), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
+                                      VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -119,6 +119,12 @@ typedef long long __attribute__((vector_
 
 #ifdef __AVX512F__
 
+/* Sadly there are a few exceptions to the general naming rules. */
+# define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
+# define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
+# define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
+# define __builtin_ia32_shuf_i64x2_512_mask __builtin_ia32_shuf_i64x2_mask
+
 # if VEC_SIZE > ELEM_SIZE && (defined(VEC_MAX) ? VEC_MAX : VEC_SIZE) < 64
 #  pragma GCC target ( "avx512vl" )
 # endif
@@ -262,6 +268,7 @@ OVR(pmovzxwq);
 OVR(pmulld);
 OVR(pmuldq);
 OVR(pmuludq);
+OVR(pshufd);
 OVR(punpckhdq);
 OVR(punpckhqdq);
 OVR(punpckldq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -318,7 +318,7 @@ static const struct twobyte_table {
     [0x6b ... 0x6d] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x6e] = { DstImplicit|SrcMem|ModRM|Mov, simd_none, d8s_dq64 },
     [0x6f] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_int, d8s_vl },
-    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other },
+    [0x70] = { SrcImmByte|ModRM|TwoOp, simd_other, d8s_vl },
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
@@ -432,7 +432,8 @@ static const struct ext0f38_table {
     uint8_t vsib:1;
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
-    [0x00 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
     [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
@@ -543,6 +544,7 @@ static const struct ext0f3a_table {
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
+    [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
@@ -552,6 +554,7 @@ static const struct ext0f3a_table {
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42] = { .simd_size = simd_packed_int },
+    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6701,6 +6704,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6955,6 +6959,21 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 3;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x70): /* vpshufd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x70): /* vpshufhw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x70): /* vpshuflw $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx == vex_66 )
+            generate_exception_if(evex.w, EXC_UD);
+        else
+        {
+            host_and_vcpu_must_have(avx512bw);
+            generate_exception_if(evex.brs, EXC_UD);
+        }
+        d = (d & ~SrcMask) | SrcMem | TwoOp;
+        op_bytes = 16 << evex.lr;
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x71):    /* Grp12 */
     case X86EMUL_OPC_VEX_66(0x0f, 0x71):
     CASE_SIMD_PACKED_INT(0x0f, 0x72):    /* Grp13 */
@@ -9150,7 +9169,13 @@ x86_emulate(
                                             /* vextracti64x2 $imm8,{y,z}mm,xmm/m128{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
-        generate_exception_if(!evex.lr || evex.brs, EXC_UD);
+        generate_exception_if(evex.brs, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x23): /* vshuff32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshuff64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x43): /* vshufi32x4 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+                                            /* vshufi64x2 $imm8,{y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
         fault_suppression = false;
         goto avx512f_imm8_no_sae;
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 12/50] x86emul: support AVX512{BW, DQ} mask move insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (10 preceding siblings ...)
  2019-03-15 10:43   ` [PATCH v8 11/50] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
@ 2019-03-15 10:43   ` Jan Beulich
  2019-05-17 17:02     ` Andrew Cooper
  2019-03-15 10:43   ` [PATCH v8 13/50] x86emul: basic AVX512BW testing Jan Beulich
                     ` (37 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:43 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Entries to the tables in evex-disp8.c are added despite these insns not
allowing for memory operands, with the goal of the tables giving a
complete picture of the supported EVEX-encoded insns in the end.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -314,9 +314,12 @@ static const struct test avx512bw_all[]
     INSN(pminsw,      66,   0f, ea,    vl,    w, vl),
     INSN(pminub,      66,   0f, da,    vl,    b, vl),
     INSN(pminuw,      66, 0f38, 3a,    vl,    w, vl),
+//       pmovb2m,     f3, 0f38, 29,           b
+//       pmovm2,      f3, 0f38, 28,          bw
     INSN(pmovswb,     f3, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovsxbw,    66, 0f38, 20,    vl_2,  b, vl),
     INSN(pmovuswb,    f3, 0f38, 10,    vl_2,  b, vl),
+//       pmovw2m,     f3, 0f38, 29,           w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
@@ -364,6 +367,9 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN_PFP(or,               0f, 56),
+//       pmovd2m,        f3, 0f38, 39,        d
+//       pmovm2,         f3, 0f38, 38,       dq
+//       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
     INSN_PFP(xor,              0f, 57),
 };
--- a/tools/tests/x86_emulator/opmask.S
+++ b/tools/tests/x86_emulator/opmask.S
@@ -12,17 +12,23 @@
 
 #if SIZE == 1
 # define _(x) x##b
+# define _v(x, t) _v_(x##q, t)
 #elif SIZE == 2
 # define _(x) x##w
+# define _v(x, t) _v_(x##d, t)
 # define WIDEN(x) x##bw
 #elif SIZE == 4
 # define _(x) x##d
+# define _v(x, t) _v_(x##w, t)
 # define WIDEN(x) x##wd
 #elif SIZE == 8
 # define _(x) x##q
+# define _v(x, t) _v_(x##b, t)
 # define WIDEN(x) x##dq
 #endif
 
+#define _v_(x, t) v##x##t
+
     .macro check res1:req, res2:req, line:req
     _(kmov)       %\res1, DATA(out)
 #if SIZE < 8 || !defined(__i386__)
@@ -131,6 +137,15 @@ _start:
 
 #endif
 
+#if SIZE > 2 ? defined(__AVX512BW__) : defined(__AVX512DQ__)
+
+    _(kmov)       DATA(in1), %k0
+    _v(pmovm2,)   %k0, %zmm7
+    _v(pmov,2m)   %zmm7, %k3
+    check         k0, k3, __LINE__
+
+#endif
+
     xor           %eax, %eax
     ret
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -8465,6 +8465,21 @@ x86_emulate(
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x29): /* vpmov{b,w}2m [xyz]mm,k */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x39): /* vpmov{d,q}2m [xyz]mm,k */
+        generate_exception_if(!evex.r || !evex.R, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x28): /* vpmovm2{b,w} k,[xyz]mm */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x38): /* vpmovm2{d,q} k,[xyz]mm */
+        if ( b & 0x10 )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(evex.opmsk || ea.type != OP_REG, EXC_UD);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_66(0x0f38, 0x2a):     /* movntdqa m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2a): /* vmovntdqa mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 13/50] x86emul: basic AVX512BW testing
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (11 preceding siblings ...)
  2019-03-15 10:43   ` [PATCH v8 12/50] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
@ 2019-03-15 10:43   ` Jan Beulich
  2019-05-17 17:03     ` Andrew Cooper
  2019-03-15 10:44   ` [PATCH v8 14/50] x86emul: basic AVX512DQ testing Jan Beulich
                     ` (36 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:43 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Correct PS{R,L}LDQ overrides.
v6: Re-base over changes earlier in the series.
v4: Add __AVX512VL__ conditional around majority of OVR() additions.
    Correct eq() for 1- and 2-byte cases.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -66,6 +66,9 @@ xop-flts := $(avx-flts)
 avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
+avx512bw-vecs := $(avx512f-vecs)
+avx512bw-ints := 1 2
+avx512bw-flts :=
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -31,6 +31,10 @@ ENTRY(simd_test);
 #  define eq(x, y) ((BR(cmpps, _mask, x, y, 0, -1) & ALL_TRUE) == ALL_TRUE)
 # elif FLOAT_SIZE == 8
 #  define eq(x, y) (BR(cmppd, _mask, x, y, 0, -1) == ALL_TRUE)
+# elif (INT_SIZE == 1 || UINT_SIZE == 1) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqb, _mask, (vqi_t)(x), (vqi_t)(y), -1) == ALL_TRUE)
+# elif (INT_SIZE == 2 || UINT_SIZE == 2) && defined(__AVX512BW__)
+#  define eq(x, y) (B(pcmpeqw, _mask, (vhi_t)(x), (vhi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 4 || UINT_SIZE == 4
 #  define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
 # elif INT_SIZE == 8 || UINT_SIZE == 8
@@ -374,6 +378,87 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) ((vec_t)B(pmaxuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminuq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # endif
+#elif (INT_SIZE == 1 || UINT_SIZE == 1 || INT_SIZE == 2 || UINT_SIZE == 2) && \
+      defined(__AVX512BW__) && (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if INT_SIZE == 1 || UINT_SIZE == 1
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastb %1, %0" \
+          : "=v" (t_) : "m" (*(char[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastb %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  elif defined(__AVX512VBMI__)
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
+#  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+# elif INT_SIZE == 2 || UINT_SIZE == 2
+#  define broadcast(x) ({ \
+    vec_t t_; \
+    asm ( "%{evex%} vpbroadcastw %1, %0" \
+          : "=v" (t_) : "m" (*(short[1]){ x }) ); \
+    t_; \
+})
+#  define broadcast2(x) ({ \
+    vec_t t_; \
+    asm ( "vpbroadcastw %k1, %0" : "=v" (t_) : "r" (x) ); \
+    t_; \
+})
+#  if VEC_SIZE == 16
+#   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define swap(x) ((vec_t)B(pshufd, _mask, \
+                             (vsi_t)B(pshufhw, _mask, \
+                                      B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
+                                      0b00011011, (vhi_t)undef(), ~0), \
+                             0b01001110, (vsi_t)undef(), ~0))
+#  else
+#   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
+#   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
+#  endif
+#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
+#  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+# endif
+# if INT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovsxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 1
+#  define max(x, y) ((vec_t)B(pmaxub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminub, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
+#  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
+# elif INT_SIZE == 2
+#  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
+#  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
+#  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
+#  define widen1(x) ((vec_t)B(pmovsxwd, _mask, x, (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovsxwq, _mask, x, (vdi_t)undef(), ~0))
+# elif UINT_SIZE == 2
+#  define max(x, y) ((vec_t)B(pmaxuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define min(x, y) ((vec_t)B(pminuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define mul_hi(x, y) ((vec_t)B(pmulhuw, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#  define widen1(x) ((vec_t)B(pmovzxwd, _mask, (vhi_half_t)(x), (vsi_t)undef(), ~0))
+#  define widen2(x) ((vec_t)B(pmovzxwq, _mask, (vhi_quarter_t)(x), (vdi_t)undef(), ~0))
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if INT_SIZE == 1 || UINT_SIZE == 1
 #  define interleave_hi(x, y) ((vec_t)__builtin_ia32_punpckhbw128((vqi_t)(x), (vqi_t)(y)))
@@ -565,7 +650,7 @@ static inline bool _to_bool(byte_vec_t b
 #  endif
 # endif
 #endif
-#if VEC_SIZE == 16 && defined(__SSSE3__)
+#if VEC_SIZE == 16 && defined(__SSSE3__) && !defined(__AVX512VL__)
 # if INT_SIZE == 1
 #  define abs(x) ((vec_t)__builtin_ia32_pabsb128((vqi_t)(x)))
 # elif INT_SIZE == 2
@@ -789,6 +874,40 @@ static inline half_t low_half(vec_t x)
 }
 # endif
 
+# if !defined(low_quarter) && defined(QUARTER_SIZE)
+static inline quarter_t low_quarter(vec_t x)
+{
+#  if QUARTER_SIZE < VEC_SIZE
+    quarter_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
+# if !defined(low_eighth) && defined(EIGHTH_SIZE)
+static inline eighth_t low_eighth(vec_t x)
+{
+#  if EIGHTH_SIZE < VEC_SIZE
+    eighth_t y;
+    unsigned int i;
+
+    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+        y[i] = x[i];
+
+    return y;
+#  else
+    return x;
+#  endif
+}
+# endif
+
 #endif
 
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
@@ -1117,7 +1236,7 @@ int simd_test(void)
     y = interleave_lo(alt < 0, alt < 0);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen2(x);
+    z = widen2(low_quarter(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 
@@ -1126,7 +1245,7 @@ int simd_test(void)
     y = interleave_lo(y, y);
     y = interleave_lo(z, y);
     touch(x);
-    z = widen3(x);
+    z = widen3(low_eighth(x));
     touch(x);
     if ( !eq(z, y) ) return __LINE__;
 #  endif
@@ -1148,14 +1267,14 @@ int simd_test(void)
 
 # ifdef widen2
     touch(src);
-    x = widen2(src);
+    x = widen2(low_quarter(src));
     touch(src);
     if ( !eq(x, z) ) return __LINE__;
 # endif
 
 # ifdef widen3
     touch(src);
-    x = widen3(src);
+    x = widen3(low_eighth(src));
     touch(src);
     if ( !eq(x, interleave_lo(z, (vec_t){})) ) return __LINE__;
 # endif
@@ -1175,6 +1294,36 @@ int simd_test(void)
             if ( aux2[i] != src[i] )
                 return __LINE__;
     }
+#endif
+
+#if defined(widen2) && defined(shrink2)
+    {
+        quarter_t aux1 = low_quarter(src), aux2;
+
+        touch(aux1);
+        x = widen2(aux1);
+        touch(x);
+        aux2 = shrink2(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 4; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
+#endif
+
+#if defined(widen3) && defined(shrink3)
+    {
+        eighth_t aux1 = low_eighth(src), aux2;
+
+        touch(aux1);
+        x = widen3(aux1);
+        touch(x);
+        aux2 = shrink3(x);
+        touch(aux2);
+        for ( i = 0; i < ELEM_COUNT / 8; ++i )
+            if ( aux2[i] != src[i] )
+                return __LINE__;
+    }
 #endif
 
 #ifdef dup_lo
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -95,6 +95,32 @@ typedef int __attribute__((vector_size(H
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
 # endif
 
+# if ELEM_COUNT >= 4
+#  if VEC_SIZE > 64
+#   define QUARTER_SIZE (VEC_SIZE / 4)
+#  else
+#   define QUARTER_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(QUARTER_SIZE))) quarter_t;
+typedef char __attribute__((vector_size(QUARTER_SIZE))) vqi_quarter_t;
+typedef short __attribute__((vector_size(QUARTER_SIZE))) vhi_quarter_t;
+typedef int __attribute__((vector_size(QUARTER_SIZE))) vsi_quarter_t;
+typedef long long __attribute__((vector_size(QUARTER_SIZE))) vdi_quarter_t;
+# endif
+
+# if ELEM_COUNT >= 8
+#  if VEC_SIZE > 128
+#   define EIGHTH_SIZE (VEC_SIZE / 8)
+#  else
+#   define EIGHTH_SIZE 16
+#  endif
+typedef typeof((vec_t){}[0]) __attribute__((vector_size(EIGHTH_SIZE))) eighth_t;
+typedef char __attribute__((vector_size(EIGHTH_SIZE))) vqi_eighth_t;
+typedef short __attribute__((vector_size(EIGHTH_SIZE))) vhi_eighth_t;
+typedef int __attribute__((vector_size(EIGHTH_SIZE))) vsi_eighth_t;
+typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
+# endif
+
 #endif
 
 #if VEC_SIZE == 16
@@ -182,6 +208,9 @@ OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_FP(add);
 OVR_INT(add);
+OVR_BW(adds);
+OVR_BW(addus);
+OVR_BW(avg);
 OVR_FP(div);
 OVR(extractps);
 OVR_FMA(fmadd, FP);
@@ -214,6 +243,8 @@ OVR_INT(srl);
 OVR_DQ(srlv);
 OVR_FP(sub);
 OVR_INT(sub);
+OVR_BW(subs);
+OVR_BW(subus);
 OVR_SFP(ucomi);
 OVR_VFP(unpckh);
 OVR_VFP(unpckl);
@@ -275,6 +306,31 @@ OVR(punpckldq);
 OVR(punpcklqdq);
 # endif
 
+# ifdef __AVX512BW__
+OVR(pextrb);
+OVR(pextrw);
+OVR(pinsrb);
+OVR(pinsrw);
+#  ifdef __AVX512VL__
+OVR(pmaddwd);
+OVR(pmovsxbw);
+OVR(pmovzxbw);
+OVR(pmulhuw);
+OVR(pmulhw);
+OVR(pmullw);
+OVR(psadbw);
+OVR(pshufb);
+OVR(pshufhw);
+OVR(pshuflw);
+OVR(pslldq);
+OVR(psrldq);
+OVR(punpckhbw);
+OVR(punpckhwd);
+OVR(punpcklbw);
+OVR(punpcklwd);
+#  endif
+# endif
+
 # undef OVR_VFP
 # undef OVR_SFP
 # undef OVR_INT
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
+#include "avx512bw.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -105,6 +106,11 @@ static bool simd_check_avx512bw(void)
 }
 #define simd_check_avx512bw_opmask simd_check_avx512bw
 
+static bool simd_check_avx512bw_vl(void)
+{
+    return cpu_has_avx512bw && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -284,6 +290,18 @@ static const struct {
     AVX512VL(VL u64x2,        avx512f,      16u8),
     AVX512VL(VL s64x4,        avx512f,      32i8),
     AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512BW s8x64,     avx512bw,      64i1),
+    SIMD(AVX512BW u8x64,     avx512bw,      64u1),
+    SIMD(AVX512BW s16x32,    avx512bw,      64i2),
+    SIMD(AVX512BW u16x32,    avx512bw,      64u2),
+    AVX512VL(BW+VL s8x16,    avx512bw,      16i1),
+    AVX512VL(BW+VL u8x16,    avx512bw,      16u1),
+    AVX512VL(BW+VL s8x32,    avx512bw,      32i1),
+    AVX512VL(BW+VL u8x32,    avx512bw,      32u1),
+    AVX512VL(BW+VL s16x8,    avx512bw,      16i2),
+    AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
+    AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
+    AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 14/50] x86emul: basic AVX512DQ testing
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (12 preceding siblings ...)
  2019-03-15 10:43   ` [PATCH v8 13/50] x86emul: basic AVX512BW testing Jan Beulich
@ 2019-03-15 10:44   ` Jan Beulich
  2019-05-17 17:03     ` Andrew Cooper
  2019-03-15 10:44   ` [PATCH v8 15/50] x86emul: support AVX512F move high/low insns Jan Beulich
                     ` (35 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Test various of the insns which have been implemented already.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Re-base.
v5: Re-base over changes earlier in the series.
v4: Wrap OVR(pmullq) in __AVX512VL__ conditional.
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -69,9 +69,12 @@ avx512f-flts := 4 8
 avx512bw-vecs := $(avx512f-vecs)
 avx512bw-ints := 1 2
 avx512bw-flts :=
+avx512dq-vecs := $(avx512f-vecs)
+avx512dq-ints := $(avx512f-ints)
+avx512dq-flts := $(avx512f-flts)
 
 avx512f-opmask-vecs := 2
-avx512dq-opmask-vecs := 1
+avx512dq-opmask-vecs := 1 2
 avx512bw-opmask-vecs := 4 8
 
 # Suppress building by default of the harness if the compiler can't deal
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -121,6 +121,34 @@ typedef int __attribute__((vector_size(E
 typedef long long __attribute__((vector_size(EIGHTH_SIZE))) vdi_eighth_t;
 # endif
 
+# define DECL_PAIR(w) \
+typedef w ## _t pair_t; \
+typedef vsi_ ## w ## _t vsi_pair_t; \
+typedef vdi_ ## w ## _t vdi_pair_t
+# define DECL_QUARTET(w) \
+typedef w ## _t quartet_t; \
+typedef vsi_ ## w ## _t vsi_quartet_t; \
+typedef vdi_ ## w ## _t vdi_quartet_t
+# define DECL_OCTET(w) \
+typedef w ## _t octet_t; \
+typedef vsi_ ## w ## _t vsi_octet_t; \
+typedef vdi_ ## w ## _t vdi_octet_t
+
+# if ELEM_COUNT == 4
+DECL_PAIR(half);
+# elif ELEM_COUNT == 8
+DECL_PAIR(quarter);
+DECL_QUARTET(half);
+# elif ELEM_COUNT == 16
+DECL_PAIR(eighth);
+DECL_QUARTET(quarter);
+DECL_OCTET(half);
+# endif
+
+# undef DECL_OCTET
+# undef DECL_QUARTET
+# undef DECL_PAIR
+
 #endif
 
 #if VEC_SIZE == 16
@@ -146,6 +174,14 @@ typedef long long __attribute__((vector_
 #ifdef __AVX512F__
 
 /* Sadly there are a few exceptions to the general naming rules. */
+# define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
+# define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+# define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
+# define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
+# define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
+# define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
+# define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
+# define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -331,6 +367,20 @@ OVR(punpcklwd);
 #  endif
 # endif
 
+# ifdef __AVX512DQ__
+OVR_VFP(and);
+OVR_VFP(andn);
+OVR_VFP(or);
+OVR(pextrd);
+OVR(pextrq);
+OVR(pinsrd);
+OVR(pinsrq);
+#  ifdef __AVX512VL__
+OVR(pmullq);
+#  endif
+OVR_VFP(xor);
+# endif
+
 # undef OVR_VFP
 # undef OVR_SFP
 # undef OVR_INT
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -139,6 +139,27 @@ static inline bool _to_bool(byte_vec_t b
 # endif
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
+     (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
+     (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#  define low_half(x) ({ \
+    half_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+    t_; \
+})
+# endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
+     (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if FLOAT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -146,6 +167,17 @@ static inline bool _to_bool(byte_vec_t b
           : "=v" (t_) : "m" (*(float[1]){ x }) ); \
     t_; \
 })
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcastf32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
+#   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
+#  endif
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
@@ -155,6 +187,13 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
 #  else
+#   define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
+#   define insert_pair(x, y, p) \
+    B(insertf32x4_, _mask, x, \
+      /* Cast needed below to work around gcc 7.x quirk. */ \
+      (p) & 1 ? (typeof(y))__builtin_ia32_shufps(y, y, 0b01000100) : (y), \
+      (p) >> 1, x, 3 << ((p) * 2))
+#   define insert_quartet(x, y, p) B(insertf32x4_, _mask, x, y, p, undef(), ~0)
 #   define interleave_hi(x, y) B(vpermi2varps, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varps, _mask, interleave_lo, x, y, ~0)
 #   define swap(x) ({ \
@@ -178,6 +217,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) B(broadcastf64x2_, _mask, x, undef(), ~0)
+#   define insert_pair(x, y, p) B(insertf64x2_, _mask, x, y, p, undef(), ~0)
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
+#   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
+#  endif
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
@@ -306,6 +353,16 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 # endif
+# if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextracti32x4 */ || \
+       (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextracti64x2 */
+#  define low_quarter(x) ({ \
+    quarter_t t_; \
+    asm ( "vextracti%c[w]x%c[n] $0, %[s], %[d]" \
+          : [d] "=m" (t_) \
+          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 4) ); \
+    t_; \
+})
+# endif
 # if INT_SIZE == 4 || UINT_SIZE == 4
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -318,11 +375,30 @@ static inline bool _to_bool(byte_vec_t b
     asm ( "vpbroadcastd %k1, %0" : "=v" (t_) : "r" (x) ); \
     t_; \
 })
+#  ifdef __AVX512DQ__
+#   define broadcast_pair(x) ({ \
+    vec_t t_; \
+    asm ( "vbroadcasti32x2 %1, %0" : "=v" (t_) : "m" (x) ); \
+    t_; \
+})
+#  endif
+#  if VEC_SIZE == 64 && defined(__AVX512DQ__)
+#   define broadcast_octet(x) ((vec_t)B(broadcasti32x8_, _mask, (vsi_octet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_octet(x, y, p) ((vec_t)B(inserti32x8_, _mask, (vsi_t)(x), (vsi_octet_t)(y), p, (vsi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhdq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpckldq, _mask, (vsi_t)(x), (vsi_t)(y), (vsi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, (vsi_t)(x), 0b00011011, (vsi_t)undef(), ~0))
 #  else
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti32x4_, _mask, (vsi_quartet_t)(x), (vsi_t)undef(), ~0))
+#   define insert_pair(x, y, p) \
+    (vec_t)(B(inserti32x4_, _mask, (vsi_t)(x), \
+              /* First cast needed below to work around gcc 7.x quirk. */ \
+              (p) & 1 ? (vsi_pair_t)__builtin_ia32_pshufd((vsi_pair_t)(y), 0b01000100) \
+                      : (vsi_pair_t)(y), \
+              (p) >> 1, (vsi_t)(x), 3 << ((p) * 2)))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti32x4_, _mask, (vsi_t)(x), (vsi_quartet_t)(y), p, (vsi_t)undef(), ~0))
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2vard, _mask, (vsi_t)(x), interleave_hi, (vsi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2vard, _mask, interleave_lo, (vsi_t)(x), (vsi_t)(y), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
@@ -347,6 +423,14 @@ static inline bool _to_bool(byte_vec_t b
     t_; \
 })
 #  endif
+#  if VEC_SIZE >= 32 && defined(__AVX512DQ__)
+#   define broadcast_pair(x) ((vec_t)B(broadcasti64x2_, _mask, (vdi_pair_t)(x), (vdi_t)undef(), ~0))
+#   define insert_pair(x, y, p) ((vec_t)B(inserti64x2_, _mask, (vdi_t)(x), (vdi_pair_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
+#  if VEC_SIZE == 64
+#   define broadcast_quartet(x) ((vec_t)B(broadcasti64x4_, , (vdi_quartet_t)(x), (vdi_t)undef(), ~0))
+#   define insert_quartet(x, y, p) ((vec_t)B(inserti64x4_, _mask, (vdi_t)(x), (vdi_quartet_t)(y), p, (vdi_t)undef(), ~0))
+#  endif
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklqdq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
@@ -898,7 +982,7 @@ static inline eighth_t low_eighth(vec_t
     eighth_t y;
     unsigned int i;
 
-    for ( i = 0; i < ELEM_COUNT / 4; ++i )
+    for ( i = 0; i < ELEM_COUNT / 8; ++i )
         y[i] = x[i];
 
     return y;
@@ -910,6 +994,50 @@ static inline eighth_t low_eighth(vec_t
 
 #endif
 
+#ifdef broadcast_pair
+# if ELEM_COUNT == 4
+#  define broadcast_half broadcast_pair
+# elif ELEM_COUNT == 8
+#  define broadcast_quarter broadcast_pair
+# elif ELEM_COUNT == 16
+#  define broadcast_eighth broadcast_pair
+# endif
+#endif
+
+#ifdef insert_pair
+# if ELEM_COUNT == 4
+#  define insert_half insert_pair
+# elif ELEM_COUNT == 8
+#  define insert_quarter insert_pair
+# elif ELEM_COUNT == 16
+#  define insert_eighth insert_pair
+# endif
+#endif
+
+#ifdef broadcast_quartet
+# if ELEM_COUNT == 8
+#  define broadcast_half broadcast_quartet
+# elif ELEM_COUNT == 16
+#  define broadcast_quarter broadcast_quartet
+# endif
+#endif
+
+#ifdef insert_quartet
+# if ELEM_COUNT == 8
+#  define insert_half insert_quartet
+# elif ELEM_COUNT == 16
+#  define insert_quarter insert_quartet
+# endif
+#endif
+
+#if defined(broadcast_octet) && ELEM_COUNT == 16
+# define broadcast_half broadcast_octet
+#endif
+
+#if defined(insert_octet) && ELEM_COUNT == 16
+# define insert_half insert_octet
+#endif
+
 #if defined(__AVX512F__) && defined(FLOAT_SIZE)
 # include "simd-fma.c"
 #endif
@@ -1205,6 +1333,60 @@ int simd_test(void)
     if ( !eq(broadcast2(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
 
+#if defined(broadcast_half) && defined(insert_half)
+    {
+        half_t aux = low_half(src);
+
+        touch(aux);
+        x = broadcast_half(aux);
+        touch(aux);
+        y = insert_half(src, aux, 1);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_quarter) && defined(insert_quarter)
+    {
+        quarter_t aux = low_quarter(src);
+
+        touch(aux);
+        x = broadcast_quarter(aux);
+        touch(aux);
+        y = insert_quarter(src, aux, 1);
+        touch(aux);
+        y = insert_quarter(y, aux, 2);
+        touch(aux);
+        y = insert_quarter(y, aux, 3);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
+#if defined(broadcast_eighth) && defined(insert_eighth) && \
+    /* At least gcc 7.3 "optimizes" away all insert_eighth() calls below. */ \
+    __GNUC__ >= 8
+    {
+        eighth_t aux = low_eighth(src);
+
+        touch(aux);
+        x = broadcast_eighth(aux);
+        touch(aux);
+        y = insert_eighth(src, aux, 1);
+        touch(aux);
+        y = insert_eighth(y, aux, 2);
+        touch(aux);
+        y = insert_eighth(y, aux, 3);
+        touch(aux);
+        y = insert_eighth(y, aux, 4);
+        touch(aux);
+        y = insert_eighth(y, aux, 5);
+        touch(aux);
+        y = insert_eighth(y, aux, 6);
+        touch(aux);
+        y = insert_eighth(y, aux, 7);
+        if ( !eq(x, y) ) return __LINE__;
+    }
+#endif
+
 #if defined(interleave_lo) && defined(interleave_hi)
     touch(src);
     x = interleave_lo(inv, src);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -23,6 +23,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
 #include "avx512bw.h"
+#include "avx512dq.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -100,6 +101,11 @@ static bool simd_check_avx512dq(void)
 }
 #define simd_check_avx512dq_opmask simd_check_avx512dq
 
+static bool simd_check_avx512dq_vl(void)
+{
+    return cpu_has_avx512dq && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -267,9 +273,10 @@ static const struct {
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
     SIMD(OPMASK/w,     avx512f_opmask,         2),
-    SIMD(OPMASK/b,    avx512dq_opmask,         1),
-    SIMD(OPMASK/d,    avx512bw_opmask,         4),
-    SIMD(OPMASK/q,    avx512bw_opmask,         8),
+    SIMD(OPMASK+DQ/b, avx512dq_opmask,         1),
+    SIMD(OPMASK+DQ/w, avx512dq_opmask,         2),
+    SIMD(OPMASK+BW/d, avx512bw_opmask,         4),
+    SIMD(OPMASK+BW/q, avx512bw_opmask,         8),
     SIMD(AVX512F f32 scalar,  avx512f,        f4),
     SIMD(AVX512F f32x16,      avx512f,      64f4),
     SIMD(AVX512F f64 scalar,  avx512f,        f8),
@@ -302,6 +309,24 @@ static const struct {
     AVX512VL(BW+VL u16x8,    avx512bw,      16u2),
     AVX512VL(BW+VL s16x16,   avx512bw,      32i2),
     AVX512VL(BW+VL u16x16,   avx512bw,      32u2),
+    SIMD(AVX512DQ f32x16,    avx512dq,      64f4),
+    SIMD(AVX512DQ f64x8,     avx512dq,      64f8),
+    SIMD(AVX512DQ s32x16,    avx512dq,      64i4),
+    SIMD(AVX512DQ u32x16,    avx512dq,      64u4),
+    SIMD(AVX512DQ s64x8,     avx512dq,      64i8),
+    SIMD(AVX512DQ u64x8,     avx512dq,      64u8),
+    AVX512VL(DQ+VL f32x4,    avx512dq,      16f4),
+    AVX512VL(DQ+VL f64x2,    avx512dq,      16f8),
+    AVX512VL(DQ+VL f32x8,    avx512dq,      32f4),
+    AVX512VL(DQ+VL f64x4,    avx512dq,      32f8),
+    AVX512VL(DQ+VL s32x4,    avx512dq,      16i4),
+    AVX512VL(DQ+VL u32x4,    avx512dq,      16u4),
+    AVX512VL(DQ+VL s32x8,    avx512dq,      32i4),
+    AVX512VL(DQ+VL u32x8,    avx512dq,      32u4),
+    AVX512VL(DQ+VL s64x2,    avx512dq,      16i8),
+    AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
+    AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
+    AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 15/50] x86emul: support AVX512F move high/low insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (13 preceding siblings ...)
  2019-03-15 10:44   ` [PATCH v8 14/50] x86emul: basic AVX512DQ testing Jan Beulich
@ 2019-03-15 10:44   ` Jan Beulich
  2019-05-21 10:59     ` Andrew Cooper
  2019-03-15 10:45   ` [PATCH v8 16/50] x86emul: support AVX512F move duplicate insns Jan Beulich
                     ` (34 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: No need to set fault_suppression to false.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -253,6 +253,16 @@ static const struct test avx512f_128[] =
     INSN(insertps,  66, 0f3a, 21, el,    d, el),
     INSN(mov,       66,   0f, 6e, el, dq64, el),
     INSN(mov,       66,   0f, 7e, el, dq64, el),
+//       movhlps,     ,   0f, 12,        d
+    INSN(movhpd,    66,   0f, 16, el,    q, vl),
+    INSN(movhpd,    66,   0f, 17, el,    q, vl),
+    INSN(movhps,      ,   0f, 16, el_2,  d, vl),
+    INSN(movhps,      ,   0f, 17, el_2,  d, vl),
+//       movlhps,     ,   0f, 16,        d
+    INSN(movlpd,    66,   0f, 12, el,    q, vl),
+    INSN(movlpd,    66,   0f, 13, el,    q, vl),
+    INSN(movlps,      ,   0f, 12, el_2,  d, vl),
+    INSN(movlps,      ,   0f, 13, el_2,  d, vl),
     INSN(movq,      f3,   0f, 7e, el,    q, el),
     INSN(movq,      66,   0f, d6, el,    q, el),
 };
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -266,6 +266,12 @@ OVR(movd);
 OVR(movq);
 OVR_SFP(mov);
 OVR_VFP(mova);
+OVR(movhlps);
+OVR(movhpd);
+OVR(movhps);
+OVR(movlhps);
+OVR(movlpd);
+OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -286,11 +286,11 @@ static const struct twobyte_table {
     [0x0f] = { ModRM|SrcImmByte },
     [0x10] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x11] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x12] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x13] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x14 ... 0x15] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
-    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
-    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other },
+    [0x16] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, 3 },
+    [0x17] = { DstMem|SrcImplicit|ModRM|Mov, simd_other, 3 },
     [0x18 ... 0x1f] = { ImplicitOps|ModRM },
     [0x20 ... 0x21] = { DstMem|SrcImplicit|ModRM },
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
@@ -6032,6 +6032,25 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_fp;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x12):   /* vmovlpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x13): /* vmovlp{s,d} xmm,m64 */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x16):   /* vmovhpd m64,xmm,xmm */
+    CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x17): /* vmovhp{s,d} xmm,m64 */
+        generate_exception_if(ea.type != OP_MEM, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x12):      /* vmovlps m64,xmm,xmm */
+                                            /* vmovhlps xmm,xmm,xmm */
+    case X86EMUL_OPC_EVEX(0x0f, 0x16):      /* vmovhps m64,xmm,xmm */
+                                            /* vmovlhps xmm,xmm,xmm */
+        generate_exception_if((evex.lr || evex.opmsk || evex.brs ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( (d & DstMask) != DstMem )
+            d &= ~TwoOp;
+        op_bytes = 8;
+        goto simd_zmm;
+
     case X86EMUL_OPC_F3(0x0f, 0x12):       /* movsldup xmm/m128,xmm */
     case X86EMUL_OPC_VEX_F3(0x0f, 0x12):   /* vmovsldup {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F2(0x0f, 0x12):       /* movddup xmm/m64,xmm */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 16/50] x86emul: support AVX512F move duplicate insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (14 preceding siblings ...)
  2019-03-15 10:44   ` [PATCH v8 15/50] x86emul: support AVX512F move high/low insns Jan Beulich
@ 2019-03-15 10:45   ` Jan Beulich
  2019-05-21 11:09     ` Andrew Cooper
  2019-03-15 10:46   ` [PATCH v8 17/50] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
                     ` (33 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:45 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Judging from insn prefixes, these are scalar insns, but their (memory)
operands are vector ones (with the exception of 128-bit VMOVDDUP). For
this some adjustments to disp8scale calculation code are needed.

No explicit test harness additions other than the overrides, as the
compiler already makes use of the insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: Fix Disp8 test for VMOVDDUP when AVX512VL is unavailable.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -146,6 +146,7 @@ static const struct test avx512f_all[] =
     INSN_SFP(mov,            0f, 11),
     INSN_PFP_NB(mova,        0f, 28),
     INSN_PFP_NB(mova,        0f, 29),
+    INSN(movddup,      f2,   0f, 12,    vl,   q_nb, vl),
     INSN(movdqa32,     66,   0f, 6f,    vl,   d_nb, vl),
     INSN(movdqa32,     66,   0f, 7f,    vl,   d_nb, vl),
     INSN(movdqa64,     66,   0f, 6f,    vl,   q_nb, vl),
@@ -157,6 +158,8 @@ static const struct test avx512f_all[] =
     INSN(movntdq,      66,   0f, e7,    vl,   d_nb, vl),
     INSN(movntdqa,     66, 0f38, 2a,    vl,   d_nb, vl),
     INSN_PFP_NB(movnt,       0f, 2b),
+    INSN(movshdup,     f3,   0f, 16,    vl,   d_nb, vl),
+    INSN(movsldup,     f3,   0f, 12,    vl,   d_nb, vl),
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
@@ -694,6 +697,19 @@ static void test_group(const struct test
 
             switch ( tests[i].esz )
             {
+            case ESZ_q_nb:
+                /* The 128-bit form of VMOVDDUP needs special casing. */
+                if ( vl[j] == VL_128 && tests[i].spc == SPC_0f &&
+                     tests[i].opc == 0x12 && tests[i].pfx == PFX_f2 )
+                {
+                    struct test test = tests[i];
+
+                    test.vsz = VSZ_el;
+                    test.scale = SC_el;
+                    test_one(&test, vl[j], instr, ctxt);
+                    continue;
+                }
+                /* fall through */
             default:
                 test_one(&tests[i], vl[j], instr, ctxt);
                 break;
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -326,8 +326,11 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
+OVR(movshdup);
+OVR(movsldup);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3048,6 +3048,15 @@ x86_decode(
 
             switch ( b )
             {
+            case 0x12: /* vmovsldup / vmovddup */
+                if ( evex.pfx == vex_f2 )
+                    disp8scale = evex.lr ? 4 + evex.lr : 3;
+                /* fall through */
+            case 0x16: /* vmovshdup */
+                if ( evex.pfx == vex_f3 )
+                    disp8scale = 4 + evex.lr;
+                break;
+
             case 0x20: /* mov cr,reg */
             case 0x21: /* mov dr,reg */
             case 0x22: /* mov reg,cr */
@@ -6066,6 +6075,20 @@ x86_emulate(
         host_and_vcpu_must_have(sse3);
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x12):   /* vmovsldup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x12):   /* vmovddup [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x16):   /* vmovshdup [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if((evex.brs ||
+                               evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = !(evex.pfx & VEX_PREFIX_DOUBLE_MASK) || evex.lr
+                   ? 16 << evex.lr : 8;
+        fault_suppression = false;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x14): /* vunpcklp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0x15): /* vunpckhp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 17/50] x86emul: support AVX512{F, BW, _VBMI} permute insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (15 preceding siblings ...)
  2019-03-15 10:45   ` [PATCH v8 16/50] x86emul: support AVX512F move duplicate insns Jan Beulich
@ 2019-03-15 10:46   ` Jan Beulich
  2019-05-21 11:24     ` Andrew Cooper
  2019-03-15 10:46   ` [PATCH v8 18/50] x86emul: support AVX512BW pack insns Jan Beulich
                     ` (32 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:46 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -178,6 +178,10 @@ static const struct test avx512f_all[] =
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
+    INSN(permilpd,     66, 0f38, 0d,    vl,      q, vl),
+    INSN(permilpd,     66, 0f3a, 05,    vl,      q, vl),
+    INSN(permilps,     66, 0f38, 0c,    vl,      d, vl),
+    INSN(permilps,     66, 0f3a, 04,    vl,      d, vl),
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
@@ -278,6 +282,10 @@ static const struct test avx512f_no128[]
     INSN(extracti32x4,   66, 0f3a, 39, el_4,  d, vl),
     INSN(insertf32x4,    66, 0f3a, 18, el_4,  d, vl),
     INSN(inserti32x4,    66, 0f3a, 38, el_4,  d, vl),
+    INSN(perm,           66, 0f38, 36, vl,   dq, vl),
+    INSN(perm,           66, 0f38, 16, vl,   sd, vl),
+    INSN(permpd,         66, 0f3a, 01, vl,    q, vl),
+    INSN(permq,          66, 0f3a, 00, vl,    q, vl),
     INSN(shuff32x4,      66, 0f3a, 23, vl,    d, vl),
     INSN(shuff64x2,      66, 0f3a, 23, vl,    q, vl),
     INSN(shufi32x4,      66, 0f3a, 43, vl,    d, vl),
@@ -316,6 +324,7 @@ static const struct test avx512bw_all[]
     INSN(pcmpgtb,     66,   0f, 64,    vl,    b, vl),
     INSN(pcmpgtw,     66,   0f, 65,    vl,    w, vl),
     INSN(pcmpu,       66, 0f3a, 3e,    vl,   bw, vl),
+    INSN(permw,       66, 0f38, 8d,    vl,    w, vl),
     INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
     INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
@@ -412,6 +421,7 @@ static const struct test avx512dq_512[]
 };
 
 static const struct test avx512_vbmi_all[] = {
+    INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
 };
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -186,6 +186,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufps, _mask, x, x, 0b00011011, undef(), ~0)
+#   define swap2(x) B_(vpermilps, _mask, x, 0b00011011, undef(), ~0)
 #  else
 #   define broadcast_quartet(x) B(broadcastf32x4_, _mask, x, undef(), ~0)
 #   define insert_pair(x, y, p) \
@@ -200,6 +201,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f32x4_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufps, _mask, t_, t_, 0b00011011, undef(), ~0); \
 })
+#   define swap2(x) B(vpermilps, _mask, \
+                       B(shuf_f32x4_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b00011011, undef(), ~0)
 #  endif
 # elif FLOAT_SIZE == 8
 #  if VEC_SIZE >= 32
@@ -233,6 +238,7 @@ static inline bool _to_bool(byte_vec_t b
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
 #   define swap(x) B(shufpd, _mask, x, x, 0b01, undef(), ~0)
+#   define swap2(x) B_(vpermilpd, _mask, x, 0b01, undef(), ~0)
 #  else
 #   define interleave_hi(x, y) B(vpermi2varpd, _mask, x, interleave_hi, y, ~0)
 #   define interleave_lo(x, y) B(vpermt2varpd, _mask, interleave_lo, x, y, ~0)
@@ -240,6 +246,10 @@ static inline bool _to_bool(byte_vec_t b
     vec_t t_ = B(shuf_f64x2_, _mask, x, x, VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0); \
     B(shufpd, _mask, t_, t_, 0b01010101, undef(), ~0); \
 })
+#   define swap2(x) B(vpermilpd, _mask, \
+                       B(shuf_f64x2_, _mask, x, x, \
+                         VEC_SIZE == 32 ? 0b01 : 0b00011011, undef(), ~0), \
+                       0b01010101, undef(), ~0)
 #  endif
 # endif
 #elif FLOAT_SIZE == 4 && defined(__SSE__)
@@ -405,6 +415,7 @@ static inline bool _to_bool(byte_vec_t b
                              B(shuf_i32x4_, _mask, (vsi_t)(x), (vsi_t)(x), \
                                VEC_SIZE == 32 ? 0b01 : 0b00011011, (vsi_t)undef(), ~0), \
                              0b00011011, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B_(permvarsi, _mask, (vsi_t)(x), (vsi_t)(inv - 1), (vsi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
                               (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
@@ -442,8 +453,17 @@ static inline bool _to_bool(byte_vec_t b
                              (vsi_t)B(shuf_i64x2_, _mask, (vdi_t)(x), (vdi_t)(x), \
                                       VEC_SIZE == 32 ? 0b01 : 0b00011011, (vdi_t)undef(), ~0), \
                              0b01001110, (vsi_t)undef(), ~0))
+#   define swap2(x) ((vec_t)B(permvardi, _mask, (vdi_t)(x), (vdi_t)(inv - 1), (vdi_t)undef(), ~0))
 #  endif
 #  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+#  if VEC_SIZE == 32
+#   define swap3(x) ((vec_t)B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0))
+#  elif VEC_SIZE == 64
+#   define swap3(x) ({ \
+    vdi_t t_ = B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0); \
+    B(shuf_i64x2_, _mask, t_, t_, 0b01001110, (vdi_t)undef(), ~0); \
+})
+#  endif
 # endif
 # if INT_SIZE == 4
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
@@ -489,6 +509,9 @@ static inline bool _to_bool(byte_vec_t b
 #  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
 #  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
+#  ifdef __AVX512VBMI__
+#   define swap2(x) ((vec_t)B(permvarqi, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
+#  endif
 # elif INT_SIZE == 2 || UINT_SIZE == 2
 #  define broadcast(x) ({ \
     vec_t t_; \
@@ -517,6 +540,7 @@ static inline bool _to_bool(byte_vec_t b
                               (0b01010101010101010101010101010101 & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
+#  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
 # endif
 # if INT_SIZE == 1
 #  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
@@ -1325,6 +1349,12 @@ int simd_test(void)
     if ( !eq(swap2(src), inv) ) return __LINE__;
 #endif
 
+#ifdef swap3
+    touch(src);
+    if ( !eq(swap3(src), inv) ) return __LINE__;
+    touch(src);
+#endif
+
 #ifdef broadcast
     if ( !eq(broadcast(ELEM_COUNT + 1), src + inv) ) return __LINE__;
 #endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -275,6 +275,8 @@ OVR(movlps);
 OVR_VFP(movnt);
 OVR_VFP(movu);
 OVR_FP(mul);
+OVR_VFP(perm);
+OVR_VFP(permil);
 OVR_VFP(shuf);
 OVR_INT(sll);
 OVR_DQ(sllv);
@@ -331,6 +333,8 @@ OVR(movntdq);
 OVR(movntdqa);
 OVR(movshdup);
 OVR(movsldup);
+OVR(permd);
+OVR(permq);
 OVR(pmovsxbd);
 OVR(pmovsxbq);
 OVR(pmovsxdq);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -434,7 +434,8 @@ static const struct ext0f38_table {
 } ext0f38_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
-    [0x0c ... 0x0f] = { .simd_size = simd_packed_fp },
+    [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x13] = { .simd_size = simd_other, .two_op = 1 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -477,6 +478,7 @@ static const struct ext0f38_table {
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x8c] = { .simd_size = simd_packed_int },
+    [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -522,10 +524,10 @@ static const struct ext0f3a_table {
     uint8_t four_op:1;
     disp8scale_t d8s:4;
 } ext0f3a_table[256] = {
-    [0x00] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x00] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
+    [0x01] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x02] = { .simd_size = simd_packed_int },
-    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1 },
+    [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
@@ -8102,6 +8104,9 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf2): /* vpslld xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf3): /* vpsllq xmm/m128,[xyz]mm,[xyz]mm{k} */
         generate_exception_if(evex.brs, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0c): /* vpermilps [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0d): /* vpermilpd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         if ( b == 0xe2 )
             goto avx512f_no_sae;
@@ -8447,6 +8452,12 @@ x86_emulate(
         generate_exception_if(!vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x16): /* vpermp{s,d} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x36): /* vperm{d,q} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x20): /* vpmovsxbw xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,{x,y}mm */
@@ -8652,6 +8663,7 @@ x86_emulate(
 
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         if ( !evex.w )
             host_and_vcpu_must_have(avx512_vbmi);
         else
@@ -9077,6 +9089,12 @@ x86_emulate(
         generate_exception_if(!vex.l || !vex.w, EXC_UD);
         goto simd_0f_imm8_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x00): /* vpermq $imm8,{y,z}mm/mem,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x01): /* vpermpd $imm8,{y,z}mm/mem,{y,z}mm{k} */
+        generate_exception_if(!evex.lr || !evex.w, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x38): /* vinserti128 $imm8,xmm/m128,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x39): /* vextracti128 $imm8,ymm,xmm/m128 */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x46): /* vperm2i128 $imm8,ymm/m256,ymm,ymm */
@@ -9096,6 +9114,12 @@ x86_emulate(
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x04): /* vpermilps $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x05): /* vpermilpd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_66(0x0f3a, 0x08): /* roundps $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x09): /* roundpd $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f3a, 0x0a): /* roundss $imm8,xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 18/50] x86emul: support AVX512BW pack insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (16 preceding siblings ...)
  2019-03-15 10:46   ` [PATCH v8 17/50] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
@ 2019-03-15 10:46   ` Jan Beulich
  2019-05-21 11:26     ` Andrew Cooper
  2019-03-15 10:47   ` [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns Jan Beulich
                     ` (31 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:46 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

No further test harness additions - what is there is good enough for
these rather "regular" insns.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -306,6 +306,10 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(packssdw,    66,   0f, 6b,    vl, d_nb, vl),
+    INSN(packsswb,    66,   0f, 63,    vl,    w, vl),
+    INSN(packusdw,    66, 0f38, 2b,    vl, d_nb, vl),
+    INSN(packuswb,    66,   0f, 67,    vl,    w, vl),
     INSN(paddb,       66,   0f, fc,    vl,    b, vl),
     INSN(paddsb,      66,   0f, ec,    vl,    b, vl),
     INSN(paddsw,      66,   0f, ed,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -361,6 +361,10 @@ OVR(pextrw);
 OVR(pinsrb);
 OVR(pinsrw);
 #  ifdef __AVX512VL__
+OVR(packssdw);
+OVR(packsswb);
+OVR(packusdw);
+OVR(packuswb);
 OVR(pmaddwd);
 OVR(pmovsxbw);
 OVR(pmovzxbw);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -453,7 +453,7 @@ static const struct ext0f38_table {
     [0x25] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
-    [0x2b] = { .simd_size = simd_packed_int },
+    [0x2b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
@@ -6744,6 +6744,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         op_bytes = 16 << evex.lr;
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x63): /* vpacksswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x67): /* vpackuswb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
@@ -6805,6 +6807,12 @@ x86_emulate(
         avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x6b): /* vpackssdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2b): /* vpackusdw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w || evex.brs, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6c): /* vpunpcklqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x6d): /* vpunpckhqdq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (17 preceding siblings ...)
  2019-03-15 10:46   ` [PATCH v8 18/50] x86emul: support AVX512BW pack insns Jan Beulich
@ 2019-03-15 10:47   ` Jan Beulich
  2019-05-21 11:33     ` Andrew Cooper
  2019-03-15 10:47   ` [PATCH v8 20/50] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
                     ` (30 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:47 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVTPS2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size change for twobyte_table[0x5a] is benign to pre-existing
code, but allows decode_disp8scale() to work as is here.

Also correct the comment on an AVX counterpart.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: ea.type == OP_* -> ea.type != OP_*. Re-base.
v6: Re-base over changes earlier in the series.
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,6 +109,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
+    INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
+    INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -181,7 +181,9 @@ static inline bool _to_bool(byte_vec_t b
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  define widen1(x) ((vec_t)BR(cvtps2pd, _mask, x, (vdf_t)undef(), ~0))
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklps, _mask, x, y, undef(), ~0)
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -68,6 +68,7 @@ typedef short __attribute__((vector_size
 typedef int __attribute__((vector_size(VEC_SIZE))) vsi_t;
 #if VEC_SIZE >= 8
 typedef long long __attribute__((vector_size(VEC_SIZE))) vdi_t;
+typedef double __attribute__((vector_size(VEC_SIZE))) vdf_t;
 #endif
 
 #if ELEM_SIZE == 1
@@ -93,6 +94,7 @@ typedef char __attribute__((vector_size(
 typedef short __attribute__((vector_size(HALF_SIZE))) vhi_half_t;
 typedef int __attribute__((vector_size(HALF_SIZE))) vsi_half_t;
 typedef long long __attribute__((vector_size(HALF_SIZE))) vdi_half_t;
+typedef float __attribute__((vector_size(HALF_SIZE))) vsf_half_t;
 # endif
 
 # if ELEM_COUNT >= 4
@@ -328,6 +330,13 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(cvtpd2psx);
+OVR(cvtpd2psy);
+OVR(cvtph2ps);
+OVR(cvtps2pd);
+OVR(cvtps2ph);
+OVR(cvtsd2ss);
+OVR(cvtss2sd);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3871,6 +3871,49 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vcvtph2ps 32(%ecx),%zmm7{%k4}...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(evex_vcvtph2ps);
+        decl_insn(evex_vcvtps2ph);
+
+        asm volatile ( "vpternlogd $0x81, %%zmm7, %%zmm7, %%zmm7\n\t"
+                       "kmovw %1,%%k4\n"
+                       put_insn(evex_vcvtph2ps, "vcvtph2ps 32(%0), %%zmm7%{%%k4%}")
+                       :: "c" (NULL), "r" (0x3333) );
+
+        set_insn(evex_vcvtph2ps);
+        memset(res, 0xff, 128);
+        res[8] = 0x40003c00; /* (1.0, 2.0) */
+        res[10] = 0x44004200; /* (3.0, 4.0) */
+        res[12] = 0x3400b800; /* (-.5, .25) */
+        res[14] = 0xbc000000; /* (0.0, -1.) */
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        asm volatile ( "vmovups %%zmm7, %0" : "=m" (res[16]) );
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtph2ps) )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vcvtps2ph $0,%zmm3,64(%edx){%k4}...");
+        asm volatile ( "vmovups %0, %%zmm3\n"
+                       put_insn(evex_vcvtps2ph, "vcvtps2ph $0, %%zmm3, 128(%1)%{%%k4%}")
+                       :: "m" (res[16]), "d" (NULL) );
+
+        set_insn(evex_vcvtps2ph);
+        regs.edx = (unsigned long)res;
+        memset(res + 32, 0xcc, 32);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(evex_vcvtps2ph) )
+            goto fail;
+        res[15] = res[13] = res[11] = res[9] = 0xcccccccc;
+        if ( memcmp(res + 8, res + 32, 32) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -310,7 +310,8 @@ static const struct twobyte_table {
     [0x52 ... 0x53] = { DstImplicit|SrcMem|ModRM|TwoOp, simd_single_fp },
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
-    [0x5a ... 0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -437,7 +438,7 @@ static const struct ext0f38_table {
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x13] = { .simd_size = simd_other, .two_op = 1 },
+    [0x13] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x14 ... 0x16] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x17] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x18] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 2 },
@@ -541,7 +542,7 @@ static const struct ext0f3a_table {
     [0x19] = { .simd_size = simd_128, .to_mem = 1, .two_op = 1, .d8s = 4 },
     [0x1a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
     [0x1b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
-    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1 },
+    [0x1d] = { .simd_size = simd_other, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x1e ... 0x1f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_none, .d8s = 0 },
     [0x21] = { .simd_size = simd_other, .d8s = 2 },
@@ -3071,6 +3072,11 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x5a: /* vcvtps2pd needs special casing */
+                if ( disp8scale && !evex.pfx && !evex.brs )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -5998,6 +6004,7 @@ x86_emulate(
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5d):    /* vmin{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5e):    /* vdiv{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5f):    /* vmax{p,s}{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    avx512f_all_fp:
         generate_exception_if((evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK) ||
                                (ea.type != OP_REG && evex.brs &&
                                 (evex.pfx & VEX_PREFIX_SCALAR_MASK))),
@@ -6557,7 +6564,7 @@ x86_emulate(
         goto simd_zmm;
 
     CASE_SIMD_ALL_FP(, 0x0f, 0x5a):        /* cvt{p,s}{s,d}2{p,s}{s,d} xmm/mem,xmm */
-    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} xmm/mem,xmm */
+    CASE_SIMD_ALL_FP(_VEX, 0x0f, 0x5a):    /* vcvtp{s,d}2p{s,d} {x,y}mm/mem,{x,y}mm */
                                            /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm */
         op_bytes = 4 << (((vex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + vex.l) +
                          !!(vex.pfx & VEX_PREFIX_DOUBLE_MASK));
@@ -6566,6 +6573,12 @@ x86_emulate(
             goto simd_0f_sse2;
         goto simd_0f_avx;
 
+    CASE_SIMD_ALL_FP(_EVEX, 0x0f, 0x5a):   /* vcvtp{s,d}2p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+                                           /* vcvts{s,d}2s{s,d} xmm/mem,xmm,xmm{k} */
+        op_bytes = 4 << (((evex.pfx & VEX_PREFIX_SCALAR_MASK) ? 0 : 1 + evex.lr) +
+                         evex.w);
+        goto avx512f_all_fp;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x5b):     /* cvt{ps,dq}2{dq,ps} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x5b): /* vcvt{ps,dq}2{dq,ps} {x,y}mm/mem,{x,y}mm */
     case X86EMUL_OPC_F3(0x0f, 0x5b):       /* cvttps2dq xmm/mem,xmm */
@@ -8455,6 +8468,15 @@ x86_emulate(
         op_bytes = 8 << vex.l;
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x13): /* vcvtph2ps {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w || (ea.type != OP_REG && evex.brs), EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.brs )
+            avx512_vlen_check(false);
+        op_bytes = 8 << evex.lr;
+        elem_bytes = 2;
+        goto simd_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x16): /* vpermps ymm/m256,ymm,ymm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x36): /* vpermd ymm/m256,ymm,ymm */
         generate_exception_if(!vex.l || vex.w, EXC_UD);
@@ -9283,27 +9305,79 @@ x86_emulate(
         goto avx512f_imm8_no_sae;
 
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,{x,y}mm,xmm/mem */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x1d): /* vcvtps2ph $imm8,[xyz]mm,{x,y}mm/mem{k} */
     {
         uint32_t mxcsr;
 
-        generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
-        host_and_vcpu_must_have(f16c);
         fail_if(!ops->write);
+        if ( evex_encoded() )
+        {
+            generate_exception_if((evex.w || evex.reg != 0xf || !evex.RX ||
+                                   (ea.type != OP_REG && (evex.z || evex.brs))),
+                                  EXC_UD);
+            host_and_vcpu_must_have(avx512f);
+            avx512_vlen_check(false);
+            opc = init_evex(stub);
+        }
+        else
+        {
+            generate_exception_if(vex.w || vex.reg != 0xf, EXC_UD);
+            host_and_vcpu_must_have(f16c);
+            opc = init_prefixes(stub);
+        }
+
+        op_bytes = 8 << evex.lr;
 
-        opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
         if ( ea.type == OP_MEM )
         {
             /* Convert memory operand to (%rAX). */
             vex.b = 1;
+            evex.b = 1;
             opc[1] &= 0x38;
         }
         opc[2] = imm1;
-        insn_bytes = PFX_BYTES + 3;
+        if ( evex_encoded() )
+        {
+            unsigned int full = 0;
+
+            insn_bytes = EVEX_PFX_BYTES + 3;
+            copy_EVEX(opc, evex);
+
+            if ( ea.type == OP_MEM && evex.opmsk )
+            {
+                full = 0xffff >> (16 - op_bytes / 2);
+                op_mask &= full;
+                if ( !op_mask )
+                    goto complete_insn;
+
+                first_byte = __builtin_ctz(op_mask);
+                op_mask >>= first_byte;
+                full >>= first_byte;
+                first_byte <<= 1;
+                op_bytes = (32 - __builtin_clz(op_mask)) << 1;
+
+                /*
+                 * We may need to read (parts of) the memory operand for the
+                 * purpose of merging in order to avoid splitting the write
+                 * below into multiple ones.
+                 */
+                if ( op_mask != full &&
+                     (rc = ops->read(ea.mem.seg,
+                                     truncate_ea(ea.mem.off + first_byte),
+                                     (void *)mmvalp + first_byte, op_bytes,
+                                     ctxt)) != X86EMUL_OKAY )
+                    goto done;
+            }
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 3;
+            copy_VEX(opc, vex);
+        }
         opc[3] = 0xc3;
 
-        copy_VEX(opc, vex);
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
                     "=m" (*mmvalp), [mxcsr] "=m" (mxcsr) : "a" (mmvalp));
@@ -9312,7 +9386,8 @@ x86_emulate(
 
         if ( ea.type == OP_MEM )
         {
-            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
+            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
+                            (void *)mmvalp + first_byte, op_bytes, ctxt);
             if ( rc != X86EMUL_OKAY )
             {
                 asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 20/50] x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (18 preceding siblings ...)
  2019-03-15 10:47   ` [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns Jan Beulich
@ 2019-03-15 10:47   ` Jan Beulich
  2019-05-21 11:37     ` Andrew Cooper
  2019-03-15 10:52   ` [PATCH v8 21/50] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
                     ` (29 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:47 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

... including the two AVX512DQ forms which shared encodings, just with
EVEX.W set there.

VCVTDQ2PD, sharing its main opcode with others, needs a "manual"
override of disp8scale.

The simd_size changes for the twobyte_table[] entries are benign to
pre-existing code, but allow decode_disp8scale() to work as is here.

The at this point wrong placement of the 0xe6 case block is once again
in anticipation of further additions of case labels.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: ea.type == OP_* -> ea.type != OP_*. Re-base.
v6: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,8 +109,12 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
+    INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
+    INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
+    INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
@@ -398,6 +402,8 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
+    INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -92,6 +92,13 @@ static inline bool _to_bool(byte_vec_t b
 # define to_int(x) ((vec_t){ (int)(x)[0] })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
+#elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
+      (VEC_SIZE == 64 || defined(__AVX512VL__))
+# if FLOAT_SIZE == 4
+#  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+# elif FLOAT_SIZE == 8
+#  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+# endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
 #  define to_int(x) __builtin_ia32_cvtdq2ps(__builtin_ia32_cvtps2dq(x))
@@ -1142,15 +1149,21 @@ int simd_test(void)
     touch(src);
     if ( !eq(x * -alt, -src) ) return __LINE__;
 
-# if defined(recip) && defined(to_int)
+# ifdef to_int
+
+    touch(src);
+    x = to_int(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
 
+#  ifdef recip
     touch(src);
     x = recip(src);
     touch(src);
     touch(x);
     if ( !eq(to_int(recip(x)), src) ) return __LINE__;
 
-#  ifdef rsqrt
+#   ifdef rsqrt
     x = src * src;
     touch(x);
     y = rsqrt(x);
@@ -1158,6 +1171,7 @@ int simd_test(void)
     if ( !eq(to_int(recip(y)), src) ) return __LINE__;
     touch(src);
     if ( !eq(to_int(y), to_int(recip(src))) ) return __LINE__;
+#   endif
 #  endif
 
 # endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -244,6 +244,7 @@ asm ( ".macro override insn    \n\t"
 OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
+OVR_VFP(cvtdq2);
 OVR_FP(add);
 OVR_INT(add);
 OVR_BW(adds);
@@ -330,13 +331,19 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(cvtpd2dqx);
+OVR(cvtpd2dqy);
 OVR(cvtpd2psx);
 OVR(cvtpd2psy);
 OVR(cvtph2ps);
+OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
 OVR(cvtss2sd);
+OVR(cvttpd2dqx);
+OVR(cvttpd2dqy);
+OVR(cvttps2dq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -311,7 +311,7 @@ static const struct twobyte_table {
     [0x54 ... 0x57] = { DstImplicit|SrcMem|ModRM, simd_packed_fp, d8s_vl },
     [0x58 ... 0x59] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x5a] = { DstImplicit|SrcMem|ModRM|Mov, simd_any_fp, d8s_vl },
-    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x5b] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x5c ... 0x5f] = { DstImplicit|SrcMem|ModRM, simd_any_fp, d8s_vl },
     [0x60 ... 0x62] = { DstImplicit|SrcMem|ModRM, simd_other, d8s_vl },
     [0x63 ... 0x67] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
@@ -375,7 +375,7 @@ static const struct twobyte_table {
     [0xe0] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xe1 ... 0xe2] = { DstImplicit|SrcMem|ModRM, simd_128, 4 },
     [0xe3 ... 0xe5] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
-    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0xe6] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0xe7] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
     [0xe8 ... 0xef] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0xf0] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
@@ -3081,6 +3081,11 @@ x86_decode(
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
                 break;
+
+            case 0xe6: /* vcvtdq2pd needs special casing */
+                if ( disp8scale && evex.pfx == vex_f3 && !evex.w && !evex.brs )
+                    --disp8scale;
+                break;
             }
             break;
 
@@ -6587,6 +6592,22 @@ x86_emulate(
         op_bytes = 16 << vex.l;
         goto simd_0f_cvt;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x5b): /* vcvtps2dq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x5b): /* vcvttps2dq [xyz]mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+            host_and_vcpu_must_have(avx512f);
+        if ( ea.type != OP_REG || !evex.brs )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 16 << evex.lr;
+        goto simd_zmm;
+
     CASE_SIMD_PACKED_INT(0x0f, 0x60):    /* punpcklbw {,x}mm/mem,{,x}mm */
     case X86EMUL_OPC_VEX_66(0x0f, 0x60): /* vpunpcklbw {x,y}mm/mem,{x,y}mm,{x,y}mm */
     CASE_SIMD_PACKED_INT(0x0f, 0x61):    /* punpcklwd {,x}mm/mem,{,x}mm */
@@ -7251,6 +7272,27 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_xmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xe6):   /* vcvttpd2dq [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
+        if ( evex.pfx != vex_f3 )
+            host_and_vcpu_must_have(avx512f);
+        else if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        else
+        {
+            host_and_vcpu_must_have(avx512f);
+            generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
+        }
+        if ( ea.type != OP_REG || !evex.brs )
+            avx512_vlen_check(false);
+        d |= TwoOp;
+        op_bytes = 8 << (evex.w + evex.lr);
+        goto simd_zmm;
+
     case X86EMUL_OPC_F2(0x0f, 0xf0):     /* lddqu m128,xmm */
     case X86EMUL_OPC_VEX_F2(0x0f, 0xf0): /* vlddqu mem,{x,y}mm */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 21/50] x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (19 preceding siblings ...)
  2019-03-15 10:47   ` [PATCH v8 20/50] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
@ 2019-03-15 10:52   ` Jan Beulich
  2019-05-21 11:44     ` Andrew Cooper
  2019-03-15 10:52   ` [PATCH v8 22/50] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
                     ` (28 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:52 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVT{,T}S{S,D}2SI use EVEX.W for their destination (register) rather
than their (possibly memory) source operand size and hence need a
"manual" override of disp8scale.

While the SDM claims that EVEX.L'L needs to be zero for the 32-bit forms
of VCVT{,U}SI2SD (exception type E10NF), observations on my test system
do not confirm this (and I've got informal confirmation that this is a
doc mistake). Nevertheless, to be on the safe side, force evex.lr to be
zero in this case though when constructing the stub.

Slightly adjust the scalar to_int() in the test harness, to increase the
chances of the operand ending up in memory.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix VCVTSI2SS - cannot re-use VMOV{D,Q} code here, as the register
    form can't be converted to a memory one when embedded rounding is in
    effect. Force evex.lr to zero for 32-bit VCVTSI2SD. Permit embedded
    rounding for VCVT{,T}S{S,D}2SI. Re-base.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -117,8 +117,16 @@ static const struct test avx512f_all[] =
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
+    INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
+    INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
+    INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -746,8 +754,9 @@ static void test_group(const struct test
                 break;
 
             case ESZ_dq:
-                test_pair(&tests[i], vl[j], ESZ_d, "d", ESZ_q, "q",
-                          instr, ctxt);
+                test_pair(&tests[i], vl[j], ESZ_d,
+                          strncmp(tests[i].mnemonic, "cvt", 3) ? "d" : "l",
+                          ESZ_q, "q", instr, ctxt);
                 break;
 
 #ifdef __i386__
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -89,7 +89,7 @@ static inline bool _to_bool(byte_vec_t b
 #endif
 
 #if VEC_SIZE == FLOAT_SIZE
-# define to_int(x) ((vec_t){ (int)(x)[0] })
+# define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -340,10 +340,28 @@ OVR(cvtps2dq);
 OVR(cvtps2pd);
 OVR(cvtps2ph);
 OVR(cvtsd2ss);
+OVR(cvtsd2si);
+OVR(cvtsd2sil);
+OVR(cvtsd2siq);
+OVR(cvtsi2sd);
+OVR(cvtsi2sdl);
+OVR(cvtsi2sdq);
+OVR(cvtsi2ss);
+OVR(cvtsi2ssl);
+OVR(cvtsi2ssq);
 OVR(cvtss2sd);
+OVR(cvtss2si);
+OVR(cvtss2sil);
+OVR(cvtss2siq);
 OVR(cvttpd2dqx);
 OVR(cvttpd2dqy);
 OVR(cvttps2dq);
+OVR(cvttsd2si);
+OVR(cvttsd2sil);
+OVR(cvttsd2siq);
+OVR(cvttss2si);
+OVR(cvttss2sil);
+OVR(cvttss2siq);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -296,7 +296,7 @@ static const struct twobyte_table {
     [0x22 ... 0x23] = { DstImplicit|SrcMem|ModRM },
     [0x28] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x29] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
+    [0x2a] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x2b] = { DstMem|SrcImplicit|ModRM|Mov, simd_any_fp, d8s_vl },
     [0x2c ... 0x2d] = { DstImplicit|SrcMem|ModRM|Mov, simd_other },
     [0x2e ... 0x2f] = { ImplicitOps|ModRM|TwoOp, simd_none, d8s_dq },
@@ -3072,6 +3072,12 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x2c: /* vcvtts{s,d}2si need special casing */
+            case 0x2d: /* vcvts{s,d}2si need special casing */
+                if ( evex_encoded() )
+                    disp8scale = 2 + (evex.pfx & VEX_PREFIX_DOUBLE_MASK);
+                break;
+
             case 0x5a: /* vcvtps2pd needs special casing */
                 if ( disp8scale && !evex.pfx && !evex.brs )
                     --disp8scale;
@@ -6199,6 +6205,48 @@ x86_emulate(
         state->simd_size = simd_none;
         goto simd_0f_rm;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+        generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.brs),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.brs )
+            avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        if ( ea.type == OP_MEM )
+        {
+            rc = read_ulong(ea.mem.seg, ea.mem.off, &src.val,
+                            rex_prefix & REX_W ? 8 : 4, ctxt, ops);
+            if ( rc != X86EMUL_OKAY )
+                goto done;
+        }
+        else
+            src.val = *ea.reg;
+
+        opc = init_evex(stub);
+        opc[0] = b;
+        /* Convert memory/GPR source to %rAX. */
+        evex.b = 1;
+        if ( !mode_64bit() )
+            evex.w = 0;
+        /*
+         * SDM version 067 claims that exception type E10NF implies #UD when
+         * EVEX.L'L is non-zero for 32-bit VCVT{,U}SI2SD. Experimentally this
+         * cannot be confirmed, but be on the safe side for the stub.
+         */
+        if ( !evex.w && evex.pfx == vex_f2 )
+            evex.lr = 0;
+        opc[1] = (modrm & 0x38) | 0xc0;
+        insn_bytes = EVEX_PFX_BYTES + 2;
+        opc[2] = 0xc3;
+
+        copy_EVEX(opc, evex);
+        invoke_stub("", "", "=g" (dummy) : "a" (src.val));
+
+        put_stub(stub);
+        state->simd_size = simd_none;
+        break;
+
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2c):     /* cvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_VEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(, 0x0f, 0x2d):     /* cvts{s,d}2si xmm/mem,reg */
@@ -6222,14 +6270,17 @@ x86_emulate(
         }
 
         opc = init_prefixes(stub);
+    cvts_2si:
         opc[0] = b;
         /* Convert GPR destination to %rAX and memory operand to (%rCX). */
         rex_prefix &= ~REX_R;
         vex.r = 1;
+        evex.r = 1;
         if ( ea.type == OP_MEM )
         {
             rex_prefix &= ~REX_B;
             vex.b = 1;
+            evex.b = 1;
             opc[1] = 0x01;
 
             rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp,
@@ -6240,11 +6291,22 @@ x86_emulate(
         else
             opc[1] = modrm & 0xc7;
         if ( !mode_64bit() )
+        {
             vex.w = 0;
-        insn_bytes = PFX_BYTES + 2;
+            evex.w = 0;
+        }
+        if ( evex_encoded() )
+        {
+            insn_bytes = EVEX_PFX_BYTES + 2;
+            copy_EVEX(opc, evex);
+        }
+        else
+        {
+            insn_bytes = PFX_BYTES + 2;
+            copy_REX_VEX(opc, rex_prefix, vex);
+        }
         opc[2] = 0xc3;
 
-        copy_REX_VEX(opc, rex_prefix, vex);
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
 
@@ -6252,6 +6314,18 @@ x86_emulate(
         state->simd_size = simd_none;
         break;
 
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+        generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
+                               (ea.type != OP_REG && evex.brs)),
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.brs )
+            avx512_vlen_check(true);
+        get_fpu(X86EMUL_FPU_zmm);
+        opc = init_evex(stub);
+        goto cvts_2si;
+
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2e):     /* ucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x2e): /* vucomis{s,d} xmm/mem,xmm */
     CASE_SIMD_PACKED_FP(, 0x0f, 0x2f):     /* comis{s,d} xmm/mem,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 22/50] x86emul: support AVX512DQ packed quad-int/FP conversion insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (20 preceding siblings ...)
  2019-03-15 10:52   ` [PATCH v8 21/50] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
@ 2019-03-15 10:52   ` Jan Beulich
  2019-05-21 11:53     ` Andrew Cooper
  2019-03-15 10:53   ` [PATCH v8 23/50] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
                     ` (27 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:52 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

VCVT{,T}PS2QQ, sharing their main opcodes with others, once again need
"manual" overrides of disp8scale.

While not directly related here, also add a scalar variant of to_wint()
to the test harness.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: Workaround for gcc 7 quirk.
v5: Re-base over changes earlier in the series.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -410,8 +410,12 @@ static const struct test avx512dq_all[]
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
+    INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
+    INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -90,14 +90,35 @@ static inline bool _to_bool(byte_vec_t b
 
 #if VEC_SIZE == FLOAT_SIZE
 # define to_int(x) ({ int i_ = (x)[0]; touch(i_); ((vec_t){ i_ }); })
+# ifdef __x86_64__
+#  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) ({ \
+    vsf_half_t t_ = low_half(x); \
+    vdi_t lo_, hi_; \
+    touch(t_); \
+    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    t_ = high_half(x); \
+    touch(t_); \
+    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    touch(lo_); touch(hi_); \
+    insert_half(insert_half(undef(), \
+                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+})
+#  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
 # if FLOAT_SIZE == 4
@@ -121,6 +142,21 @@ static inline bool _to_bool(byte_vec_t b
 })
 #endif
 
+#if VEC_SIZE == 16 && FLOAT_SIZE == 4 && defined(__SSE__)
+# define low_half(x) (x)
+# define high_half(x) B_(movhlps, , undef(), x)
+/*
+ * GCC 7 (and perhaps earlier) report a bogus type mismatch for the conditional
+ * expression below. All works well with this no-op wrapper.
+ */
+static inline vec_t movlhps(vec_t x, vec_t y) {
+    return __builtin_ia32_movlhps(x, y);
+}
+# define insert_pair(x, y, p) \
+    ((p) ? movlhps(x, y) \
+         : ({ vec_t t_ = (x); t_[0] = (y)[0]; t_[1] = (y)[1]; t_; }))
+#endif
+
 #if VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW_A__)
 # define max __builtin_ia32_pfmax
 # define min __builtin_ia32_pfmin
@@ -149,13 +185,16 @@ static inline bool _to_bool(byte_vec_t b
 # if ELEM_COUNT == 8 /* vextractf{32,64}x4 */ || \
      (ELEM_COUNT == 16 && ELEM_SIZE == 4 && defined(__AVX512DQ__)) /* vextractf32x8 */ || \
      (ELEM_COUNT == 4 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
-#  define low_half(x) ({ \
+#  define _half(x, lh) ({ \
     half_t t_; \
-    asm ( "vextractf%c[w]x%c[n] $0, %[s], %[d]" \
+    asm ( "vextractf%c[w]x%c[n] %[sel], %[s], %[d]" \
           : [d] "=m" (t_) \
-          : [s] "v" (x), [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
+          : [s] "v" (x), [sel] "i" (lh), \
+            [w] "i" (ELEM_SIZE * 8), [n] "i" (ELEM_COUNT / 2) ); \
     t_; \
 })
+#  define low_half(x)  _half(x, 0)
+#  define high_half(x) _half(x, 1)
 # endif
 # if (ELEM_COUNT == 16 && ELEM_SIZE == 4) /* vextractf32x4 */ || \
      (ELEM_COUNT == 8 && ELEM_SIZE == 8 && defined(__AVX512DQ__)) /* vextractf64x2 */
@@ -1176,6 +1215,13 @@ int simd_test(void)
 
 # endif
 
+# ifdef to_wint
+    touch(src);
+    x = to_wint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
 # ifdef sqrt
     x = src * src;
     touch(x);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -325,6 +325,8 @@ static const struct twobyte_table {
     [0x77] = { DstImplicit|SrcNone },
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3083,6 +3085,12 @@ x86_decode(
                     --disp8scale;
                 break;
 
+            case 0x7a: /* vcvttps2qq needs special casing */
+            case 0x7b: /* vcvtps2qq needs special casing */
+                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.brs )
+                    --disp8scale;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -7355,7 +7363,13 @@ x86_emulate(
         if ( evex.pfx != vex_f3 )
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
+        {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2qq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512dq);
+        }
         else
         {
             host_and_vcpu_must_have(avx512f);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 23/50] x86emul: support AVX512{F, DQ} uint-to-FP conversion insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (21 preceding siblings ...)
  2019-03-15 10:52   ` [PATCH v8 22/50] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
@ 2019-03-15 10:53   ` Jan Beulich
  2019-05-21 11:58     ` Andrew Cooper
  2019-03-15 10:54   ` [PATCH v8 24/50] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
                     ` (26 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:53 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Some "manual" overrides of disp8scale are needed here again. In
particular code ends up simpler when using d8s_dq64 in the
twobyte_table[] entry.

Test harness additions will be done once the reverse conversions are
also available.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -127,6 +127,10 @@ static const struct test avx512f_all[] =
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
+    INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
+    INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
+    INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
@@ -416,6 +420,8 @@ static const struct test avx512dq_all[]
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
+    INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -326,7 +326,7 @@ static const struct twobyte_table {
     [0x78] = { ImplicitOps|ModRM },
     [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
-    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
+    [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
     [0x7e] = { DstMem|SrcImplicit|ModRM|Mov, simd_none, d8s_dq64 },
     [0x7f] = { DstMem|SrcImplicit|ModRM|Mov, simd_packed_int, d8s_vl },
@@ -3085,12 +3085,16 @@ x86_decode(
                     --disp8scale;
                 break;
 
-            case 0x7a: /* vcvttps2qq needs special casing */
-            case 0x7b: /* vcvtps2qq needs special casing */
-                if ( disp8scale && evex.pfx == vex_66 && !evex.w && !evex.brs )
+            case 0x7a: /* vcvttps2qq and vcvtudq2pd need special casing */
+                if ( disp8scale && evex.pfx != vex_f2 && !evex.w && !evex.brs )
                     --disp8scale;
                 break;
 
+            case 0x7b: /* vcvtp{s,d}2qq need special casing */
+                if ( disp8scale && evex.pfx == vex_66 )
+                    disp8scale = (evex.brs ? 2 : 3 + evex.lr) + evex.w;
+                break;
+
             case 0x7e: /* vmovq xmm/m64,xmm needs special casing */
                 if ( disp8scale == 2 && evex.pfx == vex_f3 )
                     disp8scale = 3;
@@ -6214,6 +6218,7 @@ x86_emulate(
         goto simd_0f_rm;
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2a): /* vcvtsi2s{s,d} r/m,xmm,xmm */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x7b): /* vcvtusi2s{s,d} r/m,xmm,xmm */
         generate_exception_if(evex.opmsk || (ea.type != OP_REG && evex.brs),
                               EXC_UD);
         host_and_vcpu_must_have(avx512f);
@@ -6680,6 +6685,8 @@ x86_emulate(
         /* fall through */
     case X86EMUL_OPC_EVEX(0x0f, 0x5b):    /* vcvtdq2ps [xyz]mm/mem,[xyz]mm{k} */
                                           /* vcvtqq2ps [xyz]mm/mem,{x,y}mm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f, 0x7a): /* vcvtudq2ps [xyz]mm/mem,[xyz]mm{k} */
+                                          /* vcvtuqq2ps [xyz]mm/mem,{x,y}mm{k} */
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
@@ -7358,6 +7365,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_F2(0x0f, 0xe6):   /* vcvtpd2dq [xyz]mm/mem,{x,y}mm{k} */
         generate_exception_if(!evex.w, EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7a):   /* vcvtudq2pd {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtuqq2pd [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f, 0xe6):   /* vcvtdq2pd {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvtqq2pd [xyz]mm/mem,[xyz]mm{k} */
         if ( evex.pfx != vex_f3 )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 24/50] x86emul: support AVX512{F, DQ} FP-to-uint conversion insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (22 preceding siblings ...)
  2019-03-15 10:53   ` [PATCH v8 23/50] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
@ 2019-03-15 10:54   ` Jan Beulich
  2019-05-21 12:09     ` Andrew Cooper
  2019-03-15 10:54   ` [PATCH v8 25/50] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
                     ` (25 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:54 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Along the lines of prior patches, VCVT{,T}PS2UQQ as well as
VCVT{,T}S{S,D}2USI need "manual" overrides of disp8scale.

The twobyte_table[] entries get altered, with their prior values
now put in place in x86_decode_twobyte().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v4: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -112,21 +112,29 @@ static const struct test avx512f_all[] =
     INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
     INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
     INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
+    INSN(cvtpd2udq,      ,   0f, 79,    vl,      q, vl),
     INSN(cvtpd2ps,     66,   0f, 5a,    vl,      q, vl),
     INSN(cvtph2ps,     66, 0f38, 13,    vl_2, d_nb, vl),
     INSN(cvtps2dq,     66,   0f, 5b,    vl,      d, vl),
     INSN(cvtps2pd,       ,   0f, 5a,    vl_2,    d, vl),
     INSN(cvtps2ph,     66, 0f3a, 1d,    vl_2, d_nb, vl),
+    INSN(cvtps2udq,      ,   0f, 79,    vl,      d, vl),
     INSN(cvtsd2si,     f2,   0f, 2d,    el,      q, el),
+    INSN(cvtsd2usi,    f2,   0f, 79,    el,      q, el),
     INSN(cvtsd2ss,     f2,   0f, 5a,    el,      q, el),
     INSN(cvtsi2sd,     f2,   0f, 2a,    el,   dq64, el),
     INSN(cvtsi2ss,     f3,   0f, 2a,    el,   dq64, el),
     INSN(cvtss2sd,     f3,   0f, 5a,    el,      d, el),
     INSN(cvtss2si,     f3,   0f, 2d,    el,      d, el),
+    INSN(cvtss2usi,    f3,   0f, 79,    el,      d, el),
     INSN(cvttpd2dq,    66,   0f, e6,    vl,      q, vl),
+    INSN(cvttpd2udq,     ,   0f, 78,    vl,      q, vl),
     INSN(cvttps2dq,    f3,   0f, 5b,    vl,      d, vl),
+    INSN(cvttps2udq,     ,   0f, 78,    vl,      d, vl),
     INSN(cvttsd2si,    f2,   0f, 2c,    el,      q, el),
+    INSN(cvttsd2usi,   f2,   0f, 78,    el,      q, el),
     INSN(cvttss2si,    f3,   0f, 2c,    el,      d, el),
+    INSN(cvttss2usi,   f3,   0f, 78,    el,      d, el),
     INSN(cvtudq2pd,    f3,   0f, 7a,    vl_2,    d, vl),
     INSN(cvtudq2ps,    f2,   0f, 7a,    vl,      d, vl),
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
@@ -415,11 +423,15 @@ static const struct test avx512dq_all[]
     INSN_PFP(andn,             0f, 55),
     INSN(broadcasti32x2, 66, 0f38, 59, el_2,  d, vl),
     INSN(cvtpd2qq,       66,   0f, 7b,   vl,  q, vl),
+    INSN(cvtpd2uqq,      66,   0f, 79,   vl,  q, vl),
     INSN(cvtps2qq,       66,   0f, 7b, vl_2,  d, vl),
+    INSN(cvtps2uqq,      66,   0f, 79, vl_2,  d, vl),
     INSN(cvtqq2pd,       f3,   0f, e6,   vl,  q, vl),
     INSN(cvtqq2ps,         ,   0f, 5b,   vl,  q, vl),
     INSN(cvttpd2qq,      66,   0f, 7a,   vl,  q, vl),
+    INSN(cvttpd2uqq,     66,   0f, 78,   vl,  q, vl),
     INSN(cvttps2qq,      66,   0f, 7a, vl_2,  d, vl),
+    INSN(cvttps2uqq,     66,   0f, 78, vl_2,  d, vl),
     INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
     INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
     INSN_PFP(or,               0f, 56),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -93,31 +93,65 @@ static inline bool _to_bool(byte_vec_t b
 # ifdef __x86_64__
 #  define to_wint(x) ({ long l_ = (x)[0]; touch(l_); ((vec_t){ l_ }); })
 # endif
+# ifdef __AVX512F__
+/*
+ * Sadly even gcc 9.x, at the time of writing, does not carry out at least
+ * uint -> FP conversions using VCVTUSI2S{S,D}, so we need to use builtins
+ * or inline assembly here. The full-vector parameter types of the builtins
+ * aren't very helpful for our purposes, so use inline assembly.
+ */
+#  if FLOAT_SIZE == 4
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    float __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtss2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2ss%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  elif FLOAT_SIZE == 8
+#   define to_u_int(type, x) ({ \
+    unsigned type u_; \
+    double __attribute__((vector_size(16))) t_; \
+    asm ( "vcvtsd2usi %1, %0" : "=r" (u_) : "m" ((x)[0]) ); \
+    asm ( "vcvtusi2sd%z1 %1, %0, %0" : "=v" (t_) : "m" (u_) ); \
+    (vec_t){ t_[0] }; \
+})
+#  endif
+#  define to_uint(x) to_u_int(int, x)
+#  ifdef __x86_64__
+#   define to_uwint(x) to_u_int(long, x)
+#  endif
+# endif
 #elif VEC_SIZE == 8 && FLOAT_SIZE == 4 && defined(__3dNOW__)
 # define to_int(x) __builtin_ia32_pi2fd(__builtin_ia32_pf2id(x))
 #elif defined(FLOAT_SIZE) && VEC_SIZE > FLOAT_SIZE && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
 # if FLOAT_SIZE == 4
 #  define to_int(x) BR(cvtdq2ps, _mask, BR(cvtps2dq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
+#  define to_uint(x) BR(cvtudq2ps, _mask, BR(cvtps2udq, _mask, x, (vsi_t)undef(), ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
-#   define to_wint(x) ({ \
+#   define to_w_int(x, s) ({ \
     vsf_half_t t_ = low_half(x); \
     vdi_t lo_, hi_; \
     touch(t_); \
-    lo_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    lo_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     t_ = high_half(x); \
     touch(t_); \
-    hi_ = BR(cvtps2qq, _mask, t_, (vdi_t)undef(), ~0); \
+    hi_ = BR(cvtps2 ## s ## qq, _mask, t_, (vdi_t)undef(), ~0); \
     touch(lo_); touch(hi_); \
     insert_half(insert_half(undef(), \
-                            BR(cvtqq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
-                BR(cvtqq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
+                            BR(cvt ## s ## qq2ps, _mask, lo_, (vsf_half_t){}, ~0), 0), \
+                BR(cvt ## s ## qq2ps, _mask, hi_, (vsf_half_t){}, ~0), 1); \
 })
+#   define to_wint(x) to_w_int(x, )
+#   define to_uwint(x) to_w_int(x, u)
 #  endif
 # elif FLOAT_SIZE == 8
 #  define to_int(x) B(cvtdq2pd, _mask, BR(cvtpd2dq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
+#  define to_uint(x) B(cvtudq2pd, _mask, BR(cvtpd2udq, _mask, x, (vsi_half_t){}, ~0), undef(), ~0)
 #  ifdef __AVX512DQ__
 #   define to_wint(x) BR(cvtqq2pd, _mask, BR(cvtpd2qq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
+#   define to_uwint(x) BR(cvtuqq2pd, _mask, BR(cvtpd2uqq, _mask, x, (vdi_t)undef(), ~0), undef(), ~0)
 #  endif
 # endif
 #elif VEC_SIZE == 16 && defined(__SSE2__)
@@ -1221,6 +1255,20 @@ int simd_test(void)
     touch(src);
     if ( !eq(x, src) ) return __LINE__;
 # endif
+
+# ifdef to_uint
+    touch(src);
+    x = to_uint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
+
+# ifdef to_uwint
+    touch(src);
+    x = to_uwint(src);
+    touch(src);
+    if ( !eq(x, src) ) return __LINE__;
+# endif
 
 # ifdef sqrt
     x = src * src;
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -323,8 +323,7 @@ static const struct twobyte_table {
     [0x71 ... 0x73] = { DstImplicit|SrcImmByte|ModRM, simd_none, d8s_vl },
     [0x74 ... 0x76] = { DstImplicit|SrcMem|ModRM, simd_packed_int, d8s_vl },
     [0x77] = { DstImplicit|SrcNone },
-    [0x78] = { ImplicitOps|ModRM },
-    [0x79] = { DstReg|SrcMem|ModRM, simd_packed_int },
+    [0x78 ... 0x79] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_vl },
     [0x7a] = { DstImplicit|SrcMem|ModRM|Mov, simd_packed_fp, d8s_vl },
     [0x7b] = { DstImplicit|SrcMem|ModRM|Mov, simd_other, d8s_dq64 },
     [0x7c ... 0x7d] = { DstImplicit|SrcMem|ModRM, simd_other },
@@ -2523,6 +2522,8 @@ x86_decode_twobyte(
         break;
 
     case 0x78:
+        state->desc = ImplicitOps;
+        state->simd_size = simd_none;
         switch ( vex.pfx )
         {
         case vex_66: /* extrq $imm8, $imm8, xmm */
@@ -2535,7 +2536,7 @@ x86_decode_twobyte(
     case 0x10 ... 0x18:
     case 0x28 ... 0x2f:
     case 0x50 ... 0x77:
-    case 0x79 ... 0x7d:
+    case 0x7a ... 0x7d:
     case 0x7f:
     case 0xc2 ... 0xc3:
     case 0xc5 ... 0xc6:
@@ -2557,6 +2558,12 @@ x86_decode_twobyte(
         op_bytes = mode_64bit() ? 8 : 4;
         break;
 
+    case 0x79:
+        state->desc = DstReg | SrcMem;
+        state->simd_size = simd_packed_int;
+        ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
+        break;
+
     case 0x7e:
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         if ( vex.pfx == vex_f3 ) /* movq xmm/m64,xmm */
@@ -3074,6 +3081,18 @@ x86_decode(
                 modrm_mod = 3;
                 break;
 
+            case 0x78:
+            case 0x79:
+                if ( !evex.pfx )
+                    break;
+                /* vcvt{,t}ps2uqq need special casing */
+                if ( evex.pfx == vex_66 )
+                {
+                    if ( !evex.w && !evex.brs )
+                        --disp8scale;
+                    break;
+                }
+                /* vcvt{,t}s{s,d}2usi need special casing: fall through */
             case 0x2c: /* vcvtts{s,d}2si need special casing */
             case 0x2d: /* vcvts{s,d}2si need special casing */
                 if ( evex_encoded() )
@@ -6329,6 +6348,8 @@ x86_emulate(
 
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2c): /* vcvtts{s,d}2si xmm/mem,reg */
     CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x2d): /* vcvts{s,d}2si xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x78): /* vcvtts{s,d}2usi xmm/mem,reg */
+    CASE_SIMD_SCALAR_FP(_EVEX, 0x0f, 0x79): /* vcvts{s,d}2usi xmm/mem,reg */
         generate_exception_if((evex.reg != 0xf || !evex.RX || evex.opmsk ||
                                (ea.type != OP_REG && evex.brs)),
                               EXC_UD);
@@ -6690,7 +6711,11 @@ x86_emulate(
         if ( evex.w )
             host_and_vcpu_must_have(avx512dq);
         else
+        {
+    case X86EMUL_OPC_EVEX(0x0f, 0x78):    /* vcvttp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX(0x0f, 0x79):    /* vcvtp{s,d}2udq [xyz]mm/mem,[xyz]mm{k} */
             host_and_vcpu_must_have(avx512f);
+        }
         if ( ea.type != OP_REG || !evex.brs )
             avx512_vlen_check(false);
         d |= TwoOp;
@@ -7373,6 +7398,10 @@ x86_emulate(
             host_and_vcpu_must_have(avx512f);
         else if ( evex.w )
         {
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x78):   /* vcvttps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvttpd2uqq [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0x79):   /* vcvtps2uqq {x,y}mm/mem,[xyz]mm{k} */
+                                            /* vcvtpd2uqq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7a):   /* vcvttps2qq {x,y}mm/mem,[xyz]mm{k} */
                                             /* vcvttpd2qq [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0x7b):   /* vcvtps2qq {x,y}mm/mem,[xyz]mm{k} */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 25/50] x86emul: support remaining AVX512F legacy-equivalent insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (23 preceding siblings ...)
  2019-03-15 10:54   ` [PATCH v8 24/50] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
@ 2019-03-15 10:54   ` Jan Beulich
  2019-05-21 13:06     ` Andrew Cooper
  2019-03-15 10:54   ` [PATCH v8 26/50] x86emul: support remaining AVX512BW " Jan Beulich
                     ` (24 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:54 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Plus their AVX512BW counterparts.

Take the opportunity and also eliminate a pair of open coded instances
of scalar_1op().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: Re-base over changes earlier in the series.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -193,6 +193,8 @@ static const struct test avx512f_all[] =
     INSN_PFP_NB(movu,        0f, 10),
     INSN_PFP_NB(movu,        0f, 11),
     INSN_FP(mul,             0f, 59),
+    INSN(pabsd,        66, 0f38, 1e,    vl,      d, vl),
+    INSN(pabsq,        66, 0f38, 1f,    vl,      q, vl),
     INSN(paddd,        66,   0f, fe,    vl,      d, vl),
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
@@ -276,6 +278,10 @@ static const struct test avx512f_all[] =
     INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
     INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
+    INSN(rndscalepd,   66, 0f3a, 09,    vl,      q, vl),
+    INSN(rndscaleps,   66, 0f3a, 08,    vl,      d, vl),
+    INSN(rndscalesd,   66, 0f3a, 0b,    el,      q, el),
+    INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
@@ -336,6 +342,8 @@ static const struct test avx512bw_all[]
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
     INSN(movdqu16,    f2,   0f, 7f,    vl,    w, vl),
+    INSN(pabsb,       66, 0f38, 1c,    vl,    b, vl),
+    INSN(pabsw,       66, 0f38, 1d,    vl,    w, vl),
     INSN(packssdw,    66,   0f, 6b,    vl, d_nb, vl),
     INSN(packsswb,    66,   0f, 63,    vl,    w, vl),
     INSN(packusdw,    66, 0f38, 2b,    vl, d_nb, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -211,8 +211,10 @@ static inline vec_t movlhps(vec_t x, vec
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
+#  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
+#  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
 #elif defined(FLOAT_SIZE) && defined(__AVX512F__) && \
       (VEC_SIZE == 64 || defined(__AVX512VL__))
@@ -263,6 +265,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
 #  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
+#  define trunc(x) BR(rndscaleps_, _mask, x, 0b1011, undef(), ~0)
 #  define widen1(x) ((vec_t)BR(cvtps2pd, _mask, x, (vdf_t)undef(), ~0))
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhps, _mask, x, y, undef(), ~0)
@@ -316,6 +319,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
 #  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
+#  define trunc(x) BR(rndscalepd_, _mask, x, 0b1011, undef(), ~0)
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) B(unpckhpd, _mask, x, y, undef(), ~0)
 #   define interleave_lo(x, y) B(unpcklpd, _mask, x, y, undef(), ~0)
@@ -548,6 +552,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  endif
 # endif
 # if INT_SIZE == 4
+#  define abs(x) B(pabsd, _mask, x, undef(), ~0)
 #  define max(x, y) B(pmaxsd, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsd, _mask, x, y, undef(), ~0)
 #  define mul_full(x, y) ((vec_t)B(pmuldq, _mask, x, y, (vdi_t)undef(), ~0))
@@ -558,6 +563,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define mul_full(x, y) ((vec_t)B(pmuludq, _mask, (vsi_t)(x), (vsi_t)(y), (vdi_t)undef(), ~0))
 #  define widen1(x) ((vec_t)B(pmovzxdq, _mask, (vsi_half_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 8
+#  define abs(x) ((vec_t)B(pabsq, _mask, (vdi_t)(x), (vdi_t)undef(), ~0))
 #  define max(x, y) ((vec_t)B(pmaxsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsq, _mask, (vdi_t)(x), (vdi_t)(y), (vdi_t)undef(), ~0))
 # elif UINT_SIZE == 8
@@ -625,6 +631,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
 # endif
 # if INT_SIZE == 1
+#  define abs(x) ((vec_t)B(pabsb, _mask, (vqi_t)(x), (vqi_t)undef(), ~0))
 #  define max(x, y) ((vec_t)B(pmaxsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #  define min(x, y) ((vec_t)B(pminsb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #  define widen1(x) ((vec_t)B(pmovsxbw, _mask, (vqi_half_t)(x), (vhi_t)undef(), ~0))
@@ -637,6 +644,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  define widen2(x) ((vec_t)B(pmovzxbd, _mask, (vqi_quarter_t)(x), (vsi_t)undef(), ~0))
 #  define widen3(x) ((vec_t)B(pmovzxbq, _mask, (vqi_eighth_t)(x), (vdi_t)undef(), ~0))
 # elif INT_SIZE == 2
+#  define abs(x) B(pabsw, _mask, x, undef(), ~0)
 #  define max(x, y) B(pmaxsw, _mask, x, y, undef(), ~0)
 #  define min(x, y) B(pminsw, _mask, x, y, undef(), ~0)
 #  define mul_hi(x, y) B(pmulhw, _mask, x, y, undef(), ~0)
@@ -948,19 +956,11 @@ static inline vec_t movlhps(vec_t x, vec
 #if VEC_SIZE == FLOAT_SIZE
 # define max(x, y) ((vec_t){({ typeof(x[0]) x_ = (x)[0], y_ = (y)[0]; x_ > y_ ? x_ : y_; })})
 # define min(x, y) ((vec_t){({ typeof(x[0]) x_ = (x)[0], y_ = (y)[0]; x_ < y_ ? x_ : y_; })})
-# ifdef __SSE4_1__
+# if defined(__SSE4_1__) && !defined(__AVX512F__)
 #  if FLOAT_SIZE == 4
-#   define trunc(x) ({ \
-    float __attribute__((vector_size(16))) r_; \
-    asm ( "roundss $0b1011,%1,%0" : "=x" (r_) : "m" (x) ); \
-    (vec_t){ r_[0] }; \
-})
+#   define trunc(x) scalar_1op(x, "roundss $0b1011, %[in], %[out]")
 #  elif FLOAT_SIZE == 8
-#   define trunc(x) ({ \
-    double __attribute__((vector_size(16))) r_; \
-    asm ( "roundsd $0b1011,%1,%0" : "=x" (r_) : "m" (x) ); \
-    (vec_t){ r_[0] }; \
-})
+#   define trunc(x) scalar_1op(x, "roundsd $0b1011, %[in], %[out]")
 #  endif
 # endif
 #endif
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -184,6 +184,8 @@ DECL_OCTET(half);
 # define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
 # define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
 # define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
+# define __builtin_ia32_rndscalepd_512_mask __builtin_ia32_rndscalepd_mask
+# define __builtin_ia32_rndscaleps_512_mask __builtin_ia32_rndscaleps_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
@@ -245,6 +247,7 @@ OVR_INT(broadcast);
 OVR_SFP(broadcast);
 OVR_SFP(comi);
 OVR_VFP(cvtdq2);
+OVR_INT(abs);
 OVR_FP(add);
 OVR_INT(add);
 OVR_BW(adds);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -446,7 +446,7 @@ static const struct ext0f38_table {
     [0x19] = { .simd_size = simd_scalar_opc, .two_op = 1, .d8s = 3 },
     [0x1a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x1b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
-    [0x1c ... 0x1e] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x1c ... 0x1f] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x20] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x21] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
     [0x22] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_8 },
@@ -531,8 +531,8 @@ static const struct ext0f3a_table {
     [0x02] = { .simd_size = simd_packed_int },
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
-    [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1 },
-    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc },
+    [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc, .d8s = d8s_dq },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
     [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
@@ -6917,6 +6917,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = 1 << (b & 1);
@@ -8303,6 +8305,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfa): /* vpsubd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfb): /* vpsubq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfe): /* vpaddd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1e): /* vpabsd [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1f): /* vpabsq [xyz]mm/mem,[xyz]mm{k} */
         generate_exception_if(evex.w != (b & 1), EXC_UD);
         goto avx512f_no_sae;
 
@@ -9331,6 +9335,17 @@ x86_emulate(
         host_and_vcpu_must_have(sse4_1);
         goto simd_0f3a_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0a): /* vrndscaless $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0b): /* vrndscalesd $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x08): /* vrndscaleps $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x09): /* vrndscalepd $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.w != (b & 1), EXC_UD);
+        avx512_vlen_check(b & 2);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC(0x0f3a, 0x0f):    /* palignr $imm8,mm/m64,mm */
     case X86EMUL_OPC_66(0x0f3a, 0x0f): /* palignr $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(ssse3);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 26/50] x86emul: support remaining AVX512BW legacy-equivalent insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (24 preceding siblings ...)
  2019-03-15 10:54   ` [PATCH v8 25/50] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
@ 2019-03-15 10:54   ` Jan Beulich
  2019-05-21 13:08     ` Andrew Cooper
  2019-05-23 16:10     ` Andrew Cooper
  2019-03-15 10:55   ` [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
                     ` (23 subsequent siblings)
  49 siblings, 2 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:54 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Re-base.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -354,6 +354,7 @@ static const struct test avx512bw_all[]
     INSN(paddusb,     66,   0f, dc,    vl,    b, vl),
     INSN(paddusw,     66,   0f, dd,    vl,    w, vl),
     INSN(paddw,       66,   0f, fd,    vl,    w, vl),
+    INSN(palignr,     66, 0f3a, 0f,    vl,    b, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
     INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
@@ -369,6 +370,7 @@ static const struct test avx512bw_all[]
     INSN(permw,       66, 0f38, 8d,    vl,    w, vl),
     INSN(permi2w,     66, 0f38, 75,    vl,    w, vl),
     INSN(permt2w,     66, 0f38, 7d,    vl,    w, vl),
+    INSN(pmaddubsw,   66, 0f38, 04,    vl,    b, vl),
     INSN(pmaddwd,     66,   0f, f5,    vl,    w, vl),
     INSN(pmaxsb,      66, 0f38, 3c,    vl,    b, vl),
     INSN(pmaxsw,      66,   0f, ee,    vl,    w, vl),
@@ -386,6 +388,7 @@ static const struct test avx512bw_all[]
 //       pmovw2m,     f3, 0f38, 29,           w
     INSN(pmovwb,      f3, 0f38, 30,    vl_2,  b, vl),
     INSN(pmovzxbw,    66, 0f38, 30,    vl_2,  b, vl),
+    INSN(pmulhrsw,    66, 0f38, 0b,    vl,    w, vl),
     INSN(pmulhuw,     66,   0f, e4,    vl,    w, vl),
     INSN(pmulhw,      66,   0f, e5,    vl,    w, vl),
     INSN(pmullw,      66,   0f, d5,    vl,    w, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -587,6 +587,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklbw, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0))
+#   define rotr(x, n) ((vec_t)B(palignr, _mask, (vdi_t)(x), (vdi_t)(x), (n) * 8, (vdi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufb, _mask, (vqi_t)(x), (vqi_t)(inv - 1), (vqi_t)undef(), ~0))
 #  elif defined(__AVX512VBMI__)
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
@@ -615,6 +616,7 @@ static inline vec_t movlhps(vec_t x, vec
 #  if VEC_SIZE == 16
 #   define interleave_hi(x, y) ((vec_t)B(punpckhwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(punpcklwd, _mask, (vhi_t)(x), (vhi_t)(y), (vhi_t)undef(), ~0))
+#   define rotr(x, n) ((vec_t)B(palignr, _mask, (vdi_t)(x), (vdi_t)(x), (n) * 16, (vdi_t)undef(), ~0))
 #   define swap(x) ((vec_t)B(pshufd, _mask, \
                              (vsi_t)B(pshufhw, _mask, \
                                       B(pshuflw, _mask, (vhi_t)(x), 0b00011011, (vhi_t)undef(), ~0), \
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -402,9 +402,12 @@ OVR(packssdw);
 OVR(packsswb);
 OVR(packusdw);
 OVR(packuswb);
+OVR(palignr);
+OVR(pmaddubsw);
 OVR(pmaddwd);
 OVR(pmovsxbw);
 OVR(pmovzxbw);
+OVR(pmulhrsw);
 OVR(pmulhuw);
 OVR(pmulhw);
 OVR(pmullw);
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -435,7 +435,10 @@ static const struct ext0f38_table {
     disp8scale_t d8s:4;
 } ext0f38_table[256] = {
     [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x01 ... 0x03] = { .simd_size = simd_packed_int },
+    [0x04] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x05 ... 0x0b] = { .simd_size = simd_packed_int },
+    [0x0b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x0e ... 0x0f] = { .simd_size = simd_packed_fp },
     [0x10 ... 0x12] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -534,7 +537,8 @@ static const struct ext0f3a_table {
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x0a ... 0x0b] = { .simd_size = simd_scalar_opc, .d8s = d8s_dq },
     [0x0c ... 0x0d] = { .simd_size = simd_packed_fp },
-    [0x0e ... 0x0f] = { .simd_size = simd_packed_int },
+    [0x0e] = { .simd_size = simd_packed_int },
+    [0x0f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x14] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 0 },
     [0x15] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = 1 },
     [0x16] = { .simd_size = simd_none, .to_mem = 1, .two_op = 1, .d8s = d8s_dq64 },
@@ -6899,6 +6903,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf5): /* vpmaddwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x00): /* vpshufb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x04): /* vpmaddubsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xd5): /* vpmullw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -6917,6 +6922,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f, 0xf9): /* vpsubw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfc): /* vpaddb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f, 0xfd): /* vpaddw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x0b): /* vpmulhrsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
@@ -9374,6 +9380,10 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 4;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0f): /* vpalignr $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        goto avx512bw_imm;
+
     case X86EMUL_OPC_66(0x0f3a, 0x14): /* pextrb $imm8,xmm,r/m */
     case X86EMUL_OPC_66(0x0f3a, 0x15): /* pextrw $imm8,xmm,r/m */
     case X86EMUL_OPC_66(0x0f3a, 0x16): /* pextr{d,q} $imm8,xmm,r/m */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (25 preceding siblings ...)
  2019-03-15 10:54   ` [PATCH v8 26/50] x86emul: support remaining AVX512BW " Jan Beulich
@ 2019-03-15 10:55   ` Jan Beulich
  2019-05-23 16:15     ` Andrew Cooper
  2019-03-15 10:56   ` [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns Jan Beulich
                     ` (22 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also include the only other AVX512ER insn pair, VEXP2P{D,S}.

Note that despite the replacement of the SHA insns' table slots there's
no need to special case their decoding: Their insn-specific code already
sets op_bytes (as was required due to simd_other), and TwoOp is of no
relevance for legacy encoded SIMD insns.

The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
to be on the safe side. The SDM does not clarify behavior there, and
it's even more ambiguous here (without AVX512VL in the picture).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix vector length check for AVX512ER insns. ea.type == OP_* ->
    ea.type != OP_*. Re-base.
v6: Re-base. AVX512ER tests now also successfully run.
v5: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -72,6 +72,9 @@ avx512bw-flts :=
 avx512dq-vecs := $(avx512f-vecs)
 avx512dq-ints := $(avx512f-ints)
 avx512dq-flts := $(avx512f-flts)
+avx512er-vecs := 64
+avx512er-ints :=
+avx512er-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -278,10 +278,14 @@ static const struct test avx512f_all[] =
     INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
     INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
+    INSN(rcp14,        66, 0f38, 4c,    vl,     sd, vl),
+    INSN(rcp14,        66, 0f38, 4d,    el,     sd, el),
     INSN(rndscalepd,   66, 0f3a, 09,    vl,      q, vl),
     INSN(rndscaleps,   66, 0f3a, 08,    vl,      d, vl),
     INSN(rndscalesd,   66, 0f3a, 0b,    el,      q, el),
     INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
+    INSN(rsqrt14,      66, 0f38, 4e,    vl,     sd, vl),
+    INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
@@ -477,6 +481,14 @@ static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512er_512[] = {
+    INSN(exp2,    66, 0f38, c8, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, ca, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, cb, el, sd, el),
+    INSN(rsqrt28, 66, 0f38, cc, vl, sd, vl),
+    INSN(rsqrt28, 66, 0f38, cd, el, sd, el),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -837,5 +849,6 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512er, 512);
     RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -210,9 +210,23 @@ static inline vec_t movlhps(vec_t x, vec
 })
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28ss %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14ss %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28sd %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14sd %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
@@ -263,6 +277,13 @@ static inline vec_t movlhps(vec_t x, vec
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28ps, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14ps, _mask, x, undef(), ~0)
+#  endif
 #  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscaleps_, _mask, x, 0b1011, undef(), ~0)
@@ -318,6 +339,13 @@ static inline vec_t movlhps(vec_t x, vec
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28pd, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14pd, _mask, x, undef(), ~0)
+#  endif
 #  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscalepd_, _mask, x, 0b1011, undef(), ~0)
 #  if VEC_SIZE == 16
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -178,14 +178,20 @@ DECL_OCTET(half);
 /* Sadly there are a few exceptions to the general naming rules. */
 # define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
 # define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+# define __builtin_ia32_exp2pd512_mask __builtin_ia32_exp2pd_mask
+# define __builtin_ia32_exp2ps512_mask __builtin_ia32_exp2ps_mask
 # define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
 # define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
 # define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
 # define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
 # define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
 # define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
+# define __builtin_ia32_rcp28pd512_mask __builtin_ia32_rcp28pd_mask
+# define __builtin_ia32_rcp28ps512_mask __builtin_ia32_rcp28ps_mask
 # define __builtin_ia32_rndscalepd_512_mask __builtin_ia32_rndscalepd_mask
 # define __builtin_ia32_rndscaleps_512_mask __builtin_ia32_rndscaleps_mask
+# define __builtin_ia32_rsqrt28pd512_mask __builtin_ia32_rsqrt28pd_mask
+# define __builtin_ia32_rsqrt28ps512_mask __builtin_ia32_rsqrt28ps_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -24,6 +24,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f.h"
 #include "avx512bw.h"
 #include "avx512dq.h"
+#include "avx512er.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -106,6 +107,11 @@ static bool simd_check_avx512dq_vl(void)
     return cpu_has_avx512dq && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx512er(void)
+{
+    return cpu_has_avx512er;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -327,6 +333,10 @@ static const struct {
     AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
     AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
     AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
+    SIMD(AVX512ER f32 scalar,avx512er,        f4),
+    SIMD(AVX512ER f32x16,    avx512er,      64f4),
+    SIMD(AVX512ER f64 scalar,avx512er,        f8),
+    SIMD(AVX512ER f64x8,     avx512er,      64f8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -134,6 +134,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_bmi2       cp.feat.bmi2
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
+#define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -471,6 +471,10 @@ static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
@@ -510,7 +514,12 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
-    [0xc8 ... 0xcd] = { .simd_size = simd_other },
+    [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xc9] = { .simd_size = simd_other },
+    [0xca] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xcc] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
     [0xf0] = { .two_op = 1 },
@@ -1873,6 +1882,7 @@ static bool vcpu_has(
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
+#define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
@@ -6168,6 +6178,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4c): /* vrcp14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4e): /* vrsqrt14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
@@ -8865,6 +8877,13 @@ x86_emulate(
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4d): /* vrcp14s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4f): /* vrsqrt14s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.brs, EXC_UD);
+        avx512_vlen_check(true);
+        goto simd_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x5a): /* vbroadcasti128 m128,ymm */
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
@@ -9112,6 +9131,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
         host_and_vcpu_must_have(avx512f);
+    simd_zmm_scalar_sae:
         generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
         if ( !evex.brs )
             avx512_vlen_check(true);
@@ -9127,6 +9147,19 @@ x86_emulate(
         op_bytes = 16;
         goto simd_0f38_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc8): /* vexp2p{s,d} zmm/mem,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xca): /* vrcp28p{s,d} zmm/mem,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcc): /* vrsqrt28p{s,d} zmm/mem,zmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        generate_exception_if((ea.type != OP_REG || !evex.brs) && evex.lr != 2,
+                              EXC_UD);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcb): /* vrcp28s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcd): /* vrsqrt28s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        goto simd_zmm_scalar_sae;
+
     case X86EMUL_OPC(0x0f38, 0xf0): /* movbe m,r */
     case X86EMUL_OPC(0x0f38, 0xf1): /* movbe r,m */
         vcpu_must_have(movbe);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -102,6 +102,7 @@
 #define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
+#define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (26 preceding siblings ...)
  2019-03-15 10:55   ` [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
@ 2019-03-15 10:56   ` Jan Beulich
  2019-05-29 12:51     ` Andrew Cooper
  2019-06-10 14:03     ` Andrew Cooper
  2019-03-15 10:56   ` [PATCH v8 29/50] x86emul: support AVX512DQ " Jan Beulich
                     ` (21 subsequent siblings)
  49 siblings, 2 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix vector length check for scalar insns. ea.type == OP_* ->
    ea.type != OP_*. Re-base.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -140,6 +140,8 @@ static const struct test avx512f_all[] =
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
     INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
+    INSN(fixupimm,     66, 0f3a, 54,    vl,     sd, vl),
+    INSN(fixupimm,     66, 0f3a, 55,    el,     sd, el),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
     INSN(fmadd132,     66, 0f38, 99,    el,     sd, el),
     INSN(fmadd213,     66, 0f38, a8,    vl,     sd, vl),
@@ -170,6 +172,10 @@ static const struct test avx512f_all[] =
     INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
     INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
     INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
+    INSN(getexp,       66, 0f38, 42,    vl,     sd, vl),
+    INSN(getexp,       66, 0f38, 43,    el,     sd, el),
+    INSN(getmant,      66, 0f3a, 26,    vl,     sd, vl),
+    INSN(getmant,      66, 0f3a, 27,    el,     sd, el),
     INSN_FP(max,             0f, 5f),
     INSN_FP(min,             0f, 5d),
     INSN_SFP(mov,            0f, 10),
@@ -286,6 +292,8 @@ static const struct test avx512f_all[] =
     INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
     INSN(rsqrt14,      66, 0f38, 4e,    vl,     sd, vl),
     INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
+    INSN(scalef,       66, 0f38, 2c,    vl,     sd, vl),
+    INSN(scalef,       66, 0f38, 2d,    el,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -174,6 +174,11 @@ static inline bool _to_bool(byte_vec_t b
     asm ( op : [out] "=&x" (r_) : [in] "m" (x) ); \
     (vec_t){ r_[0] }; \
 })
+# define scalar_2op(x, y, op) ({ \
+    typeof((x)[0]) __attribute__((vector_size(16))) r_ = { x[0] }; \
+    asm ( op : [out] "=&x" (r_) : [in1] "[out]" (r_), [in2] "m" (y) ); \
+    (vec_t){ r_[0] }; \
+})
 #endif
 
 #if VEC_SIZE == 16 && FLOAT_SIZE == 4 && defined(__SSE__)
@@ -210,6 +215,8 @@ static inline vec_t movlhps(vec_t x, vec
 })
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
+#  define getexp(x) scalar_1op(x, "vgetexpss %[in], %[out], %[out]")
+#  define getmant(x) scalar_1op(x, "vgetmantss $0, %[in], %[out], %[out]")
 #  ifdef __AVX512ER__
 #   define recip(x) scalar_1op(x, "vrcp28ss %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt28ss %[in], %[out], %[out]")
@@ -217,9 +224,12 @@ static inline vec_t movlhps(vec_t x, vec
 #   define recip(x) scalar_1op(x, "vrcp14ss %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt14ss %[in], %[out], %[out]")
 #  endif
+#  define scale(x, y) scalar_2op(x, y, "vscalefss %[in2], %[in1], %[out]")
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
+#  define getexp(x) scalar_1op(x, "vgetexpsd %[in], %[out], %[out]")
+#  define getmant(x) scalar_1op(x, "vgetmantsd $0, %[in], %[out], %[out]")
 #  ifdef __AVX512ER__
 #   define recip(x) scalar_1op(x, "vrcp28sd %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt28sd %[in], %[out], %[out]")
@@ -227,6 +237,7 @@ static inline vec_t movlhps(vec_t x, vec
 #   define recip(x) scalar_1op(x, "vrcp14sd %[in], %[out], %[out]")
 #   define rsqrt(x) scalar_1op(x, "vrsqrt14sd %[in], %[out], %[out]")
 #  endif
+#  define scale(x, y) scalar_2op(x, y, "vscalefsd %[in2], %[in1], %[out]")
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
@@ -274,9 +285,12 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
 #   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  define getexp(x) BR(getexpps, _mask, x, undef(), ~0)
+#  define getmant(x) BR(getmantps, _mask, x, 0, undef(), ~0)
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
 #   define rsqrt(x) BR(rsqrt28ps, _mask, x, undef(), ~0)
@@ -336,9 +350,12 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
 #   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  define getexp(x) BR(getexppd, _mask, x, undef(), ~0)
+#  define getmant(x) BR(getmantpd, _mask, x, 0, undef(), ~0)
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
 #   define rsqrt(x) BR(rsqrt28pd, _mask, x, undef(), ~0)
@@ -1766,6 +1783,28 @@ int simd_test(void)
 # endif
 #endif
 
+#if defined(getexp) && defined(getmant)
+    touch(src);
+    x = getmant(src);
+    touch(src);
+    y = getexp(src);
+    touch(src);
+    for ( j = i = 0; i < ELEM_COUNT; ++i )
+    {
+        if ( y[i] != j ) return __LINE__;
+
+        if ( !((i + 1) & (i + 2)) )
+            ++j;
+
+        if ( !(i & (i + 1)) && x[i] != 1 ) return __LINE__;
+    }
+# ifdef scale
+    touch(y);
+    z = scale(x, y);
+    if ( !eq(src, z) ) return __LINE__;
+# endif
+#endif
+
 #if (defined(__XOP__) && VEC_SIZE == 16 && (INT_SIZE == 2 || INT_SIZE == 4)) || \
     (defined(__AVX512F__) && defined(FLOAT_SIZE))
     return -fma_test();
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3924,6 +3924,44 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    printf("%-40s", "Testing vfixupimmpd $0,8(%edx){1to8},%zmm3,%zmm4...");
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vfixupimmpd);
+        static const struct {
+            double d[4];
+        }
+        src = { { -1, 0, 1, 2 } },
+        dst = { { 3, 4, 5, 6 } },
+        out = { { .5, -1, 90, 2 } };
+
+        asm volatile ( "vbroadcastf64x4 %1, %%zmm3\n\t"
+                       "vbroadcastf64x4 %2, %%zmm4\n"
+                       put_insn(vfixupimmpd,
+                                "vfixupimmpd $0, 8(%0)%{1to8%}, %%zmm3, %%zmm4")
+                       :: "d" (NULL), "m" (src), "m" (dst) );
+
+        set_insn(vfixupimmpd);
+        /*
+         * Nibble (token) mapping (unused ones simply set to zero):
+         * 2 (ZERO)    ->  -1 (0x9)
+         * 3 (POS_ONE) ->  90 (0xc)
+         * 6 (NEG)     -> 1/2 (0xb)
+         * 7 (POS)     -> src (0x1)
+         */
+        res[2] = 0x1b00c900;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        asm volatile ( "vmovupd %%zmm4, %0" : "=m" (res[0]) );
+        if ( rc != X86EMUL_OKAY || !check_eip(vfixupimmpd) ||
+             memcmp(res + 0, &out, sizeof(out)) ||
+             memcmp(res + 8, &out, sizeof(out)) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -459,7 +459,8 @@ static const struct ext0f38_table {
     [0x26 ... 0x29] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x2a] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x2b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x2c ... 0x2d] = { .simd_size = simd_packed_fp },
+    [0x2c] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x2d] = { .simd_size = simd_packed_fp, .d8s = d8s_dq },
     [0x2e ... 0x2f] = { .simd_size = simd_packed_fp, .to_mem = 1 },
     [0x30] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x31] = { .simd_size = simd_other, .two_op = 1, .d8s = d8s_vl_by_4 },
@@ -470,6 +471,8 @@ static const struct ext0f38_table {
     [0x36 ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x42] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x43] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -563,6 +566,8 @@ static const struct ext0f3a_table {
     [0x22] = { .simd_size = simd_none, .d8s = d8s_dq64 },
     [0x23] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x25] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x26] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x27] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x30 ... 0x33] = { .simd_size = simd_other, .two_op = 1 },
     [0x38] = { .simd_size = simd_128, .d8s = 4 },
     [0x3a] = { .simd_size = simd_256, .d8s = d8s_vl_by_2 },
@@ -577,6 +582,8 @@ static const struct ext0f3a_table {
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4c] = { .simd_size = simd_packed_int, .four_op = 1 },
+    [0x54] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x55] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -2684,6 +2691,10 @@ x86_decode_0f38(
         ctxt->opcode |= MASK_INSR(vex.pfx, X86EMUL_OPC_PFX_MASK);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0, 0x2d): /* vscalefs{s,d} */
+        state->simd_size = simd_scalar_vexw;
+        break;
+
     case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
     case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
     case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
@@ -9095,6 +9106,8 @@ x86_emulate(
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2c): /* vscalefp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x42): /* vgetexpp{s,d} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -9118,6 +9131,8 @@ x86_emulate(
             avx512_vlen_check(false);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2d): /* vscalefs{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x43): /* vgetexps{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
@@ -9681,6 +9696,21 @@ x86_emulate(
         op_bytes = 4;
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type != OP_REG || !evex.brs )
+            avx512_vlen_check(false);
+        goto simd_imm8_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
+        if ( !evex.brs )
+            avx512_vlen_check(true);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
         if ( !vex.w )




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 29/50] x86emul: support AVX512DQ floating point manipulation insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (27 preceding siblings ...)
  2019-03-15 10:56   ` [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns Jan Beulich
@ 2019-03-15 10:56   ` Jan Beulich
  2019-06-10 14:06     ` [Xen-devel] " Andrew Cooper
  2019-03-15 10:56   ` [PATCH v8 30/50] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
                     ` (20 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512DQ in the insn emulator.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix vector length check for scalar insns. Re-base.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -457,11 +457,17 @@ static const struct test avx512dq_all[]
     INSN(cvttps2uqq,     66,   0f, 78, vl_2,  d, vl),
     INSN(cvtuqq2pd,      f3,   0f, 7a,   vl,  q, vl),
     INSN(cvtuqq2ps,      f2,   0f, 7a,   vl,  q, vl),
+    INSN(fpclass,        66, 0f3a, 66,   vl, sd, vl),
+    INSN(fpclass,        66, 0f3a, 67,   el, sd, el),
     INSN_PFP(or,               0f, 56),
 //       pmovd2m,        f3, 0f38, 39,        d
 //       pmovm2,         f3, 0f38, 38,       dq
 //       pmovq2m,        f3, 0f38, 39,        q
     INSN(pmullq,         66, 0f38, 40,   vl,  q, vl),
+    INSN(range,          66, 0f3a, 50,   vl, sd, vl),
+    INSN(range,          66, 0f3a, 51,   el, sd, el),
+    INSN(reduce,         66, 0f3a, 56,   vl, sd, vl),
+    INSN(reduce,         66, 0f3a, 57,   el, sd, el),
     INSN_PFP(xor,              0f, 57),
 };
 
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -285,10 +285,18 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_octet(x) B(broadcastf32x8_, _mask, x, undef(), ~0)
 #   define insert_octet(x, y, p) B(insertf32x8_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  ifdef __AVX512DQ__
+#   define frac(x) B(reduceps, _mask, x, 0b00001011, undef(), ~0)
+#  endif
 #  define getexp(x) BR(getexpps, _mask, x, undef(), ~0)
 #  define getmant(x) BR(getmantps, _mask, x, 0, undef(), ~0)
-#  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
-#  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define max(x, y) BR(rangeps, _mask, x, y, 0b0101, undef(), ~0)
+#   define min(x, y) BR(rangeps, _mask, x, y, 0b0100, undef(), ~0)
+#  else
+#   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
+#  endif
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
 #  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
@@ -350,10 +358,18 @@ static inline vec_t movlhps(vec_t x, vec
 #   define broadcast_quartet(x) B(broadcastf64x4_, , x, undef(), ~0)
 #   define insert_quartet(x, y, p) B(insertf64x4_, _mask, x, y, p, undef(), ~0)
 #  endif
+#  ifdef __AVX512DQ__
+#   define frac(x) B(reducepd, _mask, x, 0b00001011, undef(), ~0)
+#  endif
 #  define getexp(x) BR(getexppd, _mask, x, undef(), ~0)
 #  define getmant(x) BR(getmantpd, _mask, x, 0, undef(), ~0)
-#  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
-#  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  ifdef __AVX512DQ__
+#   define max(x, y) BR(rangepd, _mask, x, y, 0b0101, undef(), ~0)
+#   define min(x, y) BR(rangepd, _mask, x, y, 0b0100, undef(), ~0)
+#  else
+#   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
+#   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
+#  endif
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
 #  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3962,6 +3962,39 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+
+    printf("%-40s", "Testing vfpclasspsz $0x46,64(%edx),%k2...");
+    if ( stack_exec && cpu_has_avx512dq )
+    {
+        decl_insn(vfpclassps);
+
+        asm volatile ( put_insn(vfpclassps,
+                                /* 0x46: check for +/- 0 and neg. */
+                                "vfpclasspsz $0x46, 64(%0), %%k2")
+                       :: "d" (NULL) );
+
+        set_insn(vfpclassps);
+        for ( i = 0; i < 3; ++i )
+        {
+            res[16 + i * 5 + 0] = 0x00000000; /* +0 */
+            res[16 + i * 5 + 1] = 0x80000000; /* -0 */
+            res[16 + i * 5 + 2] = 0x80000001; /* -DEN */
+            res[16 + i * 5 + 3] = 0xff000000; /* -FIN */
+            res[16 + i * 5 + 4] = 0x7f000000; /* +FIN */
+        }
+        res[31] = 0;
+        regs.edx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vfpclassps) )
+            goto fail;
+        asm volatile ( "kmovw %%k2, %0" : "=g" (rc) );
+        if ( rc != 0xbdef )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -582,10 +582,16 @@ static const struct ext0f3a_table {
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4c] = { .simd_size = simd_packed_int, .four_op = 1 },
+    [0x50] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x51] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x54] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x55] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x56] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x57] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x5c ... 0x5f] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x60 ... 0x63] = { .simd_size = simd_packed_int, .two_op = 1 },
+    [0x66] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x67] = { .simd_size = simd_scalar_vexw, .two_op = 1, .d8s = d8s_dq },
     [0x68 ... 0x69] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x6a ... 0x6b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x6c ... 0x6d] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -9696,6 +9702,10 @@ x86_emulate(
         op_bytes = 4;
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x50): /* vrangep{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x56): /* vreducep{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512dq);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512f);
@@ -9703,6 +9713,10 @@ x86_emulate(
             avx512_vlen_check(false);
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x51): /* vranges{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x57): /* vreduces{s,d} $imm8,xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512dq);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
         host_and_vcpu_must_have(avx512f);
@@ -9858,6 +9872,16 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x66): /* vfpclassp{s,d} $imm8,[xyz]mm/mem,k{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x67): /* vfpclasss{s,d} $imm8,[xyz]mm/mem,k{k} */
+        host_and_vcpu_must_have(avx512dq);
+        generate_exception_if(!evex.r || !evex.R || evex.z, EXC_UD);
+        if ( !(b & 1) )
+            goto avx512f_imm8_no_sae;
+        generate_exception_if(evex.brs, EXC_UD);
+        avx512_vlen_check(true);
+        goto simd_imm8_zmm;
+
     case X86EMUL_OPC(0x0f3a, 0xcc):     /* sha1rnds4 $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(sha);
         op_bytes = 16;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 30/50] x86emul: support AVX512{F, _VBMI2} compress/expand insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (28 preceding siblings ...)
  2019-03-15 10:56   ` [PATCH v8 29/50] x86emul: support AVX512DQ " Jan Beulich
@ 2019-03-15 10:56   ` Jan Beulich
  2019-06-10 14:51     ` [Xen-devel] " Andrew Cooper
  2019-03-15 10:58   ` [PATCH v8 31/50] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
                     ` (19 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: Re-base. Add tests for the byte/word forms.
v5: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -109,6 +109,7 @@ static const struct test avx512f_all[] =
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
     INSN(comiss,         ,   0f, 2f,    el,      d, el),
+    INSN(compress,     66, 0f38, 8a,    vl,     sd, el),
     INSN(cvtdq2pd,     f3,   0f, e6,    vl_2,    d, vl),
     INSN(cvtdq2ps,       ,   0f, 5b,    vl,      d, vl),
     INSN(cvtpd2dq,     f2,   0f, e6,    vl,      q, vl),
@@ -140,6 +141,7 @@ static const struct test avx512f_all[] =
     INSN(cvtusi2sd,    f2,   0f, 7b,    el,   dq64, el),
     INSN(cvtusi2ss,    f3,   0f, 7b,    el,   dq64, el),
     INSN_FP(div,             0f, 5e),
+    INSN(expand,       66, 0f38, 88,    vl,     sd, el),
     INSN(fixupimm,     66, 0f3a, 54,    vl,     sd, vl),
     INSN(fixupimm,     66, 0f3a, 55,    el,     sd, el),
     INSN(fmadd132,     66, 0f38, 98,    vl,     sd, vl),
@@ -214,6 +216,7 @@ static const struct test avx512f_all[] =
     INSN(pcmpgtd,      66,   0f, 66,    vl,      d, vl),
     INSN(pcmpgtq,      66, 0f38, 37,    vl,      q, vl),
     INSN(pcmpu,        66, 0f3a, 1e,    vl,     dq, vl),
+    INSN(pcompress,    66, 0f38, 8b,    vl,     dq, el),
     INSN(permi2,       66, 0f38, 76,    vl,     dq, vl),
     INSN(permi2,       66, 0f38, 77,    vl,     sd, vl),
     INSN(permilpd,     66, 0f38, 0d,    vl,      q, vl),
@@ -222,6 +225,7 @@ static const struct test avx512f_all[] =
     INSN(permilps,     66, 0f3a, 04,    vl,      d, vl),
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
+    INSN(pexpand,      66, 0f38, 89,    vl,     dq, el),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -509,6 +513,11 @@ static const struct test avx512_vbmi_all
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
 };
 
+static const struct test avx512_vbmi2_all[] = {
+    INSN(pcompress, 66, 0f38, 63, vl, bw, el),
+    INSN(pexpand,   66, 0f38, 62, vl, bw, el),
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -865,4 +874,5 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 512);
     RUN(avx512er, 512);
     RUN(avx512_vbmi, all);
+    RUN(avx512_vbmi2, all);
 }
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -3995,6 +3995,227 @@ int main(int argc, char **argv)
     else
         printf("skipped\n");
 
+    /*
+     * The following compress/expand tests are not only making sure the
+     * accessed data is correct, but they also verify (by placing operands
+     * on the mapping boundaries) that elements controlled by clear mask
+     * bits don't get accessed.
+     */
+    if ( stack_exec && cpu_has_avx512f )
+    {
+        decl_insn(vpcompressd);
+        decl_insn(vpcompressq);
+        decl_insn(vpexpandd);
+        decl_insn(vpexpandq);
+        static const struct {
+            unsigned int d[16];
+        } dsrc = { { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 } };
+        static const struct {
+            unsigned long long q[8];
+        } qsrc = { { 0, 1, 2, 3, 4, 5, 6, 7 } };
+        unsigned int *ptr = res + MMAP_SZ / sizeof(*res) - 32;
+
+        printf("%-40s", "Testing vpcompressd %zmm1,24*4(%ecx){%k2}...");
+        asm volatile ( "kmovw %1, %%k2\n\t"
+                       "vmovdqu32 %2, %%zmm1\n"
+                       put_insn(vpcompressd,
+                                "vpcompressd %%zmm1, 24*4(%0)%{%%k2%}")
+                       :: "c" (NULL), "r" (0x55aa), "m" (dsrc) );
+
+        memset(ptr, 0xdb, 32 * 4);
+        set_insn(vpcompressd);
+        regs.ecx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressd) ||
+             memcmp(ptr, ptr + 8, 16 * 4) )
+            goto fail;
+        for ( i = 0; i < 4; ++i )
+            if ( ptr[24 + i] != 2 * i + 1 )
+                goto fail;
+        for ( ; i < 8; ++i )
+            if ( ptr[24 + i] != 2 * i )
+                goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandd 8*4(%edx),%zmm3{%k2}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm3, %%zmm3, %%zmm3\n"
+                       put_insn(vpexpandd,
+                                "vpexpandd 8*4(%0), %%zmm3%{%%k2%}%{z%}")
+                       :: "d" (NULL) );
+        set_insn(vpexpandd);
+        regs.edx = (unsigned long)(ptr + 16);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandd) )
+            goto fail;
+        asm ( "vmovdqa32 %%zmm1, %%zmm2%{%%k2%}%{z%}\n\t"
+              "vpcmpeqd %%zmm2, %%zmm3, %%k0\n\t"
+              "kmovw %%k0, %0"
+              : "=r" (rc) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpcompressq %zmm4,12*8(%edx){%k3}...");
+        asm volatile ( "kmovw %1, %%k3\n\t"
+                       "vmovdqu64 %2, %%zmm4\n"
+                       put_insn(vpcompressq,
+                                "vpcompressq %%zmm4, 12*8(%0)%{%%k3%}")
+                       :: "d" (NULL), "r" (0x5a), "m" (qsrc) );
+
+        memset(ptr, 0xdb, 16 * 8);
+        set_insn(vpcompressq);
+        regs.edx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressq) ||
+             memcmp(ptr, ptr + 8, 8 * 8) )
+            goto fail;
+        for ( i = 0; i < 2; ++i )
+        {
+            if ( ptr[(12 + i) * 2] != 2 * i + 1 ||
+                 ptr[(12 + i) * 2 + 1] )
+                goto fail;
+        }
+        for ( ; i < 4; ++i )
+        {
+            if ( ptr[(12 + i) * 2] != 2 * i ||
+                 ptr[(12 + i) * 2 + 1] )
+                goto fail;
+        }
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandq 4*8(%ecx),%zmm5{%k3}{z}...");
+        asm volatile ( "vpternlogq $0x81, %%zmm5, %%zmm5, %%zmm5\n"
+                       put_insn(vpexpandq,
+                                "vpexpandq 4*8(%0), %%zmm5%{%%k3%}%{z%}")
+                       :: "c" (NULL) );
+        set_insn(vpexpandq);
+        regs.ecx = (unsigned long)(ptr + 16);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandq) )
+            goto fail;
+        asm ( "vmovdqa64 %%zmm4, %%zmm6%{%%k3%}%{z%}\n\t"
+              "vpcmpeqq %%zmm5, %%zmm6, %%k0\n\t"
+              "kmovw %%k0, %0"
+              : "=r" (rc) );
+        if ( rc != 0xff )
+            goto fail;
+        printf("okay\n");
+    }
+
+#if __GNUC__ > 7 /* can't check for __AVX512VBMI2__ here */
+    if ( stack_exec && cpu_has_avx512_vbmi2 )
+    {
+        decl_insn(vpcompressb);
+        decl_insn(vpcompressw);
+        decl_insn(vpexpandb);
+        decl_insn(vpexpandw);
+        static const struct {
+            unsigned char b[64];
+        } bsrc = { { 0,  1,  2,  3,  4,  5,  6,  7,
+                     8,  9, 10, 11, 12, 13, 14, 15,
+                    16, 17, 18, 19, 20, 21, 22, 23,
+                    24, 25, 26, 27, 28, 29, 30, 31,
+                    32, 33, 34, 35, 36, 37, 38, 39,
+                    40, 41, 42, 43, 44, 45, 46, 47,
+                    48, 49, 50, 51, 52, 53, 54, 55,
+                    56, 57, 58, 59, 60, 61, 62, 63 } };
+        static const struct {
+            unsigned short w[32];
+        } wsrc = { { 0,  1,  2,  3,  4,  5,  6,  7,
+                     8,  9, 10, 11, 12, 13, 14, 15,
+                    16, 17, 18, 19, 20, 21, 22, 23,
+                    24, 25, 26, 27, 28, 29, 30, 31 } };
+        unsigned char *ptr = (void *)res + MMAP_SZ - 128;
+        unsigned long long w = 0x55555555aaaaaaaaULL;
+
+        printf("%-40s", "Testing vpcompressb %zmm1,96*1(%ecx){%k2}...");
+        asm volatile ( "kmovq %1, %%k2\n\t"
+                       "vmovdqu8 %2, %%zmm1\n"
+                       put_insn(vpcompressb,
+                                "vpcompressb %%zmm1, 96*1(%0)%{%%k2%}")
+                       :: "c" (NULL), "m" (w), "m" (bsrc) );
+
+        memset(ptr, 0xdb, 128 * 1);
+        set_insn(vpcompressb);
+        regs.ecx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressb) ||
+             memcmp(ptr, ptr + 32, 64 * 1) )
+            goto fail;
+        for ( i = 0; i < 16; ++i )
+            if ( ptr[96 + i] != 2 * i + 1 )
+                goto fail;
+        for ( ; i < 32; ++i )
+            if ( ptr[96 + i] != 2 * i )
+                goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandb 32*1(%edx),%zmm3{%k2}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm3, %%zmm3, %%zmm3\n"
+                       put_insn(vpexpandb,
+                                "vpexpandb 32*1(%0), %%zmm3%{%%k2%}%{z%}")
+                       :: "d" (NULL) );
+        set_insn(vpexpandb);
+        regs.edx = (unsigned long)(ptr + 64);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandb) )
+            goto fail;
+        asm ( "vmovdqu8 %%zmm1, %%zmm2%{%%k2%}%{z%}\n\t"
+              "vpcmpeqb %%zmm2, %%zmm3, %%k0\n\t"
+              "kmovq %%k0, %0"
+              : "=m" (w) );
+        if ( w != 0xffffffffffffffffULL )
+            goto fail;
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpcompressw %zmm4,48*2(%edx){%k3}...");
+        asm volatile ( "kmovd %1, %%k3\n\t"
+                       "vmovdqu16 %2, %%zmm4\n"
+                       put_insn(vpcompressw,
+                                "vpcompressw %%zmm4, 48*2(%0)%{%%k3%}")
+                       :: "d" (NULL), "r" (0x5555aaaa), "m" (wsrc) );
+
+        memset(ptr, 0xdb, 64 * 2);
+        set_insn(vpcompressw);
+        regs.edx = (unsigned long)ptr;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpcompressw) ||
+             memcmp(ptr, ptr + 32, 32 * 2) )
+            goto fail;
+        for ( i = 0; i < 8; ++i )
+        {
+            if ( ptr[(48 + i) * 2] != 2 * i + 1 ||
+                 ptr[(48 + i) * 2 + 1] )
+                goto fail;
+        }
+        for ( ; i < 16; ++i )
+        {
+            if ( ptr[(48 + i) * 2] != 2 * i ||
+                 ptr[(48 + i) * 2 + 1] )
+                goto fail;
+        }
+        printf("okay\n");
+
+        printf("%-40s", "Testing vpexpandw 16*2(%ecx),%zmm5{%k3}{z}...");
+        asm volatile ( "vpternlogd $0x81, %%zmm5, %%zmm5, %%zmm5\n"
+                       put_insn(vpexpandw,
+                                "vpexpandw 16*2(%0), %%zmm5%{%%k3%}%{z%}")
+                       :: "c" (NULL) );
+        set_insn(vpexpandw);
+        regs.ecx = (unsigned long)(ptr + 64);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(vpexpandw) )
+            goto fail;
+        asm ( "vmovdqu16 %%zmm4, %%zmm6%{%%k3%}%{z%}\n\t"
+              "vpcmpeqw %%zmm5, %%zmm6, %%k0\n\t"
+              "kmovq %%k0, %0"
+              : "=m" (w) );
+        if ( w != 0xffffffff )
+            goto fail;
+        printf("okay\n");
+    }
+#endif
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -59,6 +59,9 @@
     (type *)((char *)mptr__ - offsetof(type, member)); \
 })
 
+#define hweight32 __builtin_popcount
+#define hweight64 __builtin_popcountll
+
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
 extern uint32_t mxcsr_mask;
@@ -138,6 +141,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
+#define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -482,6 +482,8 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
+    [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -489,6 +491,10 @@ static const struct ext0f38_table {
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x88] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_dq },
+    [0x89] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_dq },
+    [0x8a] = { .simd_size = simd_packed_fp, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
+    [0x8b] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
@@ -1900,6 +1906,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
+#define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -8905,6 +8912,36 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x62): /* vpexpand{b,w} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x63): /* vpcompress{b,w} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        elem_bytes = 1 << evex.w;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x88): /* vexpandp{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x89): /* vpexpand{d,q} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8a): /* vcompressp{s,d} [xyz]mm,[xyz]mm/mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8b): /* vpcompress{d,q} [xyz]mm,[xyz]mm/mem{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.brs, EXC_UD);
+        avx512_vlen_check(false);
+        /*
+         * For the respective code below the main switch() to work we need to
+         * compact op_mask here: Memory accesses are non-sparse even if the
+         * mask register has sparsely set bits.
+         */
+        if ( likely(fault_suppression) )
+        {
+            n = 1 << ((b & 8 ? 2 : 4) + evex.lr - evex.w);
+            EXPECT(elem_bytes > 0);
+            ASSERT(op_bytes == n * elem_bytes);
+            op_mask &= ~0ULL >> (64 - n);
+            n = hweight64(op_mask);
+            op_bytes = n * elem_bytes;
+            if ( n )
+                op_mask = ~0ULL >> (64 - n);
+        }
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -109,6 +109,7 @@
 
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
+#define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
 /* CPUID level 0x80000007.edx */
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -228,6 +228,7 @@ XEN_CPUFEATURE(AVX512_VBMI,   6*32+ 1) /
 XEN_CPUFEATURE(UMIP,          6*32+ 2) /*S  User Mode Instruction Prevention */
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
+XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
 XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
 
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -266,10 +266,10 @@ def crunch_numbers(state):
                   AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
                   AVX512_VPOPCNTDQ],
 
-        # AVX512 extensions acting solely on vectors of bytes/words are made
+        # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask
         # registers), despite the SDM not formally making this connection.
-        AVX512BW: [AVX512_VBMI],
+        AVX512BW: [AVX512_VBMI, AVX512_VBMI2],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 31/50] x86emul: support remaining misc AVX512{F, BW} insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (29 preceding siblings ...)
  2019-03-15 10:56   ` [PATCH v8 30/50] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
@ 2019-03-15 10:58   ` Jan Beulich
  2019-06-18 16:42     ` [Xen-devel] " Andrew Cooper
  2019-03-15 10:58   ` [PATCH v8 32/50] x86emul: support AVX512F gather insns Jan Beulich
                     ` (18 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:58 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512BW in the insn emulator, and leaves just
the scatter/gather ones open in the AVX512F set.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v5: New.
---
TBD: The *blendm* inline functions don't reliably produce the intended
     insns, as the respective moves are about as good a fit for the
     compiler when looking for a match for the intended operation. We'd
     need to switch to inline assembly if we wanted to guarantee the
     testing of those insns. Thoughts?

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -105,6 +105,8 @@ enum esz {
 
 static const struct test avx512f_all[] = {
     INSN_FP(add,             0f, 58),
+    INSN(align,        66, 0f3a, 03,    vl,     dq, vl),
+    INSN(blendm,       66, 0f38, 65,    vl,     sd, vl),
     INSN(broadcastss,  66, 0f38, 18,    el,      d, el),
     INSN_FP(cmp,             0f, c2),
     INSN(comisd,       66,   0f, 2f,    el,      q, el),
@@ -207,6 +209,7 @@ static const struct test avx512f_all[] =
     INSN(paddq,        66,   0f, d4,    vl,      q, vl),
     INSN(pand,         66,   0f, db,    vl,     dq, vl),
     INSN(pandn,        66,   0f, df,    vl,     dq, vl),
+    INSN(pblendm,      66, 0f38, 64,    vl,     dq, vl),
 //       pbroadcast,   66, 0f38, 7c,          dq64
     INSN(pbroadcastd,  66, 0f38, 58,    el,      d, el),
     INSN(pbroadcastq,  66, 0f38, 59,    el,      q, el),
@@ -354,6 +357,7 @@ static const struct test avx512f_512[] =
 };
 
 static const struct test avx512bw_all[] = {
+    INSN(dbpsadbw,    66, 0f3a, 42,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 6f,    vl,    b, vl),
     INSN(movdqu8,     f2,   0f, 7f,    vl,    b, vl),
     INSN(movdqu16,    f2,   0f, 6f,    vl,    w, vl),
@@ -373,6 +377,7 @@ static const struct test avx512bw_all[]
     INSN(palignr,     66, 0f3a, 0f,    vl,    b, vl),
     INSN(pavgb,       66,   0f, e0,    vl,    b, vl),
     INSN(pavgw,       66,   0f, e3,    vl,    w, vl),
+    INSN(pblendm,     66, 0f38, 66,    vl,   bw, vl),
     INSN(pbroadcastb, 66, 0f38, 78,    el,    b, el),
 //       pbroadcastb, 66, 0f38, 7a,           b
     INSN(pbroadcastw, 66, 0f38, 79,    el_2,  b, vl),
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -297,7 +297,7 @@ static inline vec_t movlhps(vec_t x, vec
 #   define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  endif
-#  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  define mix(x, y) B(blendmps_, _mask, x, y, (0b1010101010101010 & ALL_TRUE))
 #  define scale(x, y) BR(scalefps, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
@@ -370,7 +370,7 @@ static inline vec_t movlhps(vec_t x, vec
 #   define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #   define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  endif
-#  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  define mix(x, y) B(blendmpd_, _mask, x, y, 0b10101010)
 #  define scale(x, y) BR(scalefpd, _mask, x, y, undef(), ~0)
 #  if VEC_SIZE == 64 && defined(__AVX512ER__)
 #   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
@@ -564,8 +564,9 @@ static inline vec_t movlhps(vec_t x, vec
                              0b00011011, (vsi_t)undef(), ~0))
 #   define swap2(x) ((vec_t)B_(permvarsi, _mask, (vsi_t)(x), (vsi_t)(inv - 1), (vsi_t)undef(), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdqa32_, _mask, (vsi_t)(x), (vsi_t)(y), \
-                              (0b0101010101010101 & ((1 << ELEM_COUNT) - 1))))
+#  define mix(x, y) ((vec_t)B(blendmd_, _mask, (vsi_t)(x), (vsi_t)(y), \
+                              (0b1010101010101010 & ((1 << ELEM_COUNT) - 1))))
+#  define rotr(x, n) ((vec_t)B(alignd, _mask, (vsi_t)(x), (vsi_t)(x), n, (vsi_t)undef(), ~0))
 #  define shrink1(x) ((half_t)B(pmovqd, _mask, (vdi_t)(x), (vsi_half_t){}, ~0))
 # elif INT_SIZE == 8 || UINT_SIZE == 8
 #  define broadcast(x) ({ \
@@ -602,7 +603,8 @@ static inline vec_t movlhps(vec_t x, vec
                              0b01001110, (vsi_t)undef(), ~0))
 #   define swap2(x) ((vec_t)B(permvardi, _mask, (vdi_t)(x), (vdi_t)(inv - 1), (vdi_t)undef(), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdqa64_, _mask, (vdi_t)(x), (vdi_t)(y), 0b01010101))
+#  define mix(x, y) ((vec_t)B(blendmq_, _mask, (vdi_t)(x), (vdi_t)(y), 0b10101010))
+#  define rotr(x, n) ((vec_t)B(alignq, _mask, (vdi_t)(x), (vdi_t)(x), n, (vdi_t)undef(), ~0))
 #  if VEC_SIZE == 32
 #   define swap3(x) ((vec_t)B_(permdi, _mask, (vdi_t)(x), 0b00011011, (vdi_t)undef(), ~0))
 #  elif VEC_SIZE == 64
@@ -654,8 +656,8 @@ static inline vec_t movlhps(vec_t x, vec
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varqi, _mask, (vqi_t)(x), interleave_hi, (vqi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varqi, _mask, interleave_lo, (vqi_t)(x), (vqi_t)(y), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdquqi, _mask, (vqi_t)(x), (vqi_t)(y), \
-                              (0b0101010101010101010101010101010101010101010101010101010101010101LL & ALL_TRUE)))
+#  define mix(x, y) ((vec_t)B(blendmb_, _mask, (vqi_t)(x), (vqi_t)(y), \
+                              (0b1010101010101010101010101010101010101010101010101010101010101010LL & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovwb, _mask, (vhi_t)(x), (vqi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovdb, _mask, (vsi_t)(x), (vqi_quarter_t){}, ~0))
 #  define shrink3(x) ((eighth_t)B(pmovqb, _mask, (vdi_t)(x), (vqi_eighth_t){}, ~0))
@@ -687,8 +689,8 @@ static inline vec_t movlhps(vec_t x, vec
 #   define interleave_hi(x, y) ((vec_t)B(vpermi2varhi, _mask, (vhi_t)(x), interleave_hi, (vhi_t)(y), ~0))
 #   define interleave_lo(x, y) ((vec_t)B(vpermt2varhi, _mask, interleave_lo, (vhi_t)(x), (vhi_t)(y), ~0))
 #  endif
-#  define mix(x, y) ((vec_t)B(movdquhi, _mask, (vhi_t)(x), (vhi_t)(y), \
-                              (0b01010101010101010101010101010101 & ALL_TRUE)))
+#  define mix(x, y) ((vec_t)B(blendmw_, _mask, (vhi_t)(x), (vhi_t)(y), \
+                              (0b10101010101010101010101010101010 & ALL_TRUE)))
 #  define shrink1(x) ((half_t)B(pmovdw, _mask, (vsi_t)(x), (vhi_half_t){}, ~0))
 #  define shrink2(x) ((quarter_t)B(pmovqw, _mask, (vdi_t)(x), (vhi_quarter_t){}, ~0))
 #  define swap2(x) ((vec_t)B(permvarhi, _mask, (vhi_t)(x), (vhi_t)(inv - 1), (vhi_t)undef(), ~0))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -484,6 +484,7 @@ static const struct ext0f38_table {
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
     [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
+    [0x64 ... 0x66] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -550,6 +551,7 @@ static const struct ext0f3a_table {
     [0x00] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x01] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x02] = { .simd_size = simd_packed_int },
+    [0x03] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x04 ... 0x05] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x06] = { .simd_size = simd_packed_fp },
     [0x08 ... 0x09] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -581,8 +583,7 @@ static const struct ext0f3a_table {
     [0x3b] = { .simd_size = simd_256, .to_mem = 1, .two_op = 1, .d8s = d8s_vl_by_2 },
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
-    [0x42] = { .simd_size = simd_packed_int },
-    [0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x42 ... 0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x44] = { .simd_size = simd_packed_int },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -6204,6 +6205,8 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x4c): /* vrcp14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x4e): /* vrsqrt14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x64): /* vpblendm{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x65): /* vblendmp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
@@ -6961,6 +6964,7 @@ x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x0b): /* vpmulhrsw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1c): /* vpabsb [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x1d): /* vpabsw [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x66): /* vpblendm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         host_and_vcpu_must_have(avx512bw);
         generate_exception_if(evex.brs, EXC_UD);
         elem_bytes = 1 << (b & 1);
@@ -8130,10 +8134,12 @@ x86_emulate(
         goto simd_0f_to_gpr;
 
     CASE_SIMD_PACKED_FP(_EVEX, 0x0f, 0xc6): /* vshufp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        fault_suppression = false;
         generate_exception_if(evex.w != (evex.pfx & VEX_PREFIX_DOUBLE_MASK),
                               EXC_UD);
         /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x03): /* valign{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x25): /* vpternlog{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     avx512f_imm8_no_sae:
         host_and_vcpu_must_have(avx512f);
@@ -9471,6 +9477,9 @@ x86_emulate(
         insn_bytes = PFX_BYTES + 4;
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x42): /* vdbpsadbw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(evex.w, EXC_UD);
+        /* fall through */
     case X86EMUL_OPC_EVEX_66(0x0f3a, 0x0f): /* vpalignr $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
         fault_suppression = false;
         goto avx512bw_imm;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 32/50] x86emul: support AVX512F gather insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (30 preceding siblings ...)
  2019-03-15 10:58   ` [PATCH v8 31/50] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
@ 2019-03-15 10:58   ` Jan Beulich
  2019-06-19 12:05     ` [Xen-devel] " Andrew Cooper
  2019-03-15 10:59   ` [PATCH v8 33/50] x86emul: add high register S/G test cases Jan Beulich
                     ` (17 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:58 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This requires getting modrm_reg and sib_index set correctly in the EVEX
case, to account for the high 16 [XYZ]MM registers. Extend the
adjustments to modrm_rm as well, such that x86_insn_modrm() would
correctly report register numbers (this was a latent issue only as we
don't currently have callers of that function which would care about an
EVEX case). The adjustment in turn requires dropping the assertion from
decode_gpr() as well as re-introducing the explicit masking, as we now
need to actively mask off the high bit when a GPR is meant.
_decode_gpr() invocations also need slight adjustments, when invoked in
generic code ahead of the main switch(). All other uses of modrm_reg and
modrm_rm already get suitably masked where necessary.

There was also an encoding mistake in the EVEX Disp8 test code, which
was benign (due to %rdx getting set to zero) to all non-vSIB tests as it
mistakenly encoded <disp8>(%rdx,%rdx) instead of <disp8>(%rdx,%riz). In
the vSIB case this meant <disp8>(%rdx,%zmm2) instead of the intended
<disp8>(%rdx,%zmm4).

Likewise the access count check wasn't entirely correct for the S/G
case: In the quad-word-index but dword-data case only half the number
of full vector elements get accessed.

As an unrelated change in the main test harness source file distinguish
the "n/a" messages by bitness.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Re-base.
v7: Fix ByteOp register decode. Re-base.
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -18,7 +18,7 @@ CFLAGS += $(CFLAGS_xeninclude)
 
 SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
 FMA := fma4 fma
-SG := avx2-sg
+SG := avx2-sg avx512f-sg avx512vl-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
 
 OPMASK := avx512f avx512dq avx512bw
@@ -66,6 +66,14 @@ xop-flts := $(avx-flts)
 avx512f-vecs := 64 16 32
 avx512f-ints := 4 8
 avx512f-flts := 4 8
+avx512f-sg-vecs := 64
+avx512f-sg-idxs := 4 8
+avx512f-sg-ints := $(avx512f-ints)
+avx512f-sg-flts := $(avx512f-flts)
+avx512vl-sg-vecs := 16 32
+avx512vl-sg-idxs := $(avx512f-sg-idxs)
+avx512vl-sg-ints := $(avx512f-ints)
+avx512vl-sg-flts := $(avx512f-flts)
 avx512bw-vecs := $(avx512f-vecs)
 avx512bw-ints := 1 2
 avx512bw-flts :=
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -176,6 +176,8 @@ static const struct test avx512f_all[] =
     INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
     INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
     INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
+    INSN(gatherd,      66, 0f38, 92,    vl,     sd, el),
+    INSN(gatherq,      66, 0f38, 93,    vl,     sd, el),
     INSN(getexp,       66, 0f38, 42,    vl,     sd, vl),
     INSN(getexp,       66, 0f38, 43,    el,     sd, el),
     INSN(getmant,      66, 0f3a, 26,    vl,     sd, vl),
@@ -229,6 +231,8 @@ static const struct test avx512f_all[] =
     INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
     INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
     INSN(pexpand,      66, 0f38, 89,    vl,     dq, el),
+    INSN(pgatherd,     66, 0f38, 90,    vl,     dq, el),
+    INSN(pgatherq,     66, 0f38, 91,    vl,     dq, el),
     INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
     INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
     INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
@@ -698,7 +702,7 @@ static void test_one(const struct test *
     instr[3] = evex.raw[2];
     instr[4] = test->opc;
     instr[5] = 0x44 | (test->ext << 3); /* ModR/M */
-    instr[6] = 0x12; /* SIB: base rDX, index none / xMM4 */
+    instr[6] = 0x22; /* SIB: base rDX, index none / xMM4 */
     instr[7] = 1; /* Disp8 */
     instr[8] = 0; /* immediate, if any */
 
@@ -718,7 +722,8 @@ static void test_one(const struct test *
          if ( accessed[i] )
              goto fail;
     for ( ; i < (test->scale == SC_vl ? vsz : esz) + (sg ? esz : vsz); ++i )
-         if ( accessed[i] != (sg ? vsz / esz : 1) )
+         if ( accessed[i] != (sg ? (vsz / esz) >> (test->opc & 1 & !evex.w)
+                                 : 1) )
              goto fail;
     for ( ; i < ARRAY_SIZE(accessed); ++i )
          if ( accessed[i] )
--- a/tools/tests/x86_emulator/simd-sg.c
+++ b/tools/tests/x86_emulator/simd-sg.c
@@ -35,13 +35,78 @@ typedef long long __attribute__((vector_
 #define ITEM_COUNT (VEC_SIZE / ELEM_SIZE < IVEC_SIZE / IDX_SIZE ? \
                     VEC_SIZE / ELEM_SIZE : IVEC_SIZE / IDX_SIZE)
 
-#if VEC_SIZE == 16
-# define to_bool(cmp) __builtin_ia32_ptestc128(cmp, (vec_t){} == 0)
-#else
-# define to_bool(cmp) __builtin_ia32_ptestc256(cmp, (vec_t){} == 0)
-#endif
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if ELEM_SIZE == 4
+#  if IDX_SIZE == 4 || defined(__AVX512VL__)
+#   define to_mask(msk) B(ptestmd, , (vsi_t)(msk), (vsi_t)(msk), ~0)
+#   define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
+#  else
+#   define widen(x) __builtin_ia32_pmovzxdq512_mask((vsi_t)(x), (idi_t){}, ~0)
+#   define to_mask(msk) __builtin_ia32_ptestmq512(widen(msk), widen(msk), ~0)
+#   define eq(x, y) (__builtin_ia32_pcmpeqq512_mask(widen(x), widen(y), ~0) == ALL_TRUE)
+#  endif
+#  define BG_(dt, it, reg, mem, idx, msk, scl) \
+    __builtin_ia32_gather##it##dt(reg, mem, idx, to_mask(msk), scl)
+# else
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+#  define BG_(dt, it, reg, mem, idx, msk, scl) \
+    __builtin_ia32_gather##it##dt(reg, mem, idx, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), scl)
+# endif
+/*
+ * Instead of replicating the main IDX_SIZE conditional below three times, use
+ * a double layer of macro invocations, allowing for substitution of the
+ * respective relevant macro argument tokens.
+ */
+# define BG(dt, it, reg, mem, idx, msk, scl) BG_(dt, it, reg, mem, idx, msk, scl)
+# if VEC_MAX < 64
+/*
+ * The sub-512-bit built-ins have an extra "3" infix, presumably because the
+ * 512-bit names were chosen without the AVX512VL extension in mind (and hence
+ * making the latter collide with the AVX2 ones).
+ */
+#  define si 3si
+#  define di 3di
+# endif
+# if VEC_MAX == 16
+#  define v8df v2df
+#  define v8di v2di
+#  define v16sf v4sf
+#  define v16si v4si
+# elif VEC_MAX == 32
+#  define v8df v4df
+#  define v8di v4di
+#  define v16sf v8sf
+#  define v16si v8si
+# endif
+# if IDX_SIZE == 4
+#  if INT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16si, si, reg, mem, idx, msk, scl)
+#  elif INT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, si, (vdi_t)(reg), mem, idx, msk, scl))
+#  elif FLOAT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16sf, si, reg, mem, idx, msk, scl)
+#  elif FLOAT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) BG(v8df, si, reg, mem, idx, msk, scl)
+#  endif
+# elif IDX_SIZE == 8
+#  if INT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16si, di, reg, mem, (idi_t)(idx), msk, scl)
+#  elif INT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, di, (vdi_t)(reg), mem, (idi_t)(idx), msk, scl))
+#  elif FLOAT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16sf, di, reg, mem, (idi_t)(idx), msk, scl)
+#  elif FLOAT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) BG(v8df, di, reg, mem, (idi_t)(idx), msk, scl)
+#  endif
+# endif
+#elif defined(__AVX2__)
+# if VEC_SIZE == 16
+#  define to_bool(cmp) __builtin_ia32_ptestc128(cmp, (vec_t){} == 0)
+# else
+#  define to_bool(cmp) __builtin_ia32_ptestc256(cmp, (vec_t){} == 0)
+# endif
 
-#if defined(__AVX2__)
 # if VEC_MAX == 16
 #  if IDX_SIZE == 4
 #   if INT_SIZE == 4
@@ -111,6 +176,10 @@ typedef long long __attribute__((vector_
 # endif
 #endif
 
+#ifndef eq
+# define eq(x, y) to_bool((x) == (y))
+#endif
+
 #define GLUE_(x, y) x ## y
 #define GLUE(x, y) GLUE_(x, y)
 
@@ -119,6 +188,7 @@ typedef long long __attribute__((vector_
 #define PUT8(n)  PUT4(n),   PUT4((n) +  4)
 #define PUT16(n) PUT8(n),   PUT8((n) +  8)
 #define PUT32(n) PUT16(n), PUT16((n) + 16)
+#define PUT64(n) PUT32(n), PUT32((n) + 32)
 
 const typeof((vec_t){}[0]) array[] = {
     GLUE(PUT, VEC_MAX)(1),
@@ -174,7 +244,7 @@ int sg_test(void)
 
     y = gather(full, array + ITEM_COUNT, -idx, full, ELEM_SIZE);
 #if ITEM_COUNT == ELEM_COUNT
-    if ( !to_bool(y == x - 1) )
+    if ( !eq(y, x - 1) )
         return __LINE__;
 #else
     for ( i = 0; i < ITEM_COUNT; ++i )
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,8 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512dq-opmask.h"
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
+#include "avx512f-sg.h"
+#include "avx512vl-sg.h"
 #include "avx512bw.h"
 #include "avx512dq.h"
 #include "avx512er.h"
@@ -90,11 +92,13 @@ static bool simd_check_avx512f(void)
     return cpu_has_avx512f;
 }
 #define simd_check_avx512f_opmask simd_check_avx512f
+#define simd_check_avx512f_sg simd_check_avx512f
 
 static bool simd_check_avx512f_vl(void)
 {
     return cpu_has_avx512f && cpu_has_avx512vl;
 }
+#define simd_check_avx512vl_sg simd_check_avx512f_vl
 
 static bool simd_check_avx512dq(void)
 {
@@ -291,6 +295,14 @@ static const struct {
     SIMD(AVX512F u32x16,      avx512f,      64u4),
     SIMD(AVX512F s64x8,       avx512f,      64i8),
     SIMD(AVX512F u64x8,       avx512f,      64u8),
+    SIMD(AVX512F S/G f32[16x32], avx512f_sg, 64x4f4),
+    SIMD(AVX512F S/G f64[ 8x32], avx512f_sg, 64x4f8),
+    SIMD(AVX512F S/G f32[ 8x64], avx512f_sg, 64x8f4),
+    SIMD(AVX512F S/G f64[ 8x64], avx512f_sg, 64x8f8),
+    SIMD(AVX512F S/G i32[16x32], avx512f_sg, 64x4i4),
+    SIMD(AVX512F S/G i64[ 8x32], avx512f_sg, 64x4i8),
+    SIMD(AVX512F S/G i32[ 8x64], avx512f_sg, 64x8i4),
+    SIMD(AVX512F S/G i64[ 8x64], avx512f_sg, 64x8i8),
     AVX512VL(VL f32x4,        avx512f,      16f4),
     AVX512VL(VL f64x2,        avx512f,      16f8),
     AVX512VL(VL f32x8,        avx512f,      32f4),
@@ -303,6 +315,22 @@ static const struct {
     AVX512VL(VL u64x2,        avx512f,      16u8),
     AVX512VL(VL s64x4,        avx512f,      32i8),
     AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512VL S/G f32[4x32], avx512vl_sg, 16x4f4),
+    SIMD(AVX512VL S/G f64[2x32], avx512vl_sg, 16x4f8),
+    SIMD(AVX512VL S/G f32[2x64], avx512vl_sg, 16x8f4),
+    SIMD(AVX512VL S/G f64[2x64], avx512vl_sg, 16x8f8),
+    SIMD(AVX512VL S/G f32[8x32], avx512vl_sg, 32x4f4),
+    SIMD(AVX512VL S/G f64[4x32], avx512vl_sg, 32x4f8),
+    SIMD(AVX512VL S/G f32[4x64], avx512vl_sg, 32x8f4),
+    SIMD(AVX512VL S/G f64[4x64], avx512vl_sg, 32x8f8),
+    SIMD(AVX512VL S/G i32[4x32], avx512vl_sg, 16x4i4),
+    SIMD(AVX512VL S/G i64[2x32], avx512vl_sg, 16x4i8),
+    SIMD(AVX512VL S/G i32[2x64], avx512vl_sg, 16x8i4),
+    SIMD(AVX512VL S/G i64[2x64], avx512vl_sg, 16x8i8),
+    SIMD(AVX512VL S/G i32[8x32], avx512vl_sg, 32x4i4),
+    SIMD(AVX512VL S/G i64[4x32], avx512vl_sg, 32x4i8),
+    SIMD(AVX512VL S/G i32[4x64], avx512vl_sg, 32x8i4),
+    SIMD(AVX512VL S/G i64[4x64], avx512vl_sg, 32x8i8),
     SIMD(AVX512BW s8x64,     avx512bw,      64i1),
     SIMD(AVX512BW u8x64,     avx512bw,      64u1),
     SIMD(AVX512BW s16x32,    avx512bw,      64i2),
@@ -4260,7 +4288,7 @@ int main(int argc, char **argv)
 
         if ( !blobs[j].size )
         {
-            printf("%-39s n/a\n", blobs[j].name);
+            printf("%-39s n/a (%u-bit)\n", blobs[j].name, blobs[j].bitness);
             continue;
         }
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -499,7 +499,7 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
-    [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
+    [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x9a] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -3054,7 +3054,8 @@ x86_decode(
 
         d &= ~ModRM;
 #undef ModRM /* Only its aliases are valid to use from here on. */
-        modrm_reg = ((rex_prefix & 4) << 1) | ((modrm & 0x38) >> 3);
+        modrm_reg = ((rex_prefix & 4) << 1) | ((modrm & 0x38) >> 3) |
+                    ((evex_encoded() && !evex.R) << 4);
         modrm_rm  = modrm & 0x07;
 
         /*
@@ -3224,7 +3225,8 @@ x86_decode(
         if ( modrm_mod == 3 )
         {
             generate_exception_if(d & vSIB, EXC_UD);
-            modrm_rm |= (rex_prefix & 1) << 3;
+            modrm_rm |= ((rex_prefix & 1) << 3) |
+                        (evex_encoded() && !evex.x) << 4;
             ea.type = OP_REG;
         }
         else if ( ad_bytes == 2 )
@@ -3289,7 +3291,10 @@ x86_decode(
 
                 state->sib_index = ((sib >> 3) & 7) | ((rex_prefix << 2) & 8);
                 state->sib_scale = (sib >> 6) & 3;
-                if ( state->sib_index != 4 && !(d & vSIB) )
+                if ( unlikely(d & vSIB) )
+                    state->sib_index |= (mode_64bit() && evex_encoded() &&
+                                         !evex.RX) << 4;
+                else if ( state->sib_index != 4 )
                 {
                     ea.mem.off = *decode_gpr(state->regs, state->sib_index);
                     ea.mem.off <<= state->sib_scale;
@@ -3592,7 +3597,7 @@ x86_emulate(
     generate_exception_if(state->not_64bit && mode_64bit(), EXC_UD);
 
     if ( ea.type == OP_REG )
-        ea.reg = _decode_gpr(&_regs, modrm_rm, (d & ByteOp) && !rex_prefix);
+        ea.reg = _decode_gpr(&_regs, modrm_rm, (d & ByteOp) && !rex_prefix && !vex.opcx);
 
     memset(mmvalp, 0xaa /* arbitrary */, sizeof(*mmvalp));
 
@@ -3606,7 +3611,7 @@ x86_emulate(
         src.type = OP_REG;
         if ( d & ByteOp )
         {
-            src.reg = _decode_gpr(&_regs, modrm_reg, !rex_prefix);
+            src.reg = _decode_gpr(&_regs, modrm_reg, !rex_prefix && !vex.opcx);
             src.val = *(uint8_t *)src.reg;
             src.bytes = 1;
         }
@@ -3704,7 +3709,7 @@ x86_emulate(
         dst.type = OP_REG;
         if ( d & ByteOp )
         {
-            dst.reg = _decode_gpr(&_regs, modrm_reg, !rex_prefix);
+            dst.reg = _decode_gpr(&_regs, modrm_reg, !rex_prefix && !vex.opcx);
             dst.val = *(uint8_t *)dst.reg;
             dst.bytes = 1;
         }
@@ -9119,6 +9124,130 @@ x86_emulate(
         put_stub(stub);
 
         state->simd_size = simd_none;
+        break;
+    }
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x90): /* vpgatherd{d,q} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x91): /* vpgatherq{d,q} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x92): /* vgatherdp{s,d} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x93): /* vgatherqp{s,d} mem,[xyz]mm{k} */
+    {
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+        bool done = false;
+
+        ASSERT(ea.type == OP_MEM);
+        generate_exception_if((!evex.opmsk || evex.brs || evex.z ||
+                               evex.reg != 0xf ||
+                               modrm_reg == state->sib_index),
+                              EXC_UD);
+        avx512_vlen_check(false);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        /* Read destination and index registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x7f; /* vmovdqa{32,64} */
+        /*
+         * The register writeback below has to retain masked-off elements, but
+         * needs to clear upper portions in the index-wider-than-data cases.
+         * Therefore read (and write below) the full register. The alternative
+         * would have been to fiddle with the mask register used.
+         */
+        pevex->opmsk = 0;
+        /* Use (%rax) as destination and modrm_reg as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
+
+        pevex->pfx = vex_f3; /* vmovdqu{32,64} */
+        pevex->w = b & 1;
+        /* Switch to sib_index as source. */
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        opc[1] = (state->sib_index & 7) << 3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the destination and mask values. */
+        n = 1 << (2 + evex.lr - ((b & 1) | evex.w));
+        op_bytes = 4 << evex.w;
+        memset((void *)mmvalp + n * op_bytes, 0, 64 - n * op_bytes);
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = ops->read(ea.mem.seg,
+                           truncate_ea(ea.mem.off + (idx << state->sib_scale)),
+                           (void *)mmvalp + i * op_bytes, op_bytes, ctxt);
+            if ( rc != X86EMUL_OKAY )
+            {
+                /*
+                 * If we've made some progress and the access did not fault,
+                 * force a retry instead. This is for example necessary to
+                 * cope with the limited capacity of HVM's MMIO cache.
+                 */
+                if ( rc != X86EMUL_EXCEPTION && done )
+                    rc = X86EMUL_RETRY;
+                break;
+            }
+
+            op_mask &= ~(1 << i);
+            done = true;
+
+#ifdef __XEN__
+            if ( op_mask && local_events_need_delivery() )
+            {
+                rc = X86EMUL_RETRY;
+                break;
+            }
+#endif
+        }
+
+        /* Write destination and mask registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x6f; /* vmovdqa{32,64} */
+        pevex->opmsk = 0;
+        /* Use modrm_reg as destination and (%rax) as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
+
+        /*
+         * kmovw: This is VEX-encoded, so we can't use pevex. Avoid copy_VEX() etc
+         * as well, since we can easily use the 2-byte VEX form here.
+         */
+        opc -= EVEX_PFX_BYTES;
+        opc[0] = 0xc5;
+        opc[1] = 0xf8;
+        opc[2] = 0x90;
+        /* Use (%rax) as source. */
+        opc[3] = evex.opmsk << 3;
+        opc[4] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+        put_stub(stub);
+
+        state->simd_size = simd_none;
         break;
     }
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -662,8 +662,6 @@ static inline unsigned long *decode_gpr(
     BUILD_BUG_ON(ARRAY_SIZE(cpu_user_regs_gpr_offsets) &
                  (ARRAY_SIZE(cpu_user_regs_gpr_offsets) - 1));
 
-    ASSERT(modrm < ARRAY_SIZE(cpu_user_regs_gpr_offsets));
-
     /* Note that this also acts as array_access_nospec() stand-in. */
     modrm &= ARRAY_SIZE(cpu_user_regs_gpr_offsets) - 1;
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 33/50] x86emul: add high register S/G test cases
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (31 preceding siblings ...)
  2019-03-15 10:58   ` [PATCH v8 32/50] x86emul: support AVX512F gather insns Jan Beulich
@ 2019-03-15 10:59   ` Jan Beulich
  2019-06-19 12:07     ` [Xen-devel] " Andrew Cooper
  2019-03-15 10:59   ` [PATCH v8 34/50] x86emul: support AVX512F scatter insns Jan Beulich
                     ` (16 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:59 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

In order to verify that in particular the index register decoding works
correctly in the S/G emulation paths, add dedicated (64-bit only) cases
disallowing the compiler to use the lower registers. Other than in the
generic SIMD case, where occasional uses of %xmm or %ymm registers in
generated code cause various internal compiler errors when disallowing
use of all of the lower 16 registers (apparently due to insn templates
trying to use AVX2 encodings), doing so here in the AVX512F case looks
to be fine.

While the main goal here is the AVX512F case, add an AVX2 variant as
well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -147,6 +147,12 @@ $(foreach flavor,$(SIMD) $(FMA),$(eval $
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
 $(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
+first-string = $(shell for s in $(1); do echo "$$s"; break; done)
+
+avx2-sg-cflags-x86_64    := "-D_high $(foreach n,7 6 5 4 3 2 1,-ffixed-ymm$(n)) $(call first-string,$(avx2-sg-cflags))"
+avx512f-sg-cflags-x86_64 := "-D_higher $(foreach n,7 6 5 4 3 2 1,-ffixed-zmm$(n)) $(call first-string,$(avx512f-sg-cflags))"
+avx512f-sg-cflags-x86_64 += "-D_highest $(foreach n,15 14 13 12 11 10 9 8,-ffixed-zmm$(n)) $(call first-string,$(avx512f-sg-cflags-x86_64))"
+
 $(addsuffix .h,$(TESTCASES)): %.h: %.c testcase.mk Makefile
 	rm -f $@.new $*.bin
 	$(foreach arch,$(filter-out $(XEN_COMPILE_ARCH),x86_32) $(XEN_COMPILE_ARCH), \
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -266,6 +266,9 @@ static const struct {
     SIMD(AVX2 S/G i64[4x32],  avx2_sg,    32x4i8),
     SIMD(AVX2 S/G i32[4x64],  avx2_sg,    32x8i4),
     SIMD(AVX2 S/G i64[4x64],  avx2_sg,    32x8i8),
+#ifdef __x86_64__
+    SIMD_(64, AVX2 S/G %ymm8+, avx2_sg,     high),
+#endif
     SIMD(XOP 128bit single,       xop,      16f4),
     SIMD(XOP 256bit single,       xop,      32f4),
     SIMD(XOP 128bit double,       xop,      16f8),
@@ -303,6 +306,10 @@ static const struct {
     SIMD(AVX512F S/G i64[ 8x32], avx512f_sg, 64x4i8),
     SIMD(AVX512F S/G i32[ 8x64], avx512f_sg, 64x8i4),
     SIMD(AVX512F S/G i64[ 8x64], avx512f_sg, 64x8i8),
+#ifdef __x86_64__
+    SIMD_(64, AVX512F S/G %zmm8+, avx512f_sg, higher),
+    SIMD_(64, AVX512F S/G %zmm16+, avx512f_sg, highest),
+#endif
     AVX512VL(VL f32x4,        avx512f,      16f4),
     AVX512VL(VL f64x2,        avx512f,      16f8),
     AVX512VL(VL f32x8,        avx512f,      32f4),





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 34/50] x86emul: support AVX512F scatter insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (32 preceding siblings ...)
  2019-03-15 10:59   ` [PATCH v8 33/50] x86emul: add high register S/G test cases Jan Beulich
@ 2019-03-15 10:59   ` Jan Beulich
  2019-03-15 11:00   ` [PATCH v8 35/50] x86emul: support AVX512PF insns Jan Beulich
                     ` (15 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 10:59 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

This completes support of AVX512F in the insn emulator.

Note that in the test harness there's a little bit of trickery needed to
get around the not fully consistent naming of AVX512VL gather and
scatter built-ins. To suppress expansion of the "di" and "si" tokens
they get constructed by token concatenation in BS(), which is different
from BG().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: I couldn't really decide whether to duplicate code or merge scatter
     into gather emulation.
---
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -270,6 +270,8 @@ static const struct test avx512f_all[] =
     INSN(prolv,        66, 0f38, 15,    vl,     dq, vl),
     INSNX(pror,        66,   0f, 72, 0, vl,     dq, vl),
     INSN(prorv,        66, 0f38, 14,    vl,     dq, vl),
+    INSN(pscatterd,    66, 0f38, a0,    vl,     dq, el),
+    INSN(pscatterq,    66, 0f38, a1,    vl,     dq, el),
     INSN(pshufd,       66,   0f, 70,    vl,      d, vl),
     INSN(pslld,        66,   0f, f2,    el_4,    d, vl),
     INSNX(pslld,       66,   0f, 72, 6, vl,      d, vl),
@@ -305,6 +307,8 @@ static const struct test avx512f_all[] =
     INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
     INSN(scalef,       66, 0f38, 2c,    vl,     sd, vl),
     INSN(scalef,       66, 0f38, 2d,    el,     sd, el),
+    INSN(scatterd,     66, 0f38, a2,    vl,     sd, el),
+    INSN(scatterq,     66, 0f38, a3,    vl,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
--- a/tools/tests/x86_emulator/simd-sg.c
+++ b/tools/tests/x86_emulator/simd-sg.c
@@ -48,10 +48,14 @@ typedef long long __attribute__((vector_
 #  endif
 #  define BG_(dt, it, reg, mem, idx, msk, scl) \
     __builtin_ia32_gather##it##dt(reg, mem, idx, to_mask(msk), scl)
+#  define BS_(dt, it, mem, idx, reg, msk, scl) \
+    __builtin_ia32_scatter##it##dt(mem, to_mask(msk), idx, reg, scl)
 # else
 #  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
 #  define BG_(dt, it, reg, mem, idx, msk, scl) \
     __builtin_ia32_gather##it##dt(reg, mem, idx, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), scl)
+#  define BS_(dt, it, mem, idx, reg, msk, scl) \
+    __builtin_ia32_scatter##it##dt(mem, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), idx, reg, scl)
 # endif
 /*
  * Instead of replicating the main IDX_SIZE conditional below three times, use
@@ -59,6 +63,7 @@ typedef long long __attribute__((vector_
  * respective relevant macro argument tokens.
  */
 # define BG(dt, it, reg, mem, idx, msk, scl) BG_(dt, it, reg, mem, idx, msk, scl)
+# define BS(dt, it, mem, idx, reg, msk, scl) BS_(dt, it##i, mem, idx, reg, msk, scl)
 # if VEC_MAX < 64
 /*
  * The sub-512-bit built-ins have an extra "3" infix, presumably because the
@@ -82,22 +87,30 @@ typedef long long __attribute__((vector_
 # if IDX_SIZE == 4
 #  if INT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16si, si, reg, mem, idx, msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16si, s, mem, idx, reg, msk, scl)
 #  elif INT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, si, (vdi_t)(reg), mem, idx, msk, scl))
+#   define scatter(mem, idx, reg, msk, scl) BS(v8di, s, mem, idx, (vdi_t)(reg), msk, scl)
 #  elif FLOAT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16sf, si, reg, mem, idx, msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16sf, s, mem, idx, reg, msk, scl)
 #  elif FLOAT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) BG(v8df, si, reg, mem, idx, msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v8df, s, mem, idx, reg, msk, scl)
 #  endif
 # elif IDX_SIZE == 8
 #  if INT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16si, di, reg, mem, (idi_t)(idx), msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16si, d, mem, (idi_t)(idx), reg, msk, scl)
 #  elif INT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, di, (vdi_t)(reg), mem, (idi_t)(idx), msk, scl))
+#   define scatter(mem, idx, reg, msk, scl) BS(v8di, d, mem, (idi_t)(idx), (vdi_t)(reg), msk, scl)
 #  elif FLOAT_SIZE == 4
 #   define gather(reg, mem, idx, msk, scl) BG(v16sf, di, reg, mem, (idi_t)(idx), msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v16sf, d, mem, (idi_t)(idx), reg, msk, scl)
 #  elif FLOAT_SIZE == 8
 #   define gather(reg, mem, idx, msk, scl) BG(v8df, di, reg, mem, (idi_t)(idx), msk, scl)
+#   define scatter(mem, idx, reg, msk, scl) BS(v8df, d, mem, (idi_t)(idx), reg, msk, scl)
 #  endif
 # endif
 #elif defined(__AVX2__)
@@ -195,6 +208,8 @@ const typeof((vec_t){}[0]) array[] = {
     GLUE(PUT, VEC_MAX)(VEC_MAX + 1)
 };
 
+typeof((vec_t){}[0]) out[VEC_MAX * 2];
+
 int sg_test(void)
 {
     unsigned int i;
@@ -275,5 +290,41 @@ int sg_test(void)
 # endif
 #endif
 
+#ifdef scatter
+
+    for ( i = 0; i < sizeof(out) / sizeof(*out); ++i )
+        out[i] = 0;
+
+    for ( i = 0; i < ITEM_COUNT; ++i )
+        x[i] = i + 1;
+
+    touch(x);
+
+    scatter(out, (idx_t){}, x, (vec_t){ 1 } != 0, 1);
+    if ( out[0] != 1 )
+        return __LINE__;
+    for ( i = 1; i < ITEM_COUNT; ++i )
+        if ( out[i] )
+            return __LINE__;
+
+    scatter(out, (idx_t){}, x, full, 1);
+    if ( out[0] != ITEM_COUNT )
+        return __LINE__;
+    for ( i = 1; i < ITEM_COUNT; ++i )
+        if ( out[i] )
+            return __LINE__;
+
+    scatter(out, idx, x, full, ELEM_SIZE);
+    for ( i = 1; i <= ITEM_COUNT; ++i )
+        if ( out[i] != i )
+            return __LINE__;
+
+    scatter(out, inv, x, full, ELEM_SIZE);
+    for ( i = 1; i <= ITEM_COUNT; ++i )
+        if ( out[i] != ITEM_COUNT + 1 - i )
+            return __LINE__;
+
+#endif
+
     return 0;
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -508,6 +508,7 @@ static const struct ext0f38_table {
     [0x9d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x9e] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x9f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xa0 ... 0xa3] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0xa6 ... 0xa8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xa9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xaa] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -9330,6 +9331,102 @@ x86_emulate(
             avx512_vlen_check(true);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa0): /* vpscatterd{d,q} [xyz]mm,mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa1): /* vpscatterq{d,q} [xyz]mm,mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa2): /* vscatterdp{s,d} [xyz]mm,mem{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa3): /* vscatterqp{s,d} [xyz]mm,mem{k} */
+    {
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+        bool done = false;
+
+        ASSERT(ea.type == OP_MEM);
+        fail_if(!ops->write);
+        generate_exception_if((!evex.opmsk || evex.brs || evex.z ||
+                               evex.reg != 0xf ||
+                               modrm_reg == state->sib_index),
+                              EXC_UD);
+        avx512_vlen_check(false);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        /* Read source and index registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x7f; /* vmovdqa{32,64} */
+        /* Use (%rax) as destination and modrm_reg as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
+
+        pevex->pfx = vex_f3; /* vmovdqu{32,64} */
+        pevex->w = b & 1;
+        /* Switch to sib_index as source. */
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        opc[1] = (state->sib_index & 7) << 3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the mask value. */
+        n = 1 << (2 + evex.lr - ((b & 1) | evex.w));
+        op_bytes = 4 << evex.w;
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = ops->write(ea.mem.seg,
+                            truncate_ea(ea.mem.off + (idx << state->sib_scale)),
+                            (void *)mmvalp + i * op_bytes, op_bytes, ctxt);
+            if ( rc != X86EMUL_OKAY )
+            {
+                /* See comment in gather emulation. */
+                if ( rc != X86EMUL_EXCEPTION && done )
+                    rc = X86EMUL_RETRY;
+                break;
+            }
+
+            op_mask &= ~(1 << i);
+            done = true;
+
+#ifdef __XEN__
+            if ( op_mask && local_events_need_delivery() )
+            {
+                rc = X86EMUL_RETRY;
+                break;
+            }
+#endif
+        }
+
+        /* Write mask register. See comment in gather emulation. */
+        opc = get_stub(stub);
+        opc[0] = 0xc5;
+        opc[1] = 0xf8;
+        opc[2] = 0x90;
+        /* Use (%rax) as source. */
+        opc[3] = evex.opmsk << 3;
+        opc[4] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+        put_stub(stub);
+
+        state->simd_size = simd_none;
+        break;
+    }
+
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xc9):     /* sha1msg1 xmm/m128,xmm */
     case X86EMUL_OPC(0x0f38, 0xca):     /* sha1msg2 xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 35/50] x86emul: support AVX512PF insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (33 preceding siblings ...)
  2019-03-15 10:59   ` [PATCH v8 34/50] x86emul: support AVX512F scatter insns Jan Beulich
@ 2019-03-15 11:00   ` Jan Beulich
  2019-03-15 11:00   ` [PATCH v8 36/50] x86emul: support AVX512CD insns Jan Beulich
                     ` (14 subsequent siblings)
  49 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:00 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Some adjustments are necessary to the EVEX Disp8 scaling test code to
account for the zero byte reads/writes, which get issued for the test
harness only.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: #GP/#SS don't arise here. Add previously missed change to
    emul_test_init().
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -520,6 +520,17 @@ static const struct test avx512er_512[]
     INSN(rsqrt28, 66, 0f38, cd, el, sd, el),
 };
 
+static const struct test avx512pf_512[] = {
+    INSNX(gatherpf0d,  66, 0f38, c6, 1, vl, sd, el),
+    INSNX(gatherpf0q,  66, 0f38, c7, 1, vl, sd, el),
+    INSNX(gatherpf1d,  66, 0f38, c6, 2, vl, sd, el),
+    INSNX(gatherpf1q,  66, 0f38, c7, 2, vl, sd, el),
+    INSNX(scatterpf0d, 66, 0f38, c6, 5, vl, sd, el),
+    INSNX(scatterpf0q, 66, 0f38, c7, 5, vl, sd, el),
+    INSNX(scatterpf1d, 66, 0f38, c6, 6, vl, sd, el),
+    INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -580,7 +591,7 @@ static bool record_access(enum x86_segme
 static int read(enum x86_segment seg, unsigned long offset, void *p_data,
                 unsigned int bytes, struct x86_emulate_ctxt *ctxt)
 {
-    if ( !record_access(seg, offset, bytes) )
+    if ( !record_access(seg, offset, bytes + !bytes) )
         return X86EMUL_UNHANDLEABLE;
     memset(p_data, 0, bytes);
     return X86EMUL_OKAY;
@@ -589,7 +600,7 @@ static int read(enum x86_segment seg, un
 static int write(enum x86_segment seg, unsigned long offset, void *p_data,
                  unsigned int bytes, struct x86_emulate_ctxt *ctxt)
 {
-    if ( !record_access(seg, offset, bytes) )
+    if ( !record_access(seg, offset, bytes + !bytes) )
         return X86EMUL_UNHANDLEABLE;
     return X86EMUL_OKAY;
 }
@@ -597,7 +608,7 @@ static int write(enum x86_segment seg, u
 static void test_one(const struct test *test, enum vl vl,
                      unsigned char *instr, struct x86_emulate_ctxt *ctxt)
 {
-    unsigned int vsz, esz, i;
+    unsigned int vsz, esz, i, n;
     int rc;
     bool sg = strstr(test->mnemonic, "gather") ||
               strstr(test->mnemonic, "scatter");
@@ -725,10 +736,20 @@ static void test_one(const struct test *
     for ( i = 0; i < (test->scale == SC_vl ? vsz : esz); ++i )
          if ( accessed[i] )
              goto fail;
-    for ( ; i < (test->scale == SC_vl ? vsz : esz) + (sg ? esz : vsz); ++i )
+
+    n = test->scale == SC_vl ? vsz : esz;
+    if ( !sg )
+        n += vsz;
+    else if ( !strstr(test->mnemonic, "pf") )
+        n += esz;
+    else
+        ++n;
+
+    for ( ; i < n; ++i )
          if ( accessed[i] != (sg ? (vsz / esz) >> (test->opc & 1 & !evex.w)
                                  : 1) )
              goto fail;
+
     for ( ; i < ARRAY_SIZE(accessed); ++i )
          if ( accessed[i] )
              goto fail;
@@ -887,6 +908,8 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
     RUN(avx512er, 512);
+#define cpu_has_avx512pf cpu_has_avx512f
+    RUN(avx512pf, 512);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
 }
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -73,6 +73,7 @@ bool emul_test_init(void)
      */
     cp.basic.movbe = true;
     cp.feat.adx = true;
+    cp.feat.avx512pf = cp.feat.avx512f;
     cp.feat.rdpid = true;
     cp.extd.clzero = true;
 
@@ -135,12 +136,14 @@ int emul_test_cpuid(
         res->c |= 1U << 22;
 
     /*
-     * The emulator doesn't itself use ADCX/ADOX/RDPID, so we can always run
-     * the respective tests.
+     * The emulator doesn't itself use ADCX/ADOX/RDPID nor the S/G prefetch
+     * insns, so we can always run the respective tests.
      */
     if ( leaf == 7 && subleaf == 0 )
     {
         res->b |= 1U << 19;
+        if ( res->b & (1U << 16) )
+            res->b |= 1U << 26;
         res->c |= 1U << 22;
     }
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -525,6 +525,7 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xc6 ... 0xc7] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xc9] = { .simd_size = simd_other },
     [0xca] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -1903,6 +1904,7 @@ static bool vcpu_has(
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
+#define vcpu_has_avx512pf()    vcpu_has(         7, EBX, 26, ctxt, ops)
 #define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
@@ -9425,6 +9427,94 @@ x86_emulate(
 
         state->simd_size = simd_none;
         break;
+    }
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc6):
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc7):
+    {
+#ifndef __XEN__
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+#endif
+
+        ASSERT(ea.type == OP_MEM);
+        generate_exception_if((!cpu_has_avx512f || !evex.opmsk || evex.brs ||
+                               evex.z || evex.reg != 0xf || evex.lr != 2),
+                              EXC_UD);
+
+        switch ( modrm_reg & 7 )
+        {
+        case 1: /* vgatherpf0{d,q}p{s,d} mem{k} */
+        case 2: /* vgatherpf1{d,q}p{s,d} mem{k} */
+        case 5: /* vscatterpf0{d,q}p{s,d} mem{k} */
+        case 6: /* vscatterpf1{d,q}p{s,d} mem{k} */
+            vcpu_must_have(avx512pf);
+            break;
+        default:
+            generate_exception(EXC_UD);
+        }
+
+        get_fpu(X86EMUL_FPU_zmm);
+
+#ifndef __XEN__
+        /*
+         * For the test harness perform zero byte memory accesses, such that
+         * in particular correct Disp8 scaling can be verified.
+         */
+        fail_if((modrm_reg & 4) && !ops->write);
+
+        /* Read index register. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        /* vmovdqu{32,64} */
+        opc[0] = 0x7f;
+        pevex->pfx = vex_f3;
+        pevex->w = b & 1;
+        /* Use (%rax) as destination and sib_index as source. */
+        pevex->b = 1;
+        opc[1] = (state->sib_index & 7) << 3;
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the mask value. */
+        n = 1 << (4 - ((b & 1) | evex.w));
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; rc == X86EMUL_OKAY && op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = (modrm_reg & 4
+                  ? ops->write
+                  : ops->read)(ea.mem.seg,
+                               truncate_ea(ea.mem.off +
+                                           (idx << state->sib_scale)),
+                               NULL, 0, ctxt);
+            if ( rc == X86EMUL_EXCEPTION )
+            {
+                /* Squash memory access related exceptions. */
+                x86_emul_reset_event(ctxt);
+                rc = X86EMUL_OKAY;
+            }
+
+            op_mask &= ~(1 << i);
+        }
+#endif
+
+        state->simd_size = simd_none;
+        break;
     }
 
     case X86EMUL_OPC(0x0f38, 0xc8):     /* sha1nexte xmm/m128,xmm */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 36/50] x86emul: support AVX512CD insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (34 preceding siblings ...)
  2019-03-15 11:00   ` [PATCH v8 35/50] x86emul: support AVX512PF insns Jan Beulich
@ 2019-03-15 11:00   ` Jan Beulich
  2019-06-19 12:13     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:01   ` [PATCH v8 37/50] x86emul: complete support of AVX512_VBMI insns Jan Beulich
                     ` (13 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:00 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Since the insns here and in particular their memory access patterns
follow the usual scheme I didn't think it was necessary to add
contrived tests specifically for them, beyond the Disp8 scaling ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -458,6 +458,13 @@ static const struct test avx512bw_128[]
     INSN(pinsrw, 66,   0f, c4, el, w, el),
 };
 
+static const struct test avx512cd_all[] = {
+//       pbroadcastmb2q, f3, 0f38, 2a,      q
+//       pbroadcastmw2d, f3, 0f38, 3a,      d
+    INSN(pconflict,      66, 0f38, c4, vl, dq, vl),
+    INSN(plzcnt,         66, 0f38, 44, vl, dq, vl),
+};
+
 static const struct test avx512dq_all[] = {
     INSN_PFP(and,              0f, 54),
     INSN_PFP(andn,             0f, 55),
@@ -903,6 +910,7 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512f, 512);
     RUN(avx512bw, all);
     RUN(avx512bw, 128);
+    RUN(avx512cd, all);
     RUN(avx512dq, all);
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -138,6 +138,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
 #define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
+#define cpu_has_avx512cd  (cp.feat.avx512cd && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -473,6 +473,7 @@ static const struct ext0f38_table {
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x42] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x43] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x44] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -525,6 +526,7 @@ static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xc4] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0xc6 ... 0xc7] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xc9] = { .simd_size = simd_other },
@@ -1906,6 +1908,7 @@ static bool vcpu_has(
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_avx512pf()    vcpu_has(         7, EBX, 26, ctxt, ops)
 #define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
+#define vcpu_has_avx512cd()    vcpu_has(         7, EBX, 28, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
@@ -8816,6 +8819,20 @@ x86_emulate(
         evex.opcx = vex_0f;
         goto vmovdqa;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x2a): /* vpbroadcastmb2q k,[xyz]mm */
+    case X86EMUL_OPC_EVEX_F3(0x0f38, 0x3a): /* vpbroadcastmw2d k,[xyz]mm */
+        generate_exception_if((ea.type != OP_REG || evex.opmsk ||
+                               evex.w == ((b >> 4) & 1)),
+                              EXC_UD);
+        d |= TwoOp;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc4): /* vpconflict{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        fault_suppression = false;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x44): /* vplzcnt{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512cd);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2c): /* vmaskmovps mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2d): /* vmaskmovpd mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x2e): /* vmaskmovps {x,y}mm,{x,y}mm,mem */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -103,6 +103,7 @@
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
 #define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
+#define cpu_has_avx512cd        boot_cpu_has(X86_FEATURE_AVX512CD)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 37/50] x86emul: complete support of AVX512_VBMI insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (35 preceding siblings ...)
  2019-03-15 11:00   ` [PATCH v8 36/50] x86emul: support AVX512CD insns Jan Beulich
@ 2019-03-15 11:01   ` Jan Beulich
  2019-06-19 12:16     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:01   ` [PATCH v8 38/50] x86emul: support of AVX512* population count insns Jan Beulich
                     ` (12 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:01 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also add testing of ones support for which was added before. Sadly gcc's
command line option naming is not in line with Intel's naming of the
feature, which makes it necessary to mis-name things in the test harness.

Since the only new insn here and in particular its memory access pattern
follows the usual scheme, I didn't think it was necessary to add a
contrived test specifically for it, beyond the Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@ vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er avx512vbmi
 FMA := fma4 fma
 SG := avx2-sg avx512f-sg avx512vl-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -83,6 +83,9 @@ avx512dq-flts := $(avx512f-flts)
 avx512er-vecs := 64
 avx512er-ints :=
 avx512er-flts := 4 8
+avx512vbmi-vecs := $(avx512bw-vecs)
+avx512vbmi-ints := $(avx512bw-ints)
+avx512vbmi-flts := $(avx512bw-flts)
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -542,6 +542,7 @@ static const struct test avx512_vbmi_all
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
     INSN(permt2b,       66, 0f38, 7d, vl, b, vl),
+    INSN(pmultishiftqb, 66, 0f38, 83, vl, q, vl),
 };
 
 static const struct test avx512_vbmi2_all[] = {
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -27,6 +27,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw.h"
 #include "avx512dq.h"
 #include "avx512er.h"
+#include "avx512vbmi.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -127,6 +128,16 @@ static bool simd_check_avx512bw_vl(void)
     return cpu_has_avx512bw && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx512vbmi(void)
+{
+    return cpu_has_avx512_vbmi;
+}
+
+static bool simd_check_avx512vbmi_vl(void)
+{
+    return cpu_has_avx512_vbmi && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -372,6 +383,18 @@ static const struct {
     SIMD(AVX512ER f32x16,    avx512er,      64f4),
     SIMD(AVX512ER f64 scalar,avx512er,        f8),
     SIMD(AVX512ER f64x8,     avx512er,      64f8),
+    SIMD(AVX512_VBMI s8x64,  avx512vbmi,    64i1),
+    SIMD(AVX512_VBMI u8x64,  avx512vbmi,    64u1),
+    SIMD(AVX512_VBMI s16x32, avx512vbmi,    64i2),
+    SIMD(AVX512_VBMI u16x32, avx512vbmi,    64u2),
+    AVX512VL(_VBMI+VL s8x16, avx512vbmi,    16i1),
+    AVX512VL(_VBMI+VL u8x16, avx512vbmi,    16u1),
+    AVX512VL(_VBMI+VL s8x32, avx512vbmi,    32i1),
+    AVX512VL(_VBMI+VL u8x32, avx512vbmi,    32u1),
+    AVX512VL(_VBMI+VL s16x8, avx512vbmi,    16i2),
+    AVX512VL(_VBMI+VL u16x8, avx512vbmi,    16u2),
+    AVX512VL(_VBMI+VL s16x16, avx512vbmi,   32i2),
+    AVX512VL(_VBMI+VL u16x16, avx512vbmi,   32u2),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -493,6 +493,7 @@ static const struct ext0f38_table {
     [0x7a ... 0x7c] = { .simd_size = simd_none, .two_op = 1 },
     [0x7d ... 0x7e] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x7f] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
+    [0x83] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x88] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_dq },
     [0x89] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_dq },
     [0x8a] = { .simd_size = simd_packed_fp, .to_mem = 1, .two_op = 1, .d8s = d8s_dq },
@@ -9023,6 +9024,12 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x83): /* vpmultishiftqb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        host_and_vcpu_must_have(avx512_vbmi);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8c): /* vpmaskmov{d,q} mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x8e): /* vpmaskmov{d,q} {x,y}mm,{x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 38/50] x86emul: support of AVX512* population count insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (36 preceding siblings ...)
  2019-03-15 11:01   ` [PATCH v8 37/50] x86emul: complete support of AVX512_VBMI insns Jan Beulich
@ 2019-03-15 11:01   ` Jan Beulich
  2019-06-19 12:22     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:02   ` [PATCH v8 39/50] x86emul: support of AVX512_IFMA insns Jan Beulich
                     ` (11 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:01 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Plus the only other AVX512_BITALG one.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -538,6 +538,11 @@ static const struct test avx512pf_512[]
     INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
 };
 
+static const struct test avx512_bitalg_all[] = {
+    INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
+    INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -550,6 +555,10 @@ static const struct test avx512_vbmi2_al
     INSN(pexpand,   66, 0f38, 62, vl, bw, el),
 };
 
+static const struct test avx512_vpopcntdq_all[] = {
+    INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -919,6 +928,8 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512er, 512);
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
+    RUN(avx512_bitalg, all);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
+    RUN(avx512_vpopcntdq, all);
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -143,6 +143,8 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
+#define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -479,6 +479,7 @@ static const struct ext0f38_table {
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
@@ -501,6 +502,7 @@ static const struct ext0f38_table {
     [0x8c] = { .simd_size = simd_packed_int },
     [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
+    [0x8f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
     [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -1915,6 +1917,8 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
+#define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -8923,6 +8927,19 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8f): /* vpshufbitqmb [xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(evex.w || !evex.r || !evex.R || evex.z, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x54): /* vpopcnt{b,w} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_bitalg);
+        generate_exception_if(evex.brs, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x55): /* vpopcnt{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vpopcntdq);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,{x,y}mm */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -111,6 +111,8 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
+#define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
 /* CPUID level 0x80000007.edx */
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
 XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
 
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -269,7 +269,7 @@ def crunch_numbers(state):
         # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask
         # registers), despite the SDM not formally making this connection.
-        AVX512BW: [AVX512_VBMI, AVX512_VBMI2],
+        AVX512BW: [AVX512_VBMI, AVX512_BITALG, AVX512_VBMI2],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 39/50] x86emul: support of AVX512_IFMA insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (37 preceding siblings ...)
  2019-03-15 11:01   ` [PATCH v8 38/50] x86emul: support of AVX512* population count insns Jan Beulich
@ 2019-03-15 11:02   ` Jan Beulich
  2019-06-19 12:23     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:02   ` [PATCH v8 40/50] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
                     ` (10 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:02 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Once again take the liberty and also correct the (public interface) name
of the AVX512_IFMA feature flag to match the SDM, on the assumption that
no external consumer has actually been using that flag so far.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Reject EVEX.W=0.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -543,6 +543,11 @@ static const struct test avx512_bitalg_a
     INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
 };
 
+static const struct test avx512_ifma_all[] = {
+    INSN(pmadd52huq, 66, 0f38, b5, vl, q, vl),
+    INSN(pmadd52luq, 66, 0f38, b4, vl, q, vl),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -929,6 +934,7 @@ void evex_disp8_test(void *instr, struct
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
     RUN(avx512_bitalg, all);
+    RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
     RUN(avx512_vpopcntdq, all);
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -137,6 +137,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_bmi2       cp.feat.bmi2
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
+#define cpu_has_avx512_ifma (cp.feat.avx512_ifma && xcr0_mask(0xe6))
 #define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
 #define cpu_has_avx512cd  (cp.feat.avx512cd && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -521,6 +521,7 @@ static const struct ext0f38_table {
     [0xad] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xae] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xaf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xb4 ... 0xb5] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xb6 ... 0xb8] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xb9] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xba] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -1907,6 +1908,7 @@ static bool vcpu_has(
 #define vcpu_has_rdseed()      vcpu_has(         7, EBX, 18, ctxt, ops)
 #define vcpu_has_adx()         vcpu_has(         7, EBX, 19, ctxt, ops)
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
+#define vcpu_has_avx512_ifma() vcpu_has(         7, EBX, 21, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
 #define vcpu_has_avx512pf()    vcpu_has(         7, EBX, 26, ctxt, ops)
@@ -9470,6 +9472,12 @@ x86_emulate(
         break;
     }
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb4): /* vpmadd52luq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb5): /* vpmadd52huq [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_ifma);
+        generate_exception_if(!evex.w, EXC_UD);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xc6):
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xc7):
     {
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -102,6 +102,7 @@
 #define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
+#define cpu_has_avx512_ifma     boot_cpu_has(X86_FEATURE_AVX512_IFMA)
 #define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_avx512cd        boot_cpu_has(X86_FEATURE_AVX512CD)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -212,7 +212,7 @@ XEN_CPUFEATURE(AVX512DQ,      5*32+17) /
 XEN_CPUFEATURE(RDSEED,        5*32+18) /*A  RDSEED instruction */
 XEN_CPUFEATURE(ADX,           5*32+19) /*A  ADCX, ADOX instructions */
 XEN_CPUFEATURE(SMAP,          5*32+20) /*S  Supervisor Mode Access Prevention */
-XEN_CPUFEATURE(AVX512IFMA,    5*32+21) /*A  AVX-512 Integer Fused Multiply Add */
+XEN_CPUFEATURE(AVX512_IFMA,   5*32+21) /*A  AVX-512 Integer Fused Multiply Add */
 XEN_CPUFEATURE(CLFLUSHOPT,    5*32+23) /*A  CLFLUSHOPT instruction */
 XEN_CPUFEATURE(CLWB,          5*32+24) /*A  CLWB instruction */
 XEN_CPUFEATURE(AVX512PF,      5*32+26) /*A  AVX-512 Prefetch Instructions */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -262,7 +262,7 @@ def crunch_numbers(state):
         # (which in practice depends on the EVEX prefix to encode) as well
         # as mask registers, and the instructions themselves. All further
         # AVX512 features are built on top of AVX512F
-        AVX512F: [AVX512DQ, AVX512IFMA, AVX512PF, AVX512ER, AVX512CD,
+        AVX512F: [AVX512DQ, AVX512_IFMA, AVX512PF, AVX512ER, AVX512CD,
                   AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
                   AVX512_VPOPCNTDQ],
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 40/50] x86emul: support remaining AVX512_VBMI2 insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (38 preceding siblings ...)
  2019-03-15 11:02   ` [PATCH v8 39/50] x86emul: support of AVX512_IFMA insns Jan Beulich
@ 2019-03-15 11:02   ` Jan Beulich
  2019-06-19 12:25     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:04   ` [PATCH v8 41/50] x86emul: support AVX512_4FMAPS insns Jan Beulich
                     ` (9 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:02 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Re-base over change earlier in the series.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -558,6 +558,14 @@ static const struct test avx512_vbmi_all
 static const struct test avx512_vbmi2_all[] = {
     INSN(pcompress, 66, 0f38, 63, vl, bw, el),
     INSN(pexpand,   66, 0f38, 62, vl, bw, el),
+    INSN(pshld,     66, 0f3a, 71, vl, dq, vl),
+    INSN(pshldv,    66, 0f38, 71, vl, dq, vl),
+    INSN(pshldvw,   66, 0f38, 70, vl,  w, vl),
+    INSN(pshldw,    66, 0f3a, 70, vl,  w, vl),
+    INSN(pshrd,     66, 0f3a, 73, vl, dq, vl),
+    INSN(pshrdv,    66, 0f38, 73, vl, dq, vl),
+    INSN(pshrdvw,   66, 0f38, 72, vl,  w, vl),
+    INSN(pshrdw,    66, 0f3a, 72, vl,  w, vl),
 };
 
 static const struct test avx512_vpopcntdq_all[] = {
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -487,6 +487,7 @@ static const struct ext0f38_table {
     [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
     [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
     [0x64 ... 0x66] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x70 ... 0x73] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x75 ... 0x76] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x77] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0x78] = { .simd_size = simd_other, .two_op = 1 },
@@ -611,6 +612,7 @@ static const struct ext0f3a_table {
     [0x6a ... 0x6b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x6c ... 0x6d] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x6e ... 0x6f] = { .simd_size = simd_scalar_opc, .four_op = 1 },
+    [0x70 ... 0x73] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x78 ... 0x79] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x7a ... 0x7b] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0x7c ... 0x7d] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -8993,6 +8995,16 @@ x86_emulate(
         }
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x70): /* vpshldvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x72): /* vpshrdvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        elem_bytes = 2;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x71): /* vpshldv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x73): /* vpshrdv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x75): /* vpermi2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x7d): /* vpermt2{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8d): /* vperm{b,w} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
@@ -10293,6 +10305,16 @@ x86_emulate(
         avx512_vlen_check(true);
         goto simd_imm8_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x70): /* vpshldw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x72): /* vpshrdw $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        generate_exception_if(!evex.w, EXC_UD);
+        elem_bytes = 2;
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x71): /* vpshld{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x73): /* vpshrd{d,q} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vbmi2);
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC(0x0f3a, 0xcc):     /* sha1rnds4 $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(sha);
         op_bytes = 16;





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 41/50] x86emul: support AVX512_4FMAPS insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (39 preceding siblings ...)
  2019-03-15 11:02   ` [PATCH v8 40/50] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
@ 2019-03-15 11:04   ` Jan Beulich
  2019-06-19 14:58     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:04   ` [PATCH v8 42/50] x86emul: support AVX512_4VNNIW insns Jan Beulich
                     ` (8 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:04 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Correct vcpu_has_*() insertion point.
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -538,6 +538,13 @@ static const struct test avx512pf_512[]
     INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
 };
 
+static const struct test avx512_4fmaps_512[] = {
+    INSN(4fmaddps,  f2, 0f38, 9a, el_4, d, vl),
+    INSN(4fmaddss,  f2, 0f38, 9b, el_4, d, vl),
+    INSN(4fnmaddps, f2, 0f38, aa, el_4, d, vl),
+    INSN(4fnmaddss, f2, 0f38, ab, el_4, d, vl),
+};
+
 static const struct test avx512_bitalg_all[] = {
     INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
     INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
@@ -941,6 +948,7 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512er, 512);
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
+    RUN(avx512_4fmaps, 512);
     RUN(avx512_bitalg, all);
     RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -4274,6 +4274,81 @@ int main(int argc, char **argv)
     }
 #endif
 
+    printf("%-40s", "Testing v4fmaddps 32(%ecx),%zmm4,%zmm4{%k5}...");
+    if ( stack_exec && cpu_has_avx512_4fmaps )
+    {
+        decl_insn(v4fmaddps);
+        static const struct {
+            float f[16];
+        } in = {{
+            1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
+        }}, out = {{
+            1 + 1 * 9 + 2 * 10 + 3 * 11 + 4 * 12,
+            2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+            16 + 16 * 9 + 17 * 10 + 18 * 11 + 19 * 12
+        }};
+
+        asm volatile ( "vmovups %1, %%zmm4\n\t"
+                       "vbroadcastss %%xmm4, %%zmm7\n\t"
+                       "vaddps %%zmm4, %%zmm7, %%zmm5\n\t"
+                       "vaddps %%zmm5, %%zmm7, %%zmm6\n\t"
+                       "vaddps %%zmm6, %%zmm7, %%zmm7\n\t"
+                       "kmovw %2, %%k5\n"
+                       put_insn(v4fmaddps,
+                                "v4fmaddps 32(%0), %%zmm4, %%zmm4%{%%k5%}")
+                       :: "c" (NULL), "m" (in), "rmk" (0x8001) );
+
+        set_insn(v4fmaddps);
+        regs.ecx = (unsigned long)&in;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(v4fmaddps) )
+            goto fail;
+
+        asm ( "vcmpeqps %1, %%zmm4, %%k0\n\t"
+              "kmovw %%k0, %0" : "=g" (rc) : "m" (out) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing v4fnmaddss 16(%edx),%zmm4,%zmm4{%k3}...");
+    if ( stack_exec && cpu_has_avx512_4fmaps )
+    {
+        decl_insn(v4fnmaddss);
+        static const struct {
+            float f[16];
+        } in = {{
+            1, 2, 3, 4, 5, 6, 7, 8
+        }}, out = {{
+            1 - 1 * 5 - 2 * 6 - 3 * 7 - 4 * 8, 2, 3, 4
+        }};
+
+        asm volatile ( "vmovups %1, %%xmm4\n\t"
+                       "vaddss %%xmm4, %%xmm4, %%xmm5\n\t"
+                       "vaddss %%xmm5, %%xmm4, %%xmm6\n\t"
+                       "vaddss %%xmm6, %%xmm4, %%xmm7\n\t"
+                       "kmovw %2, %%k3\n"
+                       put_insn(v4fnmaddss,
+                                "v4fnmaddss 16(%0), %%xmm4, %%xmm4%{%%k3%}")
+                       :: "d" (NULL), "m" (in), "rmk" (1) );
+
+        set_insn(v4fnmaddss);
+        regs.edx = (unsigned long)&in;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(v4fnmaddss) )
+            goto fail;
+
+        asm ( "vcmpeqps %1, %%zmm4, %%k0\n\t"
+              "kmovw %%k0, %0" : "=g" (rc) : "m" (out) );
+        if ( rc != 0xffff )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
 #undef decl_insn
 #undef put_insn
 #undef set_insn
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -146,6 +146,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
+#define cpu_has_avx512_4fmaps (cp.feat.avx512_4fmaps && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1924,6 +1924,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
+#define vcpu_has_avx512_4fmaps() vcpu_has(       7, EDX,  3, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
 #define vcpu_must_have(feat) \
@@ -3205,6 +3206,18 @@ x86_decode(
                                                    state);
                     state->simd_size = simd_other;
                 }
+
+                switch ( b )
+                {
+                /* v4f{,n}madd{p,s}s need special casing */
+                case 0x9a: case 0x9b: case 0xaa: case 0xab:
+                    if ( evex.pfx == vex_f2 )
+                    {
+                        disp8scale = 4;
+                        state->simd_size = simd_128;
+                    }
+                    break;
+                }
             }
             break;
 
@@ -9388,6 +9401,24 @@ x86_emulate(
             avx512_vlen_check(true);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9a): /* v4fmaddps m128,zmm+3,zmm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0xaa): /* v4fnmaddps m128,zmm+3,zmm{k} */
+        host_and_vcpu_must_have(avx512_4fmaps);
+        generate_exception_if((ea.type != OP_MEM || evex.w || evex.brs ||
+                               evex.lr != 2),
+                              EXC_UD);
+        op_mask = op_mask & 0xffff ? 0xf : 0;
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9b): /* v4fmaddss m128,xmm+3,xmm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0xab): /* v4fnmaddss m128,xmm+3,xmm{k} */
+        host_and_vcpu_must_have(avx512_4fmaps);
+        generate_exception_if((ea.type != OP_MEM || evex.w || evex.brs ||
+                               evex.lr == 3),
+                              EXC_UD);
+        op_mask = op_mask & 1 ? 0xf : 0;
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xa0): /* vpscatterd{d,q} [xyz]mm,mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xa1): /* vpscatterq{d,q} [xyz]mm,mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xa2): /* vscatterdp{s,d} [xyz]mm,mem{k} */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -116,6 +116,9 @@
 #define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
+/* CPUID level 0x00000007:0.edx */
+#define cpu_has_avx512_4fmaps   boot_cpu_has(X86_FEATURE_AVX512_4FMAPS)
+
 /* CPUID level 0x80000007.edx */
 #define cpu_has_itsc            boot_cpu_has(X86_FEATURE_ITSC)
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 42/50] x86emul: support AVX512_4VNNIW insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (40 preceding siblings ...)
  2019-03-15 11:04   ` [PATCH v8 41/50] x86emul: support AVX512_4FMAPS insns Jan Beulich
@ 2019-03-15 11:04   ` Jan Beulich
  2019-06-19 14:58     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:04   ` [PATCH v8 43/50] x86emul: support AVX512_VNNI insns Jan Beulich
                     ` (7 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:04 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As in a few cases before, since the insns here and in particular their
memory access patterns follow the AVX512_4FMAPS scheme, I didn't think
it was necessary to add contrived tests specifically for them, beyond
the Disp8 scaling ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Correct vcpu_has_*() insertion point.
v7: Re-base.
v6: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -545,6 +545,11 @@ static const struct test avx512_4fmaps_5
     INSN(4fnmaddss, f2, 0f38, ab, el_4, d, vl),
 };
 
+static const struct test avx512_4vnniw_512[] = {
+    INSN(p4dpwssd,  f2, 0f38, 52, el_4, d, vl),
+    INSN(p4dpwssds, f2, 0f38, 53, el_4, d, vl),
+};
+
 static const struct test avx512_bitalg_all[] = {
     INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
     INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
@@ -949,6 +954,7 @@ void evex_disp8_test(void *instr, struct
 #define cpu_has_avx512pf cpu_has_avx512f
     RUN(avx512pf, 512);
     RUN(avx512_4fmaps, 512);
+    RUN(avx512_4vnniw, 512);
     RUN(avx512_bitalg, all);
     RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -146,6 +146,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
+#define cpu_has_avx512_4vnniw (cp.feat.avx512_4vnniw && xcr0_mask(0xe6))
 #define cpu_has_avx512_4fmaps (cp.feat.avx512_4fmaps && xcr0_mask(0xe6))
 
 #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -479,6 +479,7 @@ static const struct ext0f38_table {
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x52 ... 0x53] = { .simd_size = simd_128, .d8s = 4 },
     [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
@@ -1924,6 +1925,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
+#define vcpu_has_avx512_4vnniw() vcpu_has(       7, EDX,  2, ctxt, ops)
 #define vcpu_has_avx512_4fmaps() vcpu_has(       7, EDX,  3, ctxt, ops)
 #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
 
@@ -8944,6 +8946,15 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x52): /* vp4dpwssd m128,zmm+3,zmm{k} */
+    case X86EMUL_OPC_EVEX_F2(0x0f38, 0x53): /* vp4dpwssds m128,zmm+3,zmm{k} */
+        host_and_vcpu_must_have(avx512_4vnniw);
+        generate_exception_if((ea.type != OP_MEM || evex.w || evex.brs ||
+                               evex.lr != 2),
+                              EXC_UD);
+        op_mask = op_mask & 0xffff ? 0xf : 0;
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x8f): /* vpshufbitqmb [xyz]mm/mem,[xyz]mm,k{k} */
         generate_exception_if(evex.w || !evex.r || !evex.R || evex.z, EXC_UD);
         /* fall through */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -117,6 +117,7 @@
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
 
 /* CPUID level 0x00000007:0.edx */
+#define cpu_has_avx512_4vnniw   boot_cpu_has(X86_FEATURE_AVX512_4VNNIW)
 #define cpu_has_avx512_4fmaps   boot_cpu_has(X86_FEATURE_AVX512_4FMAPS)
 
 /* CPUID level 0x80000007.edx */





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 43/50] x86emul: support AVX512_VNNI insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (41 preceding siblings ...)
  2019-03-15 11:04   ` [PATCH v8 42/50] x86emul: support AVX512_4VNNIW insns Jan Beulich
@ 2019-03-15 11:04   ` Jan Beulich
  2019-06-19 15:01     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:05   ` [PATCH v8 44/50] x86emul: support VPCLMULQDQ insns Jan Beulich
                     ` (6 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:04 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Re-base.
v7: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -580,6 +580,13 @@ static const struct test avx512_vbmi2_al
     INSN(pshrdw,    66, 0f3a, 72, vl,  w, vl),
 };
 
+static const struct test avx512_vnni_all[] = {
+    INSN(pdpbusd,  66, 0f38, 50, vl, d, vl),
+    INSN(pdpbusds, 66, 0f38, 51, vl, d, vl),
+    INSN(pdpwssd,  66, 0f38, 52, vl, d, vl),
+    INSN(pdpwssds, 66, 0f38, 53, vl, d, vl),
+};
+
 static const struct test avx512_vpopcntdq_all[] = {
     INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
 };
@@ -959,5 +966,6 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512_ifma, all);
     RUN(avx512_vbmi, all);
     RUN(avx512_vbmi2, all);
+    RUN(avx512_vnni, all);
     RUN(avx512_vpopcntdq, all);
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -144,6 +144,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
 #define cpu_has_avx512_4vnniw (cp.feat.avx512_4vnniw && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -479,7 +479,7 @@ static const struct ext0f38_table {
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
-    [0x52 ... 0x53] = { .simd_size = simd_128, .d8s = 4 },
+    [0x50 ... 0x53] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
@@ -1922,6 +1922,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
 #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
@@ -3211,6 +3212,8 @@ x86_decode(
 
                 switch ( b )
                 {
+                /* vp4dpwssd{,s} need special casing */
+                case 0x52: case 0x53:
                 /* v4f{,n}madd{p,s}s need special casing */
                 case 0x9a: case 0x9b: case 0xaa: case 0xab:
                     if ( evex.pfx == vex_f2 )
@@ -9412,6 +9415,14 @@ x86_emulate(
             avx512_vlen_check(true);
         goto simd_zmm;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x50): /* vpdpbusd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x51): /* vpdpbusds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x52): /* vpdpwssd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x53): /* vpdpwssds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vnni);
+        generate_exception_if(evex.w, EXC_UD);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9a): /* v4fmaddps m128,zmm+3,zmm{k} */
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0xaa): /* v4fnmaddps m128,zmm+3,zmm{k} */
         host_and_vcpu_must_have(avx512_4fmaps);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -112,6 +112,7 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_avx512_vnni     boot_cpu_has(X86_FEATURE_AVX512_VNNI)
 #define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
 #define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
 #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_VNNI,   6*32+11) /*A  Vector Neural Network Instrs */
 XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
 XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -264,7 +264,7 @@ def crunch_numbers(state):
         # AVX512 features are built on top of AVX512F
         AVX512F: [AVX512DQ, AVX512_IFMA, AVX512PF, AVX512ER, AVX512CD,
                   AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
-                  AVX512_VPOPCNTDQ],
+                  AVX512_VNNI, AVX512_VPOPCNTDQ],
 
         # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 44/50] x86emul: support VPCLMULQDQ insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (42 preceding siblings ...)
  2019-03-15 11:04   ` [PATCH v8 43/50] x86emul: support AVX512_VNNI insns Jan Beulich
@ 2019-03-15 11:05   ` Jan Beulich
  2019-06-21 12:52     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:06   ` [PATCH v8 45/50] x86emul: support VAES insns Jan Beulich
                     ` (5 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:05 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As to the feature dependency adjustment, while strictly speaking AVX is
a sufficient prereq (to have YMM registers), 256-bit vectors of integers
have got fully introduced with AVX2 only. Sadly gcc can't be used as a
reference here: They don't provide any AVX512-independent built-in at
all.

Along the lines of PCLMULQDQ, since the insns here and in particular
their memory access patterns follow the usual scheme, I didn't think it
was necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: Should VPCLMULQDQ also depend on PCLMULQDQ?
---
v8: No need to set fault_suppression to false.
v7: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -591,6 +591,10 @@ static const struct test avx512_vpopcntd
     INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
 };
 
+static const struct test vpclmulqdq_all[] = {
+    INSN(pclmulqdq, 66, 0f3a, 44, vl, q_nb, vl)
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -968,4 +972,9 @@ void evex_disp8_test(void *instr, struct
     RUN(avx512_vbmi2, all);
     RUN(avx512_vnni, all);
     RUN(avx512_vpopcntdq, all);
+
+    if ( cpu_has_avx512f )
+    {
+        RUN(vpclmulqdq, all);
+    }
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -144,6 +144,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_vpclmulqdq (cp.feat.vpclmulqdq && xcr0_mask(6))
 #define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -594,7 +594,7 @@ static const struct ext0f3a_table {
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42 ... 0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x44] = { .simd_size = simd_packed_int },
+    [0x44] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -1922,6 +1922,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_vpclmulqdq()  vcpu_has(         7, ECX, 10, ctxt, ops)
 #define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
@@ -10219,13 +10220,19 @@ x86_emulate(
         goto opmask_shift_imm;
 
     case X86EMUL_OPC_66(0x0f3a, 0x44):     /* pclmulqdq $imm8,xmm/m128,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,xmm/m128,xmm,xmm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         host_and_vcpu_must_have(pclmulqdq);
         if ( vex.opcx == vex_none )
             goto simd_0f3a_common;
-        generate_exception_if(vex.l, EXC_UD);
+        if ( vex.l )
+            host_and_vcpu_must_have(vpclmulqdq);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm */
+        host_and_vcpu_must_have(vpclmulqdq);
+        generate_exception_if(evex.brs || evex.opmsk, EXC_UD);
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x4a): /* vblendvps {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x4b): /* vblendvpd {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.w, EXC_UD);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -112,6 +112,7 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_vpclmulqdq      boot_cpu_has(X86_FEATURE_VPCLMULQDQ)
 #define cpu_has_avx512_vnni     boot_cpu_has(X86_FEATURE_AVX512_VNNI)
 #define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
 #define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -121,7 +121,7 @@ XEN_CPUFEATURE(PBE,           0*32+31) /
 
 /* Intel-defined CPU features, CPUID level 0x00000001.ecx, word 1 */
 XEN_CPUFEATURE(SSE3,          1*32+ 0) /*A  Streaming SIMD Extensions-3 */
-XEN_CPUFEATURE(PCLMULQDQ,     1*32+ 1) /*A  Carry-less mulitplication */
+XEN_CPUFEATURE(PCLMULQDQ,     1*32+ 1) /*A  Carry-less multiplication */
 XEN_CPUFEATURE(DTES64,        1*32+ 2) /*   64-bit Debug Store */
 XEN_CPUFEATURE(MONITOR,       1*32+ 3) /*   Monitor/Mwait support */
 XEN_CPUFEATURE(DSCPL,         1*32+ 4) /*   CPL Qualified Debug Store */
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(VPCLMULQDQ,    6*32+10) /*A  Vector Carry-less Multiplication Instrs */
 XEN_CPUFEATURE(AVX512_VNNI,   6*32+11) /*A  Vector Neural Network Instrs */
 XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -255,8 +255,9 @@ def crunch_numbers(state):
 
         # This is just the dependency between AVX512 and AVX2 of XSTATE
         # feature flags.  If want to use AVX512, AVX2 must be supported and
-        # enabled.
-        AVX2: [AVX512F],
+        # enabled.  Certain later extensions, acting on 256-bit vectors of
+        # integers, better depend on AVX2 than AVX.
+        AVX2: [AVX512F, VPCLMULQDQ],
 
         # AVX512F is taken to mean hardware support for 512bit registers
         # (which in practice depends on the EVEX prefix to encode) as well




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 45/50] x86emul: support VAES insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (43 preceding siblings ...)
  2019-03-15 11:05   ` [PATCH v8 44/50] x86emul: support VPCLMULQDQ insns Jan Beulich
@ 2019-03-15 11:06   ` Jan Beulich
  2019-06-21 12:57     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:06   ` [PATCH v8 46/50] x86emul: support GFNI insns Jan Beulich
                     ` (4 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:06 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

As to the feature dependency adjustment, just like for VPCLMULQDQ while
strictly speaking AVX is a sufficient prereq (to have YMM registers),
256-bit vectors of integers have got fully introduced with AVX2 only.

A new test case (also covering AESNI) will be added to the harness by a
subsequent patch.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: Should VAES also depend on AESNI?
---
v8: No need to set fault_suppression to false.
v7: New.

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -591,6 +591,18 @@ static const struct test avx512_vpopcntd
     INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
 };
 
+/*
+ * The uses of b in this table are simply (one of) the shortest form(s) of
+ * saying "no broadcast" without introducing a 128-bit granularity enumerator.
+ * Due to all of the insns being WIG, w, d_nb, and q_nb would all also fit.
+ */
+static const struct test vaes_all[] = {
+    INSN(aesdec,     66, 0f38, de, vl, b, vl),
+    INSN(aesdeclast, 66, 0f38, df, vl, b, vl),
+    INSN(aesenc,     66, 0f38, dc, vl, b, vl),
+    INSN(aesenclast, 66, 0f38, dd, vl, b, vl),
+};
+
 static const struct test vpclmulqdq_all[] = {
     INSN(pclmulqdq, 66, 0f3a, 44, vl, q_nb, vl)
 };
@@ -975,6 +987,7 @@ void evex_disp8_test(void *instr, struct
 
     if ( cpu_has_avx512f )
     {
+        RUN(vaes, all);
         RUN(vpclmulqdq, all);
     }
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -144,6 +144,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_vaes      (cp.feat.vaes && xcr0_mask(6))
 #define cpu_has_vpclmulqdq (cp.feat.vpclmulqdq && xcr0_mask(6))
 #define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -541,7 +541,7 @@ static const struct ext0f38_table {
     [0xcc] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xcd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
-    [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
+    [0xdc ... 0xdf] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xf0] = { .two_op = 1 },
     [0xf1] = { .to_mem = 1, .two_op = 1 },
     [0xf2 ... 0xf3] = {},
@@ -1922,6 +1922,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_vaes()        vcpu_has(         7, ECX,  9, ctxt, ops)
 #define vcpu_has_vpclmulqdq()  vcpu_has(         7, ECX, 10, ctxt, ops)
 #define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
@@ -8935,13 +8936,9 @@ x86_emulate(
     case X86EMUL_OPC_66(0x0f38, 0xdb):     /* aesimc xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xdb): /* vaesimc xmm/m128,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xdc):     /* aesenc xmm/m128,xmm,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xdc): /* vaesenc xmm/m128,xmm,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xdd):     /* aesenclast xmm/m128,xmm,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xdd): /* vaesenclast xmm/m128,xmm,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xde):     /* aesdec xmm/m128,xmm,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xde): /* vaesdec xmm/m128,xmm,xmm */
     case X86EMUL_OPC_66(0x0f38, 0xdf):     /* aesdeclast xmm/m128,xmm,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0xdf): /* vaesdeclast xmm/m128,xmm,xmm */
         host_and_vcpu_must_have(aesni);
         if ( vex.opcx == vex_none )
             goto simd_0f38_common;
@@ -9655,6 +9652,24 @@ x86_emulate(
         host_and_vcpu_must_have(avx512er);
         goto simd_zmm_scalar_sae;
 
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xdc):  /* vaesenc {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xdd):  /* vaesenclast {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xde):  /* vaesdec {x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xdf):  /* vaesdeclast {x,y}mm/mem,{x,y}mm,{x,y}mm */
+        if ( !vex.l )
+            host_and_vcpu_must_have(aesni);
+        else
+            host_and_vcpu_must_have(vaes);
+        goto simd_0f_avx;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xdc): /* vaesenc [xyz]mm/mem,[xyz]mm,[xyz]mm */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xdd): /* vaesenclast [xyz]mm/mem,[xyz]mm,[xyz]mm */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xde): /* vaesdec [xyz]mm/mem,[xyz]mm,[xyz]mm */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xdf): /* vaesdeclast [xyz]mm/mem,[xyz]mm,[xyz]mm */
+        host_and_vcpu_must_have(vaes);
+        generate_exception_if(evex.brs || evex.opmsk, EXC_UD);
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC(0x0f38, 0xf0): /* movbe m,r */
     case X86EMUL_OPC(0x0f38, 0xf1): /* movbe r,m */
         vcpu_must_have(movbe);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -112,6 +112,7 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_vaes            boot_cpu_has(X86_FEATURE_VAES)
 #define cpu_has_vpclmulqdq      boot_cpu_has(X86_FEATURE_VPCLMULQDQ)
 #define cpu_has_avx512_vnni     boot_cpu_has(X86_FEATURE_AVX512_VNNI)
 #define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(VAES,          6*32+ 9) /*A  Vector AES Instrs */
 XEN_CPUFEATURE(VPCLMULQDQ,    6*32+10) /*A  Vector Carry-less Multiplication Instrs */
 XEN_CPUFEATURE(AVX512_VNNI,   6*32+11) /*A  Vector Neural Network Instrs */
 XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -257,7 +257,7 @@ def crunch_numbers(state):
         # feature flags.  If want to use AVX512, AVX2 must be supported and
         # enabled.  Certain later extensions, acting on 256-bit vectors of
         # integers, better depend on AVX2 than AVX.
-        AVX2: [AVX512F, VPCLMULQDQ],
+        AVX2: [AVX512F, VAES, VPCLMULQDQ],
 
         # AVX512F is taken to mean hardware support for 512bit registers
         # (which in practice depends on the EVEX prefix to encode) as well




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 46/50] x86emul: support GFNI insns
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (44 preceding siblings ...)
  2019-03-15 11:06   ` [PATCH v8 45/50] x86emul: support VAES insns Jan Beulich
@ 2019-03-15 11:06   ` Jan Beulich
  2019-06-21 13:19     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:07   ` [PATCH v8 47/50] x86emul: restore ordering within main switch statement Jan Beulich
                     ` (3 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:06 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Note that the ISA extensions document revision 035 is ambiguous
regarding fault suppression for VGF2P8MULB: Text says it's supported,
while the exception specification listed is E4NF. Given the wording here
and for the other two insns I'm inclined to trust the text more than the
exception reference, which was also confirmed informally.

As to the feature dependency adjustment, while strictly speaking SSE is
a sufficient prereq (to have XMM registers), vectors of bytes and qwords
have got introduced only with SSE2. gcc, for example, uses a similar
connection in its respective intrinsics header.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: Add {evex}-producing vgf2p8mulb alias to simd.h. Add missing simd.h
    dependency. Re-base.
v7: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -19,7 +19,8 @@ CFLAGS += $(CFLAGS_xeninclude)
 SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er avx512vbmi
 FMA := fma4 fma
 SG := avx2-sg avx512f-sg avx512vl-sg
-TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
+GF := sse2-gf avx2-gf avx512bw-gf
+TESTCASES := blowfish $(SIMD) $(FMA) $(SG) $(GF)
 
 OPMASK := avx512f avx512dq avx512bw
 
@@ -142,12 +143,17 @@ $(1)-cflags := \
 	   $(foreach flt,$($(1)-flts), \
 	     "-D_$(vec)x$(idx)f$(flt) -m$(1:-sg=) $(call non-sse,$(1)) -Os -DVEC_MAX=$(vec) -DIDX_SIZE=$(idx) -DFLOAT_SIZE=$(flt)")))
 endef
+define simd-gf-defs
+$(1)-cflags := $(foreach vec,$($(1:-gf=)-vecs), \
+	         "-D_$(vec) -mgfni -m$(1:-gf=) $(call non-sse,$(1)) -Os -DVEC_SIZE=$(vec)")
+endef
 define opmask-defs
 $(1)-opmask-cflags := $(foreach vec,$($(1)-opmask-vecs), "-D_$(vec) -m$(1) -Os -DSIZE=$(vec)")
 endef
 
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
+$(foreach flavor,$(GF),$(eval $(call simd-gf-defs,$(flavor))))
 $(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
 first-string = $(shell for s in $(1); do echo "$$s"; break; done)
@@ -197,7 +203,10 @@ $(addsuffix .c,$(FMA)):
 $(addsuffix .c,$(SG)):
 	ln -sf simd-sg.c $@
 
-$(addsuffix .h,$(SIMD) $(FMA) $(SG)): simd.h
+$(addsuffix .c,$(GF)):
+	ln -sf simd-gf.c $@
+
+$(addsuffix .h,$(SIMD) $(FMA) $(SG) $(GF)): simd.h
 
 xop.h avx512f.h: simd-fma.c
 
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -591,6 +591,12 @@ static const struct test avx512_vpopcntd
     INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
 };
 
+static const struct test gfni_all[] = {
+    INSN(gf2p8affineinvqb, 66, 0f3a, cf, vl, q, vl),
+    INSN(gf2p8affineqb,    66, 0f3a, ce, vl, q, vl),
+    INSN(gf2p8mulb,        66, 0f38, cf, vl, b, vl),
+};
+
 /*
  * The uses of b in this table are simply (one of) the shortest form(s) of
  * saying "no broadcast" without introducing a 128-bit granularity enumerator.
@@ -987,6 +993,7 @@ void evex_disp8_test(void *instr, struct
 
     if ( cpu_has_avx512f )
     {
+        RUN(gfni, all);
         RUN(vaes, all);
         RUN(vpclmulqdq, all);
     }
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -371,6 +371,7 @@ OVR(cvttsd2siq);
 OVR(cvttss2si);
 OVR(cvttss2sil);
 OVR(cvttss2siq);
+OVR(gf2p8mulb);
 OVR(movddup);
 OVR(movntdq);
 OVR(movntdqa);
--- /dev/null
+++ b/tools/tests/x86_emulator/simd-gf.c
@@ -0,0 +1,80 @@
+#define UINT_SIZE 1
+
+#include "simd.h"
+ENTRY(gf_test);
+
+#if VEC_SIZE == 16
+# define GF(op, s, a...) __builtin_ia32_vgf2p8 ## op ## _v16qi ## s(a)
+#elif VEC_SIZE == 32
+# define GF(op, s, a...) __builtin_ia32_vgf2p8 ## op ## _v32qi ## s(a)
+#elif VEC_SIZE == 64
+# define GF(op, s, a...) __builtin_ia32_vgf2p8 ## op ## _v64qi ## s(a)
+#endif
+
+#ifdef __AVX512BW__
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# define eq(x, y) (B(pcmpeqb, _mask, (vqi_t)(x), (vqi_t)(y), -1) == ALL_TRUE)
+# define mul(x, y) GF(mulb, _mask, (vqi_t)(x), (vqi_t)(y), (vqi_t)undef(), ~0)
+# define transform(m, dir, x, c) ({ \
+    vec_t t_; \
+    asm ( "vgf2p8affine" #dir "qb %[imm], %[matrix]%{1to%c[n]%}, %[src], %[dst]" \
+          : [dst] "=v" (t_) \
+          : [matrix] "m" (m), [src] "v" (x), [imm] "i" (c), [n] "i" (VEC_SIZE / 8) ); \
+    t_; \
+})
+#else
+# if defined(__AVX2__)
+#  define bcstq(x) ({ \
+    vdi_t t_; \
+    asm ( "vpbroadcastq %1, %0" : "=x" (t_) : "m" (x) ); \
+    t_; \
+})
+#  define to_bool(cmp) B(ptestc, , cmp, (vdi_t){} == 0)
+# else
+#  define bcstq(x) ((vdi_t){x, x})
+#  define to_bool(cmp) (__builtin_ia32_pmovmskb128(cmp) == 0xffff)
+# endif
+# define eq(x, y) to_bool((x) == (y))
+# define mul(x, y) GF(mulb, , (vqi_t)(x), (vqi_t)(y))
+# define transform(m, dir, x, c) ({ \
+    vdi_t m_ = bcstq(m); \
+    touch(m_); \
+    ((vec_t)GF(affine ## dir ## qb, , (vqi_t)(x), (vqi_t)m_, c)); \
+})
+#endif
+
+const unsigned __attribute__((mode(DI))) ident = 0x0102040810204080ULL;
+
+int gf_test(void)
+{
+    unsigned int i;
+    vec_t src, one;
+
+    for ( i = 0; i < ELEM_COUNT; ++i )
+    {
+        src[i] = i;
+        one[i] = 1;
+    }
+
+    /* Special case for first iteration. */
+    one[0] = 0;
+
+    do {
+        vec_t inv = transform(ident, inv, src, 0);
+
+        touch(src);
+        touch(inv);
+        if ( !eq(mul(src, inv), one) ) return __LINE__;
+
+        touch(src);
+        touch(inv);
+        if ( !eq(mul(inv, src), one) ) return __LINE__;
+
+        one[0] = 1;
+
+        src += ELEM_COUNT;
+        i += ELEM_COUNT;
+    } while ( i < 256 );
+
+    return 0;
+}
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -11,12 +11,14 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "3dnow.h"
 #include "sse.h"
 #include "sse2.h"
+#include "sse2-gf.h"
 #include "sse4.h"
 #include "avx.h"
 #include "fma4.h"
 #include "fma.h"
 #include "avx2.h"
 #include "avx2-sg.h"
+#include "avx2-gf.h"
 #include "xop.h"
 #include "avx512f-opmask.h"
 #include "avx512dq-opmask.h"
@@ -25,6 +27,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f-sg.h"
 #include "avx512vl-sg.h"
 #include "avx512bw.h"
+#include "avx512bw-gf.h"
 #include "avx512dq.h"
 #include "avx512er.h"
 #include "avx512vbmi.h"
@@ -138,6 +141,26 @@ static bool simd_check_avx512vbmi_vl(voi
     return cpu_has_avx512_vbmi && cpu_has_avx512vl;
 }
 
+static bool simd_check_sse2_gf(void)
+{
+    return cpu_has_gfni && cpu_has_sse2;
+}
+
+static bool simd_check_avx2_gf(void)
+{
+    return cpu_has_gfni && cpu_has_avx2;
+}
+
+static bool simd_check_avx512bw_gf(void)
+{
+    return cpu_has_gfni && cpu_has_avx512bw;
+}
+
+static bool simd_check_avx512bw_gf_vl(void)
+{
+    return cpu_has_gfni && cpu_has_avx512vl;
+}
+
 static void simd_set_regs(struct cpu_user_regs *regs)
 {
     if ( cpu_has_mmx )
@@ -395,6 +418,12 @@ static const struct {
     AVX512VL(_VBMI+VL u16x8, avx512vbmi,    16u2),
     AVX512VL(_VBMI+VL s16x16, avx512vbmi,   32i2),
     AVX512VL(_VBMI+VL u16x16, avx512vbmi,   32u2),
+    SIMD(GFNI (legacy),       sse2_gf,        16),
+    SIMD(GFNI (VEX/x16),      avx2_gf,        16),
+    SIMD(GFNI (VEX/x32),      avx2_gf,        32),
+    SIMD(GFNI (EVEX/x64), avx512bw_gf,        64),
+    AVX512VL(VL+GFNI (x16), avx512bw_gf,      16),
+    AVX512VL(VL+GFNI (x32), avx512bw_gf,      32),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -144,6 +144,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_gfni       cp.feat.gfni
 #define cpu_has_vaes      (cp.feat.vaes && xcr0_mask(6))
 #define cpu_has_vpclmulqdq (cp.feat.vpclmulqdq && xcr0_mask(6))
 #define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -540,6 +540,7 @@ static const struct ext0f38_table {
     [0xcb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xcc] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0xcd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xcf] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xf0] = { .two_op = 1 },
@@ -619,6 +620,7 @@ static const struct ext0f3a_table {
     [0x7c ... 0x7d] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x7e ... 0x7f] = { .simd_size = simd_scalar_opc, .four_op = 1 },
     [0xcc] = { .simd_size = simd_other },
+    [0xce ... 0xcf] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0xdf] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xf0] = {},
 };
@@ -1922,6 +1924,7 @@ static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_gfni()        vcpu_has(         7, ECX,  8, ctxt, ops)
 #define vcpu_has_vaes()        vcpu_has(         7, ECX,  9, ctxt, ops)
 #define vcpu_has_vpclmulqdq()  vcpu_has(         7, ECX, 10, ctxt, ops)
 #define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
@@ -9652,6 +9655,21 @@ x86_emulate(
         host_and_vcpu_must_have(avx512er);
         goto simd_zmm_scalar_sae;
 
+    case X86EMUL_OPC_66(0x0f38, 0xcf):      /* gf2p8mulb xmm/m128,xmm */
+        host_and_vcpu_must_have(gfni);
+        goto simd_0f38_common;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0xcf):  /* vgf2p8mulb {x,y}mm/mem,{x,y}mm,{x,y}mm */
+        host_and_vcpu_must_have(gfni);
+        generate_exception_if(vex.w, EXC_UD);
+        goto simd_0f_avx;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcf): /* vgf2p8mulb [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(gfni);
+        generate_exception_if(evex.w || evex.brs, EXC_UD);
+        elem_bytes = 1;
+        goto avx512f_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0xdc):  /* vaesenc {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xdd):  /* vaesenclast {x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0xde):  /* vaesdec {x,y}mm/mem,{x,y}mm,{x,y}mm */
@@ -10395,6 +10413,24 @@ x86_emulate(
         op_bytes = 16;
         goto simd_0f3a_common;
 
+    case X86EMUL_OPC_66(0x0f3a, 0xce):      /* gf2p8affineqb $imm8,xmm/m128,xmm */
+    case X86EMUL_OPC_66(0x0f3a, 0xcf):      /* gf2p8affineinvqb $imm8,xmm/m128,xmm */
+        host_and_vcpu_must_have(gfni);
+        goto simd_0f3a_common;
+
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0xce):  /* vgf2p8affineqb $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0xcf):  /* vgf2p8affineinvqb $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
+        host_and_vcpu_must_have(gfni);
+        generate_exception_if(!vex.w, EXC_UD);
+        goto simd_0f_imm8_avx;
+
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0xce): /* vgf2p8affineqb $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0xcf): /* vgf2p8affineinvqb $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(gfni);
+        generate_exception_if(!evex.w, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_66(0x0f3a, 0xdf):     /* aeskeygenassist $imm8,xmm/m128,xmm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0xdf): /* vaeskeygenassist $imm8,xmm/m128,xmm */
         host_and_vcpu_must_have(aesni);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -112,6 +112,7 @@
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_gfni            boot_cpu_has(X86_FEATURE_GFNI)
 #define cpu_has_vaes            boot_cpu_has(X86_FEATURE_VAES)
 #define cpu_has_vpclmulqdq      boot_cpu_has(X86_FEATURE_VPCLMULQDQ)
 #define cpu_has_avx512_vnni     boot_cpu_has(X86_FEATURE_AVX512_VNNI)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@ XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(GFNI,          6*32+ 8) /*A  Galois Field Instrs */
 XEN_CPUFEATURE(VAES,          6*32+ 9) /*A  Vector AES Instrs */
 XEN_CPUFEATURE(VPCLMULQDQ,    6*32+10) /*A  Vector Carry-less Multiplication Instrs */
 XEN_CPUFEATURE(AVX512_VNNI,   6*32+11) /*A  Vector Neural Network Instrs */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -197,7 +197,7 @@ def crunch_numbers(state):
         # %XMM support, without specific inter-dependencies.  Additionally
         # AMD has a special mis-alignment sub-mode.
         SSE: [SSE2, SSE3, SSSE3, SSE4A, MISALIGNSSE,
-              AESNI, PCLMULQDQ, SHA],
+              AESNI, PCLMULQDQ, SHA, GFNI],
 
         # SSE2 was re-specified as core instructions for 64bit.
         SSE2: [LM],




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 47/50] x86emul: restore ordering within main switch statement
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (45 preceding siblings ...)
  2019-03-15 11:06   ` [PATCH v8 46/50] x86emul: support GFNI insns Jan Beulich
@ 2019-03-15 11:07   ` Jan Beulich
  2019-06-21 13:20     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:07   ` [PATCH v8 48/50] x86emul: add an AES/VAES test case to the harness Jan Beulich
                     ` (2 subsequent siblings)
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:07 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Incremental additions and/or mistakes have lead to some code blocks
sitting in "unexpected" places. Re-sort the case blocks (opcode space;
major opcode; 66/F3/F2 prefix; legacy/VEX/EVEX encoding).

As an exception the opcode space 0x0f EVEX-encoded VPEXTRW is left at
its current place, to keep it close to the "pextr" label.

Pure code movement.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: New.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -7129,15 +7129,6 @@ x86_emulate(
         ASSERT(!state->simd_size);
         break;
 
-    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7e): /* vmovq xmm/m64,xmm */
-    case X86EMUL_OPC_EVEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
-        generate_exception_if(evex.lr || !evex.w || evex.opmsk || evex.brs,
-                              EXC_UD);
-        host_and_vcpu_must_have(avx512f);
-        d |= TwoOp;
-        op_bytes = 8;
-        goto simd_zmm;
-
     case X86EMUL_OPC_66(0x0f, 0xe7):     /* movntdq xmm,m128 */
     case X86EMUL_OPC_VEX_66(0x0f, 0xe7): /* vmovntdq {x,y}mm,mem */
         generate_exception_if(ea.type != OP_MEM, EXC_UD);
@@ -7535,6 +7526,15 @@ x86_emulate(
         op_bytes = 8;
         goto simd_0f_int;
 
+    case X86EMUL_OPC_EVEX_F3(0x0f, 0x7e): /* vmovq xmm/m64,xmm */
+    case X86EMUL_OPC_EVEX_66(0x0f, 0xd6): /* vmovq xmm,xmm/m64 */
+        generate_exception_if(evex.lr || !evex.w || evex.opmsk || evex.brs,
+                              EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        d |= TwoOp;
+        op_bytes = 8;
+        goto simd_zmm;
+
     case X86EMUL_OPC(0x0f, 0x80) ... X86EMUL_OPC(0x0f, 0x8f): /* jcc (near) */
         if ( test_cc(b, _regs.eflags) )
             jmp_rel((int32_t)src.val);
@@ -8635,63 +8635,6 @@ x86_emulate(
         dst.type = OP_NONE;
         break;
 
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x10): /* vpsrlvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x11): /* vpsravw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x12): /* vpsllvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        host_and_vcpu_must_have(avx512bw);
-        generate_exception_if(!evex.w || evex.brs, EXC_UD);
-        elem_bytes = 2;
-        goto avx512f_no_sae;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
-        op_bytes = elem_bytes;
-        generate_exception_if(evex.w || evex.brs, EXC_UD);
-    avx512_broadcast:
-        /*
-         * For the respective code below the main switch() to work we need to
-         * fold op_mask here: A source element gets read whenever any of its
-         * respective destination elements' mask bits is set.
-         */
-        if ( fault_suppression )
-        {
-            n = 1 << ((b & 3) - evex.w);
-            EXPECT(elem_bytes > 0);
-            ASSERT(op_bytes == n * elem_bytes);
-            for ( i = n; i < (16 << evex.lr) / elem_bytes; i += n )
-                op_mask |= (op_mask >> i) & ((1 << n) - 1);
-        }
-        goto avx512f_no_sae;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
-                                            /* vbroadcastf64x4 m256,zmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
-                                            /* vbroadcasti64x4 m256,zmm{k} */
-        generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
-        /* fall through */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
-                                            /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
-        generate_exception_if(!evex.lr, EXC_UD);
-        /* fall through */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
-                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
-        if ( b == 0x59 )
-            op_bytes = 8;
-        generate_exception_if(evex.brs, EXC_UD);
-        if ( !evex.w )
-            host_and_vcpu_must_have(avx512dq);
-        goto avx512_broadcast;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
-                                            /* vbroadcastf64x2 m128,{y,z}mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
-                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
-        generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.brs,
-                              EXC_UD);
-        if ( evex.w )
-            host_and_vcpu_must_have(avx512dq);
-        goto avx512_broadcast;
-
     case X86EMUL_OPC_66(0x0f38, 0x20): /* pmovsxbw xmm/m64,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x21): /* pmovsxbd xmm/m32,xmm */
     case X86EMUL_OPC_66(0x0f38, 0x22): /* pmovsxbq xmm/m16,xmm */
@@ -8725,47 +8668,14 @@ x86_emulate(
         host_and_vcpu_must_have(sse4_1);
         goto simd_0f38_common;
 
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x13): /* vcvtph2ps xmm/mem,{x,y}mm */
-        generate_exception_if(vex.w, EXC_UD);
-        host_and_vcpu_must_have(f16c);
-        op_bytes = 8 << vex.l;
-        goto simd_0f_ymm;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x13): /* vcvtph2ps {x,y}mm/mem,[xyz]mm{k} */
-        generate_exception_if(evex.w || (ea.type != OP_REG && evex.brs), EXC_UD);
-        host_and_vcpu_must_have(avx512f);
-        if ( !evex.brs )
-            avx512_vlen_check(false);
-        op_bytes = 8 << evex.lr;
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x10): /* vpsrlvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x11): /* vpsravw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x12): /* vpsllvw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512bw);
+        generate_exception_if(!evex.w || evex.brs, EXC_UD);
         elem_bytes = 2;
-        goto simd_zmm;
-
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x16): /* vpermps ymm/m256,ymm,ymm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x36): /* vpermd ymm/m256,ymm,ymm */
-        generate_exception_if(!vex.l || vex.w, EXC_UD);
-        goto simd_0f_avx2;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x16): /* vpermp{s,d} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x36): /* vperm{d,q} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
-        generate_exception_if(!evex.lr, EXC_UD);
-        fault_suppression = false;
         goto avx512f_no_sae;
 
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x20): /* vpmovsxbw xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x23): /* vpmovsxwd xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x25): /* vpmovsxdq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x30): /* vpmovzxbw xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x33): /* vpmovzxwd xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x35): /* vpmovzxdq xmm/mem,{x,y}mm */
-        op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
-        goto simd_0f_int;
-
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x10): /* vpmovuswb [xyz]mm,{x,y}mm/mem{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x20): /* vpmovsxbw {x,y}mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x20): /* vpmovswb [xyz]mm,{x,y}mm/mem{k} */
@@ -8811,6 +8721,96 @@ x86_emulate(
         elem_bytes = (b & 7) < 3 ? 1 : (b & 7) != 5 ? 2 : 4;
         goto avx512f_no_sae;
 
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x13): /* vcvtph2ps xmm/mem,{x,y}mm */
+        generate_exception_if(vex.w, EXC_UD);
+        host_and_vcpu_must_have(f16c);
+        op_bytes = 8 << vex.l;
+        goto simd_0f_ymm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x13): /* vcvtph2ps {x,y}mm/mem,[xyz]mm{k} */
+        generate_exception_if(evex.w || (ea.type != OP_REG && evex.brs), EXC_UD);
+        host_and_vcpu_must_have(avx512f);
+        if ( !evex.brs )
+            avx512_vlen_check(false);
+        op_bytes = 8 << evex.lr;
+        elem_bytes = 2;
+        goto simd_zmm;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x16): /* vpermps ymm/m256,ymm,ymm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x36): /* vpermd ymm/m256,ymm,ymm */
+        generate_exception_if(!vex.l || vex.w, EXC_UD);
+        goto simd_0f_avx2;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x16): /* vpermp{s,d} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x36): /* vperm{d,q} {y,z}mm/mem,{y,z}mm,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
+        fault_suppression = false;
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x18): /* vbroadcastss xmm/m32,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,[xyz]mm{k} */
+        op_bytes = elem_bytes;
+        generate_exception_if(evex.w || evex.brs, EXC_UD);
+    avx512_broadcast:
+        /*
+         * For the respective code below the main switch() to work we need to
+         * fold op_mask here: A source element gets read whenever any of its
+         * respective destination elements' mask bits is set.
+         */
+        if ( fault_suppression )
+        {
+            n = 1 << ((b & 3) - evex.w);
+            EXPECT(elem_bytes > 0);
+            ASSERT(op_bytes == n * elem_bytes);
+            for ( i = n; i < (16 << evex.lr) / elem_bytes; i += n )
+                op_mask |= (op_mask >> i) & ((1 << n) - 1);
+        }
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1b): /* vbroadcastf32x8 m256,zmm{k} */
+                                            /* vbroadcastf64x4 m256,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5b): /* vbroadcasti32x8 m256,zmm{k} */
+                                            /* vbroadcasti64x4 m256,zmm{k} */
+        generate_exception_if(ea.type != OP_MEM || evex.lr != 2, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x19): /* vbroadcastsd xmm/m64,{y,z}mm{k} */
+                                            /* vbroadcastf32x2 xmm/m64,{y,z}mm{k} */
+        generate_exception_if(!evex.lr, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,[xyz]mm{k} */
+                                            /* vbroadcasti32x2 xmm/m64,[xyz]mm{k} */
+        if ( b == 0x59 )
+            op_bytes = 8;
+        generate_exception_if(evex.brs, EXC_UD);
+        if ( !evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x1a): /* vbroadcastf32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcastf64x2 m128,{y,z}mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x5a): /* vbroadcasti32x4 m128,{y,z}mm{k} */
+                                            /* vbroadcasti64x2 m128,{y,z}mm{k} */
+        generate_exception_if(ea.type != OP_MEM || !evex.lr || evex.brs,
+                              EXC_UD);
+        if ( evex.w )
+            host_and_vcpu_must_have(avx512dq);
+        goto avx512_broadcast;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x20): /* vpmovsxbw xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x21): /* vpmovsxbd xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x22): /* vpmovsxbq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x23): /* vpmovsxwd xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x24): /* vpmovsxwq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x25): /* vpmovsxdq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x30): /* vpmovzxbw xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x31): /* vpmovzxbd xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x32): /* vpmovzxbq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x33): /* vpmovzxwd xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x34): /* vpmovzxwq xmm/mem,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x35): /* vpmovzxdq xmm/mem,{x,y}mm */
+        op_bytes = 16 >> (pmov_convert_delta[b & 7] - vex.l);
+        goto simd_0f_int;
+
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x29): /* vpmov{b,w}2m [xyz]mm,k */
     case X86EMUL_OPC_EVEX_F3(0x0f38, 0x39): /* vpmov{d,q}2m [xyz]mm,k */
         generate_exception_if(!evex.r || !evex.R, EXC_UD);
@@ -8918,6 +8918,52 @@ x86_emulate(
         break;
     }
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2c): /* vscalefp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x42): /* vgetexpp{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9a): /* vfmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9c): /* vfnmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9e): /* vfnmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa6): /* vfmaddsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa7): /* vfmsubadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa8): /* vfmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaa): /* vfmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xac): /* vfnmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xae): /* vfnmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb6): /* vfmaddsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb7): /* vfmsubadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb8): /* vfmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xba): /* vfmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512f);
+        if ( ea.type != OP_REG || !evex.brs )
+            avx512_vlen_check(false);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2d): /* vscalefs{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x43): /* vgetexps{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+    simd_zmm_scalar_sae:
+        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
+        if ( !evex.brs )
+            avx512_vlen_check(true);
+        goto simd_zmm;
+
     case X86EMUL_OPC_66(0x0f38, 0x37): /* pcmpgtq xmm/m128,xmm */
         host_and_vcpu_must_have(sse4_2);
         goto simd_0f38_common;
@@ -8950,6 +8996,31 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x50): /* vpdpbusd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x51): /* vpdpbusds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x52): /* vpdpwssd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x53): /* vpdpwssds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vnni);
+        generate_exception_if(evex.w, EXC_UD);
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,{x,y}mm */
+        op_bytes = 1 << ((!(b & 0x20) * 2) + (b & 1));
+        /* fall through */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x46): /* vpsravd {x,y}mm/mem,{x,y}mm,{x,y}mm */
+        generate_exception_if(vex.w, EXC_UD);
+        goto simd_0f_avx2;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4d): /* vrcp14s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4f): /* vrsqrt14s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.brs, EXC_UD);
+        avx512_vlen_check(true);
+        goto simd_zmm;
+
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0x52): /* vp4dpwssd m128,zmm+3,zmm{k} */
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0x53): /* vp4dpwssds m128,zmm+3,zmm{k} */
         host_and_vcpu_must_have(avx512_4vnniw);
@@ -8972,23 +9043,6 @@ x86_emulate(
         host_and_vcpu_must_have(avx512_vpopcntdq);
         goto avx512f_no_sae;
 
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,{x,y}mm */
-        op_bytes = 1 << ((!(b & 0x20) * 2) + (b & 1));
-        /* fall through */
-    case X86EMUL_OPC_VEX_66(0x0f38, 0x46): /* vpsravd {x,y}mm/mem,{x,y}mm,{x,y}mm */
-        generate_exception_if(vex.w, EXC_UD);
-        goto simd_0f_avx2;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4d): /* vrcp14s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4f): /* vrsqrt14s{s,d} xmm/mem,xmm,xmm{k} */
-        host_and_vcpu_must_have(avx512f);
-        generate_exception_if(evex.brs, EXC_UD);
-        avx512_vlen_check(true);
-        goto simd_zmm;
-
     case X86EMUL_OPC_VEX_66(0x0f38, 0x5a): /* vbroadcasti128 m128,ymm */
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
@@ -9370,60 +9424,6 @@ x86_emulate(
         host_and_vcpu_must_have(fma);
         goto simd_0f_ymm;
 
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2c): /* vscalefp{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x42): /* vgetexpp{s,d} [xyz]mm/mem,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x96): /* vfmaddsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x97): /* vfmsubadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x98): /* vfmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9a): /* vfmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9c): /* vfnmadd132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9e): /* vfnmsub132p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa6): /* vfmaddsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa7): /* vfmsubadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa8): /* vfmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaa): /* vfmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xac): /* vfnmadd213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xae): /* vfnmsub213p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb6): /* vfmaddsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb7): /* vfmsubadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb8): /* vfmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xba): /* vfmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbc): /* vfnmadd231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbe): /* vfnmsub231p{s,d} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        host_and_vcpu_must_have(avx512f);
-        if ( ea.type != OP_REG || !evex.brs )
-            avx512_vlen_check(false);
-        goto simd_zmm;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x2d): /* vscalefs{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x43): /* vgetexps{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x99): /* vfmadd132s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9b): /* vfmsub132s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9d): /* vfnmadd132s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x9f): /* vfnmsub132s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xa9): /* vfmadd213s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xab): /* vfmsub213s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xad): /* vfnmadd213s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xaf): /* vfnmsub213s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xb9): /* vfmadd231s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbb): /* vfmsub231s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
-        host_and_vcpu_must_have(avx512f);
-    simd_zmm_scalar_sae:
-        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
-        if ( !evex.brs )
-            avx512_vlen_check(true);
-        goto simd_zmm;
-
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x50): /* vpdpbusd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x51): /* vpdpbusds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x52): /* vpdpwssd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-    case X86EMUL_OPC_EVEX_66(0x0f38, 0x53): /* vpdpwssds [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
-        host_and_vcpu_must_have(avx512_vnni);
-        generate_exception_if(evex.w, EXC_UD);
-        goto avx512f_no_sae;
-
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0x9a): /* v4fmaddps m128,zmm+3,zmm{k} */
     case X86EMUL_OPC_EVEX_F2(0x0f38, 0xaa): /* v4fnmaddps m128,zmm+3,zmm{k} */
         host_and_vcpu_must_have(avx512_4fmaps);
@@ -10266,11 +10266,6 @@ x86_emulate(
         generate_exception_if(evex.brs || evex.opmsk, EXC_UD);
         goto avx512f_imm8_no_sae;
 
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x4a): /* vblendvps {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x4b): /* vblendvpd {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
-        generate_exception_if(vex.w, EXC_UD);
-        goto simd_0f_imm8_avx;
-
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x48): /* vpermil2ps $imm,{x,y}mm/mem,{x,y}mm,{x,y}mm,{x,y}mm */
                                            /* vpermil2ps $imm,{x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x49): /* vpermil2pd $imm,{x,y}mm/mem,{x,y}mm,{x,y}mm,{x,y}mm */
@@ -10278,6 +10273,11 @@ x86_emulate(
         host_and_vcpu_must_have(xop);
         goto simd_0f_imm8_ymm;
 
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x4a): /* vblendvps {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x4b): /* vblendvpd {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
+        generate_exception_if(vex.w, EXC_UD);
+        goto simd_0f_imm8_avx;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x4c): /* vpblendvb {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_int_imm8;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 48/50] x86emul: add an AES/VAES test case to the harness
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (46 preceding siblings ...)
  2019-03-15 11:07   ` [PATCH v8 47/50] x86emul: restore ordering within main switch statement Jan Beulich
@ 2019-03-15 11:07   ` Jan Beulich
  2019-06-21 13:36     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:08   ` [PATCH v8 49/50] x86emul: add a SHA " Jan Beulich
  2019-03-15 11:08   ` [PATCH v8 50/50] x86emul: add a PCLMUL/VPCLMUL " Jan Beulich
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:07 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -19,8 +19,9 @@ CFLAGS += $(CFLAGS_xeninclude)
 SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er avx512vbmi
 FMA := fma4 fma
 SG := avx2-sg avx512f-sg avx512vl-sg
+AES := ssse3-aes avx-aes avx2-vaes avx512bw-vaes
 GF := sse2-gf avx2-gf avx512bw-gf
-TESTCASES := blowfish $(SIMD) $(FMA) $(SG) $(GF)
+TESTCASES := blowfish $(SIMD) $(FMA) $(SG) $(AES) $(GF)
 
 OPMASK := avx512f avx512dq avx512bw
 
@@ -143,6 +144,10 @@ $(1)-cflags := \
 	   $(foreach flt,$($(1)-flts), \
 	     "-D_$(vec)x$(idx)f$(flt) -m$(1:-sg=) $(call non-sse,$(1)) -Os -DVEC_MAX=$(vec) -DIDX_SIZE=$(idx) -DFLOAT_SIZE=$(flt)")))
 endef
+define simd-aes-defs
+$(1)-cflags := $(foreach vec,$($(patsubst %-aes,sse,$(1))-vecs) $($(patsubst %-vaes,%,$(1))-vecs), \
+	         "-D_$(vec) -maes $(addprefix -m,$(subst -,$(space),$(1))) $(call non-sse,$(1)) -Os -DVEC_SIZE=$(vec)")
+endef
 define simd-gf-defs
 $(1)-cflags := $(foreach vec,$($(1:-gf=)-vecs), \
 	         "-D_$(vec) -mgfni -m$(1:-gf=) $(call non-sse,$(1)) -Os -DVEC_SIZE=$(vec)")
@@ -153,6 +158,7 @@ endef
 
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
+$(foreach flavor,$(AES),$(eval $(call simd-aes-defs,$(flavor))))
 $(foreach flavor,$(GF),$(eval $(call simd-gf-defs,$(flavor))))
 $(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
@@ -203,10 +209,13 @@ $(addsuffix .c,$(FMA)):
 $(addsuffix .c,$(SG)):
 	ln -sf simd-sg.c $@
 
+$(addsuffix .c,$(AES)):
+	ln -sf simd-aes.c $@
+
 $(addsuffix .c,$(GF)):
 	ln -sf simd-gf.c $@
 
-$(addsuffix .h,$(SIMD) $(FMA) $(SG) $(GF)): simd.h
+$(addsuffix .h,$(SIMD) $(FMA) $(SG) $(AES) $(GF)): simd.h
 
 xop.h avx512f.h: simd-fma.c
 
--- /dev/null
+++ b/tools/tests/x86_emulator/simd-aes.c
@@ -0,0 +1,102 @@
+#define UINT_SIZE 1
+
+#include "simd.h"
+ENTRY(aes_test);
+
+#if VEC_SIZE == 16
+# define AES(op, a...) __builtin_ia32_vaes ## op ## _v16qi(a)
+# define imc(x) ((vec_t)__builtin_ia32_aesimc128((vdi_t)(x)))
+#elif VEC_SIZE == 32
+# define AES(op, a...) __builtin_ia32_vaes ## op ## _v32qi(a)
+# define imc(x) ({ \
+    vec_t r_; \
+    unsigned char __attribute__((vector_size(16))) t_; \
+    asm ( "vaesimc (%3), %x0\n\t" \
+          "vaesimc 16(%3), %1\n\t" \
+          "vinserti128 $1, %1, %0, %0" \
+          : "=&v" (r_), "=&v" (t_) \
+          : "m" (x), "r" (&(x)) ); \
+    r_; \
+})
+#elif VEC_SIZE == 64
+# define AES(op, a...) __builtin_ia32_vaes ## op ## _v64qi(a)
+# define imc(x) ({ \
+    vec_t r_; \
+    unsigned char __attribute__((vector_size(16))) t_; \
+    asm ( "vaesimc (%3), %x0\n\t" \
+          "vaesimc 1*16(%3), %1\n\t" \
+          "vinserti32x4 $1, %1, %0, %0\n\t" \
+          "vaesimc 2*16(%3), %1\n\t" \
+          "vinserti32x4 $2, %1, %0, %0\n\t" \
+          "vaesimc 3*16(%3), %1\n\t" \
+          "vinserti32x4 $3, %1, %0, %0" \
+          : "=&v" (r_), "=&v" (t_) \
+          : "m" (x), "r" (&(x)) ); \
+    r_; \
+})
+#endif
+
+#ifdef __AVX512BW__
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# define eq(x, y) (B(pcmpeqb, _mask, (vqi_t)(x), (vqi_t)(y), -1) == ALL_TRUE)
+# define aes(op, x, y) ((vec_t)AES(op, (vqi_t)(x), (vqi_t)(y)))
+#else
+# if defined(__AVX2__) && VEC_SIZE == 32
+#  define to_bool(cmp) B(ptestc, , cmp, (vdi_t){} == 0)
+#  define aes(op, x, y) ((vec_t)AES(op, (vqi_t)(x), (vqi_t)(y)))
+# else
+#  define to_bool(cmp) (__builtin_ia32_pmovmskb128(cmp) == 0xffff)
+#  define aes(op, x, y) ((vec_t)__builtin_ia32_aes ## op ## 128((vdi_t)(x), (vdi_t)(y)))
+# endif
+# define eq(x, y) to_bool((x) == (y))
+#endif
+
+int aes_test(void)
+{
+    unsigned int i;
+    vec_t src, zero = {};
+
+    for ( i = 0; i < ELEM_COUNT; ++i )
+        src[i] = i;
+
+    do {
+        vec_t x, y;
+
+        touch(src);
+        x = imc(src);
+        touch(src);
+
+        touch(zero);
+        y = aes(enclast, src, zero);
+        touch(zero);
+        y = aes(dec, y, zero);
+
+        if ( !eq(x, y) ) return __LINE__;
+
+        touch(zero);
+        x = aes(declast, src, zero);
+        touch(zero);
+        y = aes(enc, x, zero);
+        touch(y);
+        x = imc(y);
+
+        if ( !eq(x, src) ) return __LINE__;
+
+#if VEC_SIZE == 16
+        touch(src);
+        x = (vec_t)__builtin_ia32_aeskeygenassist128((vdi_t)src, 0);
+        touch(src);
+        y = (vec_t)__builtin_ia32_pshufb128((vqi_t)x,
+                                            (vqi_t){  7,  4,  5,  6,
+                                                      1,  2,  3,  0,
+                                                     15, 12, 13, 14,
+                                                      9, 10, 11,  8 });
+        if ( !eq(x, y) ) return __LINE__;
+#endif
+
+        src += ELEM_COUNT;
+        i += ELEM_COUNT;
+    } while ( i <= 256 );
+
+    return 0;
+}
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -340,6 +340,10 @@ REN(pandn, , d);
 REN(por, , d);
 REN(pxor, , d);
 #  endif
+OVR(aesdec);
+OVR(aesdeclast);
+OVR(aesenc);
+OVR(aesenclast);
 OVR(cvtpd2dqx);
 OVR(cvtpd2dqy);
 OVR(cvtpd2psx);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -12,12 +12,15 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "sse.h"
 #include "sse2.h"
 #include "sse2-gf.h"
+#include "ssse3-aes.h"
 #include "sse4.h"
 #include "avx.h"
+#include "avx-aes.h"
 #include "fma4.h"
 #include "fma.h"
 #include "avx2.h"
 #include "avx2-sg.h"
+#include "avx2-vaes.h"
 #include "avx2-gf.h"
 #include "xop.h"
 #include "avx512f-opmask.h"
@@ -27,6 +30,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f-sg.h"
 #include "avx512vl-sg.h"
 #include "avx512bw.h"
+#include "avx512bw-vaes.h"
 #include "avx512bw-gf.h"
 #include "avx512dq.h"
 #include "avx512er.h"
@@ -91,6 +95,16 @@ static bool simd_check_xop(void)
     return cpu_has_xop;
 }
 
+static bool simd_check_ssse3_aes(void)
+{
+    return cpu_has_aesni && cpu_has_ssse3;
+}
+
+static bool simd_check_avx_aes(void)
+{
+    return cpu_has_aesni && cpu_has_avx;
+}
+
 static bool simd_check_avx512f(void)
 {
     return cpu_has_avx512f;
@@ -141,6 +155,22 @@ static bool simd_check_avx512vbmi_vl(voi
     return cpu_has_avx512_vbmi && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx2_vaes(void)
+{
+    return cpu_has_aesni && cpu_has_vaes && cpu_has_avx2;
+}
+
+static bool simd_check_avx512bw_vaes(void)
+{
+    return cpu_has_aesni && cpu_has_vaes && cpu_has_avx512bw;
+}
+
+static bool simd_check_avx512bw_vaes_vl(void)
+{
+    return cpu_has_aesni && cpu_has_vaes &&
+           cpu_has_avx512bw && cpu_has_avx512vl;
+}
+
 static bool simd_check_sse2_gf(void)
 {
     return cpu_has_gfni && cpu_has_sse2;
@@ -319,6 +349,8 @@ static const struct {
     SIMD(XOP i16x16,              xop,      32i2),
     SIMD(XOP i32x8,               xop,      32i4),
     SIMD(XOP i64x4,               xop,      32i8),
+    SIMD(AES (legacy),      ssse3_aes,        16),
+    SIMD(AES (VEX/x16),       avx_aes,        16),
     SIMD(OPMASK/w,     avx512f_opmask,         2),
     SIMD(OPMASK+DQ/b, avx512dq_opmask,         1),
     SIMD(OPMASK+DQ/w, avx512dq_opmask,         2),
@@ -418,6 +450,10 @@ static const struct {
     AVX512VL(_VBMI+VL u16x8, avx512vbmi,    16u2),
     AVX512VL(_VBMI+VL s16x16, avx512vbmi,   32i2),
     AVX512VL(_VBMI+VL u16x16, avx512vbmi,   32u2),
+    SIMD(VAES (VEX/x32),    avx2_vaes,        32),
+    SIMD(VAES (EVEX/x64), avx512bw_vaes,      64),
+    AVX512VL(VL+VAES (x16), avx512bw_vaes,    16),
+    AVX512VL(VL+VAES (x32), avx512bw_vaes,    32),
     SIMD(GFNI (legacy),       sse2_gf,        16),
     SIMD(GFNI (VEX/x16),      avx2_gf,        16),
     SIMD(GFNI (VEX/x32),      avx2_gf,        32),
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -125,10 +125,12 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_sse        cp.basic.sse
 #define cpu_has_sse2       cp.basic.sse2
 #define cpu_has_sse3       cp.basic.sse3
+#define cpu_has_ssse3      cp.basic.ssse3
 #define cpu_has_fma       (cp.basic.fma && xcr0_mask(6))
 #define cpu_has_sse4_1     cp.basic.sse4_1
 #define cpu_has_sse4_2     cp.basic.sse4_2
 #define cpu_has_popcnt     cp.basic.popcnt
+#define cpu_has_aesni      cp.basic.aesni
 #define cpu_has_avx       (cp.basic.avx  && xcr0_mask(6))
 #define cpu_has_f16c      (cp.basic.f16c && xcr0_mask(6))
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 49/50] x86emul: add a SHA test case to the harness
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (47 preceding siblings ...)
  2019-03-15 11:07   ` [PATCH v8 48/50] x86emul: add an AES/VAES test case to the harness Jan Beulich
@ 2019-03-15 11:08   ` Jan Beulich
  2019-06-21 13:51     ` [Xen-devel] " Andrew Cooper
  2019-03-15 11:08   ` [PATCH v8 50/50] x86emul: add a PCLMUL/VPCLMUL " Jan Beulich
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:08 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also use this for AVX512VL VPRO{L,R}{,V}D as well as some further shifts
testing.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -20,8 +20,9 @@ SIMD := 3dnow sse sse2 sse4 avx avx2 xop
 FMA := fma4 fma
 SG := avx2-sg avx512f-sg avx512vl-sg
 AES := ssse3-aes avx-aes avx2-vaes avx512bw-vaes
+SHA := sse4-sha avx-sha avx512f-sha
 GF := sse2-gf avx2-gf avx512bw-gf
-TESTCASES := blowfish $(SIMD) $(FMA) $(SG) $(AES) $(GF)
+TESTCASES := blowfish $(SIMD) $(FMA) $(SG) $(AES) $(SHA) $(GF)
 
 OPMASK := avx512f avx512dq avx512bw
 
@@ -148,6 +149,10 @@ define simd-aes-defs
 $(1)-cflags := $(foreach vec,$($(patsubst %-aes,sse,$(1))-vecs) $($(patsubst %-vaes,%,$(1))-vecs), \
 	         "-D_$(vec) -maes $(addprefix -m,$(subst -,$(space),$(1))) $(call non-sse,$(1)) -Os -DVEC_SIZE=$(vec)")
 endef
+define simd-sha-defs
+$(1)-cflags := $(foreach vec,$(sse-vecs), \
+	         "-D_$(vec) $(addprefix -m,$(subst -,$(space),$(1))) -Os -DVEC_SIZE=$(vec)")
+endef
 define simd-gf-defs
 $(1)-cflags := $(foreach vec,$($(1:-gf=)-vecs), \
 	         "-D_$(vec) -mgfni -m$(1:-gf=) $(call non-sse,$(1)) -Os -DVEC_SIZE=$(vec)")
@@ -159,6 +164,7 @@ endef
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
 $(foreach flavor,$(AES),$(eval $(call simd-aes-defs,$(flavor))))
+$(foreach flavor,$(SHA),$(eval $(call simd-sha-defs,$(flavor))))
 $(foreach flavor,$(GF),$(eval $(call simd-gf-defs,$(flavor))))
 $(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
 
@@ -212,10 +218,13 @@ $(addsuffix .c,$(SG)):
 $(addsuffix .c,$(AES)):
 	ln -sf simd-aes.c $@
 
+$(addsuffix .c,$(SHA)):
+	ln -sf simd-sha.c $@
+
 $(addsuffix .c,$(GF)):
 	ln -sf simd-gf.c $@
 
-$(addsuffix .h,$(SIMD) $(FMA) $(SG) $(AES) $(GF)): simd.h
+$(addsuffix .h,$(SIMD) $(FMA) $(SG) $(AES) $(SHA) $(GF)): simd.h
 
 xop.h avx512f.h: simd-fma.c
 
--- /dev/null
+++ b/tools/tests/x86_emulator/simd-sha.c
@@ -0,0 +1,392 @@
+#define INT_SIZE 4
+
+#include "simd.h"
+ENTRY(sha_test);
+
+#define SHA(op, a...) __builtin_ia32_sha ## op(a)
+
+#ifdef __AVX512F__
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# define eq(x, y) (B(pcmpeqd, _mask, x, y, -1) == ALL_TRUE)
+# define blend(x, y, sel) B(movdqa32_, _mask, y, x, sel)
+# define rot_c(f, r, x, n) B(pro ## f ## d, _mask, x, n, undef(), ~0)
+# define rot_s(f, r, x, n) ({ /* gcc does not support embedded broadcast */ \
+    vec_t r_; \
+    asm ( "vpro" #f "vd %2%{1to%c3%}, %1, %0" \
+          : "=v" (r_) \
+          : "v" (x), "m" (n), "i" (ELEM_COUNT) ); \
+    r_; \
+})
+# define rot_v(d, x, n) B(pro ## d ## vd, _mask, x, n, undef(), ~0)
+# define shift_s(d, x, n) ({ \
+    vec_t r_; \
+    asm ( "vps" #d "lvd %2%{1to%c3%}, %1, %0" \
+          : "=v" (r_) \
+          : "v" (x), "m" (n), "i" (ELEM_COUNT) ); \
+    r_; \
+})
+# define vshift(d, x, n) ({ /* gcc does not allow memory operands */ \
+    vec_t r_; \
+    asm ( "vps" #d "ldq %2, %1, %0" \
+          : "=v" (r_) : "m" (x), "i" ((n) * ELEM_SIZE) ); \
+    r_; \
+})
+#else
+# define to_bool(cmp) (__builtin_ia32_pmovmskb128(cmp) == 0xffff)
+# define eq(x, y) to_bool((x) == (y))
+# define blend(x, y, sel) \
+    ((vec_t)__builtin_ia32_pblendw128((vhi_t)(x), (vhi_t)(y), \
+                                      ((sel) & 1 ? 0x03 : 0) | \
+                                      ((sel) & 2 ? 0x0c : 0) | \
+                                      ((sel) & 4 ? 0x30 : 0) | \
+                                      ((sel) & 8 ? 0xc0 : 0)))
+# define rot_c(f, r, x, n) (sh ## f ## _c(x, n) | sh ## r ## _c(x, 32 - (n)))
+# define rot_s(f, r, x, n) ({ /* gcc does not allow memory operands */ \
+    vec_t r_, t_, n_ = (vec_t){ 32 } - (n); \
+    asm ( "ps" #f "ld %2, %0; ps" #r "ld %3, %1; por %1, %0" \
+          : "=&x" (r_), "=&x" (t_) \
+          : "m" (n), "m" (n_), "0" (x), "1" (x) ); \
+    r_; \
+})
+static inline unsigned int rotl(unsigned int x, unsigned int n)
+{
+    return (x << (n & 0x1f)) | (x >> ((32 - n) & 0x1f));
+}
+static inline unsigned int rotr(unsigned int x, unsigned int n)
+{
+    return (x >> (n & 0x1f)) | (x << ((32 - n) & 0x1f));
+}
+# define rot_v(d, x, n) ({ \
+    vec_t t_; \
+    unsigned int i_; \
+    for ( i_ = 0; i_ < ELEM_COUNT; ++i_ ) \
+        t_[i_] = rot ## d((x)[i_], (n)[i_]); \
+    t_; \
+})
+# define shift_s(d, x, n) ({ \
+    vec_t r_; \
+    asm ( "ps" #d "ld %1, %0" : "=&x" (r_) : "m" (n), "0" (x) ); \
+    r_; \
+})
+# define vshift(d, x, n) \
+    (vec_t)(__builtin_ia32_ps ## d ## ldqi128((vdi_t)(x), (n) * ELEM_SIZE * 8))
+#endif
+
+#define alignr(x, y, n) ((vec_t)__builtin_ia32_palignr128((vdi_t)(x), (vdi_t)(y), (n) * 8))
+#define hadd(x, y) __builtin_ia32_phaddd128(x, y)
+#define rol_c(x, n) rot_c(l, r, x, n)
+#define rol_s(x, n) rot_s(l, r, x, n)
+#define rol_v(x, n...) rot_v(l, x, n)
+#define ror_c(x, n) rot_c(r, l, x, n)
+#define ror_s(x, n) rot_s(r, l, x, n)
+#define ror_v(x, n...) rot_v(r, x, n)
+#define shl_c(x, n) __builtin_ia32_pslldi128(x, n)
+#define shl_s(x, n) shift_s(l, x, n)
+#define shr_c(x, n) __builtin_ia32_psrldi128(x, n)
+#define shr_s(x, n) shift_s(r, x, n)
+#define shuf(x, s) __builtin_ia32_pshufd(x, s)
+#define swap(x) shuf(x, 0b00011011)
+#define vshl(x, n) vshift(l, x, n)
+#define vshr(x, n) vshift(r, x, n)
+
+static inline vec_t sha256_sigma0(vec_t w)
+{
+    vec_t res;
+
+    touch(w);
+    res = ror_c(w, 7);
+    touch(w);
+    res ^= rol_c(w, 14);
+    touch(w);
+    res ^= shr_c(w, 3);
+    touch(w);
+
+    return res;
+}
+
+static inline vec_t sha256_sigma1(vec_t w)
+{
+    vec_t _17 = { 17 }, _19 = { 19 }, _10 = { 10 };
+
+    return ror_s(w, _17) ^ ror_s(w, _19) ^ shr_s(w, _10);
+}
+
+static inline vec_t sha256_Sigma0(vec_t w)
+{
+    vec_t res, n1 = { 0, 0, 2, 2 }, n2 = { 0, 0, 13, 13 }, n3 = { 0, 0, 10, 10 };
+
+    touch(n1);
+    res = ror_v(w, n1);
+    touch(n2);
+    res ^= ror_v(w, n2);
+    touch(n3);
+
+    return res ^ rol_v(w, n3);
+}
+
+static inline vec_t sha256_Sigma1(vec_t w)
+{
+    return ror_c(w, 6) ^ ror_c(w, 11) ^ rol_c(w, 7);
+}
+
+int sha_test(void)
+{
+    unsigned int i;
+    vec_t src, one = { 1 };
+    vqi_t raw = {};
+
+    for ( i = 1; i < VEC_SIZE; ++i )
+        raw[i] = i;
+    src = (vec_t)raw;
+
+    for ( i = 0; i < 256; i += VEC_SIZE )
+    {
+        vec_t x, y, tmp, hash = -src;
+        vec_t a, b, c, d, e, g, h;
+        unsigned int k, r;
+
+        touch(src);
+        x = SHA(1msg1, hash, src);
+        touch(src);
+        y = hash ^ alignr(hash, src, 8);
+        touch(src);
+
+        if ( !eq(x, y) ) return __LINE__;
+
+        touch(src);
+        x = SHA(1msg2, hash, src);
+        touch(src);
+        tmp = hash ^ alignr(src, hash, 12);
+        touch(tmp);
+        y = rol_c(tmp, 1);
+        tmp = hash ^ alignr(src, y, 12);
+        touch(tmp);
+        y = rol_c(tmp, 1);
+
+        if ( !eq(x, y) ) return __LINE__;
+
+        touch(src);
+        x = SHA(1msg2, hash, src);
+        touch(src);
+        tmp = rol_s(hash ^ alignr(src, hash, 12), one);
+        y = rol_s(hash ^ alignr(src, tmp, 12), one);
+
+        if ( !eq(x, y) ) return __LINE__;
+
+        touch(src);
+        x = SHA(1nexte, hash, src);
+        touch(src);
+        touch(hash);
+        tmp = rol_c(hash, 30);
+        tmp[2] = tmp[1] = tmp[0] = 0;
+
+        if ( !eq(x, src + tmp) ) return __LINE__;
+
+        /*
+         * SHA1RNDS4
+         *
+         * SRC1 = { A0, B0, C0, D0 }
+         * SRC2 = W' = { W[0]E0, W[1], W[2], W[3] }
+         *
+         * (NB that the notation is not C-like, i.e. elements are listed
+         * high-to-low everywhere in this comment.)
+         *
+         * In order to pick a simple rounds function, an immediate value of
+         * 1 is used; 3 would also be a possibility.
+         *
+         * Applying
+         *
+         * A1 = ROL5(A0) + (B0 ^ C0 ^ D0) + W'[0] + K
+         * E1 = D0
+         * D1 = C0
+         * C1 = ROL30(B0)
+         * B1 = A0
+         *
+         * iteratively four times and resolving round variable values to
+         * A<n> and B0, C0, and D0 we get
+         *
+         * A4 = ROL5(A3) + (A2 ^ ROL30(A1) ^ ROL30(A0)) + W'[3] + ROL30(B0) + K
+         * A3 = ROL5(A2) + (A1 ^ ROL30(A0) ^ ROL30(B0)) + W'[2] +       C0  + K
+         * A2 = ROL5(A1) + (A0 ^ ROL30(B0) ^       C0 ) + W'[1] +       D0  + K
+         * A1 = ROL5(A0) + (B0 ^       C0  ^       D0 ) + W'[0]             + K
+         *
+         * (respective per-column variable names:
+         *  y         a      b          c           d      src           e    k
+         * )
+         *
+         * with
+         *
+         * B4 = A3
+         * C4 = ROL30(A2)
+         * D4 = ROL30(A1)
+         * E4 = ROL30(A0)
+         *
+         * and hence
+         *
+         * DST = { A4, A3, ROL30(A2), ROL30(A1) }
+         */
+
+        touch(src);
+        x = SHA(1rnds4, hash, src, 1);
+        touch(src);
+
+        a = vshr(hash, 3);
+        b = vshr(hash, 2);
+        touch(hash);
+        d = rol_c(hash, 30);
+        touch(hash);
+        d = blend(d, hash, 0b0011);
+        c = vshr(d, 1);
+        e = vshl(d, 1);
+        tmp = (vec_t){};
+        k = rol_c(SHA(1rnds4, tmp, tmp, 1), 2)[0];
+
+        for ( r = 0; r < 4; ++r )
+        {
+            y = rol_c(a, 5) + (b ^ c ^ d) + swap(src) + e + k;
+
+            switch ( r )
+            {
+            case 0:
+                c[3] = rol_c(y, 30)[0];
+                /* fall through */
+            case 1:
+                b[r + 2] = y[r];
+                /* fall through */
+            case 2:
+                a[r + 1] = y[r];
+                break;
+            }
+
+            switch ( r )
+            {
+            case 3:
+                if ( a[3] != y[2] ) return __LINE__;
+                /* fall through */
+            case 2:
+                if ( a[2] != y[1] ) return __LINE__;
+                if ( b[3] != y[1] ) return __LINE__;
+                /* fall through */
+            case 1:
+                if ( a[1] != y[0] ) return __LINE__;
+                if ( b[2] != y[0] ) return __LINE__;
+                if ( c[3] != rol_c(y, 30)[0] ) return __LINE__;
+                break;
+            }
+        }
+
+        a = blend(rol_c(y, 30), y, 0b1100);
+
+        if ( !eq(x, a) ) return __LINE__;
+
+        touch(src);
+        x = SHA(256msg1, hash, src);
+        touch(src);
+        y = hash + sha256_sigma0(alignr(src, hash, 4));
+
+        if ( !eq(x, y) ) return __LINE__;
+
+        touch(src);
+        x = SHA(256msg2, hash, src);
+        touch(src);
+        tmp = hash + sha256_sigma1(alignr(hash, src, 8));
+        y = hash + sha256_sigma1(alignr(tmp, src, 8));
+
+        if ( !eq(x, y) ) return __LINE__;
+
+        /*
+         * SHA256RNDS2
+         *
+         * SRC1 = { C0, D0, G0, H0 }
+         * SRC2 = { A0, B0, E0, F0 }
+         * XMM0 = W' = { ?, ?, WK1, WK0 }
+         *
+         * (NB that the notation again is not C-like, i.e. elements are listed
+         * high-to-low everywhere in this comment.)
+         *
+         * Ch(E,F,G) = (E & F) ^ (~E & G)
+         * Maj(A,B,C) = (A & B) ^ (A & C) ^ (B & C)
+         *
+         * Σ0(A) = ROR2(A) ^ ROR13(A) ^ ROR22(A)
+         * Σ1(E) = ROR6(E) ^ ROR11(E) ^ ROR25(E)
+         *
+         * Applying
+         *
+         * A1 = Ch(E0, F0, G0) + Σ1(E0) + WK0 + H0 + Maj(A0, B0, C0) + Σ0(A0)
+         * B1 = A0
+         * C1 = B0
+         * D1 = C0
+         * E1 = Ch(E0, F0, G0) + Σ1(E0) + WK0 + H0 + D0
+         * F1 = E0
+         * G1 = F0
+         * H1 = G0
+         *
+         * iteratively four times and resolving round variable values to
+         * A<n> / E<n> and B0, C0, D0, F0, G0, and H0 we get
+         *
+         * A2 = Ch(E1, E0, F0) + Σ1(E1) + WK1 + G0 + Maj(A1, A0, B0) + Σ0(A1)
+         * A1 = Ch(E0, F0, G0) + Σ1(E0) + WK0 + H0 + Maj(A0, B0, C0) + Σ0(A0)
+         * E2 = Ch(E1, E0, F0) + Σ1(E1) + WK1 + G0 + C0
+         * E1 = Ch(E0, F0, G0) + Σ1(E0) + WK0 + H0 + D0
+         *
+         * with
+         *
+         * B2 = A1
+         * F2 = E1
+         *
+         * and hence
+         *
+         * DST = { A2, A1, E2, E1 }
+         *
+         * which we can simplify a little, by letting A0, B0, and E0 be zero
+         * and F0 = ~G0, and by then utilizing
+         *
+         * Ch(0, 0, x) = x
+         * Ch(x, 0, y) = ~x & y
+         * Maj(x, 0, 0) = Maj(0, x, 0) = Maj(0, 0, x) = 0
+         *
+         * A2 = (~E1 & F0) + Σ1(E1) + WK1 + G0 + Σ0(A1)
+         * A1 = (~E0 & G0) + Σ1(E0) + WK0 + H0 + Σ0(A0)
+         * E2 = (~E1 & F0) + Σ1(E1) + WK1 + G0 + C0
+         * E1 = (~E0 & G0) + Σ1(E0) + WK0 + H0 + D0
+         *
+         * (respective per-column variable names:
+         *  y      e    g        e    src    h    d
+         * )
+         */
+
+        tmp = (vec_t){ ~hash[1] };
+        touch(tmp);
+        x = SHA(256rnds2, hash, tmp, src);
+        touch(tmp);
+
+        e = y = (vec_t){};
+        d = alignr(y, hash, 8);
+        g = (vec_t){ hash[1], tmp[0], hash[1], tmp[0] };
+        h = shuf(hash, 0b01000100);
+
+        for ( r = 0; r < 2; ++r )
+        {
+            y = (~e & g) + sha256_Sigma1(e) + shuf(src, 0b01000100) +
+                h + sha256_Sigma0(d);
+
+            if ( !r )
+            {
+                d[3] = y[2];
+                e[3] = e[1] = y[0];
+            }
+            else if ( d[3] != y[2] )
+                return __LINE__;
+            else if ( e[1] != y[0] )
+                return __LINE__;
+            else if ( e[3] != y[0] )
+                return __LINE__;
+        }
+
+        if ( !eq(x, y) ) return __LINE__;
+
+        src += 0x01010101 * VEC_SIZE;
+    }
+
+    return 0;
+}
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -14,8 +14,10 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "sse2-gf.h"
 #include "ssse3-aes.h"
 #include "sse4.h"
+#include "sse4-sha.h"
 #include "avx.h"
 #include "avx-aes.h"
+#include "avx-sha.h"
 #include "fma4.h"
 #include "fma.h"
 #include "avx2.h"
@@ -28,6 +30,7 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512bw-opmask.h"
 #include "avx512f.h"
 #include "avx512f-sg.h"
+#include "avx512f-sha.h"
 #include "avx512vl-sg.h"
 #include "avx512bw.h"
 #include "avx512bw-vaes.h"
@@ -155,6 +158,21 @@ static bool simd_check_avx512vbmi_vl(voi
     return cpu_has_avx512_vbmi && cpu_has_avx512vl;
 }
 
+static bool simd_check_sse4_sha(void)
+{
+    return cpu_has_sha && cpu_has_sse4_2;
+}
+
+static bool simd_check_avx_sha(void)
+{
+    return cpu_has_sha && cpu_has_avx;
+}
+
+static bool simd_check_avx512f_sha_vl(void)
+{
+    return cpu_has_sha && cpu_has_avx512vl;
+}
+
 static bool simd_check_avx2_vaes(void)
 {
     return cpu_has_aesni && cpu_has_vaes && cpu_has_avx2;
@@ -450,6 +468,9 @@ static const struct {
     AVX512VL(_VBMI+VL u16x8, avx512vbmi,    16u2),
     AVX512VL(_VBMI+VL s16x16, avx512vbmi,   32i2),
     AVX512VL(_VBMI+VL u16x16, avx512vbmi,   32u2),
+    SIMD(SHA,                sse4_sha,        16),
+    SIMD(AVX+SHA,             avx_sha,        16),
+    AVX512VL(VL+SHA,      avx512f_sha,        16),
     SIMD(VAES (VEX/x32),    avx2_vaes,        32),
     SIMD(VAES (EVEX/x64), avx512bw_vaes,      64),
     AVX512VL(VL+VAES (x16), avx512bw_vaes,    16),
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -142,6 +142,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512_ifma (cp.feat.avx512_ifma && xcr0_mask(0xe6))
 #define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
 #define cpu_has_avx512cd  (cp.feat.avx512cd && xcr0_mask(0xe6))
+#define cpu_has_sha        cp.feat.sha
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* [PATCH v8 50/50] x86emul: add a PCLMUL/VPCLMUL test case to the harness
  2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
                     ` (48 preceding siblings ...)
  2019-03-15 11:08   ` [PATCH v8 49/50] x86emul: add a SHA " Jan Beulich
@ 2019-03-15 11:08   ` Jan Beulich
  2019-06-21 13:58     ` [Xen-devel] " Andrew Cooper
  49 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-15 11:08 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Roger Pau Monne

Also use this for AVX512_VBMI2 VPSH{L,R}D{,V}{D,Q,W} testing (only the
quad word right shifts get actually used; the assumption is that their
"left" counterparts as well as the double word and word forms then work
as well).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v8: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -20,9 +20,10 @@ SIMD := 3dnow sse sse2 sse4 avx avx2 xop
 FMA := fma4 fma
 SG := avx2-sg avx512f-sg avx512vl-sg
 AES := ssse3-aes avx-aes avx2-vaes avx512bw-vaes
+CLMUL := ssse3-pclmul avx-pclmul avx2-vpclmulqdq avx512bw-vpclmulqdq avx512vbmi2-vpclmulqdq
 SHA := sse4-sha avx-sha avx512f-sha
 GF := sse2-gf avx2-gf avx512bw-gf
-TESTCASES := blowfish $(SIMD) $(FMA) $(SG) $(AES) $(SHA) $(GF)
+TESTCASES := blowfish $(SIMD) $(FMA) $(SG) $(AES) $(CLMUL) $(SHA) $(GF)
 
 OPMASK := avx512f avx512dq avx512bw
 
@@ -89,6 +90,7 @@ avx512er-flts := 4 8
 avx512vbmi-vecs := $(avx512bw-vecs)
 avx512vbmi-ints := $(avx512bw-ints)
 avx512vbmi-flts := $(avx512bw-flts)
+avx512vbmi2-vecs := $(avx512bw-vecs)
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
@@ -149,6 +151,10 @@ define simd-aes-defs
 $(1)-cflags := $(foreach vec,$($(patsubst %-aes,sse,$(1))-vecs) $($(patsubst %-vaes,%,$(1))-vecs), \
 	         "-D_$(vec) -maes $(addprefix -m,$(subst -,$(space),$(1))) $(call non-sse,$(1)) -Os -DVEC_SIZE=$(vec)")
 endef
+define simd-clmul-defs
+$(1)-cflags := $(foreach vec,$($(patsubst %-pclmul,sse,$(1))-vecs) $($(patsubst %-vpclmulqdq,%,$(1))-vecs), \
+	         "-D_$(vec) -mpclmul $(addprefix -m,$(subst -,$(space),$(1))) $(call non-sse,$(1)) -Os -DVEC_SIZE=$(vec)")
+endef
 define simd-sha-defs
 $(1)-cflags := $(foreach vec,$(sse-vecs), \
 	         "-D_$(vec) $(addprefix -m,$(subst -,$(space),$(1))) -Os -DVEC_SIZE=$(vec)")
@@ -164,6 +170,7 @@ endef
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
 $(foreach flavor,$(AES),$(eval $(call simd-aes-defs,$(flavor))))
+$(foreach flavor,$(CLMUL),$(eval $(call simd-clmul-defs,$(flavor))))
 $(foreach flavor,$(SHA),$(eval $(call simd-sha-defs,$(flavor))))
 $(foreach flavor,$(GF),$(eval $(call simd-gf-defs,$(flavor))))
 $(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
@@ -218,13 +225,16 @@ $(addsuffix .c,$(SG)):
 $(addsuffix .c,$(AES)):
 	ln -sf simd-aes.c $@
 
+$(addsuffix .c,$(CLMUL)):
+	ln -sf simd-clmul.c $@
+
 $(addsuffix .c,$(SHA)):
 	ln -sf simd-sha.c $@
 
 $(addsuffix .c,$(GF)):
 	ln -sf simd-gf.c $@
 
-$(addsuffix .h,$(SIMD) $(FMA) $(SG) $(AES) $(SHA) $(GF)): simd.h
+$(addsuffix .h,$(SIMD) $(FMA) $(SG) $(AES) $(CLMUL) $(SHA) $(GF)): simd.h
 
 xop.h avx512f.h: simd-fma.c
 
--- /dev/null
+++ b/tools/tests/x86_emulator/simd-clmul.c
@@ -0,0 +1,150 @@
+#define UINT_SIZE 8
+
+#include "simd.h"
+ENTRY(clmul_test);
+
+#ifdef __AVX512F__ /* AVX512BW may get enabled only below */
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+# define lane_shr_unit(x) \
+    ((vec_t)B(palignr, _mask, (vdi_t)(x), (vdi_t)(x), 64, (vdi_t){}, \
+              0x00ff00ff00ff00ffULL & (~0ULL >> (64 - VEC_SIZE))))
+#else
+# if defined(__AVX2__) && VEC_SIZE == 32
+#  define to_bool(cmp) B(ptestc, , cmp, (vdi_t){} == 0)
+# else
+#  define to_bool(cmp) (__builtin_ia32_pmovmskb128(cmp) == 0xffff)
+# endif
+# define eq(x, y) to_bool((x) == (y))
+# define lane_shr_unit(x) ((vec_t)B(palignr, , (vdi_t){}, (vdi_t)(x), 64))
+#endif
+
+#define CLMUL(op, x, y, c) (vec_t)(__builtin_ia32_ ## op((vdi_t)(x), (vdi_t)(y), c))
+
+#if VEC_SIZE == 16
+# define clmul(x, y, c) CLMUL(pclmulqdq128, x, y, c)
+# define vpshrd __builtin_ia32_vpshrd_v2di
+#elif VEC_SIZE == 32
+# define clmul(x, y, c) CLMUL(vpclmulqdq_v4di, x, y, c)
+# define vpshrd __builtin_ia32_vpshrd_v4di
+#elif VEC_SIZE == 64
+# define clmul(x, y, c) CLMUL(vpclmulqdq_v8di, x, y, c)
+# define vpshrd __builtin_ia32_vpshrd_v8di
+#endif
+
+#define clmul_ll(x, y) clmul(x, y, 0x00)
+#define clmul_hl(x, y) clmul(x, y, 0x01)
+#define clmul_lh(x, y) clmul(x, y, 0x10)
+#define clmul_hh(x, y) clmul(x, y, 0x11)
+
+#if defined(__AVX512VBMI2__)
+# pragma GCC target ( "avx512bw" )
+# define lane_shr_i(x, n) ({ \
+    vec_t h_ = lane_shr_unit(x); \
+    touch(h_); \
+    (n) < 64 ? (vec_t)vpshrd((vdi_t)(x), (vdi_t)(h_), n) : h_ >> ((n) - 64); \
+})
+# define lane_shr_v(x, n) ({ \
+    vec_t t_ = (x), h_ = lane_shr_unit(x); \
+    typeof(t_[0]) n_ = (n); \
+    if ( (n) < 64 ) \
+        /* gcc does not support embedded broadcast */ \
+        asm ( "vpshrdvq %2%{1to%c3%}, %1, %0" \
+              : "+v" (t_) : "v" (h_), "m" (n_), "i" (ELEM_COUNT) ); \
+    else \
+        t_ = h_ >> ((n) - 64); \
+    t_; \
+})
+#else
+# define lane_shr_i lane_shr_v
+# define lane_shr_v(x, n) ({ \
+    vec_t t_ = (n) > 0 ? lane_shr_unit(x) : (x); \
+    (n) < 64 ? ((x) >> (n)) | (t_ << (-(n) & 0x3f)) \
+             : t_ >> ((n) - 64); \
+})
+#endif
+
+int clmul_test(void)
+{
+    unsigned int i;
+    vec_t src;
+    vqi_t raw = {};
+
+    for ( i = 1; i < VEC_SIZE; ++i )
+        raw[i] = i;
+    src = (vec_t)raw;
+
+    for ( i = 0; i < 256; i += VEC_SIZE )
+    {
+        vec_t x = {}, y, z, lo, hi;
+        unsigned int j;
+
+        touch(x);
+        y = clmul_ll(src, x);
+        touch(x);
+
+        if ( !eq(y, x) ) return __LINE__;
+
+        for ( j = 0; j < ELEM_COUNT; j += 2 )
+            x[j] = 1;
+
+        touch(src);
+        y = clmul_ll(x, src);
+        touch(src);
+        z = clmul_lh(x, src);
+        touch(src);
+
+        for ( j = 0; j < ELEM_COUNT; j += 2 )
+            y[j + 1] = z[j];
+
+        if ( !eq(y, src) ) return __LINE__;
+
+        /*
+         * Besides the obvious property of the low and high half products
+         * being the same either direction, the "square" of a number has the
+         * property of simply being the original bit pattern with a zero bit
+         * inserted between any two bits. This is what the code below checks.
+         */
+
+        x = src;
+        touch(src);
+        y = clmul_lh(x, src);
+        touch(src);
+        z = clmul_hl(x, src);
+
+        if ( !eq(y, z) ) return __LINE__;
+
+        touch(src);
+        y = lo = clmul_ll(x, src);
+        touch(src);
+        z = hi = clmul_hh(x, src);
+        touch(src);
+
+        for ( j = 0; j < 64; ++j )
+        {
+            vec_t l = lane_shr_v(lo, 2 * j);
+            vec_t h = lane_shr_v(hi, 2 * j);
+            unsigned int n;
+
+            if ( !eq(l, y) ) return __LINE__;
+            if ( !eq(h, z) ) return __LINE__;
+
+            x = src >> j;
+
+            for ( n = 0; n < ELEM_COUNT; n += 2 )
+            {
+                if ( (x[n + 0] & 1) != (l[n] & 3) ) return __LINE__;
+                if ( (x[n + 1] & 1) != (h[n] & 3) ) return __LINE__;
+            }
+
+            touch(y);
+            y = lane_shr_i(y, 2);
+            touch(z);
+            z = lane_shr_i(z, 2);
+        }
+
+        src += 0x0101010101010101ULL * VEC_SIZE;
+    }
+
+    return 0;
+}
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -381,6 +381,7 @@ OVR(movntdq);
 OVR(movntdqa);
 OVR(movshdup);
 OVR(movsldup);
+OVR(pclmulqdq);
 OVR(permd);
 OVR(permq);
 OVR(pmovsxbd);
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -13,16 +13,19 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "sse2.h"
 #include "sse2-gf.h"
 #include "ssse3-aes.h"
+#include "ssse3-pclmul.h"
 #include "sse4.h"
 #include "sse4-sha.h"
 #include "avx.h"
 #include "avx-aes.h"
+#include "avx-pclmul.h"
 #include "avx-sha.h"
 #include "fma4.h"
 #include "fma.h"
 #include "avx2.h"
 #include "avx2-sg.h"
 #include "avx2-vaes.h"
+#include "avx2-vpclmulqdq.h"
 #include "avx2-gf.h"
 #include "xop.h"
 #include "avx512f-opmask.h"
@@ -34,10 +37,12 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512vl-sg.h"
 #include "avx512bw.h"
 #include "avx512bw-vaes.h"
+#include "avx512bw-vpclmulqdq.h"
 #include "avx512bw-gf.h"
 #include "avx512dq.h"
 #include "avx512er.h"
 #include "avx512vbmi.h"
+#include "avx512vbmi2-vpclmulqdq.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -108,6 +113,16 @@ static bool simd_check_avx_aes(void)
     return cpu_has_aesni && cpu_has_avx;
 }
 
+static bool simd_check_ssse3_pclmul(void)
+{
+    return cpu_has_pclmulqdq && cpu_has_ssse3;
+}
+
+static bool simd_check_avx_pclmul(void)
+{
+    return cpu_has_pclmulqdq && cpu_has_avx;
+}
+
 static bool simd_check_avx512f(void)
 {
     return cpu_has_avx512f;
@@ -189,6 +204,31 @@ static bool simd_check_avx512bw_vaes_vl(
            cpu_has_avx512bw && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx2_vpclmulqdq(void)
+{
+    return cpu_has_vpclmulqdq && cpu_has_avx2;
+}
+
+static bool simd_check_avx512bw_vpclmulqdq(void)
+{
+    return cpu_has_vpclmulqdq && cpu_has_avx512bw;
+}
+
+static bool simd_check_avx512bw_vpclmulqdq_vl(void)
+{
+    return cpu_has_vpclmulqdq && cpu_has_avx512bw && cpu_has_avx512vl;
+}
+
+static bool simd_check_avx512vbmi2_vpclmulqdq(void)
+{
+    return cpu_has_avx512_vbmi2 && simd_check_avx512bw_vpclmulqdq();
+}
+
+static bool simd_check_avx512vbmi2_vpclmulqdq_vl(void)
+{
+    return cpu_has_avx512_vbmi2 && simd_check_avx512bw_vpclmulqdq_vl();
+}
+
 static bool simd_check_sse2_gf(void)
 {
     return cpu_has_gfni && cpu_has_sse2;
@@ -369,6 +409,8 @@ static const struct {
     SIMD(XOP i64x4,               xop,      32i8),
     SIMD(AES (legacy),      ssse3_aes,        16),
     SIMD(AES (VEX/x16),       avx_aes,        16),
+    SIMD(PCLMUL (legacy), ssse3_pclmul,       16),
+    SIMD(PCLMUL (VEX/x2),  avx_pclmul,        16),
     SIMD(OPMASK/w,     avx512f_opmask,         2),
     SIMD(OPMASK+DQ/b, avx512dq_opmask,         1),
     SIMD(OPMASK+DQ/w, avx512dq_opmask,         2),
@@ -475,6 +517,13 @@ static const struct {
     SIMD(VAES (EVEX/x64), avx512bw_vaes,      64),
     AVX512VL(VL+VAES (x16), avx512bw_vaes,    16),
     AVX512VL(VL+VAES (x32), avx512bw_vaes,    32),
+    SIMD(VPCLMUL (VEX/x4), avx2_vpclmulqdq,  32),
+    SIMD(VPCLMUL (EVEX/x8), avx512bw_vpclmulqdq, 64),
+    AVX512VL(VL+VPCLMUL (x4), avx512bw_vpclmulqdq, 16),
+    AVX512VL(VL+VPCLMUL (x8), avx512bw_vpclmulqdq, 32),
+    SIMD(AVX512_VBMI2+VPCLMUL (x8), avx512vbmi2_vpclmulqdq, 64),
+    AVX512VL(_VBMI2+VL+VPCLMUL (x2), avx512vbmi2_vpclmulqdq, 16),
+    AVX512VL(_VBMI2+VL+VPCLMUL (x4), avx512vbmi2_vpclmulqdq, 32),
     SIMD(GFNI (legacy),       sse2_gf,        16),
     SIMD(GFNI (VEX/x16),      avx2_gf,        16),
     SIMD(GFNI (VEX/x32),      avx2_gf,        32),
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -125,6 +125,7 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_sse        cp.basic.sse
 #define cpu_has_sse2       cp.basic.sse2
 #define cpu_has_sse3       cp.basic.sse3
+#define cpu_has_pclmulqdq  cp.basic.pclmulqdq
 #define cpu_has_ssse3      cp.basic.ssse3
 #define cpu_has_fma       (cp.basic.fma && xcr0_mask(6))
 #define cpu_has_sse4_1     cp.basic.sse4_1




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 06/49] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2019-03-14 17:15       ` Jan Beulich
@ 2019-03-15 16:39         ` Andrew Cooper
  2019-03-18  9:45           ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-03-15 16:39 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 14/03/2019 17:15, Jan Beulich wrote:
>>>> On 14.03.19 at 17:38, <andrew.cooper3@citrix.com> wrote:
>> On 19/12/2018 14:38, Jan Beulich wrote:
>>> @@ -8444,6 +8465,45 @@ x86_emulate(
>>>          generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
>>>          goto simd_0f_avx2;
>>>  
>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
>>> +        host_and_vcpu_must_have(avx512bw);
>>> +        generate_exception_if(evex.w || evex.brs, EXC_UD);
>>> +        op_bytes = elem_bytes = 1 << (b & 1);
>>> +        /* See the comment at the avx512_broadcast label. */
>>> +        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
>>> +        goto avx512f_no_sae;
>>> +
>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
>>> +        host_and_vcpu_must_have(avx512bw);
>>> +        generate_exception_if(evex.w, EXC_UD);
>>> +        /* fall through */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
>>> +        generate_exception_if((ea.type != OP_REG || evex.brs ||
>>> +                               evex.reg != 0xf || !evex.RX),
>>> +                              EXC_UD);
>> generate_exception_if(ea.type != OP_REG || evex.brs ||
>>                       evex.reg != 0xf || !evex.RX, EXC_UD);
>>
>> ?
> Well, no - I don't really want the second argument on a continued
> line of the first one.

Why not?  We do this elsewhere, and it doesn't help readability to split
them.

>  Multiple full arguments on one line are fine
> with me. If you want me to drop just the inner parentheses, that
> would be fine (as mentioned in another context on this series, I've
> mainly added them because of what I understood your editor's
> behavior is when splitting lines like this, and I may have easily
> misunderstood).

Its only when your preferred indentation deviates from BSD style.  If
there are already brackets around, the chances are its fine, because the
align with brackets rule takes over.

The main issue is with return statements, where BSD style would require
the continuation of the statement to start under the second r of return.

A second more problematic issue is

exp ? one
    : two

because all indentation styles I'm aware of would move the second line
to left.  This one can't be fixed in any reasonable way by adding
brackets if there were previously none, because even if it was
bracketed, the else case would still move left.

>
>> Otherwise, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Let me know on what variant(s) of the above this holds.

I'd prefer not to split EXC_UD onto a line of its own, because I don't
think it helps the readability of the surrounding code, but I'm not
sufficiently bothered to make it a hard requirement.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 01/50] x86emul: no need to set fault_suppression to false for VMOVNT*
  2019-03-15 10:36   ` [PATCH v8 01/50] x86emul: no need to set fault_suppression to false for VMOVNT* Jan Beulich
@ 2019-03-15 16:52     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-03-15 16:52 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:36, Jan Beulich wrote:
> When evex.opmsk is required to be zero there's no need for this, as it
> won't have been set to true in the first place.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 02/50] x86emul: support AVX512{F, BW, DQ} extract insns
  2019-03-15 10:36   ` [PATCH v8 02/50] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
@ 2019-03-15 17:51     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-03-15 17:51 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:36, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 07/50] x86emul: support AVX512{F, BW} zero- and sign-extending moves
  2019-03-15 10:40   ` [PATCH v8 07/50] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
@ 2019-03-15 18:02     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-03-15 18:02 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:40, Jan Beulich wrote:
> Note that the testing in simd.c doesn't really follow the ISA extension
> pattern - to fit the scheme, extensions from byte and word granular
> vectors can (currently) sensibly only happen in the AVX512BW case (and
> hence respective abstraction macros will be added there rather than
> here).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 08/50] x86emul: support AVX512{F, BW} down conversion moves
  2019-03-15 10:40   ` [PATCH v8 08/50] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
@ 2019-03-15 18:10     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-03-15 18:10 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:40, Jan Beulich wrote:
> Note that the vpmov{,s,us}{d,q}w table entries in evex-disp8.c are
> slightly different from what one would expect, due to them requiring
> EVEX.W to be zero.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 09/50] x86emul: support AVX512{F, BW} integer unpack insns
  2019-03-15 10:41   ` [PATCH v8 09/50] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
@ 2019-03-15 18:21     ` Andrew Cooper
  2019-03-18  9:55       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-03-15 18:21 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:41, Jan Beulich wrote:
> @@ -6681,6 +6681,12 @@ x86_emulate(
>      case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm */
>          generate_exception_if(evex.opmsk, EXC_UD);
>          /* fall through */
> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
> +        op_bytes = 16 << evex.lr;
> +        /* fall through */

If this setting of op_bytes is safe to do for vpsadbw, how does the
emulation currently work?

~Andrew

>      case X86EMUL_OPC_EVEX_66(0x0f, 0xd1): /* vpsrlw xmm/m128,[xyz]mm,[xyz]mm{k} */
>      case X86EMUL_OPC_EVEX_66(0x0f, 0xe1): /* vpsraw xmm/m128,[xyz]mm,[xyz]mm{k} */
>      case X86EMUL_OPC_EVEX_66(0x0f, 0xf1): /* vpsllw xmm/m128,[xyz]mm,[xyz]mm{k} */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v7 06/49] x86emul: support AVX512{F, BW, DQ} integer broadcast insns
  2019-03-15 16:39         ` Andrew Cooper
@ 2019-03-18  9:45           ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-03-18  9:45 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 15.03.19 at 17:39, <andrew.cooper3@citrix.com> wrote:
> On 14/03/2019 17:15, Jan Beulich wrote:
>>>>> On 14.03.19 at 17:38, <andrew.cooper3@citrix.com> wrote:
>>> On 19/12/2018 14:38, Jan Beulich wrote:
>>>> @@ -8444,6 +8465,45 @@ x86_emulate(
>>>>          generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
>>>>          goto simd_0f_avx2;
>>>>  
>>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,[xyz]mm{k} */
>>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x79): /* vpbroadcastw xmm/m16,[xyz]mm{k} */
>>>> +        host_and_vcpu_must_have(avx512bw);
>>>> +        generate_exception_if(evex.w || evex.brs, EXC_UD);
>>>> +        op_bytes = elem_bytes = 1 << (b & 1);
>>>> +        /* See the comment at the avx512_broadcast label. */
>>>> +        op_mask |= !(b & 1 ? !(uint32_t)op_mask : !op_mask);
>>>> +        goto avx512f_no_sae;
>>>> +
>>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7a): /* vpbroadcastb r32,[xyz]mm{k} */
>>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7b): /* vpbroadcastw r32,[xyz]mm{k} */
>>>> +        host_and_vcpu_must_have(avx512bw);
>>>> +        generate_exception_if(evex.w, EXC_UD);
>>>> +        /* fall through */
>>>> +    case X86EMUL_OPC_EVEX_66(0x0f38, 0x7c): /* vpbroadcast{d,q} reg,[xyz]mm{k} */
>>>> +        generate_exception_if((ea.type != OP_REG || evex.brs ||
>>>> +                               evex.reg != 0xf || !evex.RX),
>>>> +                              EXC_UD);
>>> generate_exception_if(ea.type != OP_REG || evex.brs ||
>>>                       evex.reg != 0xf || !evex.RX, EXC_UD);
>>>
>>> ?
>> Well, no - I don't really want the second argument on a continued
>> line of the first one.
> 
> Why not?  We do this elsewhere, and it doesn't help readability to split
> them.

Well, readability is always a subjective thing. To me, doing what
you suggest would result in an increased risk of no realizing that
there's a second argument at the end of the second line (granted
it's not a bad here, but it would get quite bad with more than two
arguments, and a mixture of splitting and adding further ones on
the continued arguments' lines).

>>  Multiple full arguments on one line are fine
>> with me. If you want me to drop just the inner parentheses, that
>> would be fine (as mentioned in another context on this series, I've
>> mainly added them because of what I understood your editor's
>> behavior is when splitting lines like this, and I may have easily
>> misunderstood).
> 
> Its only when your preferred indentation deviates from BSD style.  If
> there are already brackets around, the chances are its fine, because the
> align with brackets rule takes over.
> 
> The main issue is with return statements, where BSD style would require
> the continuation of the statement to start under the second r of return.

Ah, I see. So in the case above the inner parentheses aren't
really necessary from the editor's pov. I still think they're useful
to make visually clear at the first glance the difference between
continued arguments and the start of further ones.

>>> Otherwise, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Let me know on what variant(s) of the above this holds.
> 
> I'd prefer not to split EXC_UD onto a line of its own, because I don't
> think it helps the readability of the surrounding code, but I'm not
> sufficiently bothered to make it a hard requirement.

Thanks.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 09/50] x86emul: support AVX512{F, BW} integer unpack insns
  2019-03-15 18:21     ` Andrew Cooper
@ 2019-03-18  9:55       ` Jan Beulich
  2019-05-20 12:11         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-03-18  9:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 15.03.19 at 19:21, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:41, Jan Beulich wrote:
>> @@ -6681,6 +6681,12 @@ x86_emulate(
>>      case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm */
>>          generate_exception_if(evex.opmsk, EXC_UD);
>>          /* fall through */
>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>> +        op_bytes = 16 << evex.lr;
>> +        /* fall through */
> 
> If this setting of op_bytes is safe to do for vpsadbw, how does the
> emulation currently work?

The setting is redundant for VPSADBW (there it gets set by virtue
of its table entry saying simd_packed_int), but it's necessary for
VUNPCK* as their table entries use simd_other, which is necessary
because of the memory access pattern of PUNPCKL*. In fact the
PUNPCKH* entries could equally well use simd_packed_int, but
that would then call for their case labels to get moved away from
the PUNPCKL* ones, and I slightly prefer them to be kept together.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2019-03-15 10:41   ` [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
@ 2019-05-17 16:50     ` Andrew Cooper
  2019-05-17 16:50       ` [Xen-devel] " Andrew Cooper
  2019-05-20  6:55       ` Jan Beulich
  0 siblings, 2 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 16:50 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:41, Jan Beulich wrote:
> Take the liberty and also correct the (public interface) name of the
> AVX512_VBMI feature flag, on the assumption that no external consumer
> has actually been using that flag so far.

I've been giving this some thought, and I think putting these in the
public interface was a mistake.

They are a representation of other peoples stable ABI, and are only used
in tools interfaces as far as Xen is concerned (SYSCTL_get_featureset,
DOMCTL_get_cpu_policy)

The only external representations are xen_cpuid_leaf_t/xen_msr_entry_t
which don't need constants to go with them.

Given the advent of libx86, I think the feature definitions ought to
move there, and be removed from Xen's public interface.  Along with
that, I see no reason to keep their variable-prefix-ness, which will
allow them to be searchable again.

>  Furthermore make it have
> AVX512BW instead of AVX512F as a prerequisite, for requiring full
> 64-bit mask registers (the upper 48 bits of which can't be accessed
> other than through XSAVE/XRSTOR without AVX512BW support).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

As for the rest, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> but
we perhaps want to sort out the position in the public interface before
changing AVX512VBMI.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2019-05-17 16:50     ` Andrew Cooper
@ 2019-05-17 16:50       ` Andrew Cooper
  2019-05-20  6:55       ` Jan Beulich
  1 sibling, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 16:50 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:41, Jan Beulich wrote:
> Take the liberty and also correct the (public interface) name of the
> AVX512_VBMI feature flag, on the assumption that no external consumer
> has actually been using that flag so far.

I've been giving this some thought, and I think putting these in the
public interface was a mistake.

They are a representation of other peoples stable ABI, and are only used
in tools interfaces as far as Xen is concerned (SYSCTL_get_featureset,
DOMCTL_get_cpu_policy)

The only external representations are xen_cpuid_leaf_t/xen_msr_entry_t
which don't need constants to go with them.

Given the advent of libx86, I think the feature definitions ought to
move there, and be removed from Xen's public interface.  Along with
that, I see no reason to keep their variable-prefix-ness, which will
allow them to be searchable again.

>  Furthermore make it have
> AVX512BW instead of AVX512F as a prerequisite, for requiring full
> 64-bit mask registers (the upper 48 bits of which can't be accessed
> other than through XSAVE/XRSTOR without AVX512BW support).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

As for the rest, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> but
we perhaps want to sort out the position in the public interface before
changing AVX512VBMI.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 11/50] x86emul: support AVX512{F, BW} integer shuffle insns
  2019-03-15 10:43   ` [PATCH v8 11/50] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
@ 2019-05-17 17:01     ` Andrew Cooper
  2019-05-17 17:01       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 17:01 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:43, Jan Beulich wrote:
> Also include vshuff{32x4,64x2} as being very similar to vshufi{32x4,64x2}.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 11/50] x86emul: support AVX512{F, BW} integer shuffle insns
  2019-05-17 17:01     ` Andrew Cooper
@ 2019-05-17 17:01       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 17:01 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:43, Jan Beulich wrote:
> Also include vshuff{32x4,64x2} as being very similar to vshufi{32x4,64x2}.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 12/50] x86emul: support AVX512{BW, DQ} mask move insns
  2019-03-15 10:43   ` [PATCH v8 12/50] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
@ 2019-05-17 17:02     ` Andrew Cooper
  2019-05-17 17:02       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 17:02 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:43, Jan Beulich wrote:
> Entries to the tables in evex-disp8.c are added despite these insns not
> allowing for memory operands, with the goal of the tables giving a
> complete picture of the supported EVEX-encoded insns in the end.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 12/50] x86emul: support AVX512{BW, DQ} mask move insns
  2019-05-17 17:02     ` Andrew Cooper
@ 2019-05-17 17:02       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 17:02 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:43, Jan Beulich wrote:
> Entries to the tables in evex-disp8.c are added despite these insns not
> allowing for memory operands, with the goal of the tables giving a
> complete picture of the supported EVEX-encoded insns in the end.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 13/50] x86emul: basic AVX512BW testing
  2019-03-15 10:43   ` [PATCH v8 13/50] x86emul: basic AVX512BW testing Jan Beulich
@ 2019-05-17 17:03     ` Andrew Cooper
  2019-05-17 17:03       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 17:03 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:43, Jan Beulich wrote:
> Test various of the insns which have been implemented already.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 13/50] x86emul: basic AVX512BW testing
  2019-05-17 17:03     ` Andrew Cooper
@ 2019-05-17 17:03       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 17:03 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:43, Jan Beulich wrote:
> Test various of the insns which have been implemented already.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 14/50] x86emul: basic AVX512DQ testing
  2019-03-15 10:44   ` [PATCH v8 14/50] x86emul: basic AVX512DQ testing Jan Beulich
@ 2019-05-17 17:03     ` Andrew Cooper
  2019-05-17 17:03       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 17:03 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:44, Jan Beulich wrote:
> Test various of the insns which have been implemented already.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 14/50] x86emul: basic AVX512DQ testing
  2019-05-17 17:03     ` Andrew Cooper
@ 2019-05-17 17:03       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-17 17:03 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:44, Jan Beulich wrote:
> Test various of the insns which have been implemented already.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2019-05-17 16:50     ` Andrew Cooper
  2019-05-17 16:50       ` [Xen-devel] " Andrew Cooper
@ 2019-05-20  6:55       ` Jan Beulich
  2019-05-20  6:55         ` [Xen-devel] " Jan Beulich
  2019-05-20 12:10         ` Andrew Cooper
  1 sibling, 2 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-20  6:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 17.05.19 at 18:50, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:41, Jan Beulich wrote:
>> Take the liberty and also correct the (public interface) name of the
>> AVX512_VBMI feature flag, on the assumption that no external consumer
>> has actually been using that flag so far.
> 
> I've been giving this some thought, and I think putting these in the
> public interface was a mistake.
> 
> They are a representation of other peoples stable ABI, and are only used
> in tools interfaces as far as Xen is concerned (SYSCTL_get_featureset,
> DOMCTL_get_cpu_policy)
> 
> The only external representations are xen_cpuid_leaf_t/xen_msr_entry_t
> which don't need constants to go with them.

While I don't really mind removing them from the public interface
again, I think you're missing to discuss one point of why they've
been put there originally: We wanted them to also serve as
documentation of the forward compatible nature of the A/S/H
annotations, i.e. such that we wouldn't lightly remove something
that was previously exposed.

>>  Furthermore make it have
>> AVX512BW instead of AVX512F as a prerequisite, for requiring full
>> 64-bit mask registers (the upper 48 bits of which can't be accessed
>> other than through XSAVE/XRSTOR without AVX512BW support).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> As for the rest, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> but
> we perhaps want to sort out the position in the public interface before
> changing AVX512VBMI.

Thanks; I don't think the order of events matters much, but if this is
something to happen soon, I don't mind waiting (and the re-basing).

There's an important point here though that you don't say any word
on, despite me having mentioned it to you on irc already after you
had given your ack to "x86/CPUID: support leaf 7 subleaf 1 /
AVX512_BF16" making a similar dependency as the patch here upon
AVX512BW rather than AVX512F. Recall that I had to split off the
change below from another patch of mine, because of our
disagreement on the intended dependencies. If we make the sub-
features requiring wider than 16-bit mask registers dependent
upon AVX512BW (as suggested by the patch here), then I think
we also want to change the SSEn dependencies as done/suggested
in the patch below (albeit of course there's no dependency on
changed register width there, just one on permitted element types,
so I could also see a basis to accept the dependency here as is, but
view things differently for SSEn).

Seeing your further acks (thanks!), just as a side note: Are you
aware that you've skipped patch 9 in the series, without which the
ones you've acked can't go in anyway (i.e. regardless of whether
I'm to wait for the public interface adjustment above)?

Jan

x86/cpuid: correct dependencies of post-SSE ISA extensions

Move AESNI, PCLMULQDQ, and SHA to SSE2, as all of them act on vectors of
integers, whereas plain SSE supports vectors of single precision floats
only. This is in line with how e.g. binutils and gcc treat them.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v?: Re-base over split out PCLMULQDQ addition.
---
TBD: On the same basis, SSE3, SSSE3 and SSE4A should probably also
depend on SSE2 rather than SSE. In fact making this a chain SSE -> SSE2
-> SSE3 -> { SSSE3, SSE4A } would probably be best, and get us in line
with both binutils and gcc. But I think I did suggest so when the
dependencies were introduced, and this wasn't liked for a reason I
forgot.

--- unstable.orig/xen/tools/gen-cpuid.py
+++ unstable/xen/tools/gen-cpuid.py
@@ -196,11 +196,12 @@ def crunch_numbers(state):
         # instructions.  Several futher instruction sets are built on core
         # %XMM support, without specific inter-dependencies.  Additionally
         # AMD has a special mis-alignment sub-mode.
-        SSE: [SSE2, SSE3, SSSE3, SSE4A, MISALIGNSSE,
-              AESNI, PCLMULQDQ, SHA],
+        SSE: [SSE2, SSE3, SSSE3, SSE4A, MISALIGNSSE],
 
-        # SSE2 was re-specified as core instructions for 64bit.
-        SSE2: [LM],
+        # SSE2 was re-specified as core instructions for 64bit.  Also ISA
+        # extensions dealing with vectors of integers are added here rather
+        # than to SSE.
+        SSE2: [LM, AESNI, PCLMULQDQ, SHA],
 
         # SSE4.1 explicitly depends on SSE3 and SSSE3
         SSE3: [SSE4_1],


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2019-05-20  6:55       ` Jan Beulich
@ 2019-05-20  6:55         ` Jan Beulich
  2019-05-20 12:10         ` Andrew Cooper
  1 sibling, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-20  6:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 17.05.19 at 18:50, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:41, Jan Beulich wrote:
>> Take the liberty and also correct the (public interface) name of the
>> AVX512_VBMI feature flag, on the assumption that no external consumer
>> has actually been using that flag so far.
> 
> I've been giving this some thought, and I think putting these in the
> public interface was a mistake.
> 
> They are a representation of other peoples stable ABI, and are only used
> in tools interfaces as far as Xen is concerned (SYSCTL_get_featureset,
> DOMCTL_get_cpu_policy)
> 
> The only external representations are xen_cpuid_leaf_t/xen_msr_entry_t
> which don't need constants to go with them.

While I don't really mind removing them from the public interface
again, I think you're missing to discuss one point of why they've
been put there originally: We wanted them to also serve as
documentation of the forward compatible nature of the A/S/H
annotations, i.e. such that we wouldn't lightly remove something
that was previously exposed.

>>  Furthermore make it have
>> AVX512BW instead of AVX512F as a prerequisite, for requiring full
>> 64-bit mask registers (the upper 48 bits of which can't be accessed
>> other than through XSAVE/XRSTOR without AVX512BW support).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> As for the rest, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> but
> we perhaps want to sort out the position in the public interface before
> changing AVX512VBMI.

Thanks; I don't think the order of events matters much, but if this is
something to happen soon, I don't mind waiting (and the re-basing).

There's an important point here though that you don't say any word
on, despite me having mentioned it to you on irc already after you
had given your ack to "x86/CPUID: support leaf 7 subleaf 1 /
AVX512_BF16" making a similar dependency as the patch here upon
AVX512BW rather than AVX512F. Recall that I had to split off the
change below from another patch of mine, because of our
disagreement on the intended dependencies. If we make the sub-
features requiring wider than 16-bit mask registers dependent
upon AVX512BW (as suggested by the patch here), then I think
we also want to change the SSEn dependencies as done/suggested
in the patch below (albeit of course there's no dependency on
changed register width there, just one on permitted element types,
so I could also see a basis to accept the dependency here as is, but
view things differently for SSEn).

Seeing your further acks (thanks!), just as a side note: Are you
aware that you've skipped patch 9 in the series, without which the
ones you've acked can't go in anyway (i.e. regardless of whether
I'm to wait for the public interface adjustment above)?

Jan

x86/cpuid: correct dependencies of post-SSE ISA extensions

Move AESNI, PCLMULQDQ, and SHA to SSE2, as all of them act on vectors of
integers, whereas plain SSE supports vectors of single precision floats
only. This is in line with how e.g. binutils and gcc treat them.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v?: Re-base over split out PCLMULQDQ addition.
---
TBD: On the same basis, SSE3, SSSE3 and SSE4A should probably also
depend on SSE2 rather than SSE. In fact making this a chain SSE -> SSE2
-> SSE3 -> { SSSE3, SSE4A } would probably be best, and get us in line
with both binutils and gcc. But I think I did suggest so when the
dependencies were introduced, and this wasn't liked for a reason I
forgot.

--- unstable.orig/xen/tools/gen-cpuid.py
+++ unstable/xen/tools/gen-cpuid.py
@@ -196,11 +196,12 @@ def crunch_numbers(state):
         # instructions.  Several futher instruction sets are built on core
         # %XMM support, without specific inter-dependencies.  Additionally
         # AMD has a special mis-alignment sub-mode.
-        SSE: [SSE2, SSE3, SSSE3, SSE4A, MISALIGNSSE,
-              AESNI, PCLMULQDQ, SHA],
+        SSE: [SSE2, SSE3, SSSE3, SSE4A, MISALIGNSSE],
 
-        # SSE2 was re-specified as core instructions for 64bit.
-        SSE2: [LM],
+        # SSE2 was re-specified as core instructions for 64bit.  Also ISA
+        # extensions dealing with vectors of integers are added here rather
+        # than to SSE.
+        SSE2: [LM, AESNI, PCLMULQDQ, SHA],
 
         # SSE4.1 explicitly depends on SSE3 and SSSE3
         SSE3: [SSE4_1],


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2019-05-20  6:55       ` Jan Beulich
  2019-05-20  6:55         ` [Xen-devel] " Jan Beulich
@ 2019-05-20 12:10         ` Andrew Cooper
  2019-05-20 12:10           ` [Xen-devel] " Andrew Cooper
  1 sibling, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-20 12:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 20/05/2019 07:55, Jan Beulich wrote:
>>>> On 17.05.19 at 18:50, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:41, Jan Beulich wrote:
>>> Take the liberty and also correct the (public interface) name of the
>>> AVX512_VBMI feature flag, on the assumption that no external consumer
>>> has actually been using that flag so far.
>> I've been giving this some thought, and I think putting these in the
>> public interface was a mistake.
>>
>> They are a representation of other peoples stable ABI, and are only used
>> in tools interfaces as far as Xen is concerned (SYSCTL_get_featureset,
>> DOMCTL_get_cpu_policy)
>>
>> The only external representations are xen_cpuid_leaf_t/xen_msr_entry_t
>> which don't need constants to go with them.
> While I don't really mind removing them from the public interface
> again, I think you're missing to discuss one point of why they've
> been put there originally: We wanted them to also serve as
> documentation of the forward compatible nature of the A/S/H
> annotations, i.e. such that we wouldn't lightly remove something
> that was previously exposed.

I don't recall that being a consideration, but the file doesn't need to
be in xen/public/ for us to apply general "don't change this lightly"
rules to the annotations.

I don't see it being different to other areas of backwards compatibility
we maintain in the main codebase.

>
>>>  Furthermore make it have
>>> AVX512BW instead of AVX512F as a prerequisite, for requiring full
>>> 64-bit mask registers (the upper 48 bits of which can't be accessed
>>> other than through XSAVE/XRSTOR without AVX512BW support).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> As for the rest, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> but
>> we perhaps want to sort out the position in the public interface before
>> changing AVX512VBMI.
> Thanks; I don't think the order of events matters much, but if this is
> something to happen soon, I don't mind waiting (and the re-basing).

I don't have any immediate plans - its not on the critical path to
DOMCTL_set_cpu_policy work.

OTOH, it doesn't look like it would be hard to do, and it would make it
clear that we are fine to rename constants.  I can look into this if
you'd like.

> There's an important point here though that you don't say any word
> on, despite me having mentioned it to you on irc already after you
> had given your ack to "x86/CPUID: support leaf 7 subleaf 1 /
> AVX512_BF16" making a similar dependency as the patch here upon
> AVX512BW rather than AVX512F. Recall that I had to split off the
> change below from another patch of mine, because of our
> disagreement on the intended dependencies. If we make the sub-
> features requiring wider than 16-bit mask registers dependent
> upon AVX512BW (as suggested by the patch here), then I think
> we also want to change the SSEn dependencies as done/suggested
> in the patch below (albeit of course there's no dependency on
> changed register width there, just one on permitted element types,
> so I could also see a basis to accept the dependency here as is, but
> view things differently for SSEn).

Reply below.

> Seeing your further acks (thanks!), just as a side note: Are you
> aware that you've skipped patch 9 in the series, without which the
> ones you've acked can't go in anyway (i.e. regardless of whether
> I'm to wait for the public interface adjustment above)?

Oops - let me go and see.

>
> Jan
>
> x86/cpuid: correct dependencies of post-SSE ISA extensions
>
> Move AESNI, PCLMULQDQ, and SHA to SSE2, as all of them act on vectors of
> integers, whereas plain SSE supports vectors of single precision floats
> only. This is in line with how e.g. binutils and gcc treat them.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

This isn't worth the effort its been arguing over, but I'd request
s/correct/adjust/ in the subject as "correctness" here is subjective.

> ---
> v?: Re-base over split out PCLMULQDQ addition.
> ---
> TBD: On the same basis, SSE3, SSSE3 and SSE4A should probably also
> depend on SSE2 rather than SSE. In fact making this a chain SSE -> SSE2
> -> SSE3 -> { SSSE3, SSE4A } would probably be best, and get us in line
> with both binutils and gcc. But I think I did suggest so when the
> dependencies were introduced, and this wasn't liked for a reason I
> forgot.

The code is only as it is because you insisted upon it being this way. 
I'll remind you that my first submission looked rather closer to this.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns
  2019-05-20 12:10         ` Andrew Cooper
@ 2019-05-20 12:10           ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-20 12:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 20/05/2019 07:55, Jan Beulich wrote:
>>>> On 17.05.19 at 18:50, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:41, Jan Beulich wrote:
>>> Take the liberty and also correct the (public interface) name of the
>>> AVX512_VBMI feature flag, on the assumption that no external consumer
>>> has actually been using that flag so far.
>> I've been giving this some thought, and I think putting these in the
>> public interface was a mistake.
>>
>> They are a representation of other peoples stable ABI, and are only used
>> in tools interfaces as far as Xen is concerned (SYSCTL_get_featureset,
>> DOMCTL_get_cpu_policy)
>>
>> The only external representations are xen_cpuid_leaf_t/xen_msr_entry_t
>> which don't need constants to go with them.
> While I don't really mind removing them from the public interface
> again, I think you're missing to discuss one point of why they've
> been put there originally: We wanted them to also serve as
> documentation of the forward compatible nature of the A/S/H
> annotations, i.e. such that we wouldn't lightly remove something
> that was previously exposed.

I don't recall that being a consideration, but the file doesn't need to
be in xen/public/ for us to apply general "don't change this lightly"
rules to the annotations.

I don't see it being different to other areas of backwards compatibility
we maintain in the main codebase.

>
>>>  Furthermore make it have
>>> AVX512BW instead of AVX512F as a prerequisite, for requiring full
>>> 64-bit mask registers (the upper 48 bits of which can't be accessed
>>> other than through XSAVE/XRSTOR without AVX512BW support).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> As for the rest, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> but
>> we perhaps want to sort out the position in the public interface before
>> changing AVX512VBMI.
> Thanks; I don't think the order of events matters much, but if this is
> something to happen soon, I don't mind waiting (and the re-basing).

I don't have any immediate plans - its not on the critical path to
DOMCTL_set_cpu_policy work.

OTOH, it doesn't look like it would be hard to do, and it would make it
clear that we are fine to rename constants.  I can look into this if
you'd like.

> There's an important point here though that you don't say any word
> on, despite me having mentioned it to you on irc already after you
> had given your ack to "x86/CPUID: support leaf 7 subleaf 1 /
> AVX512_BF16" making a similar dependency as the patch here upon
> AVX512BW rather than AVX512F. Recall that I had to split off the
> change below from another patch of mine, because of our
> disagreement on the intended dependencies. If we make the sub-
> features requiring wider than 16-bit mask registers dependent
> upon AVX512BW (as suggested by the patch here), then I think
> we also want to change the SSEn dependencies as done/suggested
> in the patch below (albeit of course there's no dependency on
> changed register width there, just one on permitted element types,
> so I could also see a basis to accept the dependency here as is, but
> view things differently for SSEn).

Reply below.

> Seeing your further acks (thanks!), just as a side note: Are you
> aware that you've skipped patch 9 in the series, without which the
> ones you've acked can't go in anyway (i.e. regardless of whether
> I'm to wait for the public interface adjustment above)?

Oops - let me go and see.

>
> Jan
>
> x86/cpuid: correct dependencies of post-SSE ISA extensions
>
> Move AESNI, PCLMULQDQ, and SHA to SSE2, as all of them act on vectors of
> integers, whereas plain SSE supports vectors of single precision floats
> only. This is in line with how e.g. binutils and gcc treat them.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

This isn't worth the effort its been arguing over, but I'd request
s/correct/adjust/ in the subject as "correctness" here is subjective.

> ---
> v?: Re-base over split out PCLMULQDQ addition.
> ---
> TBD: On the same basis, SSE3, SSSE3 and SSE4A should probably also
> depend on SSE2 rather than SSE. In fact making this a chain SSE -> SSE2
> -> SSE3 -> { SSSE3, SSE4A } would probably be best, and get us in line
> with both binutils and gcc. But I think I did suggest so when the
> dependencies were introduced, and this wasn't liked for a reason I
> forgot.

The code is only as it is because you insisted upon it being this way. 
I'll remind you that my first submission looked rather closer to this.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 09/50] x86emul: support AVX512{F, BW} integer unpack insns
  2019-03-18  9:55       ` Jan Beulich
@ 2019-05-20 12:11         ` Andrew Cooper
  2019-05-20 12:11           ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-20 12:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 18/03/2019 09:55, Jan Beulich wrote:
>>>> On 15.03.19 at 19:21, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:41, Jan Beulich wrote:
>>> @@ -6681,6 +6681,12 @@ x86_emulate(
>>>      case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm */
>>>          generate_exception_if(evex.opmsk, EXC_UD);
>>>          /* fall through */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>>> +        op_bytes = 16 << evex.lr;
>>> +        /* fall through */
>> If this setting of op_bytes is safe to do for vpsadbw, how does the
>> emulation currently work?
> The setting is redundant for VPSADBW (there it gets set by virtue
> of its table entry saying simd_packed_int), but it's necessary for
> VUNPCK* as their table entries use simd_other, which is necessary
> because of the memory access pattern of PUNPCKL*. In fact the
> PUNPCKH* entries could equally well use simd_packed_int, but
> that would then call for their case labels to get moved away from
> the PUNPCKL* ones, and I slightly prefer them to be kept together.

Ok.

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 09/50] x86emul: support AVX512{F, BW} integer unpack insns
  2019-05-20 12:11         ` Andrew Cooper
@ 2019-05-20 12:11           ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-20 12:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 18/03/2019 09:55, Jan Beulich wrote:
>>>> On 15.03.19 at 19:21, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:41, Jan Beulich wrote:
>>> @@ -6681,6 +6681,12 @@ x86_emulate(
>>>      case X86EMUL_OPC_EVEX_66(0x0f, 0xf6): /* vpsadbw [xyz]mm/mem,[xyz]mm,[xyz]mm */
>>>          generate_exception_if(evex.opmsk, EXC_UD);
>>>          /* fall through */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x60): /* vpunpcklbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x61): /* vpunpcklwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x68): /* vpunpckhbw [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f, 0x69): /* vpunpckhwd [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>>> +        op_bytes = 16 << evex.lr;
>>> +        /* fall through */
>> If this setting of op_bytes is safe to do for vpsadbw, how does the
>> emulation currently work?
> The setting is redundant for VPSADBW (there it gets set by virtue
> of its table entry saying simd_packed_int), but it's necessary for
> VUNPCK* as their table entries use simd_other, which is necessary
> because of the memory access pattern of PUNPCKL*. In fact the
> PUNPCKH* entries could equally well use simd_packed_int, but
> that would then call for their case labels to get moved away from
> the PUNPCKL* ones, and I slightly prefer them to be kept together.

Ok.

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 15/50] x86emul: support AVX512F move high/low insns
  2019-03-15 10:44   ` [PATCH v8 15/50] x86emul: support AVX512F move high/low insns Jan Beulich
@ 2019-05-21 10:59     ` Andrew Cooper
  2019-05-21 10:59       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 10:59 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:44, Jan Beulich wrote:
> No explicit test harness additions other than the overrides, as the
> compiler already makes use of the insns.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 15/50] x86emul: support AVX512F move high/low insns
  2019-05-21 10:59     ` Andrew Cooper
@ 2019-05-21 10:59       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 10:59 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:44, Jan Beulich wrote:
> No explicit test harness additions other than the overrides, as the
> compiler already makes use of the insns.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 16/50] x86emul: support AVX512F move duplicate insns
  2019-03-15 10:45   ` [PATCH v8 16/50] x86emul: support AVX512F move duplicate insns Jan Beulich
@ 2019-05-21 11:09     ` Andrew Cooper
  2019-05-21 11:09       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:09 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:45, Jan Beulich wrote:
> Judging from insn prefixes, these are scalar insns, but their (memory)
> operands are vector ones (with the exception of 128-bit VMOVDDUP). For
> this some adjustments to disp8scale calculation code are needed.
>
> No explicit test harness additions other than the overrides, as the
> compiler already makes use of the insns.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 16/50] x86emul: support AVX512F move duplicate insns
  2019-05-21 11:09     ` Andrew Cooper
@ 2019-05-21 11:09       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:09 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:45, Jan Beulich wrote:
> Judging from insn prefixes, these are scalar insns, but their (memory)
> operands are vector ones (with the exception of 128-bit VMOVDDUP). For
> this some adjustments to disp8scale calculation code are needed.
>
> No explicit test harness additions other than the overrides, as the
> compiler already makes use of the insns.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 17/50] x86emul: support AVX512{F, BW, _VBMI} permute insns
  2019-03-15 10:46   ` [PATCH v8 17/50] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
@ 2019-05-21 11:24     ` Andrew Cooper
  2019-05-21 11:24       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:24 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:46, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 17/50] x86emul: support AVX512{F, BW, _VBMI} permute insns
  2019-05-21 11:24     ` Andrew Cooper
@ 2019-05-21 11:24       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:24 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:46, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 18/50] x86emul: support AVX512BW pack insns
  2019-03-15 10:46   ` [PATCH v8 18/50] x86emul: support AVX512BW pack insns Jan Beulich
@ 2019-05-21 11:26     ` Andrew Cooper
  2019-05-21 11:26       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:26 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:46, Jan Beulich wrote:
> No further test harness additions - what is there is good enough for
> these rather "regular" insns.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 18/50] x86emul: support AVX512BW pack insns
  2019-05-21 11:26     ` Andrew Cooper
@ 2019-05-21 11:26       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:26 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:46, Jan Beulich wrote:
> No further test harness additions - what is there is good enough for
> these rather "regular" insns.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns
  2019-03-15 10:47   ` [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns Jan Beulich
@ 2019-05-21 11:33     ` Andrew Cooper
  2019-05-21 11:33       ` [Xen-devel] " Andrew Cooper
  2019-05-21 15:46       ` Jan Beulich
  0 siblings, 2 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:33 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:47, Jan Beulich wrote:
> @@ -9312,7 +9386,8 @@ x86_emulate(
>  
>          if ( ea.type == OP_MEM )
>          {
> -            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
> +            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
> +                            (void *)mmvalp + first_byte, op_bytes, ctxt);
>              if ( rc != X86EMUL_OKAY )
>              {
>                  asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );

This hunk doesn't appear to fit with the rest of the patch, because it
isn't the first use of first_byte.

Have we been subtly broken before?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns
  2019-05-21 11:33     ` Andrew Cooper
@ 2019-05-21 11:33       ` Andrew Cooper
  2019-05-21 15:46       ` Jan Beulich
  1 sibling, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:33 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:47, Jan Beulich wrote:
> @@ -9312,7 +9386,8 @@ x86_emulate(
>  
>          if ( ea.type == OP_MEM )
>          {
> -            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
> +            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
> +                            (void *)mmvalp + first_byte, op_bytes, ctxt);
>              if ( rc != X86EMUL_OKAY )
>              {
>                  asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );

This hunk doesn't appear to fit with the rest of the patch, because it
isn't the first use of first_byte.

Have we been subtly broken before?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 20/50] x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns
  2019-03-15 10:47   ` [PATCH v8 20/50] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
@ 2019-05-21 11:37     ` Andrew Cooper
  2019-05-21 11:37       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:37 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:47, Jan Beulich wrote:
> ... including the two AVX512DQ forms which shared encodings, just with
> EVEX.W set there.
>
> VCVTDQ2PD, sharing its main opcode with others, needs a "manual"
> override of disp8scale.
>
> The simd_size changes for the twobyte_table[] entries are benign to
> pre-existing code, but allow decode_disp8scale() to work as is here.
>
> The at this point wrong placement of the 0xe6 case block is once again
> in anticipation of further additions of case labels.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 20/50] x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns
  2019-05-21 11:37     ` Andrew Cooper
@ 2019-05-21 11:37       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:37 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:47, Jan Beulich wrote:
> ... including the two AVX512DQ forms which shared encodings, just with
> EVEX.W set there.
>
> VCVTDQ2PD, sharing its main opcode with others, needs a "manual"
> override of disp8scale.
>
> The simd_size changes for the twobyte_table[] entries are benign to
> pre-existing code, but allow decode_disp8scale() to work as is here.
>
> The at this point wrong placement of the 0xe6 case block is once again
> in anticipation of further additions of case labels.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 21/50] x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns
  2019-03-15 10:52   ` [PATCH v8 21/50] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
@ 2019-05-21 11:44     ` Andrew Cooper
  2019-05-21 11:44       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:44 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:52, Jan Beulich wrote:
> VCVT{,T}S{S,D}2SI use EVEX.W for their destination (register) rather
> than their (possibly memory) source operand size and hence need a
> "manual" override of disp8scale.
>
> While the SDM claims that EVEX.L'L needs to be zero for the 32-bit forms
> of VCVT{,U}SI2SD (exception type E10NF), observations on my test system
> do not confirm this (and I've got informal confirmation that this is a
> doc mistake). Nevertheless, to be on the safe side, force evex.lr to be
> zero in this case though when constructing the stub.
>
> Slightly adjust the scalar to_int() in the test harness, to increase the
> chances of the operand ending up in memory.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 21/50] x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns
  2019-05-21 11:44     ` Andrew Cooper
@ 2019-05-21 11:44       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:44 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:52, Jan Beulich wrote:
> VCVT{,T}S{S,D}2SI use EVEX.W for their destination (register) rather
> than their (possibly memory) source operand size and hence need a
> "manual" override of disp8scale.
>
> While the SDM claims that EVEX.L'L needs to be zero for the 32-bit forms
> of VCVT{,U}SI2SD (exception type E10NF), observations on my test system
> do not confirm this (and I've got informal confirmation that this is a
> doc mistake). Nevertheless, to be on the safe side, force evex.lr to be
> zero in this case though when constructing the stub.
>
> Slightly adjust the scalar to_int() in the test harness, to increase the
> chances of the operand ending up in memory.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 22/50] x86emul: support AVX512DQ packed quad-int/FP conversion insns
  2019-03-15 10:52   ` [PATCH v8 22/50] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
@ 2019-05-21 11:53     ` Andrew Cooper
  2019-05-21 11:53       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:53 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:52, Jan Beulich wrote:
> VCVT{,T}PS2QQ, sharing their main opcodes with others, once again need
> "manual" overrides of disp8scale.
>
> While not directly related here, also add a scalar variant of to_wint()
> to the test harness.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 22/50] x86emul: support AVX512DQ packed quad-int/FP conversion insns
  2019-05-21 11:53     ` Andrew Cooper
@ 2019-05-21 11:53       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:53 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:52, Jan Beulich wrote:
> VCVT{,T}PS2QQ, sharing their main opcodes with others, once again need
> "manual" overrides of disp8scale.
>
> While not directly related here, also add a scalar variant of to_wint()
> to the test harness.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 23/50] x86emul: support AVX512{F, DQ} uint-to-FP conversion insns
  2019-03-15 10:53   ` [PATCH v8 23/50] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
@ 2019-05-21 11:58     ` Andrew Cooper
  2019-05-21 11:58       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:58 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:53, Jan Beulich wrote:
> Some "manual" overrides of disp8scale are needed here again. In
> particular code ends up simpler when using d8s_dq64 in the
> twobyte_table[] entry.
>
> Test harness additions will be done once the reverse conversions are
> also available.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 23/50] x86emul: support AVX512{F, DQ} uint-to-FP conversion insns
  2019-05-21 11:58     ` Andrew Cooper
@ 2019-05-21 11:58       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 11:58 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:53, Jan Beulich wrote:
> Some "manual" overrides of disp8scale are needed here again. In
> particular code ends up simpler when using d8s_dq64 in the
> twobyte_table[] entry.
>
> Test harness additions will be done once the reverse conversions are
> also available.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 24/50] x86emul: support AVX512{F, DQ} FP-to-uint conversion insns
  2019-03-15 10:54   ` [PATCH v8 24/50] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
@ 2019-05-21 12:09     ` Andrew Cooper
  2019-05-21 12:09       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 12:09 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:54, Jan Beulich wrote:
> Along the lines of prior patches, VCVT{,T}PS2UQQ as well as
> VCVT{,T}S{S,D}2USI need "manual" overrides of disp8scale.
>
> The twobyte_table[] entries get altered, with their prior values
> now put in place in x86_decode_twobyte().
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 24/50] x86emul: support AVX512{F, DQ} FP-to-uint conversion insns
  2019-05-21 12:09     ` Andrew Cooper
@ 2019-05-21 12:09       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 12:09 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:54, Jan Beulich wrote:
> Along the lines of prior patches, VCVT{,T}PS2UQQ as well as
> VCVT{,T}S{S,D}2USI need "manual" overrides of disp8scale.
>
> The twobyte_table[] entries get altered, with their prior values
> now put in place in x86_decode_twobyte().
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 25/50] x86emul: support remaining AVX512F legacy-equivalent insns
  2019-03-15 10:54   ` [PATCH v8 25/50] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
@ 2019-05-21 13:06     ` Andrew Cooper
  2019-05-21 13:06       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 13:06 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:54, Jan Beulich wrote:
> Plus their AVX512BW counterparts.
>
> Take the opportunity and also eliminate a pair of open coded instances
> of scalar_1op().
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 25/50] x86emul: support remaining AVX512F legacy-equivalent insns
  2019-05-21 13:06     ` Andrew Cooper
@ 2019-05-21 13:06       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 13:06 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:54, Jan Beulich wrote:
> Plus their AVX512BW counterparts.
>
> Take the opportunity and also eliminate a pair of open coded instances
> of scalar_1op().
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 26/50] x86emul: support remaining AVX512BW legacy-equivalent insns
  2019-03-15 10:54   ` [PATCH v8 26/50] x86emul: support remaining AVX512BW " Jan Beulich
@ 2019-05-21 13:08     ` Andrew Cooper
  2019-05-21 13:08       ` [Xen-devel] " Andrew Cooper
  2019-05-21 13:34       ` Jan Beulich
  2019-05-23 16:10     ` Andrew Cooper
  1 sibling, 2 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 13:08 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:54, Jan Beulich wrote:
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -435,7 +435,10 @@ static const struct ext0f38_table {
>      disp8scale_t d8s:4;
>  } ext0f38_table[256] = {
>      [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
> -    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
> +    [0x01 ... 0x03] = { .simd_size = simd_packed_int },
> +    [0x04] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
> +    [0x05 ... 0x0b] = { .simd_size = simd_packed_int },
> +    [0x0b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },

It doesn't look as if you mean 0x0b twice here, although its quite
possible that GCC elides it silently (Clang definitely won't).

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 26/50] x86emul: support remaining AVX512BW legacy-equivalent insns
  2019-05-21 13:08     ` Andrew Cooper
@ 2019-05-21 13:08       ` Andrew Cooper
  2019-05-21 13:34       ` Jan Beulich
  1 sibling, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-21 13:08 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:54, Jan Beulich wrote:
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -435,7 +435,10 @@ static const struct ext0f38_table {
>      disp8scale_t d8s:4;
>  } ext0f38_table[256] = {
>      [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
> -    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
> +    [0x01 ... 0x03] = { .simd_size = simd_packed_int },
> +    [0x04] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
> +    [0x05 ... 0x0b] = { .simd_size = simd_packed_int },
> +    [0x0b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },

It doesn't look as if you mean 0x0b twice here, although its quite
possible that GCC elides it silently (Clang definitely won't).

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 26/50] x86emul: support remaining AVX512BW legacy-equivalent insns
  2019-05-21 13:08     ` Andrew Cooper
  2019-05-21 13:08       ` [Xen-devel] " Andrew Cooper
@ 2019-05-21 13:34       ` Jan Beulich
  2019-05-21 13:34         ` [Xen-devel] " Jan Beulich
  1 sibling, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-05-21 13:34 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.05.19 at 15:08, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:54, Jan Beulich wrote:
>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>> @@ -435,7 +435,10 @@ static const struct ext0f38_table {
>>      disp8scale_t d8s:4;
>>  } ext0f38_table[256] = {
>>      [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
>> -    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
>> +    [0x01 ... 0x03] = { .simd_size = simd_packed_int },
>> +    [0x04] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
>> +    [0x05 ... 0x0b] = { .simd_size = simd_packed_int },
>> +    [0x0b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
> 
> It doesn't look as if you mean 0x0b twice here, although its quite
> possible that GCC elides it silently (Clang definitely won't).

Indeed, I failed to edit the upper bound of the range in question -
with gcc the last initializer for an array slot wins. Fixed for v9.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 26/50] x86emul: support remaining AVX512BW legacy-equivalent insns
  2019-05-21 13:34       ` Jan Beulich
@ 2019-05-21 13:34         ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-21 13:34 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.05.19 at 15:08, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:54, Jan Beulich wrote:
>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>> @@ -435,7 +435,10 @@ static const struct ext0f38_table {
>>      disp8scale_t d8s:4;
>>  } ext0f38_table[256] = {
>>      [0x00] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
>> -    [0x01 ... 0x0b] = { .simd_size = simd_packed_int },
>> +    [0x01 ... 0x03] = { .simd_size = simd_packed_int },
>> +    [0x04] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
>> +    [0x05 ... 0x0b] = { .simd_size = simd_packed_int },
>> +    [0x0b] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
> 
> It doesn't look as if you mean 0x0b twice here, although its quite
> possible that GCC elides it silently (Clang definitely won't).

Indeed, I failed to edit the upper bound of the range in question -
with gcc the last initializer for an array slot wins. Fixed for v9.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns
  2019-05-21 11:33     ` Andrew Cooper
  2019-05-21 11:33       ` [Xen-devel] " Andrew Cooper
@ 2019-05-21 15:46       ` Jan Beulich
  2019-05-21 15:46         ` [Xen-devel] " Jan Beulich
  2019-05-23 16:08         ` Andrew Cooper
  1 sibling, 2 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-21 15:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.05.19 at 13:33, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:47, Jan Beulich wrote:
>> @@ -9312,7 +9386,8 @@ x86_emulate(
>>  
>>          if ( ea.type == OP_MEM )
>>          {
>> -            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
>> +            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
>> +                            (void *)mmvalp + first_byte, op_bytes, ctxt);
>>              if ( rc != X86EMUL_OKAY )
>>              {
>>                  asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );
> 
> This hunk doesn't appear to fit with the rest of the patch, because it
> isn't the first use of first_byte.
> 
> Have we been subtly broken before?

I don't think so, no, but I admit I'm not sure I understand what
you're saying above. The use of first_byte here is of course not
the first use - it gets set in the hunk further up. The AVX form of
VCVTPS2PH does not support fault suppression (as that's an
AVX512 feature), and hence no such adjustment was needed
here before.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns
  2019-05-21 15:46       ` Jan Beulich
@ 2019-05-21 15:46         ` Jan Beulich
  2019-05-23 16:08         ` Andrew Cooper
  1 sibling, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-21 15:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.05.19 at 13:33, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:47, Jan Beulich wrote:
>> @@ -9312,7 +9386,8 @@ x86_emulate(
>>  
>>          if ( ea.type == OP_MEM )
>>          {
>> -            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
>> +            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
>> +                            (void *)mmvalp + first_byte, op_bytes, ctxt);
>>              if ( rc != X86EMUL_OKAY )
>>              {
>>                  asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );
> 
> This hunk doesn't appear to fit with the rest of the patch, because it
> isn't the first use of first_byte.
> 
> Have we been subtly broken before?

I don't think so, no, but I admit I'm not sure I understand what
you're saying above. The use of first_byte here is of course not
the first use - it gets set in the hunk further up. The AVX form of
VCVTPS2PH does not support fault suppression (as that's an
AVX512 feature), and hence no such adjustment was needed
here before.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns
  2019-05-21 15:46       ` Jan Beulich
  2019-05-21 15:46         ` [Xen-devel] " Jan Beulich
@ 2019-05-23 16:08         ` Andrew Cooper
  2019-05-23 16:08           ` [Xen-devel] " Andrew Cooper
  1 sibling, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-23 16:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 21/05/2019 16:46, Jan Beulich wrote:
>>>> On 21.05.19 at 13:33, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:47, Jan Beulich wrote:
>>> @@ -9312,7 +9386,8 @@ x86_emulate(
>>>  
>>>          if ( ea.type == OP_MEM )
>>>          {
>>> -            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
>>> +            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
>>> +                            (void *)mmvalp + first_byte, op_bytes, ctxt);
>>>              if ( rc != X86EMUL_OKAY )
>>>              {
>>>                  asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );
>> This hunk doesn't appear to fit with the rest of the patch, because it
>> isn't the first use of first_byte.
>>
>> Have we been subtly broken before?
> I don't think so, no, but I admit I'm not sure I understand what
> you're saying above. The use of first_byte here is of course not
> the first use - it gets set in the hunk further up. The AVX form of
> VCVTPS2PH does not support fault suppression (as that's an
> AVX512 feature), and hence no such adjustment was needed
> here before.

Ah - what I hadn't spotted was that this is the special case for
vcvtps2ph, so this change is fine in context.

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns
  2019-05-23 16:08         ` Andrew Cooper
@ 2019-05-23 16:08           ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-23 16:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 21/05/2019 16:46, Jan Beulich wrote:
>>>> On 21.05.19 at 13:33, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:47, Jan Beulich wrote:
>>> @@ -9312,7 +9386,8 @@ x86_emulate(
>>>  
>>>          if ( ea.type == OP_MEM )
>>>          {
>>> -            rc = ops->write(ea.mem.seg, ea.mem.off, mmvalp, 8 << vex.l, ctxt);
>>> +            rc = ops->write(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
>>> +                            (void *)mmvalp + first_byte, op_bytes, ctxt);
>>>              if ( rc != X86EMUL_OKAY )
>>>              {
>>>                  asm volatile ( "ldmxcsr %0" :: "m" (mxcsr) );
>> This hunk doesn't appear to fit with the rest of the patch, because it
>> isn't the first use of first_byte.
>>
>> Have we been subtly broken before?
> I don't think so, no, but I admit I'm not sure I understand what
> you're saying above. The use of first_byte here is of course not
> the first use - it gets set in the hunk further up. The AVX form of
> VCVTPS2PH does not support fault suppression (as that's an
> AVX512 feature), and hence no such adjustment was needed
> here before.

Ah - what I hadn't spotted was that this is the special case for
vcvtps2ph, so this change is fine in context.

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 26/50] x86emul: support remaining AVX512BW legacy-equivalent insns
  2019-03-15 10:54   ` [PATCH v8 26/50] x86emul: support remaining AVX512BW " Jan Beulich
  2019-05-21 13:08     ` Andrew Cooper
@ 2019-05-23 16:10     ` Andrew Cooper
  2019-05-23 16:10       ` [Xen-devel] " Andrew Cooper
  1 sibling, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-23 16:10 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:54, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

With the identified issue fixed, Acked-by: Andrew Cooper
<andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 26/50] x86emul: support remaining AVX512BW legacy-equivalent insns
  2019-05-23 16:10     ` Andrew Cooper
@ 2019-05-23 16:10       ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-23 16:10 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:54, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

With the identified issue fixed, Acked-by: Andrew Cooper
<andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-03-15 10:55   ` [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
@ 2019-05-23 16:15     ` Andrew Cooper
  2019-05-23 16:15       ` [Xen-devel] " Andrew Cooper
  2019-05-24  6:43       ` Jan Beulich
  0 siblings, 2 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-23 16:15 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:55, Jan Beulich wrote:
> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>
> Note that despite the replacement of the SHA insns' table slots there's
> no need to special case their decoding: Their insn-specific code already
> sets op_bytes (as was required due to simd_other), and TwoOp is of no
> relevance for legacy encoded SIMD insns.
>
> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
> to be on the safe side. The SDM does not clarify behavior there, and
> it's even more ambiguous here (without AVX512VL in the picture).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Seeing as I have some ER hardware, is there an easy way to get
GCC/binutils to emit a weird L'L field, or will this involve some manual
opcode generation to test?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-05-23 16:15     ` Andrew Cooper
@ 2019-05-23 16:15       ` Andrew Cooper
  2019-05-24  6:43       ` Jan Beulich
  1 sibling, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-23 16:15 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:55, Jan Beulich wrote:
> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>
> Note that despite the replacement of the SHA insns' table slots there's
> no need to special case their decoding: Their insn-specific code already
> sets op_bytes (as was required due to simd_other), and TwoOp is of no
> relevance for legacy encoded SIMD insns.
>
> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
> to be on the safe side. The SDM does not clarify behavior there, and
> it's even more ambiguous here (without AVX512VL in the picture).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Seeing as I have some ER hardware, is there an easy way to get
GCC/binutils to emit a weird L'L field, or will this involve some manual
opcode generation to test?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-05-23 16:15     ` Andrew Cooper
  2019-05-23 16:15       ` [Xen-devel] " Andrew Cooper
@ 2019-05-24  6:43       ` Jan Beulich
  2019-05-24  6:43         ` [Xen-devel] " Jan Beulich
  2019-05-24 20:48         ` Andrew Cooper
  1 sibling, 2 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-24  6:43 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:55, Jan Beulich wrote:
>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>
>> Note that despite the replacement of the SHA insns' table slots there's
>> no need to special case their decoding: Their insn-specific code already
>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>> relevance for legacy encoded SIMD insns.
>>
>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>> to be on the safe side. The SDM does not clarify behavior there, and
>> it's even more ambiguous here (without AVX512VL in the picture).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks, also for the others.

> Seeing as I have some ER hardware, is there an easy way to get
> GCC/binutils to emit a weird L'L field, or will this involve some manual
> opcode generation to test?

gcc does not provide any control at all, afaict. binutils allows "weird"
VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
I'm afraid this will involve using .byte.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-05-24  6:43       ` Jan Beulich
@ 2019-05-24  6:43         ` Jan Beulich
  2019-05-24 20:48         ` Andrew Cooper
  1 sibling, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-24  6:43 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:55, Jan Beulich wrote:
>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>
>> Note that despite the replacement of the SHA insns' table slots there's
>> no need to special case their decoding: Their insn-specific code already
>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>> relevance for legacy encoded SIMD insns.
>>
>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>> to be on the safe side. The SDM does not clarify behavior there, and
>> it's even more ambiguous here (without AVX512VL in the picture).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks, also for the others.

> Seeing as I have some ER hardware, is there an easy way to get
> GCC/binutils to emit a weird L'L field, or will this involve some manual
> opcode generation to test?

gcc does not provide any control at all, afaict. binutils allows "weird"
VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
I'm afraid this will involve using .byte.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-05-24  6:43       ` Jan Beulich
  2019-05-24  6:43         ` [Xen-devel] " Jan Beulich
@ 2019-05-24 20:48         ` Andrew Cooper
  2019-05-24 20:48           ` [Xen-devel] " Andrew Cooper
  2019-05-27  8:02           ` Jan Beulich
  1 sibling, 2 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-24 20:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 24/05/2019 07:43, Jan Beulich wrote:
>>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:55, Jan Beulich wrote:
>>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>>
>>> Note that despite the replacement of the SHA insns' table slots there's
>>> no need to special case their decoding: Their insn-specific code already
>>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>>> relevance for legacy encoded SIMD insns.
>>>
>>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>>> to be on the safe side. The SDM does not clarify behavior there, and
>>> it's even more ambiguous here (without AVX512VL in the picture).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Thanks, also for the others.
>
>> Seeing as I have some ER hardware, is there an easy way to get
>> GCC/binutils to emit a weird L'L field, or will this involve some manual
>> opcode generation to test?
> gcc does not provide any control at all, afaict. binutils allows "weird"
> VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
> I'm afraid this will involve using .byte.

Ok.  Given a test program of:

{
printf("Real:\n");
asm volatile ("vrcp14sd %xmm0,%xmm0,%xmm0");

printf("Bytes:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x08, 0x4d, 0xc0");

printf("Bad 0x28:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x28, 0x4d, 0xc0");

printf("Bad 0x48:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x48, 0x4d, 0xc0");

printf("Bad 0x68:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x68, 0x4d, 0xc0");
}

Then the L'L = 3 case (0x68 at the end) does indeed take #UD for both
KNL and KNM.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-05-24 20:48         ` Andrew Cooper
@ 2019-05-24 20:48           ` Andrew Cooper
  2019-05-27  8:02           ` Jan Beulich
  1 sibling, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-24 20:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 24/05/2019 07:43, Jan Beulich wrote:
>>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:55, Jan Beulich wrote:
>>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>>
>>> Note that despite the replacement of the SHA insns' table slots there's
>>> no need to special case their decoding: Their insn-specific code already
>>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>>> relevance for legacy encoded SIMD insns.
>>>
>>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>>> to be on the safe side. The SDM does not clarify behavior there, and
>>> it's even more ambiguous here (without AVX512VL in the picture).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Thanks, also for the others.
>
>> Seeing as I have some ER hardware, is there an easy way to get
>> GCC/binutils to emit a weird L'L field, or will this involve some manual
>> opcode generation to test?
> gcc does not provide any control at all, afaict. binutils allows "weird"
> VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
> I'm afraid this will involve using .byte.

Ok.  Given a test program of:

{
printf("Real:\n");
asm volatile ("vrcp14sd %xmm0,%xmm0,%xmm0");

printf("Bytes:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x08, 0x4d, 0xc0");

printf("Bad 0x28:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x28, 0x4d, 0xc0");

printf("Bad 0x48:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x48, 0x4d, 0xc0");

printf("Bad 0x68:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x68, 0x4d, 0xc0");
}

Then the L'L = 3 case (0x68 at the end) does indeed take #UD for both
KNL and KNM.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-05-24 20:48         ` Andrew Cooper
  2019-05-24 20:48           ` [Xen-devel] " Andrew Cooper
@ 2019-05-27  8:02           ` Jan Beulich
  2019-05-27  8:02             ` [Xen-devel] " Jan Beulich
  2019-05-29 10:00             ` Andrew Cooper
  1 sibling, 2 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-27  8:02 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 24.05.19 at 22:48, <andrew.cooper3@citrix.com> wrote:
> On 24/05/2019 07:43, Jan Beulich wrote:
>>>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
>>> On 15/03/2019 10:55, Jan Beulich wrote:
>>>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>>>
>>>> Note that despite the replacement of the SHA insns' table slots there's
>>>> no need to special case their decoding: Their insn-specific code already
>>>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>>>> relevance for legacy encoded SIMD insns.
>>>>
>>>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>>>> to be on the safe side. The SDM does not clarify behavior there, and
>>>> it's even more ambiguous here (without AVX512VL in the picture).
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Thanks, also for the others.
>>
>>> Seeing as I have some ER hardware, is there an easy way to get
>>> GCC/binutils to emit a weird L'L field, or will this involve some manual
>>> opcode generation to test?
>> gcc does not provide any control at all, afaict. binutils allows "weird"
>> VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
>> I'm afraid this will involve using .byte.
> 
> Ok.  Given a test program of:
> 
> {
> printf("Real:\n");
> asm volatile ("vrcp14sd %xmm0,%xmm0,%xmm0");
> 
> printf("Bytes:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x08, 0x4d, 0xc0");
> 
> printf("Bad 0x28:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x28, 0x4d, 0xc0");
> 
> printf("Bad 0x48:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x48, 0x4d, 0xc0");
> 
> printf("Bad 0x68:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x68, 0x4d, 0xc0");
> }
> 
> Then the L'L = 3 case (0x68 at the end) does indeed take #UD for both
> KNL and KNM.

And by implication I take it that the L'L=1 and L'L=2 cases indeed to not
#UD there?

Thanks for having tried this out,
Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-05-27  8:02           ` Jan Beulich
@ 2019-05-27  8:02             ` Jan Beulich
  2019-05-29 10:00             ` Andrew Cooper
  1 sibling, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-27  8:02 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 24.05.19 at 22:48, <andrew.cooper3@citrix.com> wrote:
> On 24/05/2019 07:43, Jan Beulich wrote:
>>>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
>>> On 15/03/2019 10:55, Jan Beulich wrote:
>>>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>>>
>>>> Note that despite the replacement of the SHA insns' table slots there's
>>>> no need to special case their decoding: Their insn-specific code already
>>>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>>>> relevance for legacy encoded SIMD insns.
>>>>
>>>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>>>> to be on the safe side. The SDM does not clarify behavior there, and
>>>> it's even more ambiguous here (without AVX512VL in the picture).
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Thanks, also for the others.
>>
>>> Seeing as I have some ER hardware, is there an easy way to get
>>> GCC/binutils to emit a weird L'L field, or will this involve some manual
>>> opcode generation to test?
>> gcc does not provide any control at all, afaict. binutils allows "weird"
>> VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
>> I'm afraid this will involve using .byte.
> 
> Ok.  Given a test program of:
> 
> {
> printf("Real:\n");
> asm volatile ("vrcp14sd %xmm0,%xmm0,%xmm0");
> 
> printf("Bytes:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x08, 0x4d, 0xc0");
> 
> printf("Bad 0x28:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x28, 0x4d, 0xc0");
> 
> printf("Bad 0x48:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x48, 0x4d, 0xc0");
> 
> printf("Bad 0x68:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x68, 0x4d, 0xc0");
> }
> 
> Then the L'L = 3 case (0x68 at the end) does indeed take #UD for both
> KNL and KNM.

And by implication I take it that the L'L=1 and L'L=2 cases indeed to not
#UD there?

Thanks for having tried this out,
Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-05-27  8:02           ` Jan Beulich
  2019-05-27  8:02             ` [Xen-devel] " Jan Beulich
@ 2019-05-29 10:00             ` Andrew Cooper
  2019-05-29 10:00               ` [Xen-devel] " Andrew Cooper
  1 sibling, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-05-29 10:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 27/05/2019 09:02, Jan Beulich wrote:
>>>> On 24.05.19 at 22:48, <andrew.cooper3@citrix.com> wrote:
>> On 24/05/2019 07:43, Jan Beulich wrote:
>>>>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
>>>> On 15/03/2019 10:55, Jan Beulich wrote:
>>>>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>>>>
>>>>> Note that despite the replacement of the SHA insns' table slots there's
>>>>> no need to special case their decoding: Their insn-specific code already
>>>>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>>>>> relevance for legacy encoded SIMD insns.
>>>>>
>>>>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>>>>> to be on the safe side. The SDM does not clarify behavior there, and
>>>>> it's even more ambiguous here (without AVX512VL in the picture).
>>>>>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Thanks, also for the others.
>>>
>>>> Seeing as I have some ER hardware, is there an easy way to get
>>>> GCC/binutils to emit a weird L'L field, or will this involve some manual
>>>> opcode generation to test?
>>> gcc does not provide any control at all, afaict. binutils allows "weird"
>>> VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
>>> I'm afraid this will involve using .byte.
>> Ok.  Given a test program of:
>>
>> {
>> printf("Real:\n");
>> asm volatile ("vrcp14sd %xmm0,%xmm0,%xmm0");
>>
>> printf("Bytes:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x08, 0x4d, 0xc0");
>>
>> printf("Bad 0x28:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x28, 0x4d, 0xc0");
>>
>> printf("Bad 0x48:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x48, 0x4d, 0xc0");
>>
>> printf("Bad 0x68:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x68, 0x4d, 0xc0");
>> }
>>
>> Then the L'L = 3 case (0x68 at the end) does indeed take #UD for both
>> KNL and KNM.
> And by implication I take it that the L'L=1 and L'L=2 cases indeed to not
> #UD there?

Correct.  It would appear that, unhelpfully, the L'L==3 being reserved
rule takes higher precedence than the L'L-is-ignored rule for this class
of instruction.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns
  2019-05-29 10:00             ` Andrew Cooper
@ 2019-05-29 10:00               ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-29 10:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 27/05/2019 09:02, Jan Beulich wrote:
>>>> On 24.05.19 at 22:48, <andrew.cooper3@citrix.com> wrote:
>> On 24/05/2019 07:43, Jan Beulich wrote:
>>>>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
>>>> On 15/03/2019 10:55, Jan Beulich wrote:
>>>>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>>>>
>>>>> Note that despite the replacement of the SHA insns' table slots there's
>>>>> no need to special case their decoding: Their insn-specific code already
>>>>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>>>>> relevance for legacy encoded SIMD insns.
>>>>>
>>>>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>>>>> to be on the safe side. The SDM does not clarify behavior there, and
>>>>> it's even more ambiguous here (without AVX512VL in the picture).
>>>>>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Thanks, also for the others.
>>>
>>>> Seeing as I have some ER hardware, is there an easy way to get
>>>> GCC/binutils to emit a weird L'L field, or will this involve some manual
>>>> opcode generation to test?
>>> gcc does not provide any control at all, afaict. binutils allows "weird"
>>> VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
>>> I'm afraid this will involve using .byte.
>> Ok.  Given a test program of:
>>
>> {
>> printf("Real:\n");
>> asm volatile ("vrcp14sd %xmm0,%xmm0,%xmm0");
>>
>> printf("Bytes:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x08, 0x4d, 0xc0");
>>
>> printf("Bad 0x28:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x28, 0x4d, 0xc0");
>>
>> printf("Bad 0x48:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x48, 0x4d, 0xc0");
>>
>> printf("Bad 0x68:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x68, 0x4d, 0xc0");
>> }
>>
>> Then the L'L = 3 case (0x68 at the end) does indeed take #UD for both
>> KNL and KNM.
> And by implication I take it that the L'L=1 and L'L=2 cases indeed to not
> #UD there?

Correct.  It would appear that, unhelpfully, the L'L==3 being reserved
rule takes higher precedence than the L'L-is-ignored rule for this class
of instruction.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns
  2019-03-15 10:56   ` [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns Jan Beulich
@ 2019-05-29 12:51     ` Andrew Cooper
  2019-05-29 12:51       ` [Xen-devel] " Andrew Cooper
  2019-05-29 13:15       ` Jan Beulich
  2019-06-10 14:03     ` Andrew Cooper
  1 sibling, 2 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-29 12:51 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:56, Jan Beulich wrote:
> @@ -9681,6 +9696,21 @@ x86_emulate(
>          op_bytes = 4;
>          goto simd_imm8_zmm;
>  
> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
> +        host_and_vcpu_must_have(avx512f);
> +        if ( ea.type != OP_REG || !evex.brs )
> +            avx512_vlen_check(false);
> +        goto simd_imm8_zmm;
> +
> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
> +        host_and_vcpu_must_have(avx512f);
> +        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);

Why the ea.type != OP_REG restriction?  These four instructions do take
memory operands.

~Andrew

> +        if ( !evex.brs )
> +            avx512_vlen_check(true);
> +        goto simd_imm8_zmm;
> +
>      case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
>      case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
>          if ( !vex.w )
>
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns
  2019-05-29 12:51     ` Andrew Cooper
@ 2019-05-29 12:51       ` Andrew Cooper
  2019-05-29 13:15       ` Jan Beulich
  1 sibling, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-05-29 12:51 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:56, Jan Beulich wrote:
> @@ -9681,6 +9696,21 @@ x86_emulate(
>          op_bytes = 4;
>          goto simd_imm8_zmm;
>  
> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
> +        host_and_vcpu_must_have(avx512f);
> +        if ( ea.type != OP_REG || !evex.brs )
> +            avx512_vlen_check(false);
> +        goto simd_imm8_zmm;
> +
> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
> +        host_and_vcpu_must_have(avx512f);
> +        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);

Why the ea.type != OP_REG restriction?  These four instructions do take
memory operands.

~Andrew

> +        if ( !evex.brs )
> +            avx512_vlen_check(true);
> +        goto simd_imm8_zmm;
> +
>      case X86EMUL_OPC_VEX_66(0x0f3a, 0x30): /* kshiftr{b,w} $imm8,k,k */
>      case X86EMUL_OPC_VEX_66(0x0f3a, 0x32): /* kshiftl{b,w} $imm8,k,k */
>          if ( !vex.w )
>
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns
  2019-05-29 12:51     ` Andrew Cooper
  2019-05-29 12:51       ` [Xen-devel] " Andrew Cooper
@ 2019-05-29 13:15       ` Jan Beulich
  2019-05-29 13:15         ` [Xen-devel] " Jan Beulich
  2019-06-10 14:01         ` Andrew Cooper
  1 sibling, 2 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-29 13:15 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

>>> On 29.05.19 at 14:51, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:56, Jan Beulich wrote:
>> @@ -9681,6 +9696,21 @@ x86_emulate(
>>          op_bytes = 4;
>>          goto simd_imm8_zmm;
>>  
>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>> +        host_and_vcpu_must_have(avx512f);
>> +        if ( ea.type != OP_REG || !evex.brs )
>> +            avx512_vlen_check(false);
>> +        goto simd_imm8_zmm;
>> +
>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
>> +        host_and_vcpu_must_have(avx512f);
>> +        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
> 
> Why the ea.type != OP_REG restriction?  These four instructions do take
> memory operands.

Did you perhaps read the && as || ? Scalar operations (not just the
ones here) don't support broadcast (with a memory operand), but
may support embedded rounding (or, like here, just SAE; with a
register operand). The exact same construct exists e.g. at the
simd_zmm_scalar_sae label (and the block of case labels ahead of
it actually also gets added to by the patch here).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns
  2019-05-29 13:15       ` Jan Beulich
@ 2019-05-29 13:15         ` Jan Beulich
  2019-06-10 14:01         ` Andrew Cooper
  1 sibling, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-05-29 13:15 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

>>> On 29.05.19 at 14:51, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:56, Jan Beulich wrote:
>> @@ -9681,6 +9696,21 @@ x86_emulate(
>>          op_bytes = 4;
>>          goto simd_imm8_zmm;
>>  
>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>> +        host_and_vcpu_must_have(avx512f);
>> +        if ( ea.type != OP_REG || !evex.brs )
>> +            avx512_vlen_check(false);
>> +        goto simd_imm8_zmm;
>> +
>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
>> +        host_and_vcpu_must_have(avx512f);
>> +        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
> 
> Why the ea.type != OP_REG restriction?  These four instructions do take
> memory operands.

Did you perhaps read the && as || ? Scalar operations (not just the
ones here) don't support broadcast (with a memory operand), but
may support embedded rounding (or, like here, just SAE; with a
register operand). The exact same construct exists e.g. at the
simd_zmm_scalar_sae label (and the block of case labels ahead of
it actually also gets added to by the patch here).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns
  2019-05-29 13:15       ` Jan Beulich
  2019-05-29 13:15         ` [Xen-devel] " Jan Beulich
@ 2019-06-10 14:01         ` Andrew Cooper
  1 sibling, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-10 14:01 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 29/05/2019 14:15, Jan Beulich wrote:
>>>> On 29.05.19 at 14:51, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:56, Jan Beulich wrote:
>>> @@ -9681,6 +9696,21 @@ x86_emulate(
>>>          op_bytes = 4;
>>>          goto simd_imm8_zmm;
>>>  
>>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x26): /* vgetmantp{s,d} $imm8,[xyz]mm/mem,[xyz]mm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x54): /* vfixupimmp{s,d} $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
>>> +        host_and_vcpu_must_have(avx512f);
>>> +        if ( ea.type != OP_REG || !evex.brs )
>>> +            avx512_vlen_check(false);
>>> +        goto simd_imm8_zmm;
>>> +
>>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x27): /* vgetmants{s,d} $imm8,xmm/mem,xmm,xmm{k} */
>>> +    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x55): /* vfixupimms{s,d} $imm8,xmm/mem,xmm,xmm{k} */
>>> +        host_and_vcpu_must_have(avx512f);
>>> +        generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
>> Why the ea.type != OP_REG restriction?  These four instructions do take
>> memory operands.
> Did you perhaps read the && as || ?

Oops - I did.

Sorry - I blame the jetlag.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns
  2019-03-15 10:56   ` [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns Jan Beulich
  2019-05-29 12:51     ` Andrew Cooper
@ 2019-06-10 14:03     ` Andrew Cooper
  1 sibling, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-10 14:03 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:56, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 29/50] x86emul: support AVX512DQ floating point manipulation insns
  2019-03-15 10:56   ` [PATCH v8 29/50] x86emul: support AVX512DQ " Jan Beulich
@ 2019-06-10 14:06     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-10 14:06 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:56, Jan Beulich wrote:
> This completes support of AVX512DQ in the insn emulator.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 30/50] x86emul: support AVX512{F, _VBMI2} compress/expand insns
  2019-03-15 10:56   ` [PATCH v8 30/50] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
@ 2019-06-10 14:51     ` Andrew Cooper
  2019-06-11 10:20       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-10 14:51 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:56, Jan Beulich wrote:
> +#if __GNUC__ > 7 /* can't check for __AVX512VBMI2__ here */

Why not?

> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -266,10 +266,10 @@ def crunch_numbers(state):
>                    AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
>                    AVX512_VPOPCNTDQ],
>  
> -        # AVX512 extensions acting solely on vectors of bytes/words are made
> +        # AVX512 extensions acting (solely) on vectors of bytes/words are made

Why the change here?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 30/50] x86emul: support AVX512{F, _VBMI2} compress/expand insns
  2019-06-10 14:51     ` [Xen-devel] " Andrew Cooper
@ 2019-06-11 10:20       ` Jan Beulich
  2019-06-18 16:24         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-06-11 10:20 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

>>> On 10.06.19 at 16:51, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:56, Jan Beulich wrote:
>> +#if __GNUC__ > 7 /* can't check for __AVX512VBMI2__ here */
> 
> Why not?

Because that would require passing -mavx512vbmi2 (or enabling the
feature via #pragma) which in turn would need gating on compiler
version, or else the harness couldn't be built with gcc7 at all anymore.

>> --- a/xen/tools/gen-cpuid.py
>> +++ b/xen/tools/gen-cpuid.py
>> @@ -266,10 +266,10 @@ def crunch_numbers(state):
>>                    AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
>>                    AVX512_VPOPCNTDQ],
>>  
>> -        # AVX512 extensions acting solely on vectors of bytes/words are made
>> +        # AVX512 extensions acting (solely) on vectors of bytes/words are made

Because VBMI2 doesn't act _solely_ on vectors of bytes/words.
There are also shift insns acting on vectors of dwords/qwords.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 30/50] x86emul: support AVX512{F, _VBMI2} compress/expand insns
  2019-06-11 10:20       ` Jan Beulich
@ 2019-06-18 16:24         ` Andrew Cooper
  2019-06-19  6:38           ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-18 16:24 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 11/06/2019 11:20, Jan Beulich wrote:
>>>> On 10.06.19 at 16:51, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:56, Jan Beulich wrote:
>>> +#if __GNUC__ > 7 /* can't check for __AVX512VBMI2__ here */
>> Why not?
> Because that would require passing -mavx512vbmi2 (or enabling the
> feature via #pragma) which in turn would need gating on compiler
> version, or else the harness couldn't be built with gcc7 at all anymore.

Is that really a problem?  We already require a
practically-bleeding-edge binutils.

Irrespective, are you saying that __AVX512VBMI2__ really is conditional
on -mavx512vbmi2 being passed?  If so, what's wrong with using cc-option
to gain it conditionally?

>
>>> --- a/xen/tools/gen-cpuid.py
>>> +++ b/xen/tools/gen-cpuid.py
>>> @@ -266,10 +266,10 @@ def crunch_numbers(state):
>>>                    AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
>>>                    AVX512_VPOPCNTDQ],
>>>  
>>> -        # AVX512 extensions acting solely on vectors of bytes/words are made
>>> +        # AVX512 extensions acting (solely) on vectors of bytes/words are made
> Because VBMI2 doesn't act _solely_ on vectors of bytes/words.
> There are also shift insns acting on vectors of dwords/qwords.

In which case I'd expect s/solely// here.  Putting solely in brackets
doesn't convey any information relevant to "not really solely any more".

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 31/50] x86emul: support remaining misc AVX512{F, BW} insns
  2019-03-15 10:58   ` [PATCH v8 31/50] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
@ 2019-06-18 16:42     ` Andrew Cooper
  2019-06-19  6:44       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-18 16:42 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:58, Jan Beulich wrote:
> This completes support of AVX512BW in the insn emulator, and leaves just
> the scatter/gather ones open in the AVX512F set.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v5: New.
> ---
> TBD: The *blendm* inline functions don't reliably produce the intended
>      insns, as the respective moves are about as good a fit for the
>      compiler when looking for a match for the intended operation. We'd
>      need to switch to inline assembly if we wanted to guarantee the
>      testing of those insns. Thoughts?

Do you mean __builtin_ia32_blendm_* here?  I guess it depends on the
writemask as to whether mov instructions would be more efficient.

I think we can probably rely on fuzzing to cover the difference, seeing
as there is nothing interesting or nonstandard about these
instructions.  That certainly seems preferable to a custom bode.

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 30/50] x86emul: support AVX512{F, _VBMI2} compress/expand insns
  2019-06-18 16:24         ` Andrew Cooper
@ 2019-06-19  6:38           ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-06-19  6:38 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 18.06.19 at 18:24, <andrew.cooper3@citrix.com> wrote:
> On 11/06/2019 11:20, Jan Beulich wrote:
>>>>> On 10.06.19 at 16:51, <andrew.cooper3@citrix.com> wrote:
>>> On 15/03/2019 10:56, Jan Beulich wrote:
>>>> +#if __GNUC__ > 7 /* can't check for __AVX512VBMI2__ here */
>>> Why not?
>> Because that would require passing -mavx512vbmi2 (or enabling the
>> feature via #pragma) which in turn would need gating on compiler
>> version, or else the harness couldn't be built with gcc7 at all anymore.
> 
> Is that really a problem?  We already require a
> practically-bleeding-edge binutils.

"Bleeding edge" would be 2.32 or newer. {evex} support was added in
2.29, which sort of matches up with gcc 7.

> Irrespective, are you saying that __AVX512VBMI2__ really is conditional
> on -mavx512vbmi2 being passed?

Yes. How else would you expect things to work?

> If so, what's wrong with using cc-option to gain it conditionally?

We don't want to enable all sorts of AVX- and AVX512-isms in the
compilation. Besides our inline assembly we specifically want the
compiler to only produce generic code. It was intentional after all
that for the main harness object we've never added any
-m<extension> option.

>>>> --- a/xen/tools/gen-cpuid.py
>>>> +++ b/xen/tools/gen-cpuid.py
>>>> @@ -266,10 +266,10 @@ def crunch_numbers(state):
>>>>                    AVX512BW, AVX512VL, AVX512_4VNNIW, AVX512_4FMAPS,
>>>>                    AVX512_VPOPCNTDQ],
>>>>  
>>>> -        # AVX512 extensions acting solely on vectors of bytes/words are made
>>>> +        # AVX512 extensions acting (solely) on vectors of bytes/words are made
>> Because VBMI2 doesn't act _solely_ on vectors of bytes/words.
>> There are also shift insns acting on vectors of dwords/qwords.
> 
> In which case I'd expect s/solely// here.  Putting solely in brackets
> doesn't convey any information relevant to "not really solely any more".

Well, okay.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 31/50] x86emul: support remaining misc AVX512{F, BW} insns
  2019-06-18 16:42     ` [Xen-devel] " Andrew Cooper
@ 2019-06-19  6:44       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-06-19  6:44 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 18.06.19 at 18:42, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:58, Jan Beulich wrote:
>> This completes support of AVX512BW in the insn emulator, and leaves just
>> the scatter/gather ones open in the AVX512F set.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v5: New.
>> ---
>> TBD: The *blendm* inline functions don't reliably produce the intended
>>      insns, as the respective moves are about as good a fit for the
>>      compiler when looking for a match for the intended operation. We'd
>>      need to switch to inline assembly if we wanted to guarantee the
>>      testing of those insns. Thoughts?
> 
> Do you mean __builtin_ia32_blendm_* here?

Yes, or their wrappers in the *intrin.h headers (which we don't use for
various reasons).

>  I guess it depends on the
> writemask as to whether mov instructions would be more efficient.

Right, but we don't use write masks here. Their use in general is imo
rather cumbersome in C level code anyway, and it wouldn't fit the
general structure of simd_test() very well if we wanted to make use
of them.

> I think we can probably rely on fuzzing to cover the difference, seeing
> as there is nothing interesting or nonstandard about these
> instructions.  That certainly seems preferable to a custom bode.
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 32/50] x86emul: support AVX512F gather insns
  2019-03-15 10:58   ` [PATCH v8 32/50] x86emul: support AVX512F gather insns Jan Beulich
@ 2019-06-19 12:05     ` Andrew Cooper
  2019-06-19 12:43       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 12:05 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:58, Jan Beulich wrote:
> This requires getting modrm_reg and sib_index set correctly in the EVEX
> case, to account for the high 16 [XYZ]MM registers. Extend the
> adjustments to modrm_rm as well, such that x86_insn_modrm() would
> correctly report register numbers (this was a latent issue only as we
> don't currently have callers of that function which would care about an
> EVEX case). The adjustment in turn requires dropping the assertion from
> decode_gpr() as well as re-introducing the explicit masking, as we now
> need to actively mask off the high bit when a GPR is meant.
> _decode_gpr() invocations also need slight adjustments, when invoked in
> generic code ahead of the main switch(). All other uses of modrm_reg and
> modrm_rm already get suitably masked where necessary.
>
> There was also an encoding mistake in the EVEX Disp8 test code, which
> was benign (due to %rdx getting set to zero) to all non-vSIB tests as it
> mistakenly encoded <disp8>(%rdx,%rdx) instead of <disp8>(%rdx,%riz). In
> the vSIB case this meant <disp8>(%rdx,%zmm2) instead of the intended
> <disp8>(%rdx,%zmm4).
>
> Likewise the access count check wasn't entirely correct for the S/G
> case: In the quad-word-index but dword-data case only half the number
> of full vector elements get accessed.
>
> As an unrelated change in the main test harness source file distinguish
> the "n/a" messages by bitness.

There is a very large amount going on here, and too much for a single patch.

I think the modrm fixes want splitting out because those alone are
subtle.  Its reasonable to keep the emulator test fixes in with the new
functionality which notices the bug, and it is a one-liner.

> --- a/xen/arch/x86/x86_emulate/x86_emulate.h
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.h
> @@ -662,8 +662,6 @@ static inline unsigned long *decode_gpr(
>      BUILD_BUG_ON(ARRAY_SIZE(cpu_user_regs_gpr_offsets) &
>                   (ARRAY_SIZE(cpu_user_regs_gpr_offsets) - 1));
>  
> -    ASSERT(modrm < ARRAY_SIZE(cpu_user_regs_gpr_offsets));
> -
>      /* Note that this also acts as array_access_nospec() stand-in. */

I can't locate this comment anywhere in Xen's history.

The comment currently in tree is:

/* For safety in release builds.  Debug builds will hit the ASSERT() */

which will need adjusting to make it clear that we may modrm >
ARRAY_SIZE() and that this truncation operation is the correct action to
take.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 33/50] x86emul: add high register S/G test cases
  2019-03-15 10:59   ` [PATCH v8 33/50] x86emul: add high register S/G test cases Jan Beulich
@ 2019-06-19 12:07     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 12:07 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 10:59, Jan Beulich wrote:
> In order to verify that in particular the index register decoding works
> correctly in the S/G emulation paths, add dedicated (64-bit only) cases
> disallowing the compiler to use the lower registers. Other than in the
> generic SIMD case, where occasional uses of %xmm or %ymm registers in
> generated code cause various internal compiler errors when disallowing
> use of all of the lower 16 registers (apparently due to insn templates
> trying to use AVX2 encodings), doing so here in the AVX512F case looks
> to be fine.
>
> While the main goal here is the AVX512F case, add an AVX2 variant as
> well.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 36/50] x86emul: support AVX512CD insns
  2019-03-15 11:00   ` [PATCH v8 36/50] x86emul: support AVX512CD insns Jan Beulich
@ 2019-06-19 12:13     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 12:13 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:00, Jan Beulich wrote:
> Since the insns here and in particular their memory access patterns
> follow the usual scheme I didn't think it was necessary to add
> contrived tests specifically for them, beyond the Disp8 scaling ones.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 37/50] x86emul: complete support of AVX512_VBMI insns
  2019-03-15 11:01   ` [PATCH v8 37/50] x86emul: complete support of AVX512_VBMI insns Jan Beulich
@ 2019-06-19 12:16     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 12:16 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:01, Jan Beulich wrote:
> Also add testing of ones support for which was added before. Sadly gcc's
> command line option naming is not in line with Intel's naming of the
> feature, which makes it necessary to mis-name things in the test harness.
>
> Since the only new insn here and in particular its memory access pattern
> follows the usual scheme, I didn't think it was necessary to add a
> contrived test specifically for it, beyond the Disp8 scaling one.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 38/50] x86emul: support of AVX512* population count insns
  2019-03-15 11:01   ` [PATCH v8 38/50] x86emul: support of AVX512* population count insns Jan Beulich
@ 2019-06-19 12:22     ` Andrew Cooper
  2019-06-19 12:48       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 12:22 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:01, Jan Beulich wrote:
> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -269,7 +269,7 @@ def crunch_numbers(state):
>          # AVX512 extensions acting (solely) on vectors of bytes/words are made
>          # dependents of AVX512BW (as to requiring wider than 16-bit mask
>          # registers), despite the SDM not formally making this connection.
> -        AVX512BW: [AVX512_VBMI, AVX512_VBMI2],
> +        AVX512BW: [AVX512_VBMI, AVX512_BITALG, AVX512_VBMI2],

This ordering looks suspect.  BITALG should go last, given its position
in the feature leaf.

With this fixed, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 39/50] x86emul: support of AVX512_IFMA insns
  2019-03-15 11:02   ` [PATCH v8 39/50] x86emul: support of AVX512_IFMA insns Jan Beulich
@ 2019-06-19 12:23     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 12:23 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:02, Jan Beulich wrote:
> Once again take the liberty and also correct the (public interface) name
> of the AVX512_IFMA feature flag to match the SDM, on the assumption that
> no external consumer has actually been using that flag so far.
>
> As in a few cases before, since the insns here and in particular their
> memory access patterns follow the usual scheme, I didn't think it was
> necessary to add a contrived test specifically for them, beyond the
> Disp8 scaling one.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 40/50] x86emul: support remaining AVX512_VBMI2 insns
  2019-03-15 11:02   ` [PATCH v8 40/50] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
@ 2019-06-19 12:25     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 12:25 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:02, Jan Beulich wrote:
> As in a few cases before, since the insns here and in particular their
> memory access patterns follow the usual scheme, I didn't think it was
> necessary to add a contrived test specifically for them, beyond the
> Disp8 scaling one.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 32/50] x86emul: support AVX512F gather insns
  2019-06-19 12:05     ` [Xen-devel] " Andrew Cooper
@ 2019-06-19 12:43       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-06-19 12:43 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 19.06.19 at 14:05, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:58, Jan Beulich wrote:
>> This requires getting modrm_reg and sib_index set correctly in the EVEX
>> case, to account for the high 16 [XYZ]MM registers. Extend the
>> adjustments to modrm_rm as well, such that x86_insn_modrm() would
>> correctly report register numbers (this was a latent issue only as we
>> don't currently have callers of that function which would care about an
>> EVEX case). The adjustment in turn requires dropping the assertion from
>> decode_gpr() as well as re-introducing the explicit masking, as we now
>> need to actively mask off the high bit when a GPR is meant.
>> _decode_gpr() invocations also need slight adjustments, when invoked in
>> generic code ahead of the main switch(). All other uses of modrm_reg and
>> modrm_rm already get suitably masked where necessary.
>>
>> There was also an encoding mistake in the EVEX Disp8 test code, which
>> was benign (due to %rdx getting set to zero) to all non-vSIB tests as it
>> mistakenly encoded <disp8>(%rdx,%rdx) instead of <disp8>(%rdx,%riz). In
>> the vSIB case this meant <disp8>(%rdx,%zmm2) instead of the intended
>> <disp8>(%rdx,%zmm4).
>>
>> Likewise the access count check wasn't entirely correct for the S/G
>> case: In the quad-word-index but dword-data case only half the number
>> of full vector elements get accessed.
>>
>> As an unrelated change in the main test harness source file distinguish
>> the "n/a" messages by bitness.
> 
> There is a very large amount going on here, and too much for a single patch.
> 
> I think the modrm fixes want splitting out because those alone are
> subtle.  Its reasonable to keep the emulator test fixes in with the new
> functionality which notices the bug, and it is a one-liner.

Splitting out the ModR/M handling changes is certainly possible. I'll
do so since you ask for it, but I had deliberately not done so
because to me they looked "out of context" without the actual uses.

>> --- a/xen/arch/x86/x86_emulate/x86_emulate.h
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.h
>> @@ -662,8 +662,6 @@ static inline unsigned long *decode_gpr(
>>      BUILD_BUG_ON(ARRAY_SIZE(cpu_user_regs_gpr_offsets) &
>>                   (ARRAY_SIZE(cpu_user_regs_gpr_offsets) - 1));
>>  
>> -    ASSERT(modrm < ARRAY_SIZE(cpu_user_regs_gpr_offsets));
>> -
>>      /* Note that this also acts as array_access_nospec() stand-in. */
> 
> I can't locate this comment anywhere in Xen's history.
> 
> The comment currently in tree is:
> 
> /* For safety in release builds.  Debug builds will hit the ASSERT() */
> 
> which will need adjusting to make it clear that we may modrm >
> ARRAY_SIZE() and that this truncation operation is the correct action to
> take.

See "x86emul: avoid speculative out of bounds accesses" which, together
with 3 other patches in the same original series, is still awaiting your review.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 38/50] x86emul: support of AVX512* population count insns
  2019-06-19 12:22     ` [Xen-devel] " Andrew Cooper
@ 2019-06-19 12:48       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-06-19 12:48 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 19.06.19 at 14:22, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 11:01, Jan Beulich wrote:
>> --- a/xen/tools/gen-cpuid.py
>> +++ b/xen/tools/gen-cpuid.py
>> @@ -269,7 +269,7 @@ def crunch_numbers(state):
>>          # AVX512 extensions acting (solely) on vectors of bytes/words are made
>>          # dependents of AVX512BW (as to requiring wider than 16-bit mask
>>          # registers), despite the SDM not formally making this connection.
>> -        AVX512BW: [AVX512_VBMI, AVX512_VBMI2],
>> +        AVX512BW: [AVX512_VBMI, AVX512_BITALG, AVX512_VBMI2],
> 
> This ordering looks suspect.  BITALG should go last, given its position
> in the feature leaf.

After re-basing this now is

@@ -268,7 +268,7 @@ def crunch_numbers(state):
         # AVX512 extensions acting (solely) on vectors of bytes/words are made
         # dependents of AVX512BW (as to requiring wider than 16-bit mask
         # registers), despite the SDM not formally making this connection.
-        AVX512BW: [AVX512_BF16, AVX512_VBMI, AVX512_VBMI2],
+        AVX512BW: [AVX512_BF16, AVX512_BITALG, AVX512_VBMI, AVX512_VBMI2],
 
         # The features:
         #   * Single Thread Indirect Branch Predictors

I don't think ordering based on (potentially unrelated) leaves should
be a criteria here. Instead, as indicated before, I think we'd be
better off using alphabetical sorting for such longer dependency
lists. I'd be happy to re-sort the AVX512F one as well.

> With this fixed, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

I'm not going to apply this without further clarification by you as
per the above point.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 41/50] x86emul: support AVX512_4FMAPS insns
  2019-03-15 11:04   ` [PATCH v8 41/50] x86emul: support AVX512_4FMAPS insns Jan Beulich
@ 2019-06-19 14:58     ` Andrew Cooper
  2019-06-21  6:50       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 14:58 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:04, Jan Beulich wrote:
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -1924,6 +1924,7 @@ static bool vcpu_has(
>  #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
>  #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
>  #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
> +#define vcpu_has_avx512_4fmaps() vcpu_has(       7, EDX,  3, ctxt, ops)
>  #define vcpu_has_clzero()      vcpu_has(0x80000008, EBX,  0, ctxt, ops)
>  
>  #define vcpu_must_have(feat) \
> @@ -3205,6 +3206,18 @@ x86_decode(
>                                                     state);
>                      state->simd_size = simd_other;
>                  }
> +
> +                switch ( b )
> +                {
> +                /* v4f{,n}madd{p,s}s need special casing */

Why?  Given that this isn't boilerplate addition of new instructions, it
needs a discussion in the commit message.

~Andrew

> +                case 0x9a: case 0x9b: case 0xaa: case 0xab:
> +                    if ( evex.pfx == vex_f2 )
> +                    {
> +                        disp8scale = 4;
> +                        state->simd_size = simd_128;
> +                    }
> +                    break;
> +                }
>              }
>              break;
>  


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 42/50] x86emul: support AVX512_4VNNIW insns
  2019-03-15 11:04   ` [PATCH v8 42/50] x86emul: support AVX512_4VNNIW insns Jan Beulich
@ 2019-06-19 14:58     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 14:58 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:04, Jan Beulich wrote:
> As in a few cases before, since the insns here and in particular their
> memory access patterns follow the AVX512_4FMAPS scheme, I didn't think
> it was necessary to add contrived tests specifically for them, beyond
> the Disp8 scaling ones.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 43/50] x86emul: support AVX512_VNNI insns
  2019-03-15 11:04   ` [PATCH v8 43/50] x86emul: support AVX512_VNNI insns Jan Beulich
@ 2019-06-19 15:01     ` Andrew Cooper
  2019-06-21  6:55       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-19 15:01 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:04, Jan Beulich wrote:
> --- a/tools/tests/x86_emulator/x86-emulate.h
> +++ b/tools/tests/x86_emulator/x86-emulate.h
> @@ -144,6 +144,7 @@ static inline bool xcr0_mask(uint64_t ma
>  #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
>  #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
>  #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
> +#define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
>  #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
>  #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
>  #define cpu_has_avx512_4vnniw (cp.feat.avx512_4vnniw && xcr0_mask(0xe6))
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -479,7 +479,7 @@ static const struct ext0f38_table {
>      [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
>      [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
>      [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
> -    [0x52 ... 0x53] = { .simd_size = simd_128, .d8s = 4 },
> +    [0x50 ... 0x53] = { .simd_size = simd_packed_int, .d8s = d8s_vl },

Hang on - is the previous patch correct?  Shouldn't it have
simd_packed_int/d8s_vl from the getgo?

>      [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
>      [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
>      [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
> @@ -1922,6 +1922,7 @@ static bool vcpu_has(
>  #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
>  #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
>  #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
> +#define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
>  #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
>  #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
>  #define vcpu_has_rdpid()       vcpu_has(         7, ECX, 22, ctxt, ops)
> @@ -3211,6 +3212,8 @@ x86_decode(
>  
>                  switch ( b )
>                  {
> +                /* vp4dpwssd{,s} need special casing */

Special cases should be discussed.

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 41/50] x86emul: support AVX512_4FMAPS insns
  2019-06-19 14:58     ` [Xen-devel] " Andrew Cooper
@ 2019-06-21  6:50       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-06-21  6:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 19.06.19 at 16:58, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 11:04, Jan Beulich wrote:
>> @@ -3205,6 +3206,18 @@ x86_decode(
>>                                                     state);
>>                      state->simd_size = simd_other;
>>                  }
>> +
>> +                switch ( b )
>> +                {
>> +                /* v4f{,n}madd{p,s}s need special casing */
> 
> Why?  Given that this isn't boilerplate addition of new instructions, it
> needs a discussion in the commit message.

I've added

A decoder adjustment is needed here because of the current sharing of
table entries between different (implied) opcode prefixes: The same
major opcodes are used for vfmsub{132,213}{p,s}{s,d}, which have a
different memory operand size and different Disp8 scaling.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 43/50] x86emul: support AVX512_VNNI insns
  2019-06-19 15:01     ` [Xen-devel] " Andrew Cooper
@ 2019-06-21  6:55       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-06-21  6:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 19.06.19 at 17:01, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 11:04, Jan Beulich wrote:
>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>> @@ -479,7 +479,7 @@ static const struct ext0f38_table {
>>      [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
>>      [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
>>      [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
>> -    [0x52 ... 0x53] = { .simd_size = simd_128, .d8s = 4 },
>> +    [0x50 ... 0x53] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
> 
> Hang on - is the previous patch correct?  Shouldn't it have
> simd_packed_int/d8s_vl from the getgo?

Yes, it is. The situation here is the opposite to what we have with
4FMAPS: The special F2-prefixed encoding gets introduced first
(previous patch), and gets converted to a decode special case
here (for consistency such that the "normal" 66-prefixed operand
characteristics appear in the table.

>> @@ -3211,6 +3212,8 @@ x86_decode(
>>  
>>                  switch ( b )
>>                  {
>> +                /* vp4dpwssd{,s} need special casing */
> 
> Special cases should be discussed.

I'll clone what I've added to the 4FMAPS patch.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 44/50] x86emul: support VPCLMULQDQ insns
  2019-03-15 11:05   ` [PATCH v8 44/50] x86emul: support VPCLMULQDQ insns Jan Beulich
@ 2019-06-21 12:52     ` Andrew Cooper
  2019-06-21 13:44       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 12:52 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:05, Jan Beulich wrote:
> As to the feature dependency adjustment, while strictly speaking AVX is
> a sufficient prereq (to have YMM registers), 256-bit vectors of integers
> have got fully introduced with AVX2 only. Sadly gcc can't be used as a
> reference here: They don't provide any AVX512-independent built-in at
> all.
>
> Along the lines of PCLMULQDQ, since the insns here and in particular
> their memory access patterns follow the usual scheme, I didn't think it
> was necessary to add a contrived test specifically for them, beyond the
> Disp8 scaling one.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: Should VPCLMULQDQ also depend on PCLMULQDQ?

I think so, yes.  These are all 64 by 64 multiplies with a 128 bit
result, and an imm8 to choose which quadwords get used, so both these
features will be using the silicon block in the vector pipeline.  The
only difference is whether its wired through from the legacy SSE
instructions, or the [E]VEX instructions.

I certainly don't expect to ever see hardware with VPCLMULQDQ but
lacking PCLMULQDQ.

> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -255,8 +255,9 @@ def crunch_numbers(state):
>  
>          # This is just the dependency between AVX512 and AVX2 of XSTATE
>          # feature flags.  If want to use AVX512, AVX2 must be supported and
> -        # enabled.
> -        AVX2: [AVX512F],
> +        # enabled.  Certain later extensions, acting on 256-bit vectors of
> +        # integers, better depend on AVX2 than AVX.
> +        AVX2: [AVX512F, VPCLMULQDQ],

Hmm - this is awkward, because in practice, there won't be any hardware
in existence with VPCLMULQDQ and AVX2 but lacking AVX512.

However, the VEX encoding is legitimate in the absence of the EVEX
encoding, and doesn't depend on the AVX512 XCR0 state.

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 45/50] x86emul: support VAES insns
  2019-03-15 11:06   ` [PATCH v8 45/50] x86emul: support VAES insns Jan Beulich
@ 2019-06-21 12:57     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 12:57 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:06, Jan Beulich wrote:
> As to the feature dependency adjustment, just like for VPCLMULQDQ while
> strictly speaking AVX is a sufficient prereq (to have YMM registers),
> 256-bit vectors of integers have got fully introduced with AVX2 only.
>
> A new test case (also covering AESNI) will be added to the harness by a
> subsequent patch.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: Should VAES also depend on AESNI?

It should match whatever we decide for VPCLMULQDQ vs PCLMULQDQ, so as
long as we're consistent,

Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 46/50] x86emul: support GFNI insns
  2019-03-15 11:06   ` [PATCH v8 46/50] x86emul: support GFNI insns Jan Beulich
@ 2019-06-21 13:19     ` Andrew Cooper
  2019-06-21 13:33       ` Andrew Cooper
  2019-06-21 14:00       ` Jan Beulich
  0 siblings, 2 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 13:19 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:06, Jan Beulich wrote:
> Note that the ISA extensions document revision 035 is ambiguous
> regarding fault suppression for VGF2P8MULB: Text says it's supported,
> while the exception specification listed is E4NF. Given the wording here
> and for the other two insns I'm inclined to trust the text more than the
> exception reference, which was also confirmed informally.

Version 037 has the exception reference as E4 rather than E4NF, so I
think this entire paragraph is stale now and can be dropped.

(On a tangent, AVX512_VP2INTERSECT now exists in the extensions doc.)

> As to the feature dependency adjustment, while strictly speaking SSE is
> a sufficient prereq (to have XMM registers), vectors of bytes and qwords
> have got introduced only with SSE2. gcc, for example, uses a similar
> connection in its respective intrinsics header.

This is stale now that you've moved the other integer dependences to SSE2.

Feature wise, GFNI is rather awkward.  The single feature bit covers
legacy SSE, and VEX and EVEX encodings.  This is clearly a brand new
functional block in vector pipeline which has been wired into all
instruction paths (probably because that is easier than trying to
exclude the legacy SSE part).

> @@ -138,6 +141,26 @@ static bool simd_check_avx512vbmi_vl(voi
>      return cpu_has_avx512_vbmi && cpu_has_avx512vl;
>  }
>  
> +static bool simd_check_sse2_gf(void)
> +{
> +    return cpu_has_gfni && cpu_has_sse2;

This dependency doesn't match the manual.  The legacy encoding needs
GFNI alone.

gen-cpuid.py is trying to reduce the ability to create totally
implausible configurations via levelling, but for software checks, we
should follow the manual to the letter.

> +}
> +
> +static bool simd_check_avx2_gf(void)
> +{
> +    return cpu_has_gfni && cpu_has_avx2;

Here, the dependency is only on AVX, which I think is probably trying to
express a dependency on xcr0.ymm

> +}
> +
> +static bool simd_check_avx512bw_gf(void)
> +{
> +    return cpu_has_gfni && cpu_has_avx512bw;

I don't see any BW interaction anywhere (in the manual), despite the
fact it operates on a datatype of int8.

~Andrew

> +}
> +
> +static bool simd_check_avx512bw_gf_vl(void)
> +{
> +    return cpu_has_gfni && cpu_has_avx512vl;
> +}
> +
>  static void simd_set_regs(struct cpu_user_regs *regs)
>  {
>      if ( cpu_has_mmx )
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 47/50] x86emul: restore ordering within main switch statement
  2019-03-15 11:07   ` [PATCH v8 47/50] x86emul: restore ordering within main switch statement Jan Beulich
@ 2019-06-21 13:20     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 13:20 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:07, Jan Beulich wrote:
> Incremental additions and/or mistakes have lead to some code blocks
> sitting in "unexpected" places. Re-sort the case blocks (opcode space;
> major opcode; 66/F3/F2 prefix; legacy/VEX/EVEX encoding).
>
> As an exception the opcode space 0x0f EVEX-encoded VPEXTRW is left at
> its current place, to keep it close to the "pextr" label.
>
> Pure code movement.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 46/50] x86emul: support GFNI insns
  2019-06-21 13:19     ` [Xen-devel] " Andrew Cooper
@ 2019-06-21 13:33       ` Andrew Cooper
  2019-06-21 14:00       ` Jan Beulich
  1 sibling, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 13:33 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 21/06/2019 14:19, Andrew Cooper wrote:
>> +}
>> +
>> +static bool simd_check_avx512bw_gf(void)
>> +{
>> +    return cpu_has_gfni && cpu_has_avx512bw;
> I don't see any BW interaction anywhere (in the manual), despite the
> fact it operates on a datatype of int8.

Actually, it is an integer operation (and looks suspiciously like a
lookup table) rather than a vector operation, so probably doesn't count
as using the int8 datatype, which explains the lack of interaction with BW.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 48/50] x86emul: add an AES/VAES test case to the harness
  2019-03-15 11:07   ` [PATCH v8 48/50] x86emul: add an AES/VAES test case to the harness Jan Beulich
@ 2019-06-21 13:36     ` Andrew Cooper
  2019-06-21 14:04       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 13:36 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:07, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

This feels like it should be folded with patch 45 (or perhaps easier, 45
moved later and folded into this one.  The exact ordering of patches
really doesn't matter).

> @@ -91,6 +95,16 @@ static bool simd_check_xop(void)
>      return cpu_has_xop;
>  }
>  
> +static bool simd_check_ssse3_aes(void)
> +{
> +    return cpu_has_aesni && cpu_has_ssse3;
> +}
> +
> +static bool simd_check_avx_aes(void)
> +{
> +    return cpu_has_aesni && cpu_has_avx;
> +}
> +
>  static bool simd_check_avx512f(void)
>  {
>      return cpu_has_avx512f;
> @@ -141,6 +155,22 @@ static bool simd_check_avx512vbmi_vl(voi
>      return cpu_has_avx512_vbmi && cpu_has_avx512vl;
>  }
>  
> +static bool simd_check_avx2_vaes(void)
> +{
> +    return cpu_has_aesni && cpu_has_vaes && cpu_has_avx2;
> +}
> +
> +static bool simd_check_avx512bw_vaes(void)
> +{
> +    return cpu_has_aesni && cpu_has_vaes && cpu_has_avx512bw;
> +}
> +
> +static bool simd_check_avx512bw_vaes_vl(void)
> +{
> +    return cpu_has_aesni && cpu_has_vaes &&
> +           cpu_has_avx512bw && cpu_has_avx512vl;
> +}

I've got the same concerns WRT feature tests as with the previous
patch.  Everything else LGTM.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 44/50] x86emul: support VPCLMULQDQ insns
  2019-06-21 12:52     ` [Xen-devel] " Andrew Cooper
@ 2019-06-21 13:44       ` Jan Beulich
  0 siblings, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-06-21 13:44 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.06.19 at 14:52, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 11:05, Jan Beulich wrote:
>> As to the feature dependency adjustment, while strictly speaking AVX is
>> a sufficient prereq (to have YMM registers), 256-bit vectors of integers
>> have got fully introduced with AVX2 only. Sadly gcc can't be used as a
>> reference here: They don't provide any AVX512-independent built-in at
>> all.
>>
>> Along the lines of PCLMULQDQ, since the insns here and in particular
>> their memory access patterns follow the usual scheme, I didn't think it
>> was necessary to add a contrived test specifically for them, beyond the
>> Disp8 scaling one.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> TBD: Should VPCLMULQDQ also depend on PCLMULQDQ?
> 
> I think so, yes.  These are all 64 by 64 multiplies with a 128 bit
> result, and an imm8 to choose which quadwords get used, so both these
> features will be using the silicon block in the vector pipeline.  The
> only difference is whether its wired through from the legacy SSE
> instructions, or the [E]VEX instructions.
> 
> I certainly don't expect to ever see hardware with VPCLMULQDQ but
> lacking PCLMULQDQ.

Okay, will do, and I'll assume that ...

>> --- a/xen/tools/gen-cpuid.py
>> +++ b/xen/tools/gen-cpuid.py
>> @@ -255,8 +255,9 @@ def crunch_numbers(state):
>>  
>>          # This is just the dependency between AVX512 and AVX2 of XSTATE
>>          # feature flags.  If want to use AVX512, AVX2 must be supported and
>> -        # enabled.
>> -        AVX2: [AVX512F],
>> +        # enabled.  Certain later extensions, acting on 256-bit vectors of
>> +        # integers, better depend on AVX2 than AVX.
>> +        AVX2: [AVX512F, VPCLMULQDQ],
> 
> Hmm - this is awkward, because in practice, there won't be any hardware
> in existence with VPCLMULQDQ and AVX2 but lacking AVX512.
> 
> However, the VEX encoding is legitimate in the absence of the EVEX
> encoding, and doesn't depend on the AVX512 XCR0 state.
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

... this is meant for a patch with that addition to the dependency tree.

Btw - for GFNI and its VEX/EVEX forms this gets even more interesting.
It didn't even occur to me that there could be new hardware supporting
GFNI but no AVX* whatsoever, but I've been told the current split
between SDM and ISA extensions doc is to reflect exactly this.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 49/50] x86emul: add a SHA test case to the harness
  2019-03-15 11:08   ` [PATCH v8 49/50] x86emul: add a SHA " Jan Beulich
@ 2019-06-21 13:51     ` Andrew Cooper
  2019-06-21 14:10       ` Jan Beulich
  0 siblings, 1 reply; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 13:51 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:08, Jan Beulich wrote:
> +        /*
> +         * SHA256RNDS2
> +         *
> +         * SRC1 = { C0, D0, G0, H0 }
> +         * SRC2 = { A0, B0, E0, F0 }
> +         * XMM0 = W' = { ?, ?, WK1, WK0 }
> +         *
> +         * (NB that the notation again is not C-like, i.e. elements are listed
> +         * high-to-low everywhere in this comment.)
> +         *
> +         * Ch(E,F,G) = (E & F) ^ (~E & G)
> +         * Maj(A,B,C) = (A & B) ^ (A & C) ^ (B & C)
> +         *
> +         * Σ0(A) = ROR2(A) ^ ROR13(A) ^ ROR22(A)
> +         * Σ1(E) = ROR6(E) ^ ROR11(E) ^ ROR25(E)

This looks like some encoding problems (and later on in the comment).  I
can't tell whether it is a consequence of the email or something present
in the underlying patch. 

> --- a/tools/tests/x86_emulator/test_x86_emulator.c
> +++ b/tools/tests/x86_emulator/test_x86_emulator.c
> @@ -155,6 +158,21 @@ static bool simd_check_avx512vbmi_vl(voi
>      return cpu_has_avx512_vbmi && cpu_has_avx512vl;
>  }
>  
> +static bool simd_check_sse4_sha(void)
> +{
> +    return cpu_has_sha && cpu_has_sse4_2;

The legacy instruction isn't listed as having any dependency other than
the SHA bit.

> +}
> +
> +static bool simd_check_avx_sha(void)
> +{
> +    return cpu_has_sha && cpu_has_avx;

I can't locate any [E]VEX encoding information for the SHA
instructions.  Is this a side effect of the rest of the test algorithm,
or am I missing something in the manual?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 50/50] x86emul: add a PCLMUL/VPCLMUL test case to the harness
  2019-03-15 11:08   ` [PATCH v8 50/50] x86emul: add a PCLMUL/VPCLMUL " Jan Beulich
@ 2019-06-21 13:58     ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 13:58 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 15/03/2019 11:08, Jan Beulich wrote:
> Also use this for AVX512_VBMI2 VPSH{L,R}D{,V}{D,Q,W} testing (only the
> quad word right shifts get actually used; the assumption is that their
> "left" counterparts as well as the double word and word forms then work
> as well).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citirx.com> (subject to all
related concerned raised on earlier patches being resolved in a
consistent manner)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 46/50] x86emul: support GFNI insns
  2019-06-21 13:19     ` [Xen-devel] " Andrew Cooper
  2019-06-21 13:33       ` Andrew Cooper
@ 2019-06-21 14:00       ` Jan Beulich
  2019-06-21 14:20         ` Andrew Cooper
  1 sibling, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-06-21 14:00 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.06.19 at 15:19, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 11:06, Jan Beulich wrote:
>> Note that the ISA extensions document revision 035 is ambiguous
>> regarding fault suppression for VGF2P8MULB: Text says it's supported,
>> while the exception specification listed is E4NF. Given the wording here
>> and for the other two insns I'm inclined to trust the text more than the
>> exception reference, which was also confirmed informally.
> 
> Version 037 has the exception reference as E4 rather than E4NF, so I
> think this entire paragraph is stale now and can be dropped.

Oh, indeed, they've corrected that.

> (On a tangent, AVX512_VP2INTERSECT now exists in the extensions doc.)

And I have it implemented, but no way to test until sde supports it.

>> As to the feature dependency adjustment, while strictly speaking SSE is
>> a sufficient prereq (to have XMM registers), vectors of bytes and qwords
>> have got introduced only with SSE2. gcc, for example, uses a similar
>> connection in its respective intrinsics header.
> 
> This is stale now that you've moved the other integer dependences to SSE2.

Hmm, it's redundant with that other change, but not really stale.
I can drop it if you want.

>> @@ -138,6 +141,26 @@ static bool simd_check_avx512vbmi_vl(voi
>>      return cpu_has_avx512_vbmi && cpu_has_avx512vl;
>>  }
>>  
>> +static bool simd_check_sse2_gf(void)
>> +{
>> +    return cpu_has_gfni && cpu_has_sse2;
> 
> This dependency doesn't match the manual.  The legacy encoding needs
> GFNI alone.
> 
> gen-cpuid.py is trying to reduce the ability to create totally
> implausible configurations via levelling, but for software checks, we
> should follow the manual to the letter.

This is test harness code - I'd rather be a little more strict here than
having to needlessly spend time fixing an issue in there. Furthermore
this matches how gcc enforces dependencies.

>> +}
>> +
>> +static bool simd_check_avx2_gf(void)
>> +{
>> +    return cpu_has_gfni && cpu_has_avx2;
> 
> Here, the dependency is only on AVX, which I think is probably trying to
> express a dependency on xcr0.ymm

Mostly as per above, except that here gcc (imo wrongly) enables just
AVX.

>> +}
>> +
>> +static bool simd_check_avx512bw_gf(void)
>> +{
>> +    return cpu_has_gfni && cpu_has_avx512bw;
> 
> I don't see any BW interaction anywhere (in the manual), despite the
> fact it operates on a datatype of int8.

But by operating on vectors of bytes, it requires 64 bits wide mask
registers, which is the connection to BW. Again gcc also does so.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 48/50] x86emul: add an AES/VAES test case to the harness
  2019-06-21 13:36     ` [Xen-devel] " Andrew Cooper
@ 2019-06-21 14:04       ` Jan Beulich
  2019-06-21 14:20         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-06-21 14:04 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.06.19 at 15:36, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 11:07, Jan Beulich wrote:
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> This feels like it should be folded with patch 45 (or perhaps easier, 45
> moved later and folded into this one.  The exact ordering of patches
> really doesn't matter).

Not really imo - we've had AESNI support even before this series.
Test harness coverage for everything gets added here. Apart
from this the late placement in the series is attributed to its history:
It took me a while after having finished the main AVX512 work to
find time to actually come up with at least marginally sensible tests
here.

I'd prefer if things remained split the way they are.

>> @@ -91,6 +95,16 @@ static bool simd_check_xop(void)
>>      return cpu_has_xop;
>>  }
>>  
>> +static bool simd_check_ssse3_aes(void)
>> +{
>> +    return cpu_has_aesni && cpu_has_ssse3;
>> +}
>> +
>> +static bool simd_check_avx_aes(void)
>> +{
>> +    return cpu_has_aesni && cpu_has_avx;
>> +}
>> +
>>  static bool simd_check_avx512f(void)
>>  {
>>      return cpu_has_avx512f;
>> @@ -141,6 +155,22 @@ static bool simd_check_avx512vbmi_vl(voi
>>      return cpu_has_avx512_vbmi && cpu_has_avx512vl;
>>  }
>>  
>> +static bool simd_check_avx2_vaes(void)
>> +{
>> +    return cpu_has_aesni && cpu_has_vaes && cpu_has_avx2;
>> +}
>> +
>> +static bool simd_check_avx512bw_vaes(void)
>> +{
>> +    return cpu_has_aesni && cpu_has_vaes && cpu_has_avx512bw;
>> +}
>> +
>> +static bool simd_check_avx512bw_vaes_vl(void)
>> +{
>> +    return cpu_has_aesni && cpu_has_vaes &&
>> +           cpu_has_avx512bw && cpu_has_avx512vl;
>> +}
> 
> I've got the same concerns WRT feature tests as with the previous
> patch.  Everything else LGTM.

Right - let's settle on that aspect there first.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 49/50] x86emul: add a SHA test case to the harness
  2019-06-21 13:51     ` [Xen-devel] " Andrew Cooper
@ 2019-06-21 14:10       ` Jan Beulich
  2019-06-21 14:23         ` Andrew Cooper
  0 siblings, 1 reply; 465+ messages in thread
From: Jan Beulich @ 2019-06-21 14:10 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.06.19 at 15:51, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 11:08, Jan Beulich wrote:
>> +        /*
>> +         * SHA256RNDS2
>> +         *
>> +         * SRC1 = { C0, D0, G0, H0 }
>> +         * SRC2 = { A0, B0, E0, F0 }
>> +         * XMM0 = W' = { ?, ?, WK1, WK0 }
>> +         *
>> +         * (NB that the notation again is not C-like, i.e. elements are listed
>> +         * high-to-low everywhere in this comment.)
>> +         *
>> +         * Ch(E,F,G) = (E & F) ^ (~E & G)
>> +         * Maj(A,B,C) = (A & B) ^ (A & C) ^ (B & C)
>> +         *
>> +         * Σ0(A) = ROR2(A) ^ ROR13(A) ^ ROR22(A)
>> +         * Σ1(E) = ROR6(E) ^ ROR11(E) ^ ROR25(E)
> 
> This looks like some encoding problems (and later on in the comment).  I
> can't tell whether it is a consequence of the email or something present
> in the underlying patch. 

Oh - I forgot to properly enforce UTF8 while composing the mail.
This is what it'll looks like when committed:

        /*
         * SHA256RNDS2
         *
         * SRC1 = { C0, D0, G0, H0 }
         * SRC2 = { A0, B0, E0, F0 }
         * XMM0 = W' = { ?, ?, WK1, WK0 }
         *
         * (NB that the notation again is not C-like, i.e. elements are listed
         * high-to-low everywhere in this comment.)
         *
         * Ch(E,F,G) = (E & F) ^ (~E & G)
         * Maj(A,B,C) = (A & B) ^ (A & C) ^ (B & C)
         *
         * Σ0(A) = ROR2(A) ^ ROR13(A) ^ ROR22(A)
         * Σ1(E) = ROR6(E) ^ ROR11(E) ^ ROR25(E)

>> +static bool simd_check_avx_sha(void)
>> +{
>> +    return cpu_has_sha && cpu_has_avx;
> 
> I can't locate any [E]VEX encoding information for the SHA
> instructions.  Is this a side effect of the rest of the test algorithm,
> or am I missing something in the manual?

There are only legacy forms as of now. Note the single sentence in
the commit message:

"Also use this for AVX512VL VPRO{L,R}{,V}D as well as some further shifts
 testing."

The AVX and AVX512F flavors of the test are good test cases for other
insns; they are still only utilizing legacy encoded SHA insns.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 46/50] x86emul: support GFNI insns
  2019-06-21 14:00       ` Jan Beulich
@ 2019-06-21 14:20         ` Andrew Cooper
  2019-06-21 15:02           ` Jan Beulich
  2019-06-25  6:48           ` Jan Beulich
  0 siblings, 2 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 14:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 21/06/2019 15:00, Jan Beulich wrote:
>> On 15/03/2019 11:06, Jan Beulich wrote:
>> (On a tangent, AVX512_VP2INTERSECT now exists in the extensions doc.)
> And I have it implemented, but no way to test until sde supports it.

Fair enough.

>>> @@ -138,6 +141,26 @@ static bool simd_check_avx512vbmi_vl(voi
>>>      return cpu_has_avx512_vbmi && cpu_has_avx512vl;
>>>  }
>>>  
>>> +static bool simd_check_sse2_gf(void)
>>> +{
>>> +    return cpu_has_gfni && cpu_has_sse2;
>> This dependency doesn't match the manual.  The legacy encoding needs
>> GFNI alone.
>>
>> gen-cpuid.py is trying to reduce the ability to create totally
>> implausible configurations via levelling, but for software checks, we
>> should follow the manual to the letter.
> This is test harness code - I'd rather be a little more strict here than
> having to needlessly spend time fixing an issue in there. Furthermore
> this matches how gcc enforces dependencies.
>>> +}
>>> +
>>> +static bool simd_check_avx2_gf(void)
>>> +{
>>> +    return cpu_has_gfni && cpu_has_avx2;
>> Here, the dependency is only on AVX, which I think is probably trying to
>> express a dependency on xcr0.ymm
> Mostly as per above, except that here gcc (imo wrongly) enables just
> AVX.
>
>>> +}
>>> +
>>> +static bool simd_check_avx512bw_gf(void)
>>> +{
>>> +    return cpu_has_gfni && cpu_has_avx512bw;
>> I don't see any BW interaction anywhere (in the manual), despite the
>> fact it operates on a datatype of int8.
> But by operating on vectors of bytes, it requires 64 bits wide mask
> registers, which is the connection to BW. Again gcc also does so.

To be honest, it doesn't matter what GCC does.

What matter is the expectation of arbitrary library/application code,
written in adherence to the Intel manual, when running with a levelled
CPUID policy, because *that* is the set of corner cases where things
might end up exploding.

I see your point about needing a full width mask register, which to me
suggests that the extension manual is documenting the dependency
incorrectly.

It also means that I need to change how we do feature dependency
derivation, because this is the first example of a conditional
dependency.  I.e. AVX512F but not AVX512BW implies no GFNI even if
hardware has it, but on the same hardware when levelling AVX512F out,
GFNI could be used via its legacy or VEX version.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 48/50] x86emul: add an AES/VAES test case to the harness
  2019-06-21 14:04       ` Jan Beulich
@ 2019-06-21 14:20         ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 14:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 21/06/2019 15:04, Jan Beulich wrote:
>>>> On 21.06.19 at 15:36, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 11:07, Jan Beulich wrote:
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> This feels like it should be folded with patch 45 (or perhaps easier, 45
>> moved later and folded into this one.  The exact ordering of patches
>> really doesn't matter).
> Not really imo - we've had AESNI support even before this series.
> Test harness coverage for everything gets added here. Apart
> from this the late placement in the series is attributed to its history:
> It took me a while after having finished the main AVX512 work to
> find time to actually come up with at least marginally sensible tests
> here.
>
> I'd prefer if things remained split the way they are.

Fair enough.  Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 49/50] x86emul: add a SHA test case to the harness
  2019-06-21 14:10       ` Jan Beulich
@ 2019-06-21 14:23         ` Andrew Cooper
  0 siblings, 0 replies; 465+ messages in thread
From: Andrew Cooper @ 2019-06-21 14:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

On 21/06/2019 15:10, Jan Beulich wrote:
>>>> On 21.06.19 at 15:51, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 11:08, Jan Beulich wrote:
>>> +        /*
>>> +         * SHA256RNDS2
>>> +         *
>>> +         * SRC1 = { C0, D0, G0, H0 }
>>> +         * SRC2 = { A0, B0, E0, F0 }
>>> +         * XMM0 = W' = { ?, ?, WK1, WK0 }
>>> +         *
>>> +         * (NB that the notation again is not C-like, i.e. elements are listed
>>> +         * high-to-low everywhere in this comment.)
>>> +         *
>>> +         * Ch(E,F,G) = (E & F) ^ (~E & G)
>>> +         * Maj(A,B,C) = (A & B) ^ (A & C) ^ (B & C)
>>> +         *
>>> +         * Σ0(A) = ROR2(A) ^ ROR13(A) ^ ROR22(A)
>>> +         * Σ1(E) = ROR6(E) ^ ROR11(E) ^ ROR25(E)
>> This looks like some encoding problems (and later on in the comment).  I
>> can't tell whether it is a consequence of the email or something present
>> in the underlying patch. 
> Oh - I forgot to properly enforce UTF8 while composing the mail.
> This is what it'll looks like when committed:
>
>         /*
>          * SHA256RNDS2
>          *
>          * SRC1 = { C0, D0, G0, H0 }
>          * SRC2 = { A0, B0, E0, F0 }
>          * XMM0 = W' = { ?, ?, WK1, WK0 }
>          *
>          * (NB that the notation again is not C-like, i.e. elements are listed
>          * high-to-low everywhere in this comment.)
>          *
>          * Ch(E,F,G) = (E & F) ^ (~E & G)
>          * Maj(A,B,C) = (A & B) ^ (A & C) ^ (B & C)
>          *
>          * Σ0(A) = ROR2(A) ^ ROR13(A) ^ ROR22(A)
>          * Σ1(E) = ROR6(E) ^ ROR11(E) ^ ROR25(E)

That reads much better.

>
>>> +static bool simd_check_avx_sha(void)
>>> +{
>>> +    return cpu_has_sha && cpu_has_avx;
>> I can't locate any [E]VEX encoding information for the SHA
>> instructions.  Is this a side effect of the rest of the test algorithm,
>> or am I missing something in the manual?
> There are only legacy forms as of now. Note the single sentence in
> the commit message:
>
> "Also use this for AVX512VL VPRO{L,R}{,V}D as well as some further shifts
>  testing."
>
> The AVX and AVX512F flavors of the test are good test cases for other
> insns; they are still only utilizing legacy encoded SHA insns.

Ok.  Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 46/50] x86emul: support GFNI insns
  2019-06-21 14:20         ` Andrew Cooper
@ 2019-06-21 15:02           ` Jan Beulich
  2019-06-25  6:48           ` Jan Beulich
  1 sibling, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-06-21 15:02 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.06.19 at 16:20, <andrew.cooper3@citrix.com> wrote:
> On 21/06/2019 15:00, Jan Beulich wrote:
>>> On 15/03/2019 11:06, Jan Beulich wrote:
>>> (On a tangent, AVX512_VP2INTERSECT now exists in the extensions doc.)
>> And I have it implemented, but no way to test until sde supports it.
> 
> Fair enough.
> 
>>>> @@ -138,6 +141,26 @@ static bool simd_check_avx512vbmi_vl(voi
>>>>      return cpu_has_avx512_vbmi && cpu_has_avx512vl;
>>>>  }
>>>>  
>>>> +static bool simd_check_sse2_gf(void)
>>>> +{
>>>> +    return cpu_has_gfni && cpu_has_sse2;
>>> This dependency doesn't match the manual.  The legacy encoding needs
>>> GFNI alone.
>>>
>>> gen-cpuid.py is trying to reduce the ability to create totally
>>> implausible configurations via levelling, but for software checks, we
>>> should follow the manual to the letter.
>> This is test harness code - I'd rather be a little more strict here than
>> having to needlessly spend time fixing an issue in there. Furthermore
>> this matches how gcc enforces dependencies.
>>>> +}
>>>> +
>>>> +static bool simd_check_avx2_gf(void)
>>>> +{
>>>> +    return cpu_has_gfni && cpu_has_avx2;
>>> Here, the dependency is only on AVX, which I think is probably trying to
>>> express a dependency on xcr0.ymm
>> Mostly as per above, except that here gcc (imo wrongly) enables just
>> AVX.
>>
>>>> +}
>>>> +
>>>> +static bool simd_check_avx512bw_gf(void)
>>>> +{
>>>> +    return cpu_has_gfni && cpu_has_avx512bw;
>>> I don't see any BW interaction anywhere (in the manual), despite the
>>> fact it operates on a datatype of int8.
>> But by operating on vectors of bytes, it requires 64 bits wide mask
>> registers, which is the connection to BW. Again gcc also does so.
> 
> To be honest, it doesn't matter what GCC does.
> 
> What matter is the expectation of arbitrary library/application code,
> written in adherence to the Intel manual, when running with a levelled
> CPUID policy, because *that* is the set of corner cases where things
> might end up exploding.

I agree, and I would agree making the changes you've asked for if
it was code in the main emulator. But as said - you've commented
on test harness qualification functions.

> I see your point about needing a full width mask register, which to me
> suggests that the extension manual is documenting the dependency
> incorrectly.
> 
> It also means that I need to change how we do feature dependency
> derivation, because this is the first example of a conditional
> dependency.  I.e. AVX512F but not AVX512BW implies no GFNI even if
> hardware has it, but on the same hardware when levelling AVX512F out,
> GFNI could be used via its legacy or VEX version.

Right, the way GFNI is specified isn't overly helpful. Otoh - why the
former? The legacy and VEX versions are usable in that case, too.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

* Re: [Xen-devel] [PATCH v8 46/50] x86emul: support GFNI insns
  2019-06-21 14:20         ` Andrew Cooper
  2019-06-21 15:02           ` Jan Beulich
@ 2019-06-25  6:48           ` Jan Beulich
  1 sibling, 0 replies; 465+ messages in thread
From: Jan Beulich @ 2019-06-25  6:48 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Wei Liu, Roger Pau Monne

>>> On 21.06.19 at 16:20, <andrew.cooper3@citrix.com> wrote:
> On 21/06/2019 15:00, Jan Beulich wrote:
>>> On 15/03/2019 11:06, Jan Beulich wrote:
>>>> @@ -138,6 +141,26 @@ static bool simd_check_avx512vbmi_vl(voi
>>>>      return cpu_has_avx512_vbmi && cpu_has_avx512vl;
>>>>  }
>>>>  
>>>> +static bool simd_check_sse2_gf(void)
>>>> +{
>>>> +    return cpu_has_gfni && cpu_has_sse2;
>>> This dependency doesn't match the manual.  The legacy encoding needs
>>> GFNI alone.
>>>
>>> gen-cpuid.py is trying to reduce the ability to create totally
>>> implausible configurations via levelling, but for software checks, we
>>> should follow the manual to the letter.
>> This is test harness code - I'd rather be a little more strict here than
>> having to needlessly spend time fixing an issue in there. Furthermore
>> this matches how gcc enforces dependencies.
>>>> +}
>>>> +
>>>> +static bool simd_check_avx2_gf(void)
>>>> +{
>>>> +    return cpu_has_gfni && cpu_has_avx2;
>>> Here, the dependency is only on AVX, which I think is probably trying to
>>> express a dependency on xcr0.ymm
>> Mostly as per above, except that here gcc (imo wrongly) enables just
>> AVX.
>>
>>>> +}
>>>> +
>>>> +static bool simd_check_avx512bw_gf(void)
>>>> +{
>>>> +    return cpu_has_gfni && cpu_has_avx512bw;
>>> I don't see any BW interaction anywhere (in the manual), despite the
>>> fact it operates on a datatype of int8.
>> But by operating on vectors of bytes, it requires 64 bits wide mask
>> registers, which is the connection to BW. Again gcc also does so.
> 
> To be honest, it doesn't matter what GCC does.

Coming back to this - it very much matters _here_ (i.e. in test harness
code): For one, the checks above need to be in line with the -m<isa>
options we pass to the compiler. I.e. if anything the question might
be on the Makefile additions why I enable SSE2, AVX2, and AVX512BW
respectively.

While I could simply claim that this is my choice as to producing
sensible test case binary blobs, there's a compiler aspect _there_:
gcc's intrinsics header enables SSE2, AVX, and AVX512F / AVX512BW
around the definitions on the inline wrappers around the builtins.
This in turn is because the availability of the respective builtins
depends on these ISAs to be enabled alongside GFNI itself. We
therefore may not use any ISA level _lower_ than what the
builtins require. As mentioned before, seeing them use SSE2 as
prereq for the legacy encoded insns, I question their use of AVX
instead of AVX2, and hence I'd prefer to stick with AVX2; if nothing
else then simply because of what I've said in the first half sentence
of this paragraph. (My guess is that it's the availability of
{,,V}MOVDQ{A,U}* that did determine their choice, rather than the
general availability of vector operations on the given vector and
element types and sizes. I'm sure this would lead to some rather
"funny" generated code when inputs come from other
transformations rather than straight from memory.)

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 465+ messages in thread

end of thread, other threads:[~2019-06-25  6:49 UTC | newest]

Thread overview: 465+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-09  8:15 [PATCH 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
2018-08-09  8:23 ` [PATCH 1/6] x86emul: fix FMA scalar operand sizes Jan Beulich
2018-08-09  8:24 ` [PATCH 2/6] x86emul: extend MASKMOV{Q,DQU} tests Jan Beulich
2018-08-09  8:24 ` [PATCH 3/6] x86emul: support AVX512 opmask insns Jan Beulich
2018-08-09  8:25 ` [PATCH 4/6] x86emul: clean up AVX2 insn use in test harness Jan Beulich
2018-08-09  8:25 ` [PATCH 5/6] x86emul: correct EVEX decoding Jan Beulich
2018-08-09  8:26 ` [PATCH 6/6] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
2018-08-29 14:20 ` [PATCH v2 0/6] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
2018-08-29 14:23   ` [PATCH v2 1/6] x86emul: fix FMA scalar operand sizes Jan Beulich
2018-09-03 16:43     ` Andrew Cooper
2018-09-04  7:52       ` Jan Beulich
2018-08-29 14:23   ` [PATCH v2 2/6] x86emul: extend MASKMOV{Q,DQU} tests Jan Beulich
2018-09-03 16:44     ` [PATCH v2 2/6] x86emul: extend MASKMOV{Q, DQU} tests Andrew Cooper
2018-08-29 14:24   ` [PATCH v2 3/6] x86emul: support AVX512 opmask insns Jan Beulich
2018-09-03 17:57     ` Andrew Cooper
2018-09-04  7:58       ` Jan Beulich
2018-08-29 14:24   ` [PATCH v2 4/6] x86emul: clean up AVX2 insn use in test harness Jan Beulich
2018-09-03 18:04     ` Andrew Cooper
2018-08-29 14:25   ` [PATCH v2 5/6] x86emul: correct EVEX decoding Jan Beulich
2018-09-04 10:48     ` Andrew Cooper
2018-09-04 12:48       ` Jan Beulich
2018-08-29 14:25   ` [PATCH v2 6/6] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
2018-09-04 11:02     ` Andrew Cooper
2018-09-04 12:50       ` Jan Beulich
2018-09-18 11:46 ` [PATCH v3 00/34] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
2018-09-18 11:53   ` [PATCH v3 01/34] x86emul: support AVX512 opmask insns Jan Beulich
2018-10-25 18:32     ` Andrew Cooper
2018-10-26  9:03       ` Jan Beulich
2018-10-26 11:29         ` Andrew Cooper
2018-10-26 11:59           ` Jan Beulich
2018-10-26 12:19             ` Andrew Cooper
2018-10-26 12:34               ` Jan Beulich
2018-09-18 11:53   ` [PATCH v3 02/34] x86/HVM: grow MMIO cache data size to 64 bytes Jan Beulich
2018-09-18 16:05     ` Paul Durrant
2018-10-25 18:36     ` Andrew Cooper
2018-10-26  9:04       ` Jan Beulich
2018-10-26 11:29         ` Andrew Cooper
2018-09-18 11:55   ` [PATCH v3 03/34] x86emul: correct EVEX decoding Jan Beulich
2018-09-18 11:55   ` [PATCH v3 04/34] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
2018-09-18 11:56   ` [PATCH v3 05/34] x86emul: support basic AVX512 moves Jan Beulich
2018-09-18 11:57   ` [PATCH v3 06/34] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
2018-09-18 11:57   ` [PATCH v3 07/34] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
2018-09-18 11:58   ` [PATCH v3 08/34] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
2018-09-18 11:59   ` [PATCH v3 09/34] x86emul: support AVX512DQ logic " Jan Beulich
2018-09-18 11:59   ` [PATCH v3 10/34] x86emul: support AVX512F "normal" FP compare insns Jan Beulich
2018-09-18 12:00   ` [PATCH v3 11/34] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
2018-09-18 12:00   ` [PATCH v3 12/34] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
2018-09-18 12:01   ` [PATCH v3 13/34] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
2018-09-18 12:02   ` [PATCH v3 14/34] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
2018-09-18 12:03   ` [PATCH v3 15/34] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
2018-09-18 12:03   ` [PATCH v3 16/34] x86emul/test: introduce eq() Jan Beulich
2018-09-18 12:04   ` [PATCH v3 17/34] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
2018-09-18 12:05   ` [PATCH v3 18/34] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
2018-09-18 12:05   ` [PATCH v3 19/34] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
2018-09-18 12:05   ` [PATCH v3 20/34] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
2018-09-18 12:06   ` [PATCH v3 21/34] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
2018-09-18 12:07   ` [PATCH v3 22/34] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
2018-09-18 12:07   ` [PATCH v3 23/34] x86emul: basic AVX512F testing Jan Beulich
2018-09-18 12:08   ` [PATCH v3 24/34] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
2018-09-18 12:09   ` [PATCH v3 25/34] x86emul: basic AVX512VL testing Jan Beulich
2018-09-18 12:09   ` [PATCH v3 26/34] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
2018-09-18 12:09   ` [PATCH v3 27/34] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
2018-09-18 12:10   ` [PATCH v3 28/34] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
2018-09-18 12:11   ` [PATCH v3 29/34] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
2018-09-18 12:11   ` [PATCH v3 29/34] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
2018-09-18 12:12   ` [PATCH v3 30/34] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
2018-09-18 12:14   ` [PATCH v3 32/34] x86emul: basic AVX512BW testing Jan Beulich
2018-09-18 12:14   ` [PATCH v3 33/34] x86emul: basic AVX512DQ testing Jan Beulich
2018-09-18 12:14   ` [PATCH v3 34/34] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
2018-09-25 13:14 ` [PATCH v4 00/44] x86emul: fixes, improvements, and beginnings of AVX512 support Jan Beulich
2018-09-25 13:25   ` [PATCH v4 01/44] x86emul: support AVX512 opmask insns Jan Beulich
2018-09-26  6:06     ` Jan Beulich
2018-09-25 13:26   ` [PATCH v4 02/44] x86/HVM: grow MMIO cache data size to 64 bytes Jan Beulich
2018-09-25 13:27   ` [PATCH v4 03/44] x86emul: correct EVEX decoding Jan Beulich
2018-10-26 14:33     ` Andrew Cooper
2018-09-25 13:28   ` [PATCH v4 04/44] x86emul: generalize vector length handling for AVX512/EVEX Jan Beulich
2018-10-26 16:10     ` Andrew Cooper
2018-09-25 13:28   ` [PATCH v4 05/44] x86emul: support basic AVX512 moves Jan Beulich
2018-11-13 17:12     ` Andrew Cooper
2018-11-14 14:35       ` Jan Beulich
2018-11-14 16:26         ` Andrew Cooper
2018-11-15  9:54           ` Jan Beulich
2018-09-25 13:29   ` [PATCH v4 06/44] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
2018-11-12 17:42     ` Andrew Cooper
2018-11-13 11:12       ` Jan Beulich
2018-11-13 15:45         ` Andrew Cooper
2018-11-14 14:17           ` Jan Beulich
2018-11-14 14:42             ` Andrew Cooper
2018-11-14 14:58               ` Jan Beulich
2018-09-25 13:29   ` [PATCH v4 07/44] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
2018-11-12 17:50     ` Andrew Cooper
2018-11-13 11:42       ` Jan Beulich
2018-11-13 15:58         ` Andrew Cooper
2018-11-14  8:42           ` Jan Beulich
2018-09-25 13:30   ` [PATCH v4 08/44] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
2018-11-13 18:12     ` Andrew Cooper
2018-09-25 13:31   ` [PATCH v4 09/44] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
2018-11-13 18:21     ` Andrew Cooper
2018-09-25 13:32   ` [PATCH v4 10/44] x86emul: support AVX512DQ logic " Jan Beulich
2018-11-13 18:56     ` Andrew Cooper
2018-09-25 13:32   ` [PATCH v4 11/44] x86emul: support AVX512F "normal" FP compare insns Jan Beulich
2018-11-13 19:04     ` Andrew Cooper
2018-11-14 14:41       ` Jan Beulich
2018-11-14 14:45         ` Andrew Cooper
2018-09-25 13:33   ` [PATCH v4 12/44] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
2018-11-13 19:17     ` Andrew Cooper
2018-09-25 13:33   ` [PATCH v4 13/44] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
2018-11-13 19:28     ` Andrew Cooper
2018-09-25 13:34   ` [PATCH v4 14/44] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
2018-11-13 19:30     ` Andrew Cooper
2018-09-25 13:35   ` [PATCH v4 15/44] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
2018-11-13 19:37     ` Andrew Cooper
2018-09-25 13:35   ` [PATCH v4 16/44] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
2018-11-13 19:39     ` Andrew Cooper
2018-09-25 13:36   ` [PATCH v4 17/44] x86emul/test: introduce eq() Jan Beulich
2018-10-26 11:31     ` Andrew Cooper
2018-09-25 13:37   ` [PATCH v4 18/44] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
2018-09-25 13:37   ` [PATCH v4 19/44] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
2018-09-25 13:38   ` [PATCH v4 20/44] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
2018-09-25 13:39   ` [PATCH v4 21/44] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
2018-09-25 13:40   ` [PATCH v4 22/44] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
2018-09-25 13:40   ` [PATCH v4 23/44] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
2018-09-25 13:41   ` [PATCH v4 24/44] x86emul: basic AVX512F testing Jan Beulich
2018-09-25 13:41   ` [PATCH v4 25/44] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
2018-09-25 13:42   ` [PATCH v4 26/44] x86emul: basic AVX512VL testing Jan Beulich
2018-09-25 13:43   ` [PATCH v4 27/44] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
2018-09-25 13:43   ` [PATCH v4 28/44] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
2018-09-25 13:44   ` [PATCH v4 29/44] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
2018-09-25 13:44   ` [PATCH v4 30/44] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
2018-09-25 13:46   ` [PATCH v4 31/44] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
2018-09-25 13:46   ` [PATCH v4 32/44] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
2018-09-25 13:47   ` [PATCH v4 33/44] x86emul: basic AVX512BW testing Jan Beulich
2018-09-25 13:48   ` [PATCH v4 34/44] x86emul: basic AVX512DQ testing Jan Beulich
2018-09-25 13:48   ` [PATCH v4 35/44] x86emul: support AVX512F move high/low insns Jan Beulich
2018-09-25 13:49   ` [PATCH v4 36/44] x86emul: support AVX512F move duplicate insns Jan Beulich
2018-09-25 13:49   ` [PATCH v4 37/44] x86emul: support AVX512{F, BW, VBMI} permute insns Jan Beulich
2018-09-25 13:50   ` [PATCH v4 38/44] x86emul: support AVX512BW pack insns Jan Beulich
2018-09-25 13:51   ` [PATCH v4 39/44] x86emul: support AVX512F floating-point conversion insns Jan Beulich
2018-09-25 13:52   ` [PATCH v4 40/44] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
2018-09-25 13:53   ` [PATCH v4 41/44] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
2018-09-25 13:53   ` [PATCH v4 42/44] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
2018-09-25 13:54   ` [PATCH v4 43/44] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
2018-09-25 13:55   ` [PATCH v4 44/44] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
2018-11-19 10:00 ` [PATCH v5 00/47] x86emul: fair parts of AVX512 support Jan Beulich
2018-11-19 10:13   ` [PATCH v5 01/47] x86emul: introduce IMPOSSIBLE() Jan Beulich
2018-11-19 18:11     ` Andrew Cooper
2018-11-20  8:12       ` Jan Beulich
2018-11-19 10:13   ` [PATCH v5 02/47] x86emul: support basic AVX512 moves Jan Beulich
2018-11-19 18:35     ` Andrew Cooper
2018-11-19 10:14   ` [PATCH v5 03/47] x86emul: test for correct EVEX Disp8 scaling Jan Beulich
2018-11-19 18:35     ` Andrew Cooper
2018-11-19 10:14   ` [PATCH v5 04/47] x86emul: also allow running the 32-bit harness on a 64-bit distro Jan Beulich
2018-11-19 18:40     ` Andrew Cooper
2018-11-19 10:15   ` [PATCH v5 05/47] x86emul: use AVX512 logic for emulating V{, P}MASKMOV* Jan Beulich
2018-11-19 10:16   ` [PATCH v5 06/47] x86emul: support AVX512F legacy-equivalent arithmetic FP insns Jan Beulich
2018-11-19 10:16   ` [PATCH v5 07/47] x86emul: support AVX512DQ logic " Jan Beulich
2018-11-19 10:17   ` [PATCH v5 08/47] x86emul: support basic AVX512F FP compare insns Jan Beulich
2018-11-19 10:17   ` [PATCH v5 09/47] x86emul: support AVX512F misc legacy-equivalent FP insns Jan Beulich
2018-11-19 10:18   ` [PATCH v5 10/47] x86emul: support AVX512F fused-multiply-add insns Jan Beulich
2018-11-19 10:18   ` [PATCH v5 11/47] x86emul: support AVX512F legacy-equivalent logic insns Jan Beulich
2018-11-19 10:19   ` [PATCH v5 12/47] x86emul: support AVX512{F, DQ} FP broadcast insns Jan Beulich
2018-11-19 18:44     ` Andrew Cooper
2018-11-19 10:20   ` [PATCH v5 13/47] x86emul: support AVX512F v{, u}comis{d, s} insns Jan Beulich
2018-11-19 10:21   ` [PATCH v5 14/47] x86emul: support AVX512{F, BW} packed integer compare insns Jan Beulich
2018-11-19 19:09     ` Andrew Cooper
2018-11-19 10:21   ` [PATCH v5 15/47] x86emul: support AVX512{F, BW} packed integer arithmetic insns Jan Beulich
2018-11-19 19:18     ` Andrew Cooper
2018-11-19 10:22   ` [PATCH v5 16/47] x86emul: use simd_128 also for legacy vector shift insns Jan Beulich
2018-11-19 19:21     ` Andrew Cooper
2018-11-19 10:22   ` [PATCH v5 17/47] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
2018-11-19 10:23   ` [PATCH v5 18/47] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
2018-11-19 10:23   ` [PATCH v5 19/47] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
2018-11-19 10:24   ` [PATCH v5 20/47] x86emul: basic AVX512F testing Jan Beulich
2018-11-19 10:25   ` [PATCH v5 21/47] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
2018-11-19 10:25   ` [PATCH v5 22/47] x86emul: basic AVX512VL testing Jan Beulich
2018-11-19 10:26   ` [PATCH v5 23/47] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
2018-11-19 10:26   ` [PATCH v5 24/47] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
2018-11-19 10:28   ` [PATCH v5 25/47] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
2018-11-19 10:28   ` [PATCH v5 26/47] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
2018-11-19 10:29   ` [PATCH v5 27/47] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
2018-11-19 10:30   ` [PATCH v5 28/47] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
2018-11-19 10:30   ` [PATCH v5 29/47] x86emul: basic AVX512BW testing Jan Beulich
2018-11-19 10:31   ` [PATCH v5 30/47] x86emul: basic AVX512DQ testing Jan Beulich
2018-11-19 10:31   ` [PATCH v5 31/47] x86emul: support AVX512F move high/low insns Jan Beulich
2018-11-19 10:31   ` [PATCH v5 32/47] x86emul: support AVX512F move duplicate insns Jan Beulich
2018-11-19 10:32   ` [PATCH v5 33/47] x86emul: support AVX512{F, BW, VBMI} permute insns Jan Beulich
2018-11-19 10:33   ` [PATCH v5 34/47] x86emul: support AVX512BW pack insns Jan Beulich
2018-11-19 10:33   ` [PATCH v5 35/47] x86emul: support AVX512F floating-point conversion insns Jan Beulich
2018-11-19 10:34   ` [PATCH v5 36/47] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
2018-11-19 10:35   ` [PATCH v5 37/47] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
2018-11-19 10:36   ` [PATCH v5 38/47] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
2018-11-19 10:37   ` [PATCH v5 39/47] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
2018-11-19 10:37   ` [PATCH v5 40/47] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
2018-11-19 10:38   ` [PATCH v5 41/47] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
2018-11-19 10:38   ` [PATCH v5 42/47] x86emul: support remaining AVX512BW " Jan Beulich
2018-11-19 10:39   ` [PATCH v5 43/47] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
2018-11-19 10:39   ` [PATCH v5 44/47] x86emul: support AVX512F floating point manipulation insns Jan Beulich
2018-11-19 10:40   ` [PATCH v5 45/47] x86emul: support AVX512DQ " Jan Beulich
2018-11-19 10:40   ` [PATCH v5 46/47] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
2018-11-19 10:41   ` [PATCH v5 47/47] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
2018-12-06  9:45 ` [PATCH v6 00/42] x86emul: fair parts of AVX512 support Jan Beulich
2018-12-06  9:51   ` [PATCH v6 01/42] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
2018-12-06  9:51   ` [PATCH v6 02/42] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
2018-12-06  9:51   ` [PATCH v6 03/42] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
2018-12-06  9:52   ` [PATCH v6 04/42] x86emul: basic AVX512F testing Jan Beulich
2018-12-06  9:53   ` [PATCH v6 05/42] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
2018-12-06  9:53   ` [PATCH v6 06/42] x86emul: basic AVX512VL testing Jan Beulich
2018-12-06  9:53   ` [PATCH v6 07/42] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
2018-12-06  9:54   ` [PATCH v6 08/42] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
2018-12-06  9:54   ` [PATCH v6 09/42] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
2018-12-06  9:55   ` [PATCH v6 10/42] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
2018-12-06  9:55   ` [PATCH v6 11/42] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
2018-12-06  9:55   ` [PATCH v6 12/42] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
2018-12-06  9:56   ` [PATCH v6 13/42] x86emul: basic AVX512BW testing Jan Beulich
2018-12-06  9:57   ` [PATCH v6 14/42] x86emul: basic AVX512DQ testing Jan Beulich
2018-12-06  9:57   ` [PATCH v6 15/42] x86emul: support AVX512F move high/low insns Jan Beulich
2018-12-06  9:58   ` [PATCH v6 16/42] x86emul: support AVX512F move duplicate insns Jan Beulich
2018-12-06  9:59   ` [PATCH v6 17/42] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
2018-12-06  9:59   ` [PATCH v6 18/42] x86emul: support AVX512BW pack insns Jan Beulich
2018-12-06 10:00   ` [PATCH v6 19/42] x86emul: support AVX512F floating-point conversion insns Jan Beulich
2018-12-06 10:00   ` [PATCH v6 20/42] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
2018-12-06 10:01   ` [PATCH v6 21/42] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
2018-12-06 10:01   ` [PATCH v6 22/42] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
2018-12-06 10:02   ` [PATCH v6 23/42] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
2018-12-06 10:02   ` [PATCH v6 24/42] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
2018-12-06 10:04   ` [PATCH v6 25/42] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
2018-12-06 10:04   ` [PATCH v6 26/42] x86emul: support remaining AVX512BW " Jan Beulich
2018-12-06 10:04   ` [PATCH v6 27/42] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
2018-12-06 10:05   ` [PATCH v6 28/42] x86emul: support AVX512F floating point manipulation insns Jan Beulich
2018-12-06 10:05   ` [PATCH v6 29/42] x86emul: support AVX512DQ " Jan Beulich
2018-12-06 10:06   ` [PATCH v6 30/42] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
2018-12-06 10:07   ` [PATCH v6 31/42] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
2018-12-06 10:07   ` [PATCH v6 32/42] x86emul: support AVX512F gather insns Jan Beulich
2018-12-06 10:08   ` [PATCH v6 33/42] x86emul: add high register S/G test cases Jan Beulich
2018-12-06 10:08   ` [PATCH v6 34/42] x86emul: support AVX512F scatter insns Jan Beulich
2018-12-06 10:09   ` [PATCH v6 35/42] x86emul: support AVX512PF insns Jan Beulich
2018-12-06 10:09   ` [PATCH v6 36/42] x86emul: support AVX512CD insns Jan Beulich
2018-12-06 10:10   ` [PATCH v6 37/42] x86emul: complete support of AVX512_VBMI insns Jan Beulich
2018-12-06 10:10   ` [PATCH v6 38/42] x86emul: support of AVX512* population count insns Jan Beulich
2018-12-06 10:11   ` [PATCH v6 39/42] x86emul: support of AVX512_IFMA insns Jan Beulich
2018-12-06 10:11   ` [PATCH v6 40/42] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
2018-12-06 10:12   ` [PATCH v6 41/42] x86emul: support AVX512_4FMAPS insns Jan Beulich
2018-12-06 10:12   ` [PATCH v6 42/42] x86emul: support AVX512_4VNNIW insns Jan Beulich
2018-12-19 14:17 ` [PATCH v7 00/49] x86emul: remaining AVX512 support Jan Beulich
2018-12-19 14:34   ` Jan Beulich
2018-12-19 14:35   ` [PATCH v7 01/49] x86emul: rename evex.br to evex.brs Jan Beulich
2018-12-19 15:14     ` Andrew Cooper
2018-12-19 14:36   ` [PATCH v7 02/49] x86emul: support AVX512{F, BW} shift/rotate insns Jan Beulich
2018-12-19 16:00     ` Andrew Cooper
2018-12-19 14:36   ` [PATCH v7 03/49] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
2018-12-19 18:20     ` Andrew Cooper
2018-12-20  7:49       ` Jan Beulich
2018-12-19 14:37   ` [PATCH v7 04/49] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
2019-03-14 15:35     ` Andrew Cooper
2018-12-19 14:37   ` [PATCH v7 05/49] x86emul: basic AVX512F testing Jan Beulich
2019-03-14 15:42     ` Andrew Cooper
2018-12-19 14:38   ` [PATCH v7 06/49] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
2019-03-14 16:38     ` Andrew Cooper
2019-03-14 17:15       ` Jan Beulich
2019-03-15 16:39         ` Andrew Cooper
2019-03-18  9:45           ` Jan Beulich
2018-12-19 14:38   ` [PATCH v7 07/49] x86emul: basic AVX512VL testing Jan Beulich
2019-03-14 16:39     ` Andrew Cooper
2018-12-19 14:41   ` [PATCH v7 08/49] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
2018-12-19 14:41   ` [PATCH v7 09/49] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
2018-12-19 14:42   ` [PATCH v7 10/49] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
2018-12-19 14:42   ` [PATCH v7 11/49] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
2018-12-19 14:43   ` [PATCH v7 12/49] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
2018-12-19 14:43   ` [PATCH v7 13/49] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
2018-12-19 14:44   ` [PATCH v7 14/49] x86emul: basic AVX512BW testing Jan Beulich
2018-12-19 14:46   ` [PATCH v7 15/49] x86emul: basic AVX512DQ testing Jan Beulich
2018-12-19 14:46   ` [PATCH v7 16/49] x86emul: support AVX512F move high/low insns Jan Beulich
2018-12-19 14:47   ` [PATCH v7 17/49] x86emul: support AVX512F move duplicate insns Jan Beulich
2018-12-19 14:47   ` [PATCH v7 18/49] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
2018-12-19 14:48   ` [PATCH v7 19/49] x86emul: support AVX512BW pack insns Jan Beulich
2018-12-19 14:48   ` [PATCH v7 20/49] x86emul: support AVX512F floating-point conversion insns Jan Beulich
2018-12-19 14:48   ` [PATCH v7 21/49] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
2018-12-19 14:51   ` [PATCH v7 22/49] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
2018-12-19 14:51   ` [PATCH v7 23/49] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
2018-12-19 14:52   ` [PATCH v7 24/49] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
2018-12-19 14:52   ` [PATCH v7 25/49] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
2018-12-19 14:53   ` [PATCH v7 26/49] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
2018-12-19 14:53   ` [PATCH v7 27/49] x86emul: support remaining AVX512BW " Jan Beulich
2018-12-19 14:54   ` [PATCH v7 28/49] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
2018-12-19 14:55   ` [PATCH v7 29/49] x86emul: support AVX512F floating point manipulation insns Jan Beulich
2018-12-19 14:55   ` [PATCH v7 30/49] x86emul: support AVX512DQ " Jan Beulich
2018-12-19 14:56   ` [PATCH v7 31/49] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
2018-12-19 14:56   ` [PATCH v7 32/49] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
2018-12-19 14:57   ` [PATCH v7 33/49] x86emul: support AVX512F gather insns Jan Beulich
2018-12-19 14:57   ` [PATCH v7 34/49] x86emul: add high register S/G test cases Jan Beulich
2018-12-19 14:58   ` [PATCH v7 35/49] x86emul: support AVX512F scatter insns Jan Beulich
2018-12-19 14:59   ` [PATCH v7 36/49] x86emul: support AVX512PF insns Jan Beulich
2018-12-19 14:59   ` [PATCH v7 37/49] x86emul: support AVX512CD insns Jan Beulich
2018-12-19 15:00   ` [PATCH v7 38/49] x86emul: complete support of AVX512_VBMI insns Jan Beulich
2018-12-19 15:00   ` [PATCH v7 39/49] x86emul: support of AVX512* population count insns Jan Beulich
2018-12-19 15:01   ` [PATCH v7 40/49] x86emul: support of AVX512_IFMA insns Jan Beulich
2018-12-19 15:01   ` [PATCH v7 42/49] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
2018-12-19 15:02   ` [PATCH v7 42/49] x86emul: support AVX512_4FMAPS insns Jan Beulich
2018-12-19 15:05   ` [PATCH v7 43/49] x86emul: support AVX512_4VNNIW insns Jan Beulich
2018-12-19 15:06   ` [PATCH v7 44/49] x86emul: support AVX512_VNNI insns Jan Beulich
2018-12-19 15:06   ` [PATCH v7 45/49] x86emul: support VPCLMULQDQ insns Jan Beulich
2018-12-19 15:07   ` [PATCH v7 46/49] x86emul: support VAES insns Jan Beulich
2018-12-19 15:07   ` [PATCH v7 47/49] x86emul: support GFNI insns Jan Beulich
2018-12-19 15:07   ` [PATCH v7 48/49] x86emul: restore ordering within main switch statement Jan Beulich
2018-12-19 15:08   ` [PATCH v7 49/49] tools: re-sync CPUID leaf 7 tables Jan Beulich
2019-03-14 11:07     ` Andrew Cooper
2019-03-15 10:30 ` [PATCH v8 00/50] x86emul: remaining AVX512 support Jan Beulich
2019-03-15 10:36   ` [PATCH v8 01/50] x86emul: no need to set fault_suppression to false for VMOVNT* Jan Beulich
2019-03-15 16:52     ` Andrew Cooper
2019-03-15 10:36   ` [PATCH v8 02/50] x86emul: support AVX512{F, BW, DQ} extract insns Jan Beulich
2019-03-15 17:51     ` Andrew Cooper
2019-03-15 10:37   ` [PATCH v8 03/50] x86emul: support AVX512{F, BW, DQ} insert insns Jan Beulich
2019-03-15 10:38   ` [PATCH v8 04/50] x86emul: basic AVX512F testing Jan Beulich
2019-03-15 10:39   ` [PATCH v8 05/50] x86emul: support AVX512{F, BW, DQ} integer broadcast insns Jan Beulich
2019-03-15 10:39   ` [PATCH v8 06/50] x86emul: basic AVX512VL testing Jan Beulich
2019-03-15 10:40   ` [PATCH v8 07/50] x86emul: support AVX512{F, BW} zero- and sign-extending moves Jan Beulich
2019-03-15 18:02     ` Andrew Cooper
2019-03-15 10:40   ` [PATCH v8 08/50] x86emul: support AVX512{F, BW} down conversion moves Jan Beulich
2019-03-15 18:10     ` Andrew Cooper
2019-03-15 10:41   ` [PATCH v8 09/50] x86emul: support AVX512{F, BW} integer unpack insns Jan Beulich
2019-03-15 18:21     ` Andrew Cooper
2019-03-18  9:55       ` Jan Beulich
2019-05-20 12:11         ` Andrew Cooper
2019-05-20 12:11           ` [Xen-devel] " Andrew Cooper
2019-03-15 10:41   ` [PATCH v8 10/50] x86emul: support AVX512{F, BW, _VBMI} full permute insns Jan Beulich
2019-05-17 16:50     ` Andrew Cooper
2019-05-17 16:50       ` [Xen-devel] " Andrew Cooper
2019-05-20  6:55       ` Jan Beulich
2019-05-20  6:55         ` [Xen-devel] " Jan Beulich
2019-05-20 12:10         ` Andrew Cooper
2019-05-20 12:10           ` [Xen-devel] " Andrew Cooper
2019-03-15 10:43   ` [PATCH v8 11/50] x86emul: support AVX512{F, BW} integer shuffle insns Jan Beulich
2019-05-17 17:01     ` Andrew Cooper
2019-05-17 17:01       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:43   ` [PATCH v8 12/50] x86emul: support AVX512{BW, DQ} mask move insns Jan Beulich
2019-05-17 17:02     ` Andrew Cooper
2019-05-17 17:02       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:43   ` [PATCH v8 13/50] x86emul: basic AVX512BW testing Jan Beulich
2019-05-17 17:03     ` Andrew Cooper
2019-05-17 17:03       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:44   ` [PATCH v8 14/50] x86emul: basic AVX512DQ testing Jan Beulich
2019-05-17 17:03     ` Andrew Cooper
2019-05-17 17:03       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:44   ` [PATCH v8 15/50] x86emul: support AVX512F move high/low insns Jan Beulich
2019-05-21 10:59     ` Andrew Cooper
2019-05-21 10:59       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:45   ` [PATCH v8 16/50] x86emul: support AVX512F move duplicate insns Jan Beulich
2019-05-21 11:09     ` Andrew Cooper
2019-05-21 11:09       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:46   ` [PATCH v8 17/50] x86emul: support AVX512{F, BW, _VBMI} permute insns Jan Beulich
2019-05-21 11:24     ` Andrew Cooper
2019-05-21 11:24       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:46   ` [PATCH v8 18/50] x86emul: support AVX512BW pack insns Jan Beulich
2019-05-21 11:26     ` Andrew Cooper
2019-05-21 11:26       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:47   ` [PATCH v8 19/50] x86emul: support AVX512F floating-point conversion insns Jan Beulich
2019-05-21 11:33     ` Andrew Cooper
2019-05-21 11:33       ` [Xen-devel] " Andrew Cooper
2019-05-21 15:46       ` Jan Beulich
2019-05-21 15:46         ` [Xen-devel] " Jan Beulich
2019-05-23 16:08         ` Andrew Cooper
2019-05-23 16:08           ` [Xen-devel] " Andrew Cooper
2019-03-15 10:47   ` [PATCH v8 20/50] x86emul: support AVX512F legacy-equivalent packed int/FP " Jan Beulich
2019-05-21 11:37     ` Andrew Cooper
2019-05-21 11:37       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:52   ` [PATCH v8 21/50] x86emul: support AVX512F legacy-equivalent scalar " Jan Beulich
2019-05-21 11:44     ` Andrew Cooper
2019-05-21 11:44       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:52   ` [PATCH v8 22/50] x86emul: support AVX512DQ packed quad-int/FP " Jan Beulich
2019-05-21 11:53     ` Andrew Cooper
2019-05-21 11:53       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:53   ` [PATCH v8 23/50] x86emul: support AVX512{F, DQ} uint-to-FP " Jan Beulich
2019-05-21 11:58     ` Andrew Cooper
2019-05-21 11:58       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:54   ` [PATCH v8 24/50] x86emul: support AVX512{F, DQ} FP-to-uint " Jan Beulich
2019-05-21 12:09     ` Andrew Cooper
2019-05-21 12:09       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:54   ` [PATCH v8 25/50] x86emul: support remaining AVX512F legacy-equivalent insns Jan Beulich
2019-05-21 13:06     ` Andrew Cooper
2019-05-21 13:06       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:54   ` [PATCH v8 26/50] x86emul: support remaining AVX512BW " Jan Beulich
2019-05-21 13:08     ` Andrew Cooper
2019-05-21 13:08       ` [Xen-devel] " Andrew Cooper
2019-05-21 13:34       ` Jan Beulich
2019-05-21 13:34         ` [Xen-devel] " Jan Beulich
2019-05-23 16:10     ` Andrew Cooper
2019-05-23 16:10       ` [Xen-devel] " Andrew Cooper
2019-03-15 10:55   ` [PATCH v8 27/50] x86emul: support AVX512{F, ER} reciprocal insns Jan Beulich
2019-05-23 16:15     ` Andrew Cooper
2019-05-23 16:15       ` [Xen-devel] " Andrew Cooper
2019-05-24  6:43       ` Jan Beulich
2019-05-24  6:43         ` [Xen-devel] " Jan Beulich
2019-05-24 20:48         ` Andrew Cooper
2019-05-24 20:48           ` [Xen-devel] " Andrew Cooper
2019-05-27  8:02           ` Jan Beulich
2019-05-27  8:02             ` [Xen-devel] " Jan Beulich
2019-05-29 10:00             ` Andrew Cooper
2019-05-29 10:00               ` [Xen-devel] " Andrew Cooper
2019-03-15 10:56   ` [PATCH v8 28/50] x86emul: support AVX512F floating point manipulation insns Jan Beulich
2019-05-29 12:51     ` Andrew Cooper
2019-05-29 12:51       ` [Xen-devel] " Andrew Cooper
2019-05-29 13:15       ` Jan Beulich
2019-05-29 13:15         ` [Xen-devel] " Jan Beulich
2019-06-10 14:01         ` Andrew Cooper
2019-06-10 14:03     ` Andrew Cooper
2019-03-15 10:56   ` [PATCH v8 29/50] x86emul: support AVX512DQ " Jan Beulich
2019-06-10 14:06     ` [Xen-devel] " Andrew Cooper
2019-03-15 10:56   ` [PATCH v8 30/50] x86emul: support AVX512{F, _VBMI2} compress/expand insns Jan Beulich
2019-06-10 14:51     ` [Xen-devel] " Andrew Cooper
2019-06-11 10:20       ` Jan Beulich
2019-06-18 16:24         ` Andrew Cooper
2019-06-19  6:38           ` Jan Beulich
2019-03-15 10:58   ` [PATCH v8 31/50] x86emul: support remaining misc AVX512{F, BW} insns Jan Beulich
2019-06-18 16:42     ` [Xen-devel] " Andrew Cooper
2019-06-19  6:44       ` Jan Beulich
2019-03-15 10:58   ` [PATCH v8 32/50] x86emul: support AVX512F gather insns Jan Beulich
2019-06-19 12:05     ` [Xen-devel] " Andrew Cooper
2019-06-19 12:43       ` Jan Beulich
2019-03-15 10:59   ` [PATCH v8 33/50] x86emul: add high register S/G test cases Jan Beulich
2019-06-19 12:07     ` [Xen-devel] " Andrew Cooper
2019-03-15 10:59   ` [PATCH v8 34/50] x86emul: support AVX512F scatter insns Jan Beulich
2019-03-15 11:00   ` [PATCH v8 35/50] x86emul: support AVX512PF insns Jan Beulich
2019-03-15 11:00   ` [PATCH v8 36/50] x86emul: support AVX512CD insns Jan Beulich
2019-06-19 12:13     ` [Xen-devel] " Andrew Cooper
2019-03-15 11:01   ` [PATCH v8 37/50] x86emul: complete support of AVX512_VBMI insns Jan Beulich
2019-06-19 12:16     ` [Xen-devel] " Andrew Cooper
2019-03-15 11:01   ` [PATCH v8 38/50] x86emul: support of AVX512* population count insns Jan Beulich
2019-06-19 12:22     ` [Xen-devel] " Andrew Cooper
2019-06-19 12:48       ` Jan Beulich
2019-03-15 11:02   ` [PATCH v8 39/50] x86emul: support of AVX512_IFMA insns Jan Beulich
2019-06-19 12:23     ` [Xen-devel] " Andrew Cooper
2019-03-15 11:02   ` [PATCH v8 40/50] x86emul: support remaining AVX512_VBMI2 insns Jan Beulich
2019-06-19 12:25     ` [Xen-devel] " Andrew Cooper
2019-03-15 11:04   ` [PATCH v8 41/50] x86emul: support AVX512_4FMAPS insns Jan Beulich
2019-06-19 14:58     ` [Xen-devel] " Andrew Cooper
2019-06-21  6:50       ` Jan Beulich
2019-03-15 11:04   ` [PATCH v8 42/50] x86emul: support AVX512_4VNNIW insns Jan Beulich
2019-06-19 14:58     ` [Xen-devel] " Andrew Cooper
2019-03-15 11:04   ` [PATCH v8 43/50] x86emul: support AVX512_VNNI insns Jan Beulich
2019-06-19 15:01     ` [Xen-devel] " Andrew Cooper
2019-06-21  6:55       ` Jan Beulich
2019-03-15 11:05   ` [PATCH v8 44/50] x86emul: support VPCLMULQDQ insns Jan Beulich
2019-06-21 12:52     ` [Xen-devel] " Andrew Cooper
2019-06-21 13:44       ` Jan Beulich
2019-03-15 11:06   ` [PATCH v8 45/50] x86emul: support VAES insns Jan Beulich
2019-06-21 12:57     ` [Xen-devel] " Andrew Cooper
2019-03-15 11:06   ` [PATCH v8 46/50] x86emul: support GFNI insns Jan Beulich
2019-06-21 13:19     ` [Xen-devel] " Andrew Cooper
2019-06-21 13:33       ` Andrew Cooper
2019-06-21 14:00       ` Jan Beulich
2019-06-21 14:20         ` Andrew Cooper
2019-06-21 15:02           ` Jan Beulich
2019-06-25  6:48           ` Jan Beulich
2019-03-15 11:07   ` [PATCH v8 47/50] x86emul: restore ordering within main switch statement Jan Beulich
2019-06-21 13:20     ` [Xen-devel] " Andrew Cooper
2019-03-15 11:07   ` [PATCH v8 48/50] x86emul: add an AES/VAES test case to the harness Jan Beulich
2019-06-21 13:36     ` [Xen-devel] " Andrew Cooper
2019-06-21 14:04       ` Jan Beulich
2019-06-21 14:20         ` Andrew Cooper
2019-03-15 11:08   ` [PATCH v8 49/50] x86emul: add a SHA " Jan Beulich
2019-06-21 13:51     ` [Xen-devel] " Andrew Cooper
2019-06-21 14:10       ` Jan Beulich
2019-06-21 14:23         ` Andrew Cooper
2019-03-15 11:08   ` [PATCH v8 50/50] x86emul: add a PCLMUL/VPCLMUL " Jan Beulich
2019-06-21 13:58     ` [Xen-devel] " Andrew Cooper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).