All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v1 0/5] s390x/tcg: Vector Instruction Support Part 3
@ 2019-05-15 20:31 David Hildenbrand
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: David Hildenbrand @ 2019-05-15 20:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Cornelia Huck, David Hildenbrand, Thomas Huth,
	Richard Henderson

This is the third part of vector instruction support for s390x. It is based
on part 2, which is will send a pull-request for to Conny soon.

Part 1: Vector Support Instructions
Part 2: Vector Integer Instructions
Part 3: Vector String Instructions
Part 4: Vector Floating-Point Instructions

The current state can be found at (kept updated):
    https://github.com/davidhildenbrand/qemu/tree/vx

With the current state I can boot Linux kernel + user space compiled with
SIMD support. This allows to boot distributions compiled exclusively for
z13, requiring SIMD support. Also, it is now possible to build a complete
kernel using rpmbuild as quite some issues have been sorted out.

In this part, all Vector String Instructions introduced with the
"Vector Facility" are added.

David Hildenbrand (5):
  s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  s390x/tcg: Implement VECTOR FIND ELEMENT EQUAL
  s390x/tcg: Implement VECTOR FIND ELEMENT NOT EQUAL
  s390x/tcg: Implement VECTOR ISOLATE STRING
  s390x/tcg: Implement VECTOR STRING RANGE COMPARE

 target/s390x/Makefile.objs       |   2 +-
 target/s390x/helper.h            |  32 +++
 target/s390x/insn-data.def       |  13 ++
 target/s390x/translate_vx.inc.c  | 164 ++++++++++++++
 target/s390x/vec_string_helper.c | 358 +++++++++++++++++++++++++++++++
 5 files changed, 568 insertions(+), 1 deletion(-)
 create mode 100644 target/s390x/vec_string_helper.c

-- 
2.20.1



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-15 20:31 [Qemu-devel] [PATCH v1 0/5] s390x/tcg: Vector Instruction Support Part 3 David Hildenbrand
@ 2019-05-15 20:31 ` David Hildenbrand
  2019-05-17 16:16   ` Richard Henderson
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 2/5] s390x/tcg: Implement VECTOR FIND " David Hildenbrand
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-15 20:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Cornelia Huck, David Hildenbrand, Thomas Huth,
	Richard Henderson

Complicated stuff. Provide two variants, one for the CC and one without
the CC. The CC is returned via cpu_env.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/s390x/Makefile.objs       |  2 +-
 target/s390x/helper.h            |  8 +++
 target/s390x/insn-data.def       |  5 ++
 target/s390x/translate_vx.inc.c  | 31 ++++++++++
 target/s390x/vec_string_helper.c | 97 ++++++++++++++++++++++++++++++++
 5 files changed, 142 insertions(+), 1 deletion(-)
 create mode 100644 target/s390x/vec_string_helper.c

diff --git a/target/s390x/Makefile.objs b/target/s390x/Makefile.objs
index 993ac93ed6..0a38281a14 100644
--- a/target/s390x/Makefile.objs
+++ b/target/s390x/Makefile.objs
@@ -1,7 +1,7 @@
 obj-y += cpu.o cpu_models.o cpu_features.o gdbstub.o interrupt.o helper.o
 obj-$(CONFIG_TCG) += translate.o cc_helper.o excp_helper.o fpu_helper.o
 obj-$(CONFIG_TCG) += int_helper.o mem_helper.o misc_helper.o crypto_helper.o
-obj-$(CONFIG_TCG) += vec_helper.o vec_int_helper.o
+obj-$(CONFIG_TCG) += vec_helper.o vec_int_helper.o vec_string_helper.o
 obj-$(CONFIG_SOFTMMU) += machine.o ioinst.o arch_dump.o mmu_helper.o diag.o
 obj-$(CONFIG_SOFTMMU) += sigp.o
 obj-$(CONFIG_KVM) += kvm.o
diff --git a/target/s390x/helper.h b/target/s390x/helper.h
index 7755a96c33..c45328cf73 100644
--- a/target/s390x/helper.h
+++ b/target/s390x/helper.h
@@ -211,6 +211,14 @@ DEF_HELPER_FLAGS_4(gvec_vscbi8, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
 DEF_HELPER_FLAGS_4(gvec_vscbi16, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
 DEF_HELPER_4(gvec_vtm, void, ptr, cptr, env, i32)
 
+/* === Vector String Instructions === */
+DEF_HELPER_FLAGS_4(gvec_vfae8, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
+DEF_HELPER_FLAGS_4(gvec_vfae16, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
+DEF_HELPER_FLAGS_4(gvec_vfae32, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
+DEF_HELPER_5(gvec_vfae_cc8, void, ptr, cptr, cptr, env, i32)
+DEF_HELPER_5(gvec_vfae_cc16, void, ptr, cptr, cptr, env, i32)
+DEF_HELPER_5(gvec_vfae_cc32, void, ptr, cptr, cptr, env, i32)
+
 #ifndef CONFIG_USER_ONLY
 DEF_HELPER_3(servc, i32, env, i64, i64)
 DEF_HELPER_4(diag, void, env, i32, i32, i32)
diff --git a/target/s390x/insn-data.def b/target/s390x/insn-data.def
index e61475bdc4..070ce2a471 100644
--- a/target/s390x/insn-data.def
+++ b/target/s390x/insn-data.def
@@ -1191,6 +1191,11 @@
 /* VECTOR TEST UNDER MASK */
     F(0xe7d8, VTM,     VRR_a, V,   0, 0, 0, 0, vtm, 0, IF_VEC)
 
+/* === Vector String Instructions === */
+
+/* VECTOR FIND ANY ELEMENT EQUAL */
+    F(0xe782, VFAE,    VRR_b, V,   0, 0, 0, 0, vfae, 0, IF_VEC)
+
 #ifndef CONFIG_USER_ONLY
 /* COMPARE AND SWAP AND PURGE */
     E(0xb250, CSP,     RRE,   Z,   r1_32u, ra2, r1_P, 0, csp, 0, MO_TEUL, IF_PRIV)
diff --git a/target/s390x/translate_vx.inc.c b/target/s390x/translate_vx.inc.c
index 7e0bfcb190..022990dda3 100644
--- a/target/s390x/translate_vx.inc.c
+++ b/target/s390x/translate_vx.inc.c
@@ -2353,3 +2353,34 @@ static DisasJumpType op_vtm(DisasContext *s, DisasOps *o)
     set_cc_static(s);
     return DISAS_NEXT;
 }
+
+static DisasJumpType op_vfae(DisasContext *s, DisasOps *o)
+{
+    const uint8_t es = get_field(s->fields, m4);
+    const uint8_t m5 = get_field(s->fields, m5);
+    static gen_helper_gvec_3_ptr * const cc[3] = {
+        gen_helper_gvec_vfae_cc8,
+        gen_helper_gvec_vfae_cc16,
+        gen_helper_gvec_vfae_cc32,
+    };
+    static gen_helper_gvec_3 * const nocc[3] = {
+        gen_helper_gvec_vfae8,
+        gen_helper_gvec_vfae16,
+        gen_helper_gvec_vfae32,
+    };
+
+    if (es > ES_32) {
+        gen_program_exception(s, PGM_SPECIFICATION);
+        return DISAS_NORETURN;
+    }
+
+    if (m5 & 1) {
+        gen_gvec_3_ptr(get_field(s->fields, v1), get_field(s->fields, v2),
+                       get_field(s->fields, v3), cpu_env, m5, cc[es]);
+        set_cc_static(s);
+    } else {
+        gen_gvec_3_ool(get_field(s->fields, v1), get_field(s->fields, v2),
+                       get_field(s->fields, v3), m5, nocc[es]);
+    }
+    return DISAS_NEXT;
+}
diff --git a/target/s390x/vec_string_helper.c b/target/s390x/vec_string_helper.c
new file mode 100644
index 0000000000..8a4e65b70f
--- /dev/null
+++ b/target/s390x/vec_string_helper.c
@@ -0,0 +1,97 @@
+/*
+ * QEMU TCG support -- s390x vector string instruction support
+ *
+ * Copyright (C) 2019 Red Hat Inc
+ *
+ * Authors:
+ *   David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "cpu.h"
+#include "internal.h"
+#include "vec.h"
+#include "tcg/tcg-gvec-desc.h"
+#include "exec/helper-proto.h"
+
+#define DEF_VFAE(BITS)                                                         \
+static int vfae##BITS(void *v1, const void *v2, const void *v3, uint8_t m5)    \
+{                                                                              \
+    const bool in = extract32(m5, 3, 1);                                       \
+    const bool rt = extract32(m5, 2, 1);                                       \
+    const bool zs = extract32(m5, 1, 1);                                       \
+    S390Vector tmp = {};                                                       \
+    int first_byte = 16;                                                       \
+    int cc = 3; /* no match */                                                 \
+    int i, j;                                                                  \
+                                                                               \
+    for (i = 0; i < (128 / BITS); i++) {                                       \
+        const uint##BITS##_t data = s390_vec_read_element##BITS(v2, i);        \
+        bool any_equal = false;                                                \
+                                                                               \
+        if (zs && !data) {                                                     \
+            if (cc == 3) {                                                     \
+                first_byte = i * (BITS / 8);                                   \
+                cc = 0; /* match for zero */                                   \
+            } else if (cc != 0) {                                              \
+                cc = 2; /* matching elements before match for zero */          \
+            }                                                                  \
+            if (!rt) {                                                         \
+                break;                                                         \
+            }                                                                  \
+        }                                                                      \
+                                                                               \
+        /* try to match with any other element from the other vector */        \
+        for (j = 0; j < (128 / BITS); j++) {                                   \
+            if (data == s390_vec_read_element##BITS(v3, j)) {                  \
+                any_equal = true;                                              \
+                break;                                                         \
+            }                                                                  \
+        }                                                                      \
+                                                                               \
+        /* invert the result if requested */                                   \
+        any_equal = in ^ any_equal;                                            \
+        if (cc == 3 && any_equal) {                                            \
+            first_byte = i * (BITS / 8);                                       \
+            cc = 1; /* matching elements, no match for zero */                 \
+            if (!zs && !rt) {                                                  \
+                break;                                                         \
+            }                                                                  \
+        }                                                                      \
+        /* indicate bit vector if requested */                                 \
+        if (rt && any_equal) {                                                 \
+            s390_vec_write_element##BITS(&tmp, i, (uint##BITS##_t)-1ull);      \
+        }                                                                      \
+    }                                                                          \
+    if (!rt) {                                                                 \
+        s390_vec_write_element8(&tmp, 7, first_byte);                          \
+    }                                                                          \
+    *(S390Vector *)v1 = tmp;                                                   \
+    return cc;                                                                 \
+}
+DEF_VFAE(8)
+DEF_VFAE(16)
+DEF_VFAE(32)
+
+#define DEF_VFAE_HELPER(BITS)                                                  \
+void HELPER(gvec_vfae##BITS)(void *v1, const void *v2, const void *v3,         \
+                             uint32_t desc)                                    \
+{                                                                              \
+    vfae##BITS(v1, v2, v3, simd_data(desc));                                   \
+}
+DEF_VFAE_HELPER(8)
+DEF_VFAE_HELPER(16)
+DEF_VFAE_HELPER(32)
+
+#define DEF_VFAE_CC_HELPER(BITS)                                               \
+void HELPER(gvec_vfae_cc##BITS)(void *v1, const void *v2, const void *v3,      \
+                                CPUS390XState *env, uint32_t desc)             \
+{                                                                              \
+    env->cc_op = vfae##BITS(v1, v2, v3, simd_data(desc));                      \
+}
+DEF_VFAE_CC_HELPER(8)
+DEF_VFAE_CC_HELPER(16)
+DEF_VFAE_CC_HELPER(32)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v1 2/5] s390x/tcg: Implement VECTOR FIND ELEMENT EQUAL
  2019-05-15 20:31 [Qemu-devel] [PATCH v1 0/5] s390x/tcg: Vector Instruction Support Part 3 David Hildenbrand
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL David Hildenbrand
@ 2019-05-15 20:31 ` David Hildenbrand
  2019-05-17 16:47   ` Richard Henderson
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 3/5] s390x/tcg: Implement VECTOR FIND ELEMENT NOT EQUAL David Hildenbrand
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-15 20:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Cornelia Huck, David Hildenbrand, Thomas Huth,
	Richard Henderson

Implement it similar to VECTOR FIND ANY ELEMENT EQUAL.

The zero-check seems to have precedence in case we have
"data1 == data2 == 0". The description in the PoP is a little bi
confusing.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/s390x/helper.h            |  6 ++++
 target/s390x/insn-data.def       |  2 ++
 target/s390x/translate_vx.inc.c  | 31 +++++++++++++++++
 target/s390x/vec_string_helper.c | 59 ++++++++++++++++++++++++++++++++
 4 files changed, 98 insertions(+)

diff --git a/target/s390x/helper.h b/target/s390x/helper.h
index c45328cf73..a1b169b666 100644
--- a/target/s390x/helper.h
+++ b/target/s390x/helper.h
@@ -218,6 +218,12 @@ DEF_HELPER_FLAGS_4(gvec_vfae32, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
 DEF_HELPER_5(gvec_vfae_cc8, void, ptr, cptr, cptr, env, i32)
 DEF_HELPER_5(gvec_vfae_cc16, void, ptr, cptr, cptr, env, i32)
 DEF_HELPER_5(gvec_vfae_cc32, void, ptr, cptr, cptr, env, i32)
+DEF_HELPER_FLAGS_4(gvec_vfee8, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
+DEF_HELPER_FLAGS_4(gvec_vfee16, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
+DEF_HELPER_FLAGS_4(gvec_vfee32, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
+DEF_HELPER_5(gvec_vfee_cc8, void, ptr, cptr, cptr, env, i32)
+DEF_HELPER_5(gvec_vfee_cc16, void, ptr, cptr, cptr, env, i32)
+DEF_HELPER_5(gvec_vfee_cc32, void, ptr, cptr, cptr, env, i32)
 
 #ifndef CONFIG_USER_ONLY
 DEF_HELPER_3(servc, i32, env, i64, i64)
diff --git a/target/s390x/insn-data.def b/target/s390x/insn-data.def
index 070ce2a471..d8907ef6a5 100644
--- a/target/s390x/insn-data.def
+++ b/target/s390x/insn-data.def
@@ -1195,6 +1195,8 @@
 
 /* VECTOR FIND ANY ELEMENT EQUAL */
     F(0xe782, VFAE,    VRR_b, V,   0, 0, 0, 0, vfae, 0, IF_VEC)
+/* VECTOR FIND ELEMENT EQUAL */
+    F(0xe780, VFEE,    VRR_b, V,   0, 0, 0, 0, vfee, 0, IF_VEC)
 
 #ifndef CONFIG_USER_ONLY
 /* COMPARE AND SWAP AND PURGE */
diff --git a/target/s390x/translate_vx.inc.c b/target/s390x/translate_vx.inc.c
index 022990dda3..848f6d7163 100644
--- a/target/s390x/translate_vx.inc.c
+++ b/target/s390x/translate_vx.inc.c
@@ -2384,3 +2384,34 @@ static DisasJumpType op_vfae(DisasContext *s, DisasOps *o)
     }
     return DISAS_NEXT;
 }
+
+static DisasJumpType op_vfee(DisasContext *s, DisasOps *o)
+{
+    const uint8_t es = get_field(s->fields, m4);
+    const uint8_t m5 = get_field(s->fields, m5);
+    static gen_helper_gvec_3_ptr * const cc[3] = {
+        gen_helper_gvec_vfee_cc8,
+        gen_helper_gvec_vfee_cc16,
+        gen_helper_gvec_vfee_cc32,
+    };
+    static gen_helper_gvec_3 * const nocc[3] = {
+        gen_helper_gvec_vfee8,
+        gen_helper_gvec_vfee16,
+        gen_helper_gvec_vfee32,
+    };
+
+    if (es > ES_32 || m5 & ~0x3) {
+        gen_program_exception(s, PGM_SPECIFICATION);
+        return DISAS_NORETURN;
+    }
+
+    if (m5 & 1) {
+        gen_gvec_3_ptr(get_field(s->fields, v1), get_field(s->fields, v2),
+                       get_field(s->fields, v3), cpu_env, m5, cc[es]);
+        set_cc_static(s);
+    } else {
+        gen_gvec_3_ool(get_field(s->fields, v1), get_field(s->fields, v2),
+                       get_field(s->fields, v3), m5, nocc[es]);
+    }
+    return DISAS_NEXT;
+}
diff --git a/target/s390x/vec_string_helper.c b/target/s390x/vec_string_helper.c
index 8a4e65b70f..6a5d05271c 100644
--- a/target/s390x/vec_string_helper.c
+++ b/target/s390x/vec_string_helper.c
@@ -95,3 +95,62 @@ void HELPER(gvec_vfae_cc##BITS)(void *v1, const void *v2, const void *v3,      \
 DEF_VFAE_CC_HELPER(8)
 DEF_VFAE_CC_HELPER(16)
 DEF_VFAE_CC_HELPER(32)
+
+#define DEF_VFEE(BITS)                                                         \
+static int vfee##BITS(void *v1, const void *v2, const void *v3, uint8_t m5)    \
+{                                                                              \
+    const bool zs = extract32(m5, 1, 1);                                       \
+    S390Vector tmp = {};                                                       \
+    int first_byte = 16;                                                       \
+    int cc = 3; /* no match */                                                 \
+    int i;                                                                     \
+                                                                               \
+    for (i = 0; i < (128 / BITS); i++) {                                       \
+        const uint##BITS##_t data1 = s390_vec_read_element##BITS(v2, i);       \
+        const uint##BITS##_t data2 = s390_vec_read_element##BITS(v3, i);       \
+                                                                               \
+        if (zs && !data1) {                                                    \
+            if (cc == 3) {                                                     \
+                first_byte = i * (BITS / 8);                                   \
+                cc = 0; /* match for zero */                                   \
+            } else {                                                           \
+                cc = 2; /* matching elements before match for zero */          \
+            }                                                                  \
+            break;                                                             \
+        }                                                                      \
+                                                                               \
+        if (cc == 3 && data1 == data2) {                                       \
+            first_byte = i * (BITS / 8);                                       \
+            cc = 1; /* matching elements, no match for zero */                 \
+            if (!zs) {                                                         \
+                break;                                                         \
+            }                                                                  \
+        }                                                                      \
+    }                                                                          \
+    s390_vec_write_element8(&tmp, 7, first_byte);                              \
+    *(S390Vector *)v1 = tmp;                                                   \
+    return cc;                                                                 \
+}
+DEF_VFEE(8)
+DEF_VFEE(16)
+DEF_VFEE(32)
+
+#define DEF_VFEE_HELPER(BITS)                                                  \
+void HELPER(gvec_vfee##BITS)(void *v1, const void *v2, const void *v3,         \
+                             uint32_t desc)                                    \
+{                                                                              \
+    vfee##BITS(v1, v2, v3, simd_data(desc));                                   \
+}
+DEF_VFEE_HELPER(8)
+DEF_VFEE_HELPER(16)
+DEF_VFEE_HELPER(32)
+
+#define DEF_VFEE_CC_HELPER(BITS)                                               \
+void HELPER(gvec_vfee_cc##BITS)(void *v1, const void *v2, const void *v3,      \
+                                CPUS390XState *env, uint32_t desc)             \
+{                                                                              \
+    env->cc_op = vfee##BITS(v1, v2, v3, simd_data(desc));                      \
+}
+DEF_VFEE_CC_HELPER(8)
+DEF_VFEE_CC_HELPER(16)
+DEF_VFEE_CC_HELPER(32)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v1 3/5] s390x/tcg: Implement VECTOR FIND ELEMENT NOT EQUAL
  2019-05-15 20:31 [Qemu-devel] [PATCH v1 0/5] s390x/tcg: Vector Instruction Support Part 3 David Hildenbrand
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL David Hildenbrand
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 2/5] s390x/tcg: Implement VECTOR FIND " David Hildenbrand
@ 2019-05-15 20:31 ` David Hildenbrand
  2019-05-17 17:56   ` Richard Henderson
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 4/5] s390x/tcg: Implement VECTOR ISOLATE STRING David Hildenbrand
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 5/5] s390x/tcg: Implement VECTOR STRING RANGE COMPARE David Hildenbrand
  4 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-15 20:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Cornelia Huck, David Hildenbrand, Thomas Huth,
	Richard Henderson

Similar to VECTOR FIND ELEMENT EQUAL, however the search also stops on
any inequality. A match for inequality seems to have precedence over
a match for zero, because both elements have to be zero.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/s390x/helper.h            |  6 ++++
 target/s390x/insn-data.def       |  2 ++
 target/s390x/translate_vx.inc.c  | 31 +++++++++++++++++++
 target/s390x/vec_string_helper.c | 53 ++++++++++++++++++++++++++++++++
 4 files changed, 92 insertions(+)

diff --git a/target/s390x/helper.h b/target/s390x/helper.h
index a1b169b666..fb50b404db 100644
--- a/target/s390x/helper.h
+++ b/target/s390x/helper.h
@@ -224,6 +224,12 @@ DEF_HELPER_FLAGS_4(gvec_vfee32, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
 DEF_HELPER_5(gvec_vfee_cc8, void, ptr, cptr, cptr, env, i32)
 DEF_HELPER_5(gvec_vfee_cc16, void, ptr, cptr, cptr, env, i32)
 DEF_HELPER_5(gvec_vfee_cc32, void, ptr, cptr, cptr, env, i32)
+DEF_HELPER_FLAGS_4(gvec_vfene8, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
+DEF_HELPER_FLAGS_4(gvec_vfene16, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
+DEF_HELPER_FLAGS_4(gvec_vfene32, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
+DEF_HELPER_5(gvec_vfene_cc8, void, ptr, cptr, cptr, env, i32)
+DEF_HELPER_5(gvec_vfene_cc16, void, ptr, cptr, cptr, env, i32)
+DEF_HELPER_5(gvec_vfene_cc32, void, ptr, cptr, cptr, env, i32)
 
 #ifndef CONFIG_USER_ONLY
 DEF_HELPER_3(servc, i32, env, i64, i64)
diff --git a/target/s390x/insn-data.def b/target/s390x/insn-data.def
index d8907ef6a5..d03c1ee0b3 100644
--- a/target/s390x/insn-data.def
+++ b/target/s390x/insn-data.def
@@ -1197,6 +1197,8 @@
     F(0xe782, VFAE,    VRR_b, V,   0, 0, 0, 0, vfae, 0, IF_VEC)
 /* VECTOR FIND ELEMENT EQUAL */
     F(0xe780, VFEE,    VRR_b, V,   0, 0, 0, 0, vfee, 0, IF_VEC)
+/* VECTOR FIND ELEMENT NOT EQUAL */
+    F(0xe781, VFENE,   VRR_b, V,   0, 0, 0, 0, vfene, 0, IF_VEC)
 
 #ifndef CONFIG_USER_ONLY
 /* COMPARE AND SWAP AND PURGE */
diff --git a/target/s390x/translate_vx.inc.c b/target/s390x/translate_vx.inc.c
index 848f6d7163..e36cc5c401 100644
--- a/target/s390x/translate_vx.inc.c
+++ b/target/s390x/translate_vx.inc.c
@@ -2415,3 +2415,34 @@ static DisasJumpType op_vfee(DisasContext *s, DisasOps *o)
     }
     return DISAS_NEXT;
 }
+
+static DisasJumpType op_vfene(DisasContext *s, DisasOps *o)
+{
+    const uint8_t es = get_field(s->fields, m4);
+    const uint8_t m5 = get_field(s->fields, m5);
+    static gen_helper_gvec_3_ptr * const cc[3] = {
+        gen_helper_gvec_vfene_cc8,
+        gen_helper_gvec_vfene_cc16,
+        gen_helper_gvec_vfene_cc32,
+    };
+    static gen_helper_gvec_3 * const nocc[3] = {
+        gen_helper_gvec_vfene8,
+        gen_helper_gvec_vfene16,
+        gen_helper_gvec_vfene32,
+    };
+
+    if (es > ES_32 || m5 & ~0x3) {
+        gen_program_exception(s, PGM_SPECIFICATION);
+        return DISAS_NORETURN;
+    }
+
+    if (m5 & 1) {
+        gen_gvec_3_ptr(get_field(s->fields, v1), get_field(s->fields, v2),
+                       get_field(s->fields, v3), cpu_env, m5, cc[es]);
+        set_cc_static(s);
+    } else {
+        gen_gvec_3_ool(get_field(s->fields, v1), get_field(s->fields, v2),
+                       get_field(s->fields, v3), m5, nocc[es]);
+    }
+    return DISAS_NEXT;
+}
diff --git a/target/s390x/vec_string_helper.c b/target/s390x/vec_string_helper.c
index 6a5d05271c..181f044fe5 100644
--- a/target/s390x/vec_string_helper.c
+++ b/target/s390x/vec_string_helper.c
@@ -154,3 +154,56 @@ void HELPER(gvec_vfee_cc##BITS)(void *v1, const void *v2, const void *v3,      \
 DEF_VFEE_CC_HELPER(8)
 DEF_VFEE_CC_HELPER(16)
 DEF_VFEE_CC_HELPER(32)
+
+#define DEF_VFENE(BITS)                                                        \
+static int vfene##BITS(void *v1, const void *v2, const void *v3, uint8_t m5)   \
+{                                                                              \
+    const bool zs = extract32(m5, 1, 1);                                       \
+    S390Vector tmp = {};                                                       \
+    int first_byte = 16;                                                       \
+    int cc = 3; /* no match */                                                 \
+    int i;                                                                     \
+                                                                               \
+    for (i = 0; i < (128 / BITS); i++) {                                       \
+        const uint##BITS##_t data1 = s390_vec_read_element##BITS(v2, i);       \
+        const uint##BITS##_t data2 = s390_vec_read_element##BITS(v3, i);       \
+                                                                               \
+        if (data1 != data2) {                                                  \
+            first_byte = i * (BITS / 8);                                       \
+            cc = data1 < data2 ? 1 : 2; /* inequality found */                 \
+            break;                                                             \
+        }                                                                      \
+                                                                               \
+        if (zs && !data1) {                                                    \
+            first_byte = i * (BITS / 8);                                       \
+            cc = 0; /* match for zero */                                       \
+            break;                                                             \
+        }                                                                      \
+    }                                                                          \
+    s390_vec_write_element8(&tmp, 7, first_byte);                              \
+    *(S390Vector *)v1 = tmp;                                                   \
+    return cc;                                                                 \
+}
+DEF_VFENE(8)
+DEF_VFENE(16)
+DEF_VFENE(32)
+
+#define DEF_VFENE_HELPER(BITS)                                                 \
+void HELPER(gvec_vfene##BITS)(void *v1, const void *v2, const void *v3,        \
+                              uint32_t desc)                                   \
+{                                                                              \
+    vfene##BITS(v1, v2, v3, simd_data(desc));                                  \
+}
+DEF_VFENE_HELPER(8)
+DEF_VFENE_HELPER(16)
+DEF_VFENE_HELPER(32)
+
+#define DEF_VFENE_CC_HELPER(BITS)                                              \
+void HELPER(gvec_vfene_cc##BITS)(void *v1, const void *v2, const void *v3,     \
+                                CPUS390XState *env, uint32_t desc)             \
+{                                                                              \
+    env->cc_op = vfene##BITS(v1, v2, v3, simd_data(desc));                     \
+}
+DEF_VFENE_CC_HELPER(8)
+DEF_VFENE_CC_HELPER(16)
+DEF_VFENE_CC_HELPER(32)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v1 4/5] s390x/tcg: Implement VECTOR ISOLATE STRING
  2019-05-15 20:31 [Qemu-devel] [PATCH v1 0/5] s390x/tcg: Vector Instruction Support Part 3 David Hildenbrand
                   ` (2 preceding siblings ...)
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 3/5] s390x/tcg: Implement VECTOR FIND ELEMENT NOT EQUAL David Hildenbrand
@ 2019-05-15 20:31 ` David Hildenbrand
  2019-05-17 18:20   ` Richard Henderson
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 5/5] s390x/tcg: Implement VECTOR STRING RANGE COMPARE David Hildenbrand
  4 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-15 20:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Cornelia Huck, David Hildenbrand, Thomas Huth,
	Richard Henderson

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/s390x/helper.h            |  6 +++++
 target/s390x/insn-data.def       |  2 ++
 target/s390x/translate_vx.inc.c  | 34 ++++++++++++++++++++++++++
 target/s390x/vec_string_helper.c | 41 ++++++++++++++++++++++++++++++++
 4 files changed, 83 insertions(+)

diff --git a/target/s390x/helper.h b/target/s390x/helper.h
index fb50b404db..1f9f0b463b 100644
--- a/target/s390x/helper.h
+++ b/target/s390x/helper.h
@@ -230,6 +230,12 @@ DEF_HELPER_FLAGS_4(gvec_vfene32, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, i32)
 DEF_HELPER_5(gvec_vfene_cc8, void, ptr, cptr, cptr, env, i32)
 DEF_HELPER_5(gvec_vfene_cc16, void, ptr, cptr, cptr, env, i32)
 DEF_HELPER_5(gvec_vfene_cc32, void, ptr, cptr, cptr, env, i32)
+DEF_HELPER_FLAGS_3(gvec_vistr8, TCG_CALL_NO_RWG, void, ptr, cptr, i32)
+DEF_HELPER_FLAGS_3(gvec_vistr16, TCG_CALL_NO_RWG, void, ptr, cptr, i32)
+DEF_HELPER_FLAGS_3(gvec_vistr32, TCG_CALL_NO_RWG, void, ptr, cptr, i32)
+DEF_HELPER_4(gvec_vistr_cc8, void, ptr, cptr, env, i32)
+DEF_HELPER_4(gvec_vistr_cc16, void, ptr, cptr, env, i32)
+DEF_HELPER_4(gvec_vistr_cc32, void, ptr, cptr, env, i32)
 
 #ifndef CONFIG_USER_ONLY
 DEF_HELPER_3(servc, i32, env, i64, i64)
diff --git a/target/s390x/insn-data.def b/target/s390x/insn-data.def
index d03c1ee0b3..b4a6b59608 100644
--- a/target/s390x/insn-data.def
+++ b/target/s390x/insn-data.def
@@ -1199,6 +1199,8 @@
     F(0xe780, VFEE,    VRR_b, V,   0, 0, 0, 0, vfee, 0, IF_VEC)
 /* VECTOR FIND ELEMENT NOT EQUAL */
     F(0xe781, VFENE,   VRR_b, V,   0, 0, 0, 0, vfene, 0, IF_VEC)
+/* VECTOR ISOLATE STRING */
+    F(0xe75c, VISTR,   VRR_a, V,   0, 0, 0, 0, vistr, 0, IF_VEC)
 
 #ifndef CONFIG_USER_ONLY
 /* COMPARE AND SWAP AND PURGE */
diff --git a/target/s390x/translate_vx.inc.c b/target/s390x/translate_vx.inc.c
index e36cc5c401..437b416b4a 100644
--- a/target/s390x/translate_vx.inc.c
+++ b/target/s390x/translate_vx.inc.c
@@ -188,6 +188,9 @@ static void get_vec_element_ptr_i64(TCGv_ptr ptr, uint8_t reg, TCGv_i64 enr,
 #define gen_gvec_2s(v1, v2, c, gen) \
     tcg_gen_gvec_2s(vec_full_reg_offset(v1), vec_full_reg_offset(v2), \
                     16, 16, c, gen)
+#define gen_gvec_2_ool(v1, v2, data, fn) \
+    tcg_gen_gvec_2_ool(vec_full_reg_offset(v1), vec_full_reg_offset(v2), \
+                       16, 16, data, fn)
 #define gen_gvec_2i_ool(v1, v2, c, data, fn) \
     tcg_gen_gvec_2i_ool(vec_full_reg_offset(v1), vec_full_reg_offset(v2), \
                         c, 16, 16, data, fn)
@@ -2446,3 +2449,34 @@ static DisasJumpType op_vfene(DisasContext *s, DisasOps *o)
     }
     return DISAS_NEXT;
 }
+
+static DisasJumpType op_vistr(DisasContext *s, DisasOps *o)
+{
+    const uint8_t es = get_field(s->fields, m4);
+    const uint8_t m5 = get_field(s->fields, m5);
+    static gen_helper_gvec_2_ptr * const cc[3] = {
+        gen_helper_gvec_vistr_cc8,
+        gen_helper_gvec_vistr_cc16,
+        gen_helper_gvec_vistr_cc32,
+    };
+    static gen_helper_gvec_2 * const nocc[3] = {
+        gen_helper_gvec_vistr8,
+        gen_helper_gvec_vistr16,
+        gen_helper_gvec_vistr32,
+    };
+
+    if (es > ES_32 || m5 & ~0x1) {
+        gen_program_exception(s, PGM_SPECIFICATION);
+        return DISAS_NORETURN;
+    }
+
+    if (m5 & 1) {
+        gen_gvec_2_ptr(get_field(s->fields, v1), get_field(s->fields, v2),
+                       cpu_env, 0, cc[es]);
+        set_cc_static(s);
+    } else {
+        gen_gvec_2_ool(get_field(s->fields, v1), get_field(s->fields, v2), 0,
+                       nocc[es]);
+    }
+    return DISAS_NEXT;
+}
diff --git a/target/s390x/vec_string_helper.c b/target/s390x/vec_string_helper.c
index 181f044fe5..2e998c21a2 100644
--- a/target/s390x/vec_string_helper.c
+++ b/target/s390x/vec_string_helper.c
@@ -207,3 +207,44 @@ void HELPER(gvec_vfene_cc##BITS)(void *v1, const void *v2, const void *v3,     \
 DEF_VFENE_CC_HELPER(8)
 DEF_VFENE_CC_HELPER(16)
 DEF_VFENE_CC_HELPER(32)
+
+#define DEF_VISTR(BITS)                                                        \
+static int vistr##BITS(void *v1, const void *v2)                               \
+{                                                                              \
+    S390Vector tmp = {};                                                       \
+    int i, cc = 3;                                                             \
+                                                                               \
+    for (i = 0; i < (128 / BITS); i++) {                                       \
+        const uint##BITS##_t data = s390_vec_read_element##BITS(v2, i);        \
+                                                                               \
+        if (!data) {                                                           \
+            cc = 0;                                                            \
+            break;                                                             \
+        }                                                                      \
+        s390_vec_write_element##BITS(&tmp, i, data);                           \
+    }                                                                          \
+    *(S390Vector *)v1 = tmp;                                                   \
+    return cc;                                                                 \
+}
+DEF_VISTR(8)
+DEF_VISTR(16)
+DEF_VISTR(32)
+
+#define DEF_VISTR_HELPER(BITS)                                                 \
+void HELPER(gvec_vistr##BITS)(void *v1, const void *v2, uint32_t desc)         \
+{                                                                              \
+    vistr##BITS(v1, v2);                                                       \
+}
+DEF_VISTR_HELPER(8)
+DEF_VISTR_HELPER(16)
+DEF_VISTR_HELPER(32)
+
+#define DEF_VISTR_CC_HELPER(BITS)                                              \
+void HELPER(gvec_vistr_cc##BITS)(void *v1, const void *v2,                     \
+                                 CPUS390XState *env, uint32_t desc)            \
+{                                                                              \
+    env->cc_op = vistr##BITS(v1, v2);                                          \
+}
+DEF_VISTR_CC_HELPER(8)
+DEF_VISTR_CC_HELPER(16)
+DEF_VISTR_CC_HELPER(32)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v1 5/5] s390x/tcg: Implement VECTOR STRING RANGE COMPARE
  2019-05-15 20:31 [Qemu-devel] [PATCH v1 0/5] s390x/tcg: Vector Instruction Support Part 3 David Hildenbrand
                   ` (3 preceding siblings ...)
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 4/5] s390x/tcg: Implement VECTOR ISOLATE STRING David Hildenbrand
@ 2019-05-15 20:31 ` David Hildenbrand
  2019-05-17 18:37   ` Richard Henderson
  4 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-15 20:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Cornelia Huck, David Hildenbrand, Thomas Huth,
	Richard Henderson

Crazy stuff. Implement it similar to VECTOR FIND ANY ELEMENT
EQUAL.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/s390x/helper.h            |   6 ++
 target/s390x/insn-data.def       |   2 +
 target/s390x/translate_vx.inc.c  |  37 +++++++++++
 target/s390x/vec_string_helper.c | 108 +++++++++++++++++++++++++++++++
 4 files changed, 153 insertions(+)

diff --git a/target/s390x/helper.h b/target/s390x/helper.h
index 1f9f0b463b..f2743ccd97 100644
--- a/target/s390x/helper.h
+++ b/target/s390x/helper.h
@@ -236,6 +236,12 @@ DEF_HELPER_FLAGS_3(gvec_vistr32, TCG_CALL_NO_RWG, void, ptr, cptr, i32)
 DEF_HELPER_4(gvec_vistr_cc8, void, ptr, cptr, env, i32)
 DEF_HELPER_4(gvec_vistr_cc16, void, ptr, cptr, env, i32)
 DEF_HELPER_4(gvec_vistr_cc32, void, ptr, cptr, env, i32)
+DEF_HELPER_FLAGS_5(gvec_vstrc8, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, cptr, i32)
+DEF_HELPER_FLAGS_5(gvec_vstrc16, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, cptr, i32)
+DEF_HELPER_FLAGS_5(gvec_vstrc32, TCG_CALL_NO_RWG, void, ptr, cptr, cptr, cptr, i32)
+DEF_HELPER_6(gvec_vstrc_cc8, void, ptr, cptr, cptr, cptr, env, i32)
+DEF_HELPER_6(gvec_vstrc_cc16, void, ptr, cptr, cptr, cptr, env, i32)
+DEF_HELPER_6(gvec_vstrc_cc32, void, ptr, cptr, cptr, cptr, env, i32)
 
 #ifndef CONFIG_USER_ONLY
 DEF_HELPER_3(servc, i32, env, i64, i64)
diff --git a/target/s390x/insn-data.def b/target/s390x/insn-data.def
index b4a6b59608..a2969fab58 100644
--- a/target/s390x/insn-data.def
+++ b/target/s390x/insn-data.def
@@ -1201,6 +1201,8 @@
     F(0xe781, VFENE,   VRR_b, V,   0, 0, 0, 0, vfene, 0, IF_VEC)
 /* VECTOR ISOLATE STRING */
     F(0xe75c, VISTR,   VRR_a, V,   0, 0, 0, 0, vistr, 0, IF_VEC)
+/* VECTOR STRING RANGE COMPARE */
+    F(0xe78a, VSTRC,   VRR_d, V,   0, 0, 0, 0, vstrc, 0, IF_VEC)
 
 #ifndef CONFIG_USER_ONLY
 /* COMPARE AND SWAP AND PURGE */
diff --git a/target/s390x/translate_vx.inc.c b/target/s390x/translate_vx.inc.c
index 437b416b4a..62a8d4d738 100644
--- a/target/s390x/translate_vx.inc.c
+++ b/target/s390x/translate_vx.inc.c
@@ -217,6 +217,10 @@ static void get_vec_element_ptr_i64(TCGv_ptr ptr, uint8_t reg, TCGv_i64 enr,
     tcg_gen_gvec_4_ool(vec_full_reg_offset(v1), vec_full_reg_offset(v2), \
                        vec_full_reg_offset(v3), vec_full_reg_offset(v4), \
                        16, 16, data, fn)
+#define gen_gvec_4_ptr(v1, v2, v3, v4, ptr, data, fn) \
+    tcg_gen_gvec_4_ptr(vec_full_reg_offset(v1), vec_full_reg_offset(v2), \
+                       vec_full_reg_offset(v3), vec_full_reg_offset(v4), \
+                       ptr, 16, 16, data, fn)
 #define gen_gvec_dup_i64(es, v1, c) \
     tcg_gen_gvec_dup_i64(es, vec_full_reg_offset(v1), 16, 16, c)
 #define gen_gvec_mov(v1, v2) \
@@ -2480,3 +2484,36 @@ static DisasJumpType op_vistr(DisasContext *s, DisasOps *o)
     }
     return DISAS_NEXT;
 }
+
+static DisasJumpType op_vstrc(DisasContext *s, DisasOps *o)
+{
+    const uint8_t es = get_field(s->fields, m5);
+    const uint8_t m6 = get_field(s->fields, m6);
+    static gen_helper_gvec_4_ptr * const cc[3] = {
+        gen_helper_gvec_vstrc_cc8,
+        gen_helper_gvec_vstrc_cc16,
+        gen_helper_gvec_vstrc_cc32,
+    };
+    static gen_helper_gvec_4 * const nocc[3] = {
+        gen_helper_gvec_vstrc8,
+        gen_helper_gvec_vstrc16,
+        gen_helper_gvec_vstrc32,
+    };
+
+    if (es > ES_32) {
+        gen_program_exception(s, PGM_SPECIFICATION);
+        return DISAS_NORETURN;
+    }
+
+    if (m6 & 1) {
+        gen_gvec_4_ptr(get_field(s->fields, v1), get_field(s->fields, v2),
+                       get_field(s->fields, v3), get_field(s->fields, v4),
+                       cpu_env, m6, cc[es]);
+        set_cc_static(s);
+    } else {
+        gen_gvec_4_ool(get_field(s->fields, v1), get_field(s->fields, v2),
+                       get_field(s->fields, v3), get_field(s->fields, v4), m6,
+                       nocc[es]);
+    }
+    return DISAS_NEXT;
+}
diff --git a/target/s390x/vec_string_helper.c b/target/s390x/vec_string_helper.c
index 2e998c21a2..6d6dbfa061 100644
--- a/target/s390x/vec_string_helper.c
+++ b/target/s390x/vec_string_helper.c
@@ -248,3 +248,111 @@ void HELPER(gvec_vistr_cc##BITS)(void *v1, const void *v2,                     \
 DEF_VISTR_CC_HELPER(8)
 DEF_VISTR_CC_HELPER(16)
 DEF_VISTR_CC_HELPER(32)
+
+#define DEF_ELEMENT_COMPARE(BITS)                                              \
+static bool element_compare##BITS(uint##BITS##_t data, uint##BITS##_t l,       \
+                                  uint##BITS##_t c)                            \
+{                                                                              \
+    const bool equal = extract32(c, BITS - 1, 1);                              \
+    const bool lower = extract32(c, BITS - 2, 1);                              \
+    const bool higher = extract32(c, BITS - 3, 1);                             \
+                                                                               \
+    if (equal && data == l) {                                                  \
+        return true;                                                           \
+    } else if (lower && data < l) {                                            \
+        return true;                                                           \
+    } else if (higher && data > l) {                                           \
+        return true;                                                           \
+    }                                                                          \
+    return false;                                                              \
+}
+DEF_ELEMENT_COMPARE(8)
+DEF_ELEMENT_COMPARE(16)
+DEF_ELEMENT_COMPARE(32)
+
+#define DEF_VSTRC(BITS)                                                        \
+static int vstrc##BITS(void *v1, const void *v2, const void *v3,               \
+                       const void *v4, uint8_t m6)                             \
+{                                                                              \
+    const bool in = extract32(m6, 3, 1);                                       \
+    const bool rt = extract32(m6, 2, 1);                                       \
+    const bool zs = extract32(m6, 1, 1);                                       \
+    S390Vector tmp = {};                                                       \
+    int first_byte = 16;                                                       \
+    int cc = 3; /* no match */                                                 \
+    int i, j;                                                                  \
+                                                                               \
+    for (i = 0; i < (128 / BITS); i++) {                                       \
+        const uint##BITS##_t data = s390_vec_read_element##BITS(v2, i);        \
+        bool any_comp = false;                                                 \
+                                                                               \
+        if (zs && !data) {                                                     \
+            if (cc == 3) {                                                     \
+                first_byte = i * (BITS / 8);                                   \
+                cc = 0; /* match for zero */                                   \
+            } else if (cc != 0) {                                              \
+                cc = 2; /* matching elements before match for zero */          \
+            }                                                                  \
+            if (!rt) {                                                         \
+                break;                                                         \
+            }                                                                  \
+        }                                                                      \
+                                                                               \
+        /* compare against every even-odd range pair */                        \
+        for (j = 0; j < (128 / BITS); j += 2) {                                \
+            const uint##BITS##_t l1 = s390_vec_read_element##BITS(v3, j);      \
+            const uint##BITS##_t c1 = s390_vec_read_element##BITS(v4, j);      \
+            const uint##BITS##_t l2 = s390_vec_read_element##BITS(v3, j + 1);  \
+            const uint##BITS##_t c2 = s390_vec_read_element##BITS(v4, j + 1);  \
+                                                                               \
+            if (element_compare##BITS(data, l1, c1) &&                         \
+                element_compare##BITS(data, l2, c2)) {                         \
+                any_comp = true;                                               \
+                break;                                                         \
+            }                                                                  \
+        }                                                                      \
+                                                                               \
+        /* invert the result if requested */                                   \
+        any_comp = in ^ any_comp;                                              \
+        if (cc == 3 && any_comp) {                                             \
+            first_byte = i * (BITS / 8);                                       \
+            cc = 1; /* matching elements, no match for zero */                 \
+            if (!zs && !rt) {                                                  \
+                break;                                                         \
+            }                                                                  \
+        }                                                                      \
+        /* indicate bit vector if requested */                                 \
+        if (rt && any_comp) {                                                  \
+            s390_vec_write_element##BITS(&tmp, i, (uint##BITS##_t)-1ull);      \
+        }                                                                      \
+    }                                                                          \
+    if (!rt) {                                                                 \
+        s390_vec_write_element8(&tmp, 7, first_byte);                          \
+    }                                                                          \
+    *(S390Vector *)v1 = tmp;                                                   \
+    return cc;                                                                 \
+}
+DEF_VSTRC(8)
+DEF_VSTRC(16)
+DEF_VSTRC(32)
+
+#define DEF_VSTRC_HELPER(BITS)                                                 \
+void HELPER(gvec_vstrc##BITS)(void *v1, const void *v2, const void *v3,        \
+                               const void *v4, uint32_t desc)                  \
+{                                                                              \
+    vstrc##BITS(v1, v2, v3, v4, simd_data(desc));                              \
+}
+DEF_VSTRC_HELPER(8)
+DEF_VSTRC_HELPER(16)
+DEF_VSTRC_HELPER(32)
+
+#define DEF_VSTRC_CC_HELPER(BITS)                                              \
+void HELPER(gvec_vstrc_cc##BITS)(void *v1, const void *v2, const void *v3,     \
+                                 const void *v4, CPUS390XState *env,           \
+                                 uint32_t desc)                                \
+{                                                                              \
+    env->cc_op = vstrc##BITS(v1, v2, v3, v4, simd_data(desc));                 \
+}
+DEF_VSTRC_CC_HELPER(8)
+DEF_VSTRC_CC_HELPER(16)
+DEF_VSTRC_CC_HELPER(32)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL David Hildenbrand
@ 2019-05-17 16:16   ` Richard Henderson
  2019-05-20  9:51     ` David Hildenbrand
  2019-05-22 11:01     ` David Hildenbrand
  0 siblings, 2 replies; 27+ messages in thread
From: Richard Henderson @ 2019-05-17 16:16 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 5/15/19 1:31 PM, David Hildenbrand wrote:
> +#define DEF_VFAE(BITS)                                                         \
> +static int vfae##BITS(void *v1, const void *v2, const void *v3, uint8_t m5)    


First, because this *is* complicated stuff, can we find a way to use inline
functions instead of an undebuggable macro for this?  Perhaps a different set
of wrappers than s390_vec_read_element##BITS, which always return uint32_t, so
that they have a constant signature?

> +        if (zs && !data) {
> +            if (cc == 3) {
> +                first_byte = i * (BITS / 8);
> +                cc = 0; /* match for zero */
> +            } else if (cc != 0) {
> +                cc = 2; /* matching elements before match for zero */
> +            }
> +            if (!rt) {
> +                break;
> +            }
> +        }    

So here we are computing the second intermediate result.

> +        /* try to match with any other element from the other vector */
> +        for (j = 0; j < (128 / BITS); j++) {
> +            if (data == s390_vec_read_element##BITS(v3, j)) {
> +                any_equal = true;
> +                break;
> +            }
> +        }

And here the first intermediate result,

> +        /* invert the result if requested */
> +        any_equal = in ^ any_equal;

... inverted, if requested,

> +        if (cc == 3 && any_equal) {
> +            first_byte = i * (BITS / 8);
> +            cc = 1; /* matching elements, no match for zero */
> +            if (!zs && !rt) {
> +                break;
> +            }
> +        }

> +        /* indicate bit vector if requested */
> +        if (rt && any_equal) {
> +            s390_vec_write_element##BITS(&tmp, i, (uint##BITS##_t)-1ull);
> +        }

... writing out (some of) the results of the first intermediate result.

> +    }
> +    if (!rt) {
> +        s390_vec_write_element8(&tmp, 7, first_byte);
> +    }

... writing out the rest of the first intermediate result.

I wonder if it wouldn't be clearer, within the loop, to do

	if (any_equal) {
	    if (cc == 3) {
		first_byte = ...;
		cc = 1;
	    }
	    if (rt) {
		write element -1;
	    } else if (!zs) {
		break;
	    }
	}

I also think that, if we create a bunch more of these wrappers:

> +DEF_VFAE_HELPER(8)
> +DEF_VFAE_HELPER(16)
> +DEF_VFAE_HELPER(32)

then RT and ZS can be passed in as constant parameters to the above, and then
the compiler will fold away all of the stuff that's not needed for each
different case.  Which, I think, is significant.  These are practically
different instructions with the different modifiers.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/5] s390x/tcg: Implement VECTOR FIND ELEMENT EQUAL
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 2/5] s390x/tcg: Implement VECTOR FIND " David Hildenbrand
@ 2019-05-17 16:47   ` Richard Henderson
  2019-05-17 17:42     ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-05-17 16:47 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 5/15/19 1:31 PM, David Hildenbrand wrote:
> +#define DEF_VFEE(BITS)                                                         

Same comment wrt inline functions applies.

Here, because there's one result, writing to byte 7, I wonder if it isn't
clearer to write the loop

    first_equal = n;
    first_zero = n;
    for (i = n - 1; i >= 0; --i) {
        if (data1 == data2) {
            first_equal = i;
        }
        if (data1 == 0) {
            first_zero = i;
        }
    }

// As an aside, there are bit tricks for the above,
// but let's stay simple(r) for now.

    if (zs) {
        if (first_equal < first_zero) {
            cc = (first_zero < n ? 2 : 1);
        } else {
            first_equal = first_zero;
            cc = (first_zero < n ? 0 : 3);
        }
    } else {
        cc = (first_equal < n ? 1 : 3);
    }
    s390_vec_write_element64(v1, 0, first_equal);
    s390_vec_write_element64(v1, 1, 0);

Note that you don't need S390Vector tmp, since the result is written after all
of the inputs are consumed.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/5] s390x/tcg: Implement VECTOR FIND ELEMENT EQUAL
  2019-05-17 16:47   ` Richard Henderson
@ 2019-05-17 17:42     ` Richard Henderson
  2019-05-20  9:17       ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-05-17 17:42 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 5/17/19 9:47 AM, Richard Henderson wrote:
>     first_equal = n;
>     first_zero = n;
>     for (i = n - 1; i >= 0; --i) {
>         if (data1 == data2) {
>             first_equal = i;
>         }
>         if (data1 == 0) {
>             first_zero = i;
>         }
>     }
> 
> // As an aside, there are bit tricks for the above,
> // but let's stay simple(r) for now.

What the hell, it's not /that/ tricky.


/*
 * Returns a bit set in the MSB of each element that is zero,
 * as defined by the mask M.
 */
static inline uint64_t zero_search(uint64_t a, uint64_t m)
{
    return ~(((a & m) + m) | a | m);
}

/*
 * Returns the byte offset for the first match, or 16 for no match.
 */
static inline int match_index(uint64_t c0, uint64_t c1)
{
    return (c0 ? clz64(c0) : clz64(c1) + 64) >> 3;
}

Use

  dup_const(MO_8, 0x7f)
  dup_const(MO_16, 0x7fff)
  dup_const(MO_32, 0x7fffffff)

for the M parameter for the different element sizes.

    uint64_t a0, a1, b0, b1, e0, e1, z0, z1;

    a0 = s390_vec_read_element64(v2, 0);
    a1 = s390_vec_read_element64(v2, 1);
    b0 = s390_vec_read_element64(v3, 0);
    b1 = s390_vec_read_element64(v3, 1);
    e0 = zero_search(a0 ^ b0, m);
    e1 = zero_search(a1 ^ b1, m);
    first_equal = match_index(e0, e1);

    if (zs) {
        z0 = zero_search(a0, m);
        z1 = zero_search(a1, m);
        first_zero = match_index(z0, z1);
    ...


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 3/5] s390x/tcg: Implement VECTOR FIND ELEMENT NOT EQUAL
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 3/5] s390x/tcg: Implement VECTOR FIND ELEMENT NOT EQUAL David Hildenbrand
@ 2019-05-17 17:56   ` Richard Henderson
  2019-05-20  9:48     ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-05-17 17:56 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 5/15/19 1:31 PM, David Hildenbrand wrote:
> Similar to VECTOR FIND ELEMENT EQUAL, however the search also stops on
> any inequality. A match for inequality seems to have precedence over
> a match for zero, because both elements have to be zero.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  target/s390x/helper.h            |  6 ++++
>  target/s390x/insn-data.def       |  2 ++
>  target/s390x/translate_vx.inc.c  | 31 +++++++++++++++++++
>  target/s390x/vec_string_helper.c | 53 ++++++++++++++++++++++++++++++++
>  4 files changed, 92 insertions(+)

Like the previous, only with

static inline uint64_t nonzero_search(uint64_t a, uint64_t m)
{
    return (((a & m) + m) | a) & ~m;
}

for the inequality.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 4/5] s390x/tcg: Implement VECTOR ISOLATE STRING
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 4/5] s390x/tcg: Implement VECTOR ISOLATE STRING David Hildenbrand
@ 2019-05-17 18:20   ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2019-05-17 18:20 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 5/15/19 1:31 PM, David Hildenbrand wrote:
> +#define DEF_VISTR(BITS)                                                        \
> +static int vistr##BITS(void *v1, const void *v2)
> +{
> +    S390Vector tmp = {};
> +    int i, cc = 3;
> +
> +    for (i = 0; i < (128 / BITS); i++) {
> +        const uint##BITS##_t data = s390_vec_read_element##BITS(v2, i);
> +
> +        if (!data) {
> +            cc = 0;
> +            break;
> +        }
> +        s390_vec_write_element##BITS(&tmp, i, data);
> +    }
> +    *(S390Vector *)v1 = tmp;
> +    return cc;
> +}
> +DEF_VISTR(8)
> +DEF_VISTR(16)
> +DEF_VISTR(32)

Based on previous, this becomes

    cc = 3;
    a0 = s390_vec_read_element64(v2, 0);
    a1 = s390_vec_read_element64(v2, 1);

    z0 = zero_search(a0, m);
    if (z0) {
        i = clz64(z0);
        a0 &= ~(UINT64_MAX >> i);
        a1 = 0;
        cc = 0;
    } else {
        z1 = zero_search(a1, m);
        if (z1) {
            i = clz64(z1);
            a1 &= ~(UINT64_MAX >> i);
            cc = 0;
        }
    }

    s390_vec_write_element64(v1, 0, a0);
    s390_vec_write_element64(v1, 1, a1);
    return cc;


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 5/5] s390x/tcg: Implement VECTOR STRING RANGE COMPARE
  2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 5/5] s390x/tcg: Implement VECTOR STRING RANGE COMPARE David Hildenbrand
@ 2019-05-17 18:37   ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2019-05-17 18:37 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 5/15/19 1:31 PM, David Hildenbrand wrote:> +    const bool equal =
extract32(c, BITS - 1, 1);> +    const bool lower = extract32(c, BITS - 2, 1);>
+    const bool higher = extract32(c, BITS - 3, 1);> +> +    if (equal && data
== l) {> +        return true;> +    } else if (lower && data < l) {> +
return true;> +    } else if (higher && data > l) {> +        return true;> +
  }> +    return false;
Only one of the data comparisons can be true.  Better as

    if (data < l) {
        return lower;
    } else if (data > l) {
        return higher;
    } else {
        return equal;
    }

Otherwise, again, let's see if we can turn these into inlines.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 2/5] s390x/tcg: Implement VECTOR FIND ELEMENT EQUAL
  2019-05-17 17:42     ` Richard Henderson
@ 2019-05-20  9:17       ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2019-05-20  9:17 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 17.05.19 19:42, Richard Henderson wrote:
> On 5/17/19 9:47 AM, Richard Henderson wrote:
>>     first_equal = n;
>>     first_zero = n;
>>     for (i = n - 1; i >= 0; --i) {
>>         if (data1 == data2) {
>>             first_equal = i;
>>         }
>>         if (data1 == 0) {
>>             first_zero = i;
>>         }
>>     }
>>
>> // As an aside, there are bit tricks for the above,
>> // but let's stay simple(r) for now.
> 
> What the hell, it's not /that/ tricky.
> 
> 
> /*
>  * Returns a bit set in the MSB of each element that is zero,
>  * as defined by the mask M.
>  */
> static inline uint64_t zero_search(uint64_t a, uint64_t m)
> {
>     return ~(((a & m) + m) | a | m);
> }
> 
> /*
>  * Returns the byte offset for the first match, or 16 for no match.
>  */
> static inline int match_index(uint64_t c0, uint64_t c1)
> {
>     return (c0 ? clz64(c0) : clz64(c1) + 64) >> 3;
> }
> 
> Use
> 
>   dup_const(MO_8, 0x7f)
>   dup_const(MO_16, 0x7fff)
>   dup_const(MO_32, 0x7fffffff)
> 
> for the M parameter for the different element sizes.
> 
>     uint64_t a0, a1, b0, b1, e0, e1, z0, z1;
> 
>     a0 = s390_vec_read_element64(v2, 0);
>     a1 = s390_vec_read_element64(v2, 1);
>     b0 = s390_vec_read_element64(v3, 0);
>     b1 = s390_vec_read_element64(v3, 1);
>     e0 = zero_search(a0 ^ b0, m);
>     e1 = zero_search(a1 ^ b1, m);
>     first_equal = match_index(e0, e1);
> 
>     if (zs) {
>         z0 = zero_search(a0, m);
>         z1 = zero_search(a1, m);
>         first_zero = match_index(z0, z1);
>     ...
> 
> 
> r~
> 


Crazy stuff, seems to work (not that I am surprised :D )

I now have:

+static int vfee(void *v1, const void *v2, const void *v3, bool zs,
uint8_t es)
+{
+    const uint64_t mask = dup_const(es, -1ull >> (65 - (1 << es) * 8));
+    uint64_t a0, a1, b0, b1, e0, e1, z0, z1;
+    uint64_t first_zero = 16;
+    uint64_t first_equal;
+
+    a0 = s390_vec_read_element64(v2, 0);
+    a1 = s390_vec_read_element64(v2, 1);
+    b0 = s390_vec_read_element64(v3, 0);
+    b1 = s390_vec_read_element64(v3, 1);
+    e0 = zero_search(a0 ^ b0, mask);
+    e1 = zero_search(a1 ^ b1, mask);
+    first_equal = match_index(e0, e1);
+
+    if (zs) {
+        z0 = zero_search(a0, mask);
+        z1 = zero_search(a1, mask);
+        first_zero = match_index(z0, z1);
+    }
+
+    /* zero out the destination vector */
+    s390_vec_write_element64(v1, 0, 0);
+    s390_vec_write_element64(v1, 1, 0);
+
+    if (first_zero == 16 && first_equal == 16) {
+        s390_vec_write_element8(v1, 7, 16);
+        return 3; /* no match */
+    } else if (first_zero == 16) {
+        s390_vec_write_element8(v1, 7, first_equal);
+        return 1; /* matching elements, no match for zero */
+    } else if (first_equal < first_zero) {
+        s390_vec_write_element8(v1, 7, first_equal);
+        return 2; /* matching elements before match for zero */
+    }
+    s390_vec_write_element8(v1, 7, first_zero);
+    return 0; /* match for zero */
+}


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 3/5] s390x/tcg: Implement VECTOR FIND ELEMENT NOT EQUAL
  2019-05-17 17:56   ` Richard Henderson
@ 2019-05-20  9:48     ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2019-05-20  9:48 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 17.05.19 19:56, Richard Henderson wrote:
> On 5/15/19 1:31 PM, David Hildenbrand wrote:
>> Similar to VECTOR FIND ELEMENT EQUAL, however the search also stops on
>> any inequality. A match for inequality seems to have precedence over
>> a match for zero, because both elements have to be zero.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  target/s390x/helper.h            |  6 ++++
>>  target/s390x/insn-data.def       |  2 ++
>>  target/s390x/translate_vx.inc.c  | 31 +++++++++++++++++++
>>  target/s390x/vec_string_helper.c | 53 ++++++++++++++++++++++++++++++++
>>  4 files changed, 92 insertions(+)
> 
> Like the previous, only with
> 
> static inline uint64_t nonzero_search(uint64_t a, uint64_t m)
> {
>     return (((a & m) + m) | a) & ~m;
> }
> 
> for the inequality.
> 
> 
> r~
> 

It's a little bit more tricky. because we have to identify the smaller
element. Right now I have this

+static int vfene(void *v1, const void *v2, const void *v3, bool zs,
uint8_t es)
+{
+    const uint64_t mask = dup_const(es, -1ull >> (65 - (1 << es) * 8));
+    uint64_t a0, a1, b0, b1, e0, e1, z0, z1;
+    uint64_t first_zero = 16;
+    uint64_t first_inequal;
+    bool smaller = false;
+
+    a0 = s390_vec_read_element64(v2, 0);
+    a1 = s390_vec_read_element64(v2, 1);
+    b0 = s390_vec_read_element64(v3, 0);
+    b1 = s390_vec_read_element64(v3, 1);
+    e0 = nonzero_search(a0 ^ b0, mask);
+    e1 = nonzero_search(a1 ^ b1, mask);
+    first_inequal = match_index(e0, e1);
+
+    /* identify the smaller element */
+    if (first_inequal < 16) {
+        uint8_t enr = first_inequal / (1 << es);
+        uint32_t a = s390_vec_read_element(v2, enr, es);
+        uint32_t b = s390_vec_read_element(v3, enr, es);
+        smaller = a < b;
+    }
+
+    if (zs) {
+        z0 = zero_search(a0, mask);
+        z1 = zero_search(a1, mask);
+        first_zero = match_index(z0, z1);
+    }
+
+    /* zero out the destination vector */
+    s390_vec_write_element64(v1, 0, 0);
+    s390_vec_write_element64(v1, 1, 0);
+
+    if (first_zero == 16 && first_inequal == 16) {
+        s390_vec_write_element8(v1, 7, 16);
+        return 3;
+    } else if (first_zero < first_inequal) {
+        s390_vec_write_element8(v1, 7, first_zero);
+        return 0;
+    }
+    s390_vec_write_element8(v1, 7, first_inequal);
+    return smaller ? 1 : 2;
+}


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-17 16:16   ` Richard Henderson
@ 2019-05-20  9:51     ` David Hildenbrand
  2019-05-22 11:01     ` David Hildenbrand
  1 sibling, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2019-05-20  9:51 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 17.05.19 18:16, Richard Henderson wrote:
> On 5/15/19 1:31 PM, David Hildenbrand wrote:
>> +#define DEF_VFAE(BITS)                                                         \
>> +static int vfae##BITS(void *v1, const void *v2, const void *v3, uint8_t m5)    
> 
> 
> First, because this *is* complicated stuff, can we find a way to use inline
> functions instead of an undebuggable macro for this?  Perhaps a different set
> of wrappers than s390_vec_read_element##BITS, which always return uint32_t, so
> that they have a constant signature?

For vfene I have for now

+static inline uint64_t s390_vec_read_element(const S390Vector *v,
uint8_t enr,
+                                             uint8_t es)
+{
+    switch (es) {
+    case MO_8:
+        return s390_vec_read_element8(v, enr);
+    case MO_16:
+        return s390_vec_read_element16(v, enr);
+    case MO_32:
+        return s390_vec_read_element32(v, enr);
+    case MO_64:
+        return s390_vec_read_element64(v, enr);
+    default:
+        g_assert_not_reached();
+    }
+}
+

Which we could reuse here.

I'll try to look into using a inline function instead, passing in the
element size and other flags, so the compiler can specialize.

Thanks!

> 
>> +        if (zs && !data) {
>> +            if (cc == 3) {
>> +                first_byte = i * (BITS / 8);
>> +                cc = 0; /* match for zero */
>> +            } else if (cc != 0) {
>> +                cc = 2; /* matching elements before match for zero */
>> +            }
>> +            if (!rt) {
>> +                break;
>> +            }
>> +        }    
> 
> So here we are computing the second intermediate result.
> 
>> +        /* try to match with any other element from the other vector */
>> +        for (j = 0; j < (128 / BITS); j++) {
>> +            if (data == s390_vec_read_element##BITS(v3, j)) {
>> +                any_equal = true;
>> +                break;
>> +            }
>> +        }
> 
> And here the first intermediate result,
> 
>> +        /* invert the result if requested */
>> +        any_equal = in ^ any_equal;
> 
> ... inverted, if requested,
> 
>> +        if (cc == 3 && any_equal) {
>> +            first_byte = i * (BITS / 8);
>> +            cc = 1; /* matching elements, no match for zero */
>> +            if (!zs && !rt) {
>> +                break;
>> +            }
>> +        }
> 
>> +        /* indicate bit vector if requested */
>> +        if (rt && any_equal) {
>> +            s390_vec_write_element##BITS(&tmp, i, (uint##BITS##_t)-1ull);
>> +        }
> 
> ... writing out (some of) the results of the first intermediate result.
> 
>> +    }
>> +    if (!rt) {
>> +        s390_vec_write_element8(&tmp, 7, first_byte);
>> +    }
> 
> ... writing out the rest of the first intermediate result.
> 
> I wonder if it wouldn't be clearer, within the loop, to do
> 
> 	if (any_equal) {
> 	    if (cc == 3) {
> 		first_byte = ...;
> 		cc = 1;
> 	    }
> 	    if (rt) {
> 		write element -1;
> 	    } else if (!zs) {
> 		break;
> 	    }
> 	}
> 
> I also think that, if we create a bunch more of these wrappers:
> 
>> +DEF_VFAE_HELPER(8)
>> +DEF_VFAE_HELPER(16)
>> +DEF_VFAE_HELPER(32)
> 
> then RT and ZS can be passed in as constant parameters to the above, and then
> the compiler will fold away all of the stuff that's not needed for each
> different case.  Which, I think, is significant.  These are practically
> different instructions with the different modifiers.
> 
> 
> r~
> 


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-17 16:16   ` Richard Henderson
  2019-05-20  9:51     ` David Hildenbrand
@ 2019-05-22 11:01     ` David Hildenbrand
  2019-05-22 11:09       ` Richard Henderson
  1 sibling, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-22 11:01 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson


> I also think that, if we create a bunch more of these wrappers:
> 
>> +DEF_VFAE_HELPER(8)
>> +DEF_VFAE_HELPER(16)
>> +DEF_VFAE_HELPER(32)
> 
> then RT and ZS can be passed in as constant parameters to the above, and then
> the compiler will fold away all of the stuff that's not needed for each
> different case.  Which, I think, is significant.  These are practically
> different instructions with the different modifiers.
> 

So, we have 4 flags, resulting in 16 variants. Times 3 element sizes ...
48 helpers in total. Do we really want to go down that path?

I can also go ahead any try to identify the most frequent users (in
Linux) and only specialize that one.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-22 11:01     ` David Hildenbrand
@ 2019-05-22 11:09       ` Richard Henderson
  2019-05-22 11:16         ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-05-22 11:09 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 5/22/19 7:01 AM, David Hildenbrand wrote:
> 
>> I also think that, if we create a bunch more of these wrappers:
>>
>>> +DEF_VFAE_HELPER(8)
>>> +DEF_VFAE_HELPER(16)
>>> +DEF_VFAE_HELPER(32)
>>
>> then RT and ZS can be passed in as constant parameters to the above, and then
>> the compiler will fold away all of the stuff that's not needed for each
>> different case.  Which, I think, is significant.  These are practically
>> different instructions with the different modifiers.
>>
> 
> So, we have 4 flags, resulting in 16 variants. Times 3 element sizes ...
> 48 helpers in total. Do we really want to go down that path?

Maybe?

> I can also go ahead any try to identify the most frequent users (in
> Linux) and only specialize that one.

Also plausible.  I guess it would be good to know, anyway.

I think RT probably makes the largest difference to the layout of the function,
so maybe that's the one we pick.  We could also leave our options open and make
the 3 non-CC flags be parameters to the inline function, just extract them from
the M4 parameter at the one higher level.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-22 11:09       ` Richard Henderson
@ 2019-05-22 11:16         ` David Hildenbrand
  2019-05-22 15:59           ` Richard Henderson
  2019-05-23 10:58           ` Alex Bennée
  0 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2019-05-22 11:16 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Thomas Huth, Richard Henderson

On 22.05.19 13:09, Richard Henderson wrote:
> On 5/22/19 7:01 AM, David Hildenbrand wrote:
>>
>>> I also think that, if we create a bunch more of these wrappers:
>>>
>>>> +DEF_VFAE_HELPER(8)
>>>> +DEF_VFAE_HELPER(16)
>>>> +DEF_VFAE_HELPER(32)
>>>
>>> then RT and ZS can be passed in as constant parameters to the above, and then
>>> the compiler will fold away all of the stuff that's not needed for each
>>> different case.  Which, I think, is significant.  These are practically
>>> different instructions with the different modifiers.
>>>
>>
>> So, we have 4 flags, resulting in 16 variants. Times 3 element sizes ...
>> 48 helpers in total. Do we really want to go down that path?
> 
> Maybe?

Hope my fingers won't bleed from all the copy-pasting ;)

> 
>> I can also go ahead any try to identify the most frequent users (in
>> Linux) and only specialize that one.
> 
> Also plausible.  I guess it would be good to know, anyway.

I'll dump the parameters when booting Linux. My gut feeling is that the
cc option is basically never used ...

> 
> I think RT probably makes the largest difference to the layout of the function,

Yes. I think the RT and ZS make the biggest difference. IN - not really
that heavy.

Maybe use different variants for RT and ZS for the !CC casen only.

> so maybe that's the one we pick.  We could also leave our options open and make
> the 3 non-CC flags be parameters to the inline function, just extract them from
> the M4 parameter at the one higher level.

That one, I have already done :)

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-22 11:16         ` David Hildenbrand
@ 2019-05-22 15:59           ` Richard Henderson
  2019-05-22 18:16             ` David Hildenbrand
  2019-05-23 10:58           ` Alex Bennée
  1 sibling, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-05-22 15:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, qemu-s390x, Cornelia Huck,
	qemu-devel@nongnu.org Developers, Richard Henderson

On Wed, 22 May 2019 at 07:16, David Hildenbrand <david@redhat.com> wrote:
> > Also plausible.  I guess it would be good to know, anyway.
>
> I'll dump the parameters when booting Linux. My gut feeling is that the
> cc option is basically never used ...

It looks like our intuition is wrong about that.

rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r vfaezbs * | wc -l
15

These set cc, use zs, and do not use rt.

rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r 'vfaeb' * | wc -l
3

These do not set cc, do not use zs, and do use rt.

Those are the only two VFAE forms used by glibc (note that the same
variants as 'f' are used by the wide-character strings).


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-22 15:59           ` Richard Henderson
@ 2019-05-22 18:16             ` David Hildenbrand
  2019-05-22 18:46               ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-22 18:16 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Thomas Huth, qemu-s390x, Cornelia Huck,
	qemu-devel@nongnu.org Developers, Richard Henderson

On 22.05.19 17:59, Richard Henderson wrote:
> On Wed, 22 May 2019 at 07:16, David Hildenbrand <david@redhat.com> wrote:
>>> Also plausible.  I guess it would be good to know, anyway.
>>
>> I'll dump the parameters when booting Linux. My gut feeling is that the
>> cc option is basically never used ...
> 
> It looks like our intuition is wrong about that.

Thanks for checking!

> 
> rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r vfaezbs * | wc -l
> 15
> 
> These set cc, use zs, and do not use rt.
> 
> rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r 'vfaeb' * | wc -l
> 3
> 
> These do not set cc, do not use zs, and do use rt.
> 
> Those are the only two VFAE forms used by glibc (note that the same
> variants as 'f' are used by the wide-character strings).
> 

I guess "rt" and "cc" make the biggest difference. Maybe special case
these two, result in 4 variants for each of the 3 element sizes?

> 
> r~
> 


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-22 18:16             ` David Hildenbrand
@ 2019-05-22 18:46               ` Richard Henderson
  2019-05-23  7:50                 ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-05-22 18:46 UTC (permalink / raw)
  To: David Hildenbrand, Richard Henderson
  Cc: Thomas Huth, qemu-s390x, Cornelia Huck, qemu-devel@nongnu.org Developers

On 5/22/19 2:16 PM, David Hildenbrand wrote:
> On 22.05.19 17:59, Richard Henderson wrote:
>> On Wed, 22 May 2019 at 07:16, David Hildenbrand <david@redhat.com> wrote:
>>>> Also plausible.  I guess it would be good to know, anyway.
>>>
>>> I'll dump the parameters when booting Linux. My gut feeling is that the
>>> cc option is basically never used ...
>>
>> It looks like our intuition is wrong about that.
> 
> Thanks for checking!
> 
>>
>> rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r vfaezbs * | wc -l
>> 15
>>
>> These set cc, use zs, and do not use rt.
>>
>> rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r 'vfaeb' * | wc -l
>> 3
>>
>> These do not set cc, do not use zs, and do use rt.
>>
>> Those are the only two VFAE forms used by glibc (note that the same
>> variants as 'f' are used by the wide-character strings).
>>
> 
> I guess "rt" and "cc" make the biggest difference. Maybe special case
> these two, result in 4 variants for each of the 3 element sizes?

Sounds good.


r~



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-22 18:46               ` Richard Henderson
@ 2019-05-23  7:50                 ` David Hildenbrand
  2019-05-23 12:27                   ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-23  7:50 UTC (permalink / raw)
  To: Richard Henderson, Richard Henderson
  Cc: Thomas Huth, qemu-s390x, Cornelia Huck, qemu-devel@nongnu.org Developers

On 22.05.19 20:46, Richard Henderson wrote:
> On 5/22/19 2:16 PM, David Hildenbrand wrote:
>> On 22.05.19 17:59, Richard Henderson wrote:
>>> On Wed, 22 May 2019 at 07:16, David Hildenbrand <david@redhat.com> wrote:
>>>>> Also plausible.  I guess it would be good to know, anyway.
>>>>
>>>> I'll dump the parameters when booting Linux. My gut feeling is that the
>>>> cc option is basically never used ...
>>>
>>> It looks like our intuition is wrong about that.
>>
>> Thanks for checking!
>>
>>>
>>> rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r vfaezbs * | wc -l
>>> 15
>>>
>>> These set cc, use zs, and do not use rt.
>>>
>>> rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r 'vfaeb' * | wc -l
>>> 3
>>>
>>> These do not set cc, do not use zs, and do use rt.
>>>
>>> Those are the only two VFAE forms used by glibc (note that the same
>>> variants as 'f' are used by the wide-character strings).
>>>
>>
>> I guess "rt" and "cc" make the biggest difference. Maybe special case
>> these two, result in 4 variants for each of the 3 element sizes?
> 
> Sounds good.
> 

So .... after all it might not be necessary, at least not for this
helper :) Using your crazy helper functions, I have this right now:

/*
 * Returns the number of bits composing one element.
 */
static uint8_t get_element_bits(uint8_t es)
{
    return (1 << es) * BITS_PER_BYTE;
}

/*
 * Returns the bitmask for a single element.
 */
static uint64_t get_single_element_mask(uint8_t es)
{
    return -1ull >> (64 - get_element_bits(es));
}

/*
 * Returns the bitmask for a single element (excluding the MSB).
 */
static uint64_t get_single_element_lsbs_mask(uint8_t es)
{
    return -1ull >> (65 - get_element_bits(es));
}

/*
 * Returns the bitmasks for multiple elements (excluding the MSBs).
 */
static uint64_t get_element_lsbs_mask(uint8_t es)
{
    return dup_const(es, get_single_element_lsbs_mask(es));
}

static int vfae(void *v1, const void *v2, const void *v3, bool in,
                bool rt, bool zs, uint8_t es)
{
    const uint64_t mask = get_element_lsbs_mask(es);
    const int bits = get_element_bits(es);
    uint64_t a0, a1, b0, b1, e0, e1, t0, t1, z0, z1;
    uint64_t first_zero = 16;
    uint64_t first_equal;
    int i;

    a0 = s390_vec_read_element64(v2, 0);
    a1 = s390_vec_read_element64(v2, 1);
    b0 = s390_vec_read_element64(v3, 0);
    b1 = s390_vec_read_element64(v3, 1);
    e0 = 0;
    e1 = 0;
    /* compare against equality with every other element */
    for (i = 0; i < 64; i += bits) {
        t0 = i ? rol64(b0, i) : b0;
        t1 = i ? rol64(b1, i) : b1;
        e0 |= zero_search(a0 ^ t0, mask);
        e0 |= zero_search(a0 ^ t1, mask);
        e1 |= zero_search(a1 ^ t0, mask);
        e1 |= zero_search(a1 ^ t1, mask);
    }
    /* invert the result if requested - invert only the MSBs */
    if (in) {
        e0 = ~e0 & ~mask;
        e1 = ~e1 & ~mask;
    }
    first_equal = match_index(e0, e1);

    if (zs) {
        z0 = zero_search(a0, mask);
        z1 = zero_search(a1, mask);
        first_zero = match_index(z0, z1);
    }

    if (rt) {
        e0 = (e0 >> (bits - 1)) * get_single_element_mask(es);
        e1 = (e1 >> (bits - 1)) * get_single_element_mask(es);
        s390_vec_write_element64(v1, 0, e0);
        s390_vec_write_element64(v1, 1, e1);
    } else {
        s390_vec_write_element64(v1, 0, MIN(first_equal, first_zero));
        s390_vec_write_element64(v1, 1, 0);
    }

    if (first_zero == 16 && first_equal == 16) {
        return 3; /* no match */
    } else if (first_zero == 16) {
        return 1; /* matching elements, no match for zero */
    } else if (first_equal < first_zero) {
        return 2; /* matching elements before match for zero */
    }
    return 0; /* match for zero */
}


At least the kernel boots with it - am i missing something or does this
indeed work?

Cheers!


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-22 11:16         ` David Hildenbrand
  2019-05-22 15:59           ` Richard Henderson
@ 2019-05-23 10:58           ` Alex Bennée
  1 sibling, 0 replies; 27+ messages in thread
From: Alex Bennée @ 2019-05-23 10:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-s390x, Cornelia Huck, Richard Henderson, Thomas Huth,
	Richard Henderson


David Hildenbrand <david@redhat.com> writes:

> On 22.05.19 13:09, Richard Henderson wrote:
>> On 5/22/19 7:01 AM, David Hildenbrand wrote:
>>>
>>>> I also think that, if we create a bunch more of these wrappers:
>>>>
>>>>> +DEF_VFAE_HELPER(8)
>>>>> +DEF_VFAE_HELPER(16)
>>>>> +DEF_VFAE_HELPER(32)
>>>>
>>>> then RT and ZS can be passed in as constant parameters to the above, and then
>>>> the compiler will fold away all of the stuff that's not needed for each
>>>> different case.  Which, I think, is significant.  These are practically
>>>> different instructions with the different modifiers.
>>>>
>>>
>>> So, we have 4 flags, resulting in 16 variants. Times 3 element sizes ...
>>> 48 helpers in total. Do we really want to go down that path?
>>
>> Maybe?
>
> Hope my fingers won't bleed from all the copy-pasting ;)

An alternative is to generalise the code into a helper and then just use
macros to instantiate a series of calls to it (c.f. softfloat). The idea
is you can use flatten/inline to keep it efficient but you don't have a
bunch of logic obscured by macro stuff.


--
Alex Bennée


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-23  7:50                 ` David Hildenbrand
@ 2019-05-23 12:27                   ` Richard Henderson
  2019-05-23 12:34                     ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-05-23 12:27 UTC (permalink / raw)
  To: David Hildenbrand, Richard Henderson
  Cc: Thomas Huth, qemu-s390x, Cornelia Huck, qemu-devel@nongnu.org Developers

On 5/23/19 3:50 AM, David Hildenbrand wrote:
> /*
>  * Returns the number of bits composing one element.
>  */
> static uint8_t get_element_bits(uint8_t es)
> {
>     return (1 << es) * BITS_PER_BYTE;
> }
> 
> /*
>  * Returns the bitmask for a single element.
>  */
> static uint64_t get_single_element_mask(uint8_t es)
> {
>     return -1ull >> (64 - get_element_bits(es));
> }
> 
> /*
>  * Returns the bitmask for a single element (excluding the MSB).
>  */
> static uint64_t get_single_element_lsbs_mask(uint8_t es)
> {
>     return -1ull >> (65 - get_element_bits(es));
> }
> 
> /*
>  * Returns the bitmasks for multiple elements (excluding the MSBs).
>  */
> static uint64_t get_element_lsbs_mask(uint8_t es)
> {
>     return dup_const(es, get_single_element_lsbs_mask(es));
> }
> 
> static int vfae(void *v1, const void *v2, const void *v3, bool in,
>                 bool rt, bool zs, uint8_t es)
> {
>     const uint64_t mask = get_element_lsbs_mask(es);
>     const int bits = get_element_bits(es);
>     uint64_t a0, a1, b0, b1, e0, e1, t0, t1, z0, z1;
>     uint64_t first_zero = 16;
>     uint64_t first_equal;
>     int i;
> 
>     a0 = s390_vec_read_element64(v2, 0);
>     a1 = s390_vec_read_element64(v2, 1);
>     b0 = s390_vec_read_element64(v3, 0);
>     b1 = s390_vec_read_element64(v3, 1);
>     e0 = 0;
>     e1 = 0;
>     /* compare against equality with every other element */
>     for (i = 0; i < 64; i += bits) {
>         t0 = i ? rol64(b0, i) : b0;
>         t1 = i ? rol64(b1, i) : b1;
>         e0 |= zero_search(a0 ^ t0, mask);
>         e0 |= zero_search(a0 ^ t1, mask);
>         e1 |= zero_search(a1 ^ t0, mask);
>         e1 |= zero_search(a1 ^ t1, mask);
>     }

I don't see that this is doing what you want.  You're shifting one element of B
down, but not broadcasting it so that it is compared against every element of A.

I'd expect something like

	t0 = dup_const(es, b0 >> i);
	t1 = dup_const(es, b1 >> i);

(I also don't see what rol is getting you that shift doesn't.)


>     /* invert the result if requested - invert only the MSBs */
>     if (in) {
>         e0 = ~e0 & ~mask;
>         e1 = ~e1 & ~mask;
>     }
>     first_equal = match_index(e0, e1);
> 
>     if (zs) {
>         z0 = zero_search(a0, mask);
>         z1 = zero_search(a1, mask);
>         first_zero = match_index(z0, z1);
>     }
> 
>     if (rt) {
>         e0 = (e0 >> (bits - 1)) * get_single_element_mask(es);
>         e1 = (e1 >> (bits - 1)) * get_single_element_mask(es);
>         s390_vec_write_element64(v1, 0, e0);
>         s390_vec_write_element64(v1, 1, e1);
>     } else {
>         s390_vec_write_element64(v1, 0, MIN(first_equal, first_zero));
>         s390_vec_write_element64(v1, 1, 0);
>     }
> 
>     if (first_zero == 16 && first_equal == 16) {
>         return 3; /* no match */
>     } else if (first_zero == 16) {
>         return 1; /* matching elements, no match for zero */
>     } else if (first_equal < first_zero) {
>         return 2; /* matching elements before match for zero */
>     }
>     return 0; /* match for zero */
> }

The rest of this looks good.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-23 12:27                   ` Richard Henderson
@ 2019-05-23 12:34                     ` David Hildenbrand
  2019-05-23 12:59                       ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-23 12:34 UTC (permalink / raw)
  To: Richard Henderson, Richard Henderson
  Cc: Thomas Huth, qemu-s390x, Cornelia Huck, qemu-devel@nongnu.org Developers

On 23.05.19 14:27, Richard Henderson wrote:
> On 5/23/19 3:50 AM, David Hildenbrand wrote:
>> /*
>>  * Returns the number of bits composing one element.
>>  */
>> static uint8_t get_element_bits(uint8_t es)
>> {
>>     return (1 << es) * BITS_PER_BYTE;
>> }
>>
>> /*
>>  * Returns the bitmask for a single element.
>>  */
>> static uint64_t get_single_element_mask(uint8_t es)
>> {
>>     return -1ull >> (64 - get_element_bits(es));
>> }
>>
>> /*
>>  * Returns the bitmask for a single element (excluding the MSB).
>>  */
>> static uint64_t get_single_element_lsbs_mask(uint8_t es)
>> {
>>     return -1ull >> (65 - get_element_bits(es));
>> }
>>
>> /*
>>  * Returns the bitmasks for multiple elements (excluding the MSBs).
>>  */
>> static uint64_t get_element_lsbs_mask(uint8_t es)
>> {
>>     return dup_const(es, get_single_element_lsbs_mask(es));
>> }
>>
>> static int vfae(void *v1, const void *v2, const void *v3, bool in,
>>                 bool rt, bool zs, uint8_t es)
>> {
>>     const uint64_t mask = get_element_lsbs_mask(es);
>>     const int bits = get_element_bits(es);
>>     uint64_t a0, a1, b0, b1, e0, e1, t0, t1, z0, z1;
>>     uint64_t first_zero = 16;
>>     uint64_t first_equal;
>>     int i;
>>
>>     a0 = s390_vec_read_element64(v2, 0);
>>     a1 = s390_vec_read_element64(v2, 1);
>>     b0 = s390_vec_read_element64(v3, 0);
>>     b1 = s390_vec_read_element64(v3, 1);
>>     e0 = 0;
>>     e1 = 0;
>>     /* compare against equality with every other element */
>>     for (i = 0; i < 64; i += bits) {
>>         t0 = i ? rol64(b0, i) : b0;
>>         t1 = i ? rol64(b1, i) : b1;
>>         e0 |= zero_search(a0 ^ t0, mask);
>>         e0 |= zero_search(a0 ^ t1, mask);
>>         e1 |= zero_search(a1 ^ t0, mask);
>>         e1 |= zero_search(a1 ^ t1, mask);
>>     }
> 
> I don't see that this is doing what you want.  You're shifting one element of B
> down, but not broadcasting it so that it is compared against every element of A.
> 
> I'd expect something like
> 
> 	t0 = dup_const(es, b0 >> i);
> 	t1 = dup_const(es, b1 >> i);
> 
> (I also don't see what rol is getting you that shift doesn't.)

Let's assume

a0 = [0, 1, 2, 3]
a1 = [4, 5, 6, 7]

b0 = [8, 8, 8, 4]
b1 = [8, 8, 8, 8]

What I would check is

First iteration

a0 == [8, 8, 8, 4] -> no match
a0 == [8, 8, 8, 8] -> no match
a1 == [8, 8, 8, 4] -> no match
a1 == [8, 8, 8, 8] -> no match

Second iteration

a0 == [8, 8, 4, 8] -> no match
a0 == [8, 8, 8, 8] -> no match
a1 == [8, 8, 4, 8]
a1 == [8, 8, 8, 8] -> no match

...

Last iteration

a0 == [4, 8, 8, 8] -> no match
a0 == [8, 8, 8, 8] -> no match
a1 == [4, 8, 8, 8] -> match in first element
a1 == [8, 8, 8, 8] -> no match

What am i missing?


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-23 12:34                     ` David Hildenbrand
@ 2019-05-23 12:59                       ` David Hildenbrand
  2019-05-23 13:50                         ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2019-05-23 12:59 UTC (permalink / raw)
  To: Richard Henderson, Richard Henderson
  Cc: Thomas Huth, qemu-s390x, Cornelia Huck, qemu-devel@nongnu.org Developers

On 23.05.19 14:34, David Hildenbrand wrote:
> On 23.05.19 14:27, Richard Henderson wrote:
>> On 5/23/19 3:50 AM, David Hildenbrand wrote:
>>> /*
>>>  * Returns the number of bits composing one element.
>>>  */
>>> static uint8_t get_element_bits(uint8_t es)
>>> {
>>>     return (1 << es) * BITS_PER_BYTE;
>>> }
>>>
>>> /*
>>>  * Returns the bitmask for a single element.
>>>  */
>>> static uint64_t get_single_element_mask(uint8_t es)
>>> {
>>>     return -1ull >> (64 - get_element_bits(es));
>>> }
>>>
>>> /*
>>>  * Returns the bitmask for a single element (excluding the MSB).
>>>  */
>>> static uint64_t get_single_element_lsbs_mask(uint8_t es)
>>> {
>>>     return -1ull >> (65 - get_element_bits(es));
>>> }
>>>
>>> /*
>>>  * Returns the bitmasks for multiple elements (excluding the MSBs).
>>>  */
>>> static uint64_t get_element_lsbs_mask(uint8_t es)
>>> {
>>>     return dup_const(es, get_single_element_lsbs_mask(es));
>>> }
>>>
>>> static int vfae(void *v1, const void *v2, const void *v3, bool in,
>>>                 bool rt, bool zs, uint8_t es)
>>> {
>>>     const uint64_t mask = get_element_lsbs_mask(es);
>>>     const int bits = get_element_bits(es);
>>>     uint64_t a0, a1, b0, b1, e0, e1, t0, t1, z0, z1;
>>>     uint64_t first_zero = 16;
>>>     uint64_t first_equal;
>>>     int i;
>>>
>>>     a0 = s390_vec_read_element64(v2, 0);
>>>     a1 = s390_vec_read_element64(v2, 1);
>>>     b0 = s390_vec_read_element64(v3, 0);
>>>     b1 = s390_vec_read_element64(v3, 1);
>>>     e0 = 0;
>>>     e1 = 0;
>>>     /* compare against equality with every other element */
>>>     for (i = 0; i < 64; i += bits) {
>>>         t0 = i ? rol64(b0, i) : b0;
>>>         t1 = i ? rol64(b1, i) : b1;
>>>         e0 |= zero_search(a0 ^ t0, mask);
>>>         e0 |= zero_search(a0 ^ t1, mask);
>>>         e1 |= zero_search(a1 ^ t0, mask);
>>>         e1 |= zero_search(a1 ^ t1, mask);
>>>     }
>>
>> I don't see that this is doing what you want.  You're shifting one element of B
>> down, but not broadcasting it so that it is compared against every element of A.
>>
>> I'd expect something like
>>
>> 	t0 = dup_const(es, b0 >> i);
>> 	t1 = dup_const(es, b1 >> i);
>>
>> (I also don't see what rol is getting you that shift doesn't.)
> 
> Let's assume
> 
> a0 = [0, 1, 2, 3]
> a1 = [4, 5, 6, 7]
> 
> b0 = [8, 8, 8, 4]
> b1 = [8, 8, 8, 8]
> 
> What I would check is
> 
> First iteration
> 
> a0 == [8, 8, 8, 4] -> no match
> a0 == [8, 8, 8, 8] -> no match
> a1 == [8, 8, 8, 4] -> no match
> a1 == [8, 8, 8, 8] -> no match
> 
> Second iteration
> 
> a0 == [8, 8, 4, 8] -> no match
> a0 == [8, 8, 8, 8] -> no match
> a1 == [8, 8, 4, 8]
> a1 == [8, 8, 8, 8] -> no match
> 
> ...
> 
> Last iteration
> 
> a0 == [4, 8, 8, 8] -> no match
> a0 == [8, 8, 8, 8] -> no match
> a1 == [4, 8, 8, 8] -> match in first element
> a1 == [8, 8, 8, 8] -> no match
> 
> What am i missing?
> 
> 

I guess I can simplify to

t0 = rol64(b0, i);
t1 = rol64(b1, i);

My approach: Compare all elements of B at a time
Your approach: Compare a single element of B at a time

If I'm not wrong, it boils down to to whether

rol64() or dup_const(es, b0 >> i) is faster ;)

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL
  2019-05-23 12:59                       ` David Hildenbrand
@ 2019-05-23 13:50                         ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2019-05-23 13:50 UTC (permalink / raw)
  To: David Hildenbrand, Richard Henderson
  Cc: Thomas Huth, qemu-s390x, Cornelia Huck, qemu-devel@nongnu.org Developers

On 5/23/19 8:59 AM, David Hildenbrand wrote:
> I guess I can simplify to
> 
> t0 = rol64(b0, i);
> t1 = rol64(b1, i);

Yes.

> My approach: Compare all elements of B at a time
> Your approach: Compare a single element of B at a time

Ah, I get it now.  This makes total sense, and does make the broadcast unnecessary.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-05-23 13:53 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-15 20:31 [Qemu-devel] [PATCH v1 0/5] s390x/tcg: Vector Instruction Support Part 3 David Hildenbrand
2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 1/5] s390x/tcg: Implement VECTOR FIND ANY ELEMENT EQUAL David Hildenbrand
2019-05-17 16:16   ` Richard Henderson
2019-05-20  9:51     ` David Hildenbrand
2019-05-22 11:01     ` David Hildenbrand
2019-05-22 11:09       ` Richard Henderson
2019-05-22 11:16         ` David Hildenbrand
2019-05-22 15:59           ` Richard Henderson
2019-05-22 18:16             ` David Hildenbrand
2019-05-22 18:46               ` Richard Henderson
2019-05-23  7:50                 ` David Hildenbrand
2019-05-23 12:27                   ` Richard Henderson
2019-05-23 12:34                     ` David Hildenbrand
2019-05-23 12:59                       ` David Hildenbrand
2019-05-23 13:50                         ` Richard Henderson
2019-05-23 10:58           ` Alex Bennée
2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 2/5] s390x/tcg: Implement VECTOR FIND " David Hildenbrand
2019-05-17 16:47   ` Richard Henderson
2019-05-17 17:42     ` Richard Henderson
2019-05-20  9:17       ` David Hildenbrand
2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 3/5] s390x/tcg: Implement VECTOR FIND ELEMENT NOT EQUAL David Hildenbrand
2019-05-17 17:56   ` Richard Henderson
2019-05-20  9:48     ` David Hildenbrand
2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 4/5] s390x/tcg: Implement VECTOR ISOLATE STRING David Hildenbrand
2019-05-17 18:20   ` Richard Henderson
2019-05-15 20:31 ` [Qemu-devel] [PATCH v1 5/5] s390x/tcg: Implement VECTOR STRING RANGE COMPARE David Hildenbrand
2019-05-17 18:37   ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.