All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/13] tcg/s390x: misc patches
@ 2022-12-02  6:51 Richard Henderson
  2022-12-02  6:51 ` [PATCH v3 01/13] tcg/s390x: Use register pair allocation for div and mulu2 Richard Henderson
                   ` (12 more replies)
  0 siblings, 13 replies; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

Based-on: 20221202053958.223890-1-richard.henderson@linaro.org
("[PATCH for-8.0 v3 00/34] tcg misc patches")

This contains two patches that exercise the register pair patches
within the "tcg misc patches" patch set.  Then a couple of misc
cleanups, then support for the MIE2, MIE3, and POPCOUNT features.

One thing to play with: the middle-end can expand ctz based on
either clz or ctpop, and for z16 we now have both.  I've got an
idea that for s390x the most general case of the expansion would
be better with clz.  Which leads me to wonder if there's a better
way to manage such expansions, but I haven't thought about it
too much yet.


r~


Richard Henderson (13):
  tcg/s390x: Use register pair allocation for div and mulu2
  tcg/s390x: Remove TCG_REG_TB
  tcg/s390x: Use LARL+AGHI for odd addresses
  tcg/s390x: Distinguish RRF-a and RRF-c formats
  tcg/s390x: Distinguish RIE formats
  tcg/s390x: Support MIE2 multiply single instructions
  tcg/s390x: Support MIE2 MGRK instruction
  tcg/s390x: Support MIE3 logical operations
  tcg/s390x: Create tgen_cmp2 to simplify movcond
  tcg/s390x: Generalize movcond implementation
  tcg/s390x: Support SELGR instruction in movcond
  tcg/s390x: Use tgen_movcond_int in tgen_clz
  tcg/s390x: Implement ctpop operation

 tcg/s390x/tcg-target-con-set.h |  11 +-
 tcg/s390x/tcg-target-con-str.h |   8 +-
 tcg/s390x/tcg-target.h         |  35 +-
 tcg/s390x/tcg-target.c.inc     | 663 ++++++++++++++++++++-------------
 4 files changed, 430 insertions(+), 287 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 01/13] tcg/s390x: Use register pair allocation for div and mulu2
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 15:49   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB Richard Henderson
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

Previously we hard-coded R2 and R3.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |  4 ++--
 tcg/s390x/tcg-target-con-str.h |  8 +------
 tcg/s390x/tcg-target.c.inc     | 43 +++++++++++++++++++++++++---------
 3 files changed, 35 insertions(+), 20 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index 426dd92e51..00ba727b70 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -29,8 +29,8 @@ C_O1_I2(v, v, v)
 C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, r, 0)
 C_O1_I4(r, r, ri, rI, 0)
-C_O2_I2(b, a, 0, r)
-C_O2_I3(b, a, 0, 1, r)
+C_O2_I2(o, m, 0, r)
+C_O2_I3(o, m, 0, 1, r)
 C_O2_I4(r, r, 0, 1, rA, r)
 C_O2_I4(r, r, 0, 1, ri, r)
 C_O2_I4(r, r, 0, 1, r, r)
diff --git a/tcg/s390x/tcg-target-con-str.h b/tcg/s390x/tcg-target-con-str.h
index 8bb0358ae5..76446aecae 100644
--- a/tcg/s390x/tcg-target-con-str.h
+++ b/tcg/s390x/tcg-target-con-str.h
@@ -11,13 +11,7 @@
 REGS('r', ALL_GENERAL_REGS)
 REGS('L', ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
 REGS('v', ALL_VECTOR_REGS)
-/*
- * A (single) even/odd pair for division.
- * TODO: Add something to the register allocator to allow
- * this kind of regno+1 pairing to be done more generally.
- */
-REGS('a', 1u << TCG_REG_R2)
-REGS('b', 1u << TCG_REG_R3)
+REGS('o', 0xaaaa) /* odd numbered general regs */
 
 /*
  * Define constraint letters for constants:
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index b9ba7b605e..cb00bb6999 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -2264,10 +2264,18 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
 
     case INDEX_op_div2_i32:
-        tcg_out_insn(s, RR, DR, TCG_REG_R2, args[4]);
+        tcg_debug_assert(args[0] == args[2]);
+        tcg_debug_assert(args[1] == args[3]);
+        tcg_debug_assert((args[1] & 1) == 0);
+        tcg_debug_assert(args[0] == args[1] + 1);
+        tcg_out_insn(s, RR, DR, args[1], args[4]);
         break;
     case INDEX_op_divu2_i32:
-        tcg_out_insn(s, RRE, DLR, TCG_REG_R2, args[4]);
+        tcg_debug_assert(args[0] == args[2]);
+        tcg_debug_assert(args[1] == args[3]);
+        tcg_debug_assert((args[1] & 1) == 0);
+        tcg_debug_assert(args[0] == args[1] + 1);
+        tcg_out_insn(s, RRE, DLR, args[1], args[4]);
         break;
 
     case INDEX_op_shl_i32:
@@ -2521,17 +2529,30 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
 
     case INDEX_op_div2_i64:
-        /* ??? We get an unnecessary sign-extension of the dividend
-           into R3 with this definition, but as we do in fact always
-           produce both quotient and remainder using INDEX_op_div_i64
-           instead requires jumping through even more hoops.  */
-        tcg_out_insn(s, RRE, DSGR, TCG_REG_R2, args[4]);
+        /*
+         * ??? We get an unnecessary sign-extension of the dividend
+         * into op0 with this definition, but as we do in fact always
+         * produce both quotient and remainder using INDEX_op_div_i64
+         * instead requires jumping through even more hoops.
+         */
+        tcg_debug_assert(args[0] == args[2]);
+        tcg_debug_assert(args[1] == args[3]);
+        tcg_debug_assert((args[1] & 1) == 0);
+        tcg_debug_assert(args[0] == args[1] + 1);
+        tcg_out_insn(s, RRE, DSGR, args[1], args[4]);
         break;
     case INDEX_op_divu2_i64:
-        tcg_out_insn(s, RRE, DLGR, TCG_REG_R2, args[4]);
+        tcg_debug_assert(args[0] == args[2]);
+        tcg_debug_assert(args[1] == args[3]);
+        tcg_debug_assert((args[1] & 1) == 0);
+        tcg_debug_assert(args[0] == args[1] + 1);
+        tcg_out_insn(s, RRE, DLGR, args[1], args[4]);
         break;
     case INDEX_op_mulu2_i64:
-        tcg_out_insn(s, RRE, MLGR, TCG_REG_R2, args[3]);
+        tcg_debug_assert(args[0] == args[2]);
+        tcg_debug_assert((args[1] & 1) == 0);
+        tcg_debug_assert(args[0] == args[1] + 1);
+        tcg_out_insn(s, RRE, MLGR, args[1], args[3]);
         break;
 
     case INDEX_op_shl_i64:
@@ -3226,10 +3247,10 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_div2_i64:
     case INDEX_op_divu2_i32:
     case INDEX_op_divu2_i64:
-        return C_O2_I3(b, a, 0, 1, r);
+        return C_O2_I3(o, m, 0, 1, r);
 
     case INDEX_op_mulu2_i64:
-        return C_O2_I2(b, a, 0, r);
+        return C_O2_I2(o, m, 0, r);
 
     case INDEX_op_add2_i32:
     case INDEX_op_sub2_i32:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
  2022-12-02  6:51 ` [PATCH v3 01/13] tcg/s390x: Use register pair allocation for div and mulu2 Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 19:29   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 03/13] tcg/s390x: Use LARL+AGHI for odd addresses Richard Henderson
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

This reverts 829e1376d940 ("tcg/s390: Introduce TCG_REG_TB"), and
several follow-up patches.  The primary motivation is to reduce the
less-tested code paths, pre-z10.  Secondarily, this allows the
unconditional use of TCG_TARGET_HAS_direct_jump, which might be more
important for performance than any slight increase in code size.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.h     |   2 +-
 tcg/s390x/tcg-target.c.inc | 176 +++++--------------------------------
 2 files changed, 23 insertions(+), 155 deletions(-)

diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index 22d70d431b..645f522058 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -103,7 +103,7 @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_mulsh_i32      0
 #define TCG_TARGET_HAS_extrl_i64_i32  0
 #define TCG_TARGET_HAS_extrh_i64_i32  0
-#define TCG_TARGET_HAS_direct_jump    HAVE_FACILITY(GEN_INST_EXT)
+#define TCG_TARGET_HAS_direct_jump    1
 #define TCG_TARGET_HAS_qemu_st8_i32   0
 
 #define TCG_TARGET_HAS_div2_i64       1
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index cb00bb6999..8a4bec0a28 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -65,12 +65,6 @@
 /* A scratch register that may be be used throughout the backend.  */
 #define TCG_TMP0        TCG_REG_R1
 
-/* A scratch register that holds a pointer to the beginning of the TB.
-   We don't need this when we have pc-relative loads with the general
-   instructions extension facility.  */
-#define TCG_REG_TB      TCG_REG_R12
-#define USE_REG_TB      (!HAVE_FACILITY(GEN_INST_EXT))
-
 #ifndef CONFIG_SOFTMMU
 #define TCG_GUEST_BASE_REG TCG_REG_R13
 #endif
@@ -813,8 +807,8 @@ static bool maybe_out_small_movi(TCGContext *s, TCGType type,
 }
 
 /* load a register with an immediate value */
-static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
-                             tcg_target_long sval, bool in_prologue)
+static void tcg_out_movi(TCGContext *s, TCGType type,
+                         TCGReg ret, tcg_target_long sval)
 {
     tcg_target_ulong uval;
 
@@ -853,14 +847,6 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
             tcg_out_insn(s, RIL, LARL, ret, off);
             return;
         }
-    } else if (USE_REG_TB && !in_prologue) {
-        ptrdiff_t off = tcg_tbrel_diff(s, (void *)sval);
-        if (off == sextract64(off, 0, 20)) {
-            /* This is certain to be an address within TB, and therefore
-               OFF will be negative; don't try RX_LA.  */
-            tcg_out_insn(s, RXY, LAY, ret, TCG_REG_TB, TCG_REG_NONE, off);
-            return;
-        }
     }
 
     /* A 32-bit unsigned value can be loaded in 2 insns.  And given
@@ -876,10 +862,6 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
     if (HAVE_FACILITY(GEN_INST_EXT)) {
         tcg_out_insn(s, RIL, LGRL, ret, 0);
         new_pool_label(s, sval, R_390_PC32DBL, s->code_ptr - 2, 2);
-    } else if (USE_REG_TB && !in_prologue) {
-        tcg_out_insn(s, RXY, LG, ret, TCG_REG_TB, TCG_REG_NONE, 0);
-        new_pool_label(s, sval, R_390_20, s->code_ptr - 2,
-                       tcg_tbrel_diff(s, NULL));
     } else {
         TCGReg base = ret ? ret : TCG_TMP0;
         tcg_out_insn(s, RIL, LARL, base, 0);
@@ -888,12 +870,6 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
     }
 }
 
-static void tcg_out_movi(TCGContext *s, TCGType type,
-                         TCGReg ret, tcg_target_long sval)
-{
-    tcg_out_movi_int(s, type, ret, sval, false);
-}
-
 /* Emit a load/store type instruction.  Inputs are:
    DATA:     The register to be loaded or stored.
    BASE+OFS: The effective address.
@@ -1020,35 +996,6 @@ static inline bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
     return false;
 }
 
-/* load data from an absolute host address */
-static void tcg_out_ld_abs(TCGContext *s, TCGType type,
-                           TCGReg dest, const void *abs)
-{
-    intptr_t addr = (intptr_t)abs;
-
-    if (HAVE_FACILITY(GEN_INST_EXT) && !(addr & 1)) {
-        ptrdiff_t disp = tcg_pcrel_diff(s, abs) >> 1;
-        if (disp == (int32_t)disp) {
-            if (type == TCG_TYPE_I32) {
-                tcg_out_insn(s, RIL, LRL, dest, disp);
-            } else {
-                tcg_out_insn(s, RIL, LGRL, dest, disp);
-            }
-            return;
-        }
-    }
-    if (USE_REG_TB) {
-        ptrdiff_t disp = tcg_tbrel_diff(s, abs);
-        if (disp == sextract64(disp, 0, 20)) {
-            tcg_out_ld(s, type, dest, TCG_REG_TB, disp);
-            return;
-        }
-    }
-
-    tcg_out_movi(s, TCG_TYPE_PTR, dest, addr & ~0xffff);
-    tcg_out_ld(s, type, dest, dest, addr & 0xffff);
-}
-
 static inline void tcg_out_risbg(TCGContext *s, TCGReg dest, TCGReg src,
                                  int msb, int lsb, int ofs, int z)
 {
@@ -1243,17 +1190,7 @@ static void tgen_andi(TCGContext *s, TCGType type, TCGReg dest, uint64_t val)
         return;
     }
 
-    /* Use the constant pool if USE_REG_TB, but not for small constants.  */
-    if (USE_REG_TB) {
-        if (!maybe_out_small_movi(s, type, TCG_TMP0, val)) {
-            tcg_out_insn(s, RXY, NG, dest, TCG_REG_TB, TCG_REG_NONE, 0);
-            new_pool_label(s, val & valid, R_390_20, s->code_ptr - 2,
-                           tcg_tbrel_diff(s, NULL));
-            return;
-        }
-    } else {
-        tcg_out_movi(s, type, TCG_TMP0, val);
-    }
+    tcg_out_movi(s, type, TCG_TMP0, val);
     if (type == TCG_TYPE_I32) {
         tcg_out_insn(s, RR, NR, dest, TCG_TMP0);
     } else {
@@ -1297,24 +1234,11 @@ static void tgen_ori(TCGContext *s, TCGType type, TCGReg dest, uint64_t val)
         }
     }
 
-    /* Use the constant pool if USE_REG_TB, but not for small constants.  */
-    if (maybe_out_small_movi(s, type, TCG_TMP0, val)) {
-        if (type == TCG_TYPE_I32) {
-            tcg_out_insn(s, RR, OR, dest, TCG_TMP0);
-        } else {
-            tcg_out_insn(s, RRE, OGR, dest, TCG_TMP0);
-        }
-    } else if (USE_REG_TB) {
-        tcg_out_insn(s, RXY, OG, dest, TCG_REG_TB, TCG_REG_NONE, 0);
-        new_pool_label(s, val, R_390_20, s->code_ptr - 2,
-                       tcg_tbrel_diff(s, NULL));
+    tcg_out_movi(s, type, TCG_TMP0, val);
+    if (type == TCG_TYPE_I32) {
+        tcg_out_insn(s, RR, OR, dest, TCG_TMP0);
     } else {
-        /* Perform the OR via sequential modifications to the high and
-           low parts.  Do this via recursion to handle 16-bit vs 32-bit
-           masks in each half.  */
-        tcg_debug_assert(HAVE_FACILITY(EXT_IMM));
-        tgen_ori(s, type, dest, val & 0x00000000ffffffffull);
-        tgen_ori(s, type, dest, val & 0xffffffff00000000ull);
+        tcg_out_insn(s, RRE, OGR, dest, TCG_TMP0);
     }
 }
 
@@ -1332,26 +1256,11 @@ static void tgen_xori(TCGContext *s, TCGType type, TCGReg dest, uint64_t val)
         }
     }
 
-    /* Use the constant pool if USE_REG_TB, but not for small constants.  */
-    if (maybe_out_small_movi(s, type, TCG_TMP0, val)) {
-        if (type == TCG_TYPE_I32) {
-            tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
-        } else {
-            tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
-        }
-    } else if (USE_REG_TB) {
-        tcg_out_insn(s, RXY, XG, dest, TCG_REG_TB, TCG_REG_NONE, 0);
-        new_pool_label(s, val, R_390_20, s->code_ptr - 2,
-                       tcg_tbrel_diff(s, NULL));
+    tcg_out_movi(s, type, TCG_TMP0, val);
+    if (type == TCG_TYPE_I32) {
+        tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
     } else {
-        /* Perform the xor by parts.  */
-        tcg_debug_assert(HAVE_FACILITY(EXT_IMM));
-        if (val & 0xffffffff) {
-            tcg_out_insn(s, RIL, XILF, dest, val);
-        }
-        if (val > 0xffffffff) {
-            tcg_out_insn(s, RIL, XIHF, dest, val >> 32);
-        }
+        tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
     }
 }
 
@@ -1395,19 +1304,6 @@ static int tgen_cmp(TCGContext *s, TCGType type, TCGCond c, TCGReg r1,
         if (maybe_out_small_movi(s, type, TCG_TMP0, c2)) {
             c2 = TCG_TMP0;
             /* fall through to reg-reg */
-        } else if (USE_REG_TB) {
-            if (type == TCG_TYPE_I32) {
-                op = (is_unsigned ? RXY_CLY : RXY_CY);
-                tcg_out_insn_RXY(s, op, r1, TCG_REG_TB, TCG_REG_NONE, 0);
-                new_pool_label(s, (uint32_t)c2, R_390_20, s->code_ptr - 2,
-                               4 - tcg_tbrel_diff(s, NULL));
-            } else {
-                op = (is_unsigned ? RXY_CLG : RXY_CG);
-                tcg_out_insn_RXY(s, op, r1, TCG_REG_TB, TCG_REG_NONE, 0);
-                new_pool_label(s, c2, R_390_20, s->code_ptr - 2,
-                               tcg_tbrel_diff(s, NULL));
-            }
-            goto exit;
         } else {
             if (type == TCG_TYPE_I32) {
                 op = (is_unsigned ? RIL_CLRL : RIL_CRL);
@@ -2101,43 +1997,22 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
 
     case INDEX_op_goto_tb:
         a0 = args[0];
-        if (s->tb_jmp_insn_offset) {
-            /*
-             * branch displacement must be aligned for atomic patching;
-             * see if we need to add extra nop before branch
-             */
-            if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) {
-                tcg_out16(s, NOP);
-            }
-            tcg_debug_assert(!USE_REG_TB);
-            tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4));
-            s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
-            s->code_ptr += 2;
-        } else {
-            /* load address stored at s->tb_jmp_target_addr + a0 */
-            tcg_out_ld_abs(s, TCG_TYPE_PTR, TCG_REG_TB,
-                           tcg_splitwx_to_rx(s->tb_jmp_target_addr + a0));
-            /* and go there */
-            tcg_out_insn(s, RR, BCR, S390_CC_ALWAYS, TCG_REG_TB);
+        tcg_debug_assert(s->tb_jmp_insn_offset);
+        /*
+         * branch displacement must be aligned for atomic patching;
+         * see if we need to add extra nop before branch
+         */
+        if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) {
+            tcg_out16(s, NOP);
         }
+        tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4));
+        s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
+        tcg_out32(s, 0);
         set_jmp_reset_offset(s, a0);
-
-        /* For the unlinked path of goto_tb, we need to reset
-           TCG_REG_TB to the beginning of this TB.  */
-        if (USE_REG_TB) {
-            int ofs = -tcg_current_code_size(s);
-            /* All TB are restricted to 64KiB by unwind info. */
-            tcg_debug_assert(ofs == sextract64(ofs, 0, 20));
-            tcg_out_insn(s, RXY, LAY, TCG_REG_TB,
-                         TCG_REG_TB, TCG_REG_NONE, ofs);
-        }
         break;
 
     case INDEX_op_goto_ptr:
         a0 = args[0];
-        if (USE_REG_TB) {
-            tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_TB, a0);
-        }
         tcg_out_insn(s, RR, BCR, S390_CC_ALWAYS, a0);
         break;
 
@@ -3405,9 +3280,6 @@ static void tcg_target_init(TCGContext *s)
     /* XXX many insns can't be used with R0, so we better avoid it for now */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_R0);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
-    if (USE_REG_TB) {
-        tcg_regset_set_reg(s->reserved_regs, TCG_REG_TB);
-    }
 }
 
 #define FRAME_SIZE  ((int)(TCG_TARGET_CALL_STACK_OFFSET          \
@@ -3428,16 +3300,12 @@ static void tcg_target_qemu_prologue(TCGContext *s)
 
 #ifndef CONFIG_SOFTMMU
     if (guest_base >= 0x80000) {
-        tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_GUEST_BASE_REG, guest_base, true);
+        tcg_out_movi(s, TCG_TYPE_PTR, TCG_GUEST_BASE_REG, guest_base);
         tcg_regset_set_reg(s->reserved_regs, TCG_GUEST_BASE_REG);
     }
 #endif
 
     tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
-    if (USE_REG_TB) {
-        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_TB,
-                    tcg_target_call_iarg_regs[1]);
-    }
 
     /* br %r3 (go to TB) */
     tcg_out_insn(s, RR, BCR, S390_CC_ALWAYS, tcg_target_call_iarg_regs[1]);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 03/13] tcg/s390x: Use LARL+AGHI for odd addresses
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
  2022-12-02  6:51 ` [PATCH v3 01/13] tcg/s390x: Use register pair allocation for div and mulu2 Richard Henderson
  2022-12-02  6:51 ` [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 19:42   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 04/13] tcg/s390x: Distinguish RRF-a and RRF-c formats Richard Henderson
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

Add one instead of dropping odd addresses to the constant pool.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index 8a4bec0a28..34de5c5ebe 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -811,6 +811,7 @@ static void tcg_out_movi(TCGContext *s, TCGType type,
                          TCGReg ret, tcg_target_long sval)
 {
     tcg_target_ulong uval;
+    ptrdiff_t pc_off;
 
     /* Try all 32-bit insns that can load it in one go.  */
     if (maybe_out_small_movi(s, type, ret, sval)) {
@@ -839,14 +840,14 @@ static void tcg_out_movi(TCGContext *s, TCGType type,
         }
     }
 
-    /* Try for PC-relative address load.  For odd addresses,
-       attempt to use an offset from the start of the TB.  */
-    if ((sval & 1) == 0) {
-        ptrdiff_t off = tcg_pcrel_diff(s, (void *)sval) >> 1;
-        if (off == (int32_t)off) {
-            tcg_out_insn(s, RIL, LARL, ret, off);
-            return;
+    /* Try for PC-relative address load.  For odd addresses, add one. */
+    pc_off = tcg_pcrel_diff(s, (void *)sval) >> 1;
+    if (pc_off == (int32_t)pc_off) {
+        tcg_out_insn(s, RIL, LARL, ret, pc_off);
+        if (sval & 1) {
+            tcg_out_insn(s, RI, AGHI, ret, 1);
         }
+        return;
     }
 
     /* A 32-bit unsigned value can be loaded in 2 insns.  And given
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 04/13] tcg/s390x: Distinguish RRF-a and RRF-c formats
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (2 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 03/13] tcg/s390x: Use LARL+AGHI for odd addresses Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 19:45   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 05/13] tcg/s390x: Distinguish RIE formats Richard Henderson
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

One has 3 register arguments; the other has 2 plus an m3 field.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 57 +++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index 34de5c5ebe..a489b3359e 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -177,18 +177,19 @@ typedef enum S390Opcode {
     RRE_SLBGR   = 0xb989,
     RRE_XGR     = 0xb982,
 
-    RRF_LOCR    = 0xb9f2,
-    RRF_LOCGR   = 0xb9e2,
-    RRF_NRK     = 0xb9f4,
-    RRF_NGRK    = 0xb9e4,
-    RRF_ORK     = 0xb9f6,
-    RRF_OGRK    = 0xb9e6,
-    RRF_SRK     = 0xb9f9,
-    RRF_SGRK    = 0xb9e9,
-    RRF_SLRK    = 0xb9fb,
-    RRF_SLGRK   = 0xb9eb,
-    RRF_XRK     = 0xb9f7,
-    RRF_XGRK    = 0xb9e7,
+    RRFa_NRK    = 0xb9f4,
+    RRFa_NGRK   = 0xb9e4,
+    RRFa_ORK    = 0xb9f6,
+    RRFa_OGRK   = 0xb9e6,
+    RRFa_SRK    = 0xb9f9,
+    RRFa_SGRK   = 0xb9e9,
+    RRFa_SLRK   = 0xb9fb,
+    RRFa_SLGRK  = 0xb9eb,
+    RRFa_XRK    = 0xb9f7,
+    RRFa_XGRK   = 0xb9e7,
+
+    RRFc_LOCR   = 0xb9f2,
+    RRFc_LOCGR  = 0xb9e2,
 
     RR_AR       = 0x1a,
     RR_ALR      = 0x1e,
@@ -543,8 +544,14 @@ static void tcg_out_insn_RRE(TCGContext *s, S390Opcode op,
     tcg_out32(s, (op << 16) | (r1 << 4) | r2);
 }
 
-static void tcg_out_insn_RRF(TCGContext *s, S390Opcode op,
-                             TCGReg r1, TCGReg r2, int m3)
+static void tcg_out_insn_RRFa(TCGContext *s, S390Opcode op,
+                              TCGReg r1, TCGReg r2, TCGReg r3)
+{
+    tcg_out32(s, (op << 16) | (r3 << 12) | (r1 << 4) | r2);
+}
+
+static void tcg_out_insn_RRFc(TCGContext *s, S390Opcode op,
+                              TCGReg r1, TCGReg r2, int m3)
 {
     tcg_out32(s, (op << 16) | (m3 << 12) | (r1 << 4) | r2);
 }
@@ -1419,7 +1426,7 @@ static void tgen_setcond(TCGContext *s, TCGType type, TCGCond cond,
         /* Emit: d = 0, t = 1, d = (cc ? t : d).  */
         tcg_out_movi(s, TCG_TYPE_I64, dest, 0);
         tcg_out_movi(s, TCG_TYPE_I64, TCG_TMP0, 1);
-        tcg_out_insn(s, RRF, LOCGR, dest, TCG_TMP0, cc);
+        tcg_out_insn(s, RRFc, LOCGR, dest, TCG_TMP0, cc);
     } else {
         /* Emit: d = 1; if (cc) goto over; d = 0; over:  */
         tcg_out_movi(s, type, dest, 1);
@@ -1438,7 +1445,7 @@ static void tgen_movcond(TCGContext *s, TCGType type, TCGCond c, TCGReg dest,
         if (v3const) {
             tcg_out_insn(s, RIE, LOCGHI, dest, v3, cc);
         } else {
-            tcg_out_insn(s, RRF, LOCGR, dest, v3, cc);
+            tcg_out_insn(s, RRFc, LOCGR, dest, v3, cc);
         }
     } else {
         c = tcg_invert_cond(c);
@@ -1468,7 +1475,7 @@ static void tgen_clz(TCGContext *s, TCGReg dest, TCGReg a1,
         }
         if (HAVE_FACILITY(LOAD_ON_COND)) {
             /* Emit: if (one bit found) dest = r0.  */
-            tcg_out_insn(s, RRF, LOCGR, dest, TCG_REG_R0, 2);
+            tcg_out_insn(s, RRFc, LOCGR, dest, TCG_REG_R0, 2);
         } else {
             /* Emit: if (no one bit found) goto over; dest = r0; over:  */
             tcg_out_insn(s, RI, BRC, 8, (4 + 4) >> 1);
@@ -2085,7 +2092,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         } else if (a0 == a1) {
             tcg_out_insn(s, RR, SR, a0, a2);
         } else {
-            tcg_out_insn(s, RRF, SRK, a0, a1, a2);
+            tcg_out_insn(s, RRFa, SRK, a0, a1, a2);
         }
         break;
 
@@ -2097,7 +2104,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         } else if (a0 == a1) {
             tcg_out_insn(s, RR, NR, a0, a2);
         } else {
-            tcg_out_insn(s, RRF, NRK, a0, a1, a2);
+            tcg_out_insn(s, RRFa, NRK, a0, a1, a2);
         }
         break;
     case INDEX_op_or_i32:
@@ -2108,7 +2115,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         } else if (a0 == a1) {
             tcg_out_insn(s, RR, OR, a0, a2);
         } else {
-            tcg_out_insn(s, RRF, ORK, a0, a1, a2);
+            tcg_out_insn(s, RRFa, ORK, a0, a1, a2);
         }
         break;
     case INDEX_op_xor_i32:
@@ -2119,7 +2126,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         } else if (a0 == a1) {
             tcg_out_insn(s, RR, XR, args[0], args[2]);
         } else {
-            tcg_out_insn(s, RRF, XRK, a0, a1, a2);
+            tcg_out_insn(s, RRFa, XRK, a0, a1, a2);
         }
         break;
 
@@ -2347,7 +2354,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         } else if (a0 == a1) {
             tcg_out_insn(s, RRE, SGR, a0, a2);
         } else {
-            tcg_out_insn(s, RRF, SGRK, a0, a1, a2);
+            tcg_out_insn(s, RRFa, SGRK, a0, a1, a2);
         }
         break;
 
@@ -2359,7 +2366,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         } else if (a0 == a1) {
             tcg_out_insn(s, RRE, NGR, args[0], args[2]);
         } else {
-            tcg_out_insn(s, RRF, NGRK, a0, a1, a2);
+            tcg_out_insn(s, RRFa, NGRK, a0, a1, a2);
         }
         break;
     case INDEX_op_or_i64:
@@ -2370,7 +2377,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         } else if (a0 == a1) {
             tcg_out_insn(s, RRE, OGR, a0, a2);
         } else {
-            tcg_out_insn(s, RRF, OGRK, a0, a1, a2);
+            tcg_out_insn(s, RRFa, OGRK, a0, a1, a2);
         }
         break;
     case INDEX_op_xor_i64:
@@ -2381,7 +2388,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         } else if (a0 == a1) {
             tcg_out_insn(s, RRE, XGR, a0, a2);
         } else {
-            tcg_out_insn(s, RRF, XGRK, a0, a1, a2);
+            tcg_out_insn(s, RRFa, XGRK, a0, a1, a2);
         }
         break;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 05/13] tcg/s390x: Distinguish RIE formats
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (3 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 04/13] tcg/s390x: Distinguish RRF-a and RRF-c formats Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 19:47   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 06/13] tcg/s390x: Support MIE2 multiply single instructions Richard Henderson
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

There are multiple variations, with different fields.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 47 +++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index a489b3359e..d02b433271 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -133,16 +133,19 @@ typedef enum S390Opcode {
     RI_OILL     = 0xa50b,
     RI_TMLL     = 0xa701,
 
-    RIE_CGIJ    = 0xec7c,
-    RIE_CGRJ    = 0xec64,
-    RIE_CIJ     = 0xec7e,
-    RIE_CLGRJ   = 0xec65,
-    RIE_CLIJ    = 0xec7f,
-    RIE_CLGIJ   = 0xec7d,
-    RIE_CLRJ    = 0xec77,
-    RIE_CRJ     = 0xec76,
-    RIE_LOCGHI  = 0xec46,
-    RIE_RISBG   = 0xec55,
+    RIEb_CGRJ    = 0xec64,
+    RIEb_CLGRJ   = 0xec65,
+    RIEb_CLRJ    = 0xec77,
+    RIEb_CRJ     = 0xec76,
+
+    RIEc_CGIJ    = 0xec7c,
+    RIEc_CIJ     = 0xec7e,
+    RIEc_CLGIJ   = 0xec7d,
+    RIEc_CLIJ    = 0xec7f,
+
+    RIEf_RISBG   = 0xec55,
+
+    RIEg_LOCGHI  = 0xec46,
 
     RRE_AGR     = 0xb908,
     RRE_ALGR    = 0xb90a,
@@ -561,7 +564,7 @@ static void tcg_out_insn_RI(TCGContext *s, S390Opcode op, TCGReg r1, int i2)
     tcg_out32(s, (op << 16) | (r1 << 20) | (i2 & 0xffff));
 }
 
-static void tcg_out_insn_RIE(TCGContext *s, S390Opcode op, TCGReg r1,
+static void tcg_out_insn_RIEg(TCGContext *s, S390Opcode op, TCGReg r1,
                              int i2, int m3)
 {
     tcg_out16(s, (op & 0xff00) | (r1 << 4) | m3);
@@ -1008,9 +1011,9 @@ static inline void tcg_out_risbg(TCGContext *s, TCGReg dest, TCGReg src,
                                  int msb, int lsb, int ofs, int z)
 {
     /* Format RIE-f */
-    tcg_out16(s, (RIE_RISBG & 0xff00) | (dest << 4) | src);
+    tcg_out16(s, (RIEf_RISBG & 0xff00) | (dest << 4) | src);
     tcg_out16(s, (msb << 8) | (z << 7) | lsb);
-    tcg_out16(s, (ofs << 8) | (RIE_RISBG & 0xff));
+    tcg_out16(s, (ofs << 8) | (RIEf_RISBG & 0xff));
 }
 
 static void tgen_ext8s(TCGContext *s, TCGType type, TCGReg dest, TCGReg src)
@@ -1350,7 +1353,7 @@ static void tgen_setcond(TCGContext *s, TCGType type, TCGCond cond,
         /* Emit: d = 0, d = (cc ? 1 : d).  */
         cc = tgen_cmp(s, type, cond, c1, c2, c2const, false);
         tcg_out_movi(s, TCG_TYPE_I64, dest, 0);
-        tcg_out_insn(s, RIE, LOCGHI, dest, 1, cc);
+        tcg_out_insn(s, RIEg, LOCGHI, dest, 1, cc);
         return;
     }
 
@@ -1443,7 +1446,7 @@ static void tgen_movcond(TCGContext *s, TCGType type, TCGCond c, TCGReg dest,
     if (HAVE_FACILITY(LOAD_ON_COND)) {
         cc = tgen_cmp(s, type, c, c1, c2, c2const, false);
         if (v3const) {
-            tcg_out_insn(s, RIE, LOCGHI, dest, v3, cc);
+            tcg_out_insn(s, RIEg, LOCGHI, dest, v3, cc);
         } else {
             tcg_out_insn(s, RRFc, LOCGR, dest, v3, cc);
         }
@@ -1530,6 +1533,7 @@ static void tgen_compare_branch(TCGContext *s, S390Opcode opc, int cc,
                                 TCGReg r1, TCGReg r2, TCGLabel *l)
 {
     tcg_out_reloc(s, s->code_ptr + 1, R_390_PC16DBL, l, 2);
+    /* Format RIE-b */
     tcg_out16(s, (opc & 0xff00) | (r1 << 4) | r2);
     tcg_out16(s, 0);
     tcg_out16(s, cc << 12 | (opc & 0xff));
@@ -1539,6 +1543,7 @@ static void tgen_compare_imm_branch(TCGContext *s, S390Opcode opc, int cc,
                                     TCGReg r1, int i2, TCGLabel *l)
 {
     tcg_out_reloc(s, s->code_ptr + 1, R_390_PC16DBL, l, 2);
+    /* Format RIE-c */
     tcg_out16(s, (opc & 0xff00) | (r1 << 4) | cc);
     tcg_out16(s, 0);
     tcg_out16(s, (i2 << 8) | (opc & 0xff));
@@ -1558,8 +1563,8 @@ static void tgen_brcond(TCGContext *s, TCGType type, TCGCond c,
 
         if (!c2const) {
             opc = (type == TCG_TYPE_I32
-                   ? (is_unsigned ? RIE_CLRJ : RIE_CRJ)
-                   : (is_unsigned ? RIE_CLGRJ : RIE_CGRJ));
+                   ? (is_unsigned ? RIEb_CLRJ : RIEb_CRJ)
+                   : (is_unsigned ? RIEb_CLGRJ : RIEb_CGRJ));
             tgen_compare_branch(s, opc, cc, r1, c2, l);
             return;
         }
@@ -1570,18 +1575,18 @@ static void tgen_brcond(TCGContext *s, TCGType type, TCGCond c,
            larger comparison range afforded by COMPARE IMMEDIATE.  */
         if (type == TCG_TYPE_I32) {
             if (is_unsigned) {
-                opc = RIE_CLIJ;
+                opc = RIEc_CLIJ;
                 in_range = (uint32_t)c2 == (uint8_t)c2;
             } else {
-                opc = RIE_CIJ;
+                opc = RIEc_CIJ;
                 in_range = (int32_t)c2 == (int8_t)c2;
             }
         } else {
             if (is_unsigned) {
-                opc = RIE_CLGIJ;
+                opc = RIEc_CLGIJ;
                 in_range = (uint64_t)c2 == (uint8_t)c2;
             } else {
-                opc = RIE_CGIJ;
+                opc = RIEc_CGIJ;
                 in_range = (int64_t)c2 == (int8_t)c2;
             }
         }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 06/13] tcg/s390x: Support MIE2 multiply single instructions
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (4 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 05/13] tcg/s390x: Distinguish RIE formats Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 20:02   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 07/13] tcg/s390x: Support MIE2 MGRK instruction Richard Henderson
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

The MIE2 facility adds 3-operand versions of multiply.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |  1 +
 tcg/s390x/tcg-target.h         |  1 +
 tcg/s390x/tcg-target.c.inc     | 34 ++++++++++++++++++++++++----------
 3 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index 00ba727b70..33a82e3286 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -23,6 +23,7 @@ C_O1_I2(r, 0, ri)
 C_O1_I2(r, 0, rI)
 C_O1_I2(r, 0, rJ)
 C_O1_I2(r, r, ri)
+C_O1_I2(r, r, rJ)
 C_O1_I2(r, rZ, r)
 C_O1_I2(v, v, r)
 C_O1_I2(v, v, v)
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index 645f522058..bfd623a639 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -63,6 +63,7 @@ typedef enum TCGReg {
 #define FACILITY_FAST_BCR_SER         FACILITY_LOAD_ON_COND
 #define FACILITY_DISTINCT_OPS         FACILITY_LOAD_ON_COND
 #define FACILITY_LOAD_ON_COND2        53
+#define FACILITY_MISC_INSN_EXT2       58
 #define FACILITY_VECTOR               129
 #define FACILITY_VECTOR_ENH1          135
 
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index d02b433271..cd39b2a208 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -180,6 +180,8 @@ typedef enum S390Opcode {
     RRE_SLBGR   = 0xb989,
     RRE_XGR     = 0xb982,
 
+    RRFa_MSRKC  = 0xb9fd,
+    RRFa_MSGRKC = 0xb9ed,
     RRFa_NRK    = 0xb9f4,
     RRFa_NGRK   = 0xb9e4,
     RRFa_ORK    = 0xb9f6,
@@ -2140,14 +2142,18 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
 
     case INDEX_op_mul_i32:
+        a0 = args[0], a1 = args[1], a2 = (int32_t)args[2];
         if (const_args[2]) {
-            if ((int32_t)args[2] == (int16_t)args[2]) {
-                tcg_out_insn(s, RI, MHI, args[0], args[2]);
+            tcg_out_mov(s, TCG_TYPE_I32, a0, a1);
+            if (a2 == (int16_t)a2) {
+                tcg_out_insn(s, RI, MHI, a0, a2);
             } else {
-                tcg_out_insn(s, RIL, MSFI, args[0], args[2]);
+                tcg_out_insn(s, RIL, MSFI, a0, a2);
             }
+        } else if (a0 == a1) {
+            tcg_out_insn(s, RRE, MSR, a0, a2);
         } else {
-            tcg_out_insn(s, RRE, MSR, args[0], args[2]);
+            tcg_out_insn(s, RRFa, MSRKC, a0, a1, a2);
         }
         break;
 
@@ -2405,14 +2411,18 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
 
     case INDEX_op_mul_i64:
+        a0 = args[0], a1 = args[1], a2 = args[2];
         if (const_args[2]) {
-            if (args[2] == (int16_t)args[2]) {
-                tcg_out_insn(s, RI, MGHI, args[0], args[2]);
+            tcg_out_mov(s, TCG_TYPE_I64, a0, a1);
+            if (a2 == (int16_t)a2) {
+                tcg_out_insn(s, RI, MGHI, a0, a2);
             } else {
-                tcg_out_insn(s, RIL, MSGFI, args[0], args[2]);
+                tcg_out_insn(s, RIL, MSGFI, a0, a2);
             }
+        } else if (a0 == a1) {
+            tcg_out_insn(s, RRE, MSGR, a0, a2);
         } else {
-            tcg_out_insn(s, RRE, MSGR, args[0], args[2]);
+            tcg_out_insn(s, RRFa, MSGRKC, a0, a1, a2);
         }
         break;
 
@@ -3072,12 +3082,16 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
            MULTIPLY SINGLE IMMEDIATE with a signed 32-bit, otherwise we
            have only MULTIPLY HALFWORD IMMEDIATE, with a signed 16-bit.  */
         return (HAVE_FACILITY(GEN_INST_EXT)
-                ? C_O1_I2(r, 0, ri)
+                ? (HAVE_FACILITY(MISC_INSN_EXT2)
+                   ? C_O1_I2(r, r, ri)
+                   : C_O1_I2(r, 0, ri))
                 : C_O1_I2(r, 0, rI));
 
     case INDEX_op_mul_i64:
         return (HAVE_FACILITY(GEN_INST_EXT)
-                ? C_O1_I2(r, 0, rJ)
+                ? (HAVE_FACILITY(MISC_INSN_EXT2)
+                   ? C_O1_I2(r, r, rJ)
+                   : C_O1_I2(r, 0, rJ))
                 : C_O1_I2(r, 0, rI));
 
     case INDEX_op_shl_i32:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 07/13] tcg/s390x: Support MIE2 MGRK instruction
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (5 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 06/13] tcg/s390x: Support MIE2 multiply single instructions Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 20:02   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 08/13] tcg/s390x: Support MIE3 logical operations Richard Henderson
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

The MIE2 facility adds a 3-operand signed 64x64->128 multiply.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h | 1 +
 tcg/s390x/tcg-target.h         | 2 +-
 tcg/s390x/tcg-target.c.inc     | 8 ++++++++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index 33a82e3286..b1a89a88ba 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -31,6 +31,7 @@ C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, r, 0)
 C_O1_I4(r, r, ri, rI, 0)
 C_O2_I2(o, m, 0, r)
+C_O2_I2(o, m, r, r)
 C_O2_I3(o, m, 0, 1, r)
 C_O2_I4(r, r, 0, 1, rA, r)
 C_O2_I4(r, r, 0, 1, ri, r)
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index bfd623a639..09cf6e60fc 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -136,7 +136,7 @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_add2_i64       1
 #define TCG_TARGET_HAS_sub2_i64       1
 #define TCG_TARGET_HAS_mulu2_i64      1
-#define TCG_TARGET_HAS_muls2_i64      0
+#define TCG_TARGET_HAS_muls2_i64      HAVE_FACILITY(MISC_INSN_EXT2)
 #define TCG_TARGET_HAS_muluh_i64      0
 #define TCG_TARGET_HAS_mulsh_i64      0
 
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index cd39b2a208..7315602331 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -180,6 +180,7 @@ typedef enum S390Opcode {
     RRE_SLBGR   = 0xb989,
     RRE_XGR     = 0xb982,
 
+    RRFa_MGRK   = 0xb9ec,
     RRFa_MSRKC  = 0xb9fd,
     RRFa_MSGRKC = 0xb9ed,
     RRFa_NRK    = 0xb9f4,
@@ -2452,6 +2453,11 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         tcg_debug_assert(args[0] == args[1] + 1);
         tcg_out_insn(s, RRE, MLGR, args[1], args[3]);
         break;
+    case INDEX_op_muls2_i64:
+        tcg_debug_assert((args[1] & 1) == 0);
+        tcg_debug_assert(args[0] == args[1] + 1);
+        tcg_out_insn(s, RRFa, MGRK, args[1], args[2], args[3]);
+        break;
 
     case INDEX_op_shl_i64:
         op = RSY_SLLG;
@@ -3153,6 +3159,8 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
 
     case INDEX_op_mulu2_i64:
         return C_O2_I2(o, m, 0, r);
+    case INDEX_op_muls2_i64:
+        return C_O2_I2(o, m, r, r);
 
     case INDEX_op_add2_i32:
     case INDEX_op_sub2_i32:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 08/13] tcg/s390x: Support MIE3 logical operations
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (6 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 07/13] tcg/s390x: Support MIE2 MGRK instruction Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 20:08   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 09/13] tcg/s390x: Create tgen_cmp2 to simplify movcond Richard Henderson
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

This is andc, orc, nand, nor, eqv.
We can use nor for implementing not.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |   1 +
 tcg/s390x/tcg-target.h         |  25 +++++----
 tcg/s390x/tcg-target.c.inc     | 100 +++++++++++++++++++++++++++++++++
 3 files changed, 114 insertions(+), 12 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index b1a89a88ba..dc271a6d11 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -22,6 +22,7 @@ C_O1_I1(v, vr)
 C_O1_I2(r, 0, ri)
 C_O1_I2(r, 0, rI)
 C_O1_I2(r, 0, rJ)
+C_O1_I2(r, r, r)
 C_O1_I2(r, r, ri)
 C_O1_I2(r, r, rJ)
 C_O1_I2(r, rZ, r)
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index 09cf6e60fc..191c6a073e 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -64,6 +64,7 @@ typedef enum TCGReg {
 #define FACILITY_DISTINCT_OPS         FACILITY_LOAD_ON_COND
 #define FACILITY_LOAD_ON_COND2        53
 #define FACILITY_MISC_INSN_EXT2       58
+#define FACILITY_MISC_INSN_EXT3       61
 #define FACILITY_VECTOR               129
 #define FACILITY_VECTOR_ENH1          135
 
@@ -81,13 +82,13 @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_ext16u_i32     1
 #define TCG_TARGET_HAS_bswap16_i32    1
 #define TCG_TARGET_HAS_bswap32_i32    1
-#define TCG_TARGET_HAS_not_i32        0
+#define TCG_TARGET_HAS_not_i32        HAVE_FACILITY(MISC_INSN_EXT3)
 #define TCG_TARGET_HAS_neg_i32        1
-#define TCG_TARGET_HAS_andc_i32       0
-#define TCG_TARGET_HAS_orc_i32        0
-#define TCG_TARGET_HAS_eqv_i32        0
-#define TCG_TARGET_HAS_nand_i32       0
-#define TCG_TARGET_HAS_nor_i32        0
+#define TCG_TARGET_HAS_andc_i32       HAVE_FACILITY(MISC_INSN_EXT3)
+#define TCG_TARGET_HAS_orc_i32        HAVE_FACILITY(MISC_INSN_EXT3)
+#define TCG_TARGET_HAS_eqv_i32        HAVE_FACILITY(MISC_INSN_EXT3)
+#define TCG_TARGET_HAS_nand_i32       HAVE_FACILITY(MISC_INSN_EXT3)
+#define TCG_TARGET_HAS_nor_i32        HAVE_FACILITY(MISC_INSN_EXT3)
 #define TCG_TARGET_HAS_clz_i32        0
 #define TCG_TARGET_HAS_ctz_i32        0
 #define TCG_TARGET_HAS_ctpop_i32      0
@@ -118,13 +119,13 @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_bswap16_i64    1
 #define TCG_TARGET_HAS_bswap32_i64    1
 #define TCG_TARGET_HAS_bswap64_i64    1
-#define TCG_TARGET_HAS_not_i64        0
+#define TCG_TARGET_HAS_not_i64        HAVE_FACILITY(MISC_INSN_EXT3)
 #define TCG_TARGET_HAS_neg_i64        1
-#define TCG_TARGET_HAS_andc_i64       0
-#define TCG_TARGET_HAS_orc_i64        0
-#define TCG_TARGET_HAS_eqv_i64        0
-#define TCG_TARGET_HAS_nand_i64       0
-#define TCG_TARGET_HAS_nor_i64        0
+#define TCG_TARGET_HAS_andc_i64       HAVE_FACILITY(MISC_INSN_EXT3)
+#define TCG_TARGET_HAS_orc_i64        HAVE_FACILITY(MISC_INSN_EXT3)
+#define TCG_TARGET_HAS_eqv_i64        HAVE_FACILITY(MISC_INSN_EXT3)
+#define TCG_TARGET_HAS_nand_i64       HAVE_FACILITY(MISC_INSN_EXT3)
+#define TCG_TARGET_HAS_nor_i64        HAVE_FACILITY(MISC_INSN_EXT3)
 #define TCG_TARGET_HAS_clz_i64        HAVE_FACILITY(EXT_IMM)
 #define TCG_TARGET_HAS_ctz_i64        0
 #define TCG_TARGET_HAS_ctpop_i64      0
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index 7315602331..4dcdad04c5 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -183,8 +183,18 @@ typedef enum S390Opcode {
     RRFa_MGRK   = 0xb9ec,
     RRFa_MSRKC  = 0xb9fd,
     RRFa_MSGRKC = 0xb9ed,
+    RRFa_NCRK   = 0xb9f5,
+    RRFa_NCGRK  = 0xb9e5,
+    RRFa_NNRK   = 0xb974,
+    RRFa_NNGRK  = 0xb964,
+    RRFa_NORK   = 0xb976,
+    RRFa_NOGRK  = 0xb966,
     RRFa_NRK    = 0xb9f4,
     RRFa_NGRK   = 0xb9e4,
+    RRFa_NXRK   = 0xb977,
+    RRFa_NXGRK  = 0xb967,
+    RRFa_OCRK   = 0xb975,
+    RRFa_OCGRK  = 0xb965,
     RRFa_ORK    = 0xb9f6,
     RRFa_OGRK   = 0xb9e6,
     RRFa_SRK    = 0xb9f9,
@@ -2138,9 +2148,46 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         }
         break;
 
+    case INDEX_op_andc_i32:
+        a0 = args[0], a1 = args[1], a2 = (uint32_t)args[2];
+        if (const_args[2]) {
+            tcg_out_mov(s, TCG_TYPE_I32, a0, a1);
+            tgen_andi(s, TCG_TYPE_I32, a0, ~a2);
+	} else {
+            tcg_out_insn(s, RRFa, NCRK, a0, a1, a2);
+	}
+        break;
+    case INDEX_op_orc_i32:
+        a0 = args[0], a1 = args[1], a2 = (uint32_t)args[2];
+        if (const_args[2]) {
+            tcg_out_mov(s, TCG_TYPE_I32, a0, a1);
+            tgen_ori(s, TCG_TYPE_I32, a0, ~a2);
+        } else {
+            tcg_out_insn(s, RRFa, OCRK, a0, a1, a2);
+        }
+        break;
+    case INDEX_op_eqv_i32:
+        a0 = args[0], a1 = args[1], a2 = (uint32_t)args[2];
+        if (const_args[2]) {
+            tcg_out_mov(s, TCG_TYPE_I32, a0, a1);
+            tgen_xori(s, TCG_TYPE_I32, a0, ~a2);
+        } else {
+            tcg_out_insn(s, RRFa, NXRK, a0, a1, a2);
+        }
+        break;
+    case INDEX_op_nand_i32:
+        tcg_out_insn(s, RRFa, NNRK, args[0], args[1], args[2]);
+        break;
+    case INDEX_op_nor_i32:
+        tcg_out_insn(s, RRFa, NORK, args[0], args[1], args[2]);
+        break;
+
     case INDEX_op_neg_i32:
         tcg_out_insn(s, RR, LCR, args[0], args[1]);
         break;
+    case INDEX_op_not_i32:
+        tcg_out_insn(s, RRFa, NORK, args[0], args[1], args[1]);
+        break;
 
     case INDEX_op_mul_i32:
         a0 = args[0], a1 = args[1], a2 = (int32_t)args[2];
@@ -2404,9 +2451,46 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         }
         break;
 
+    case INDEX_op_andc_i64:
+        a0 = args[0], a1 = args[1], a2 = args[2];
+        if (const_args[2]) {
+            tcg_out_mov(s, TCG_TYPE_I64, a0, a1);
+            tgen_andi(s, TCG_TYPE_I64, a0, ~a2);
+        } else {
+            tcg_out_insn(s, RRFa, NCGRK, a0, a1, a2);
+        }
+        break;
+    case INDEX_op_orc_i64:
+        a0 = args[0], a1 = args[1], a2 = args[2];
+        if (const_args[2]) {
+            tcg_out_mov(s, TCG_TYPE_I64, a0, a1);
+            tgen_ori(s, TCG_TYPE_I64, a0, ~a2);
+        } else {
+            tcg_out_insn(s, RRFa, OCGRK, a0, a1, a2);
+        }
+        break;
+    case INDEX_op_eqv_i64:
+        a0 = args[0], a1 = args[1], a2 = args[2];
+        if (const_args[2]) {
+            tcg_out_mov(s, TCG_TYPE_I64, a0, a1);
+            tgen_xori(s, TCG_TYPE_I64, a0, ~a2);
+        } else {
+            tcg_out_insn(s, RRFa, NXGRK, a0, a1, a2);
+        }
+        break;
+    case INDEX_op_nand_i64:
+        tcg_out_insn(s, RRFa, NNGRK, args[0], args[1], args[2]);
+        break;
+    case INDEX_op_nor_i64:
+        tcg_out_insn(s, RRFa, NOGRK, args[0], args[1], args[2]);
+        break;
+
     case INDEX_op_neg_i64:
         tcg_out_insn(s, RRE, LCGR, args[0], args[1]);
         break;
+    case INDEX_op_not_i64:
+        tcg_out_insn(s, RRFa, NOGRK, args[0], args[1], args[1]);
+        break;
     case INDEX_op_bswap64_i64:
         tcg_out_insn(s, RRE, LRVGR, args[0], args[1]);
         break;
@@ -3083,6 +3167,20 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
                 ? C_O1_I2(r, r, ri)
                 : C_O1_I2(r, 0, ri));
 
+    case INDEX_op_andc_i32:
+    case INDEX_op_andc_i64:
+    case INDEX_op_orc_i32:
+    case INDEX_op_orc_i64:
+    case INDEX_op_eqv_i32:
+    case INDEX_op_eqv_i64:
+        return C_O1_I2(r, r, ri);
+
+    case INDEX_op_nand_i32:
+    case INDEX_op_nand_i64:
+    case INDEX_op_nor_i32:
+    case INDEX_op_nor_i64:
+        return C_O1_I2(r, r, r);
+
     case INDEX_op_mul_i32:
         /* If we have the general-instruction-extensions, then we have
            MULTIPLY SINGLE IMMEDIATE with a signed 32-bit, otherwise we
@@ -3118,6 +3216,8 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_bswap64_i64:
     case INDEX_op_neg_i32:
     case INDEX_op_neg_i64:
+    case INDEX_op_not_i32:
+    case INDEX_op_not_i64:
     case INDEX_op_ext8s_i32:
     case INDEX_op_ext8s_i64:
     case INDEX_op_ext8u_i32:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 09/13] tcg/s390x: Create tgen_cmp2 to simplify movcond
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (7 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 08/13] tcg/s390x: Support MIE3 logical operations Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 20:14   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 10/13] tcg/s390x: Generalize movcond implementation Richard Henderson
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

Return both regular and inverted condition codes from tgen_cmp2.
This lets us choose after the fact which comparision we want.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index 4dcdad04c5..fad86f01f3 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -1288,10 +1288,11 @@ static void tgen_xori(TCGContext *s, TCGType type, TCGReg dest, uint64_t val)
     }
 }
 
-static int tgen_cmp(TCGContext *s, TCGType type, TCGCond c, TCGReg r1,
-                    TCGArg c2, bool c2const, bool need_carry)
+static int tgen_cmp2(TCGContext *s, TCGType type, TCGCond c, TCGReg r1,
+                     TCGArg c2, bool c2const, bool need_carry, int *inv_cc)
 {
     bool is_unsigned = is_unsigned_cond(c);
+    TCGCond inv_c = tcg_invert_cond(c);
     S390Opcode op;
 
     if (c2const) {
@@ -1302,6 +1303,7 @@ static int tgen_cmp(TCGContext *s, TCGType type, TCGCond c, TCGReg r1,
                 } else {
                     tcg_out_insn(s, RRE, LTGR, r1, r1);
                 }
+                *inv_cc = tcg_cond_to_ltr_cond[inv_c];
                 return tcg_cond_to_ltr_cond[c];
             }
         }
@@ -1352,9 +1354,17 @@ static int tgen_cmp(TCGContext *s, TCGType type, TCGCond c, TCGReg r1,
     }
 
  exit:
+    *inv_cc = tcg_cond_to_s390_cond[inv_c];
     return tcg_cond_to_s390_cond[c];
 }
 
+static int tgen_cmp(TCGContext *s, TCGType type, TCGCond c, TCGReg r1,
+                    TCGArg c2, bool c2const, bool need_carry)
+{
+    int inv_cc;
+    return tgen_cmp2(s, type, c, r1, c2, c2const, need_carry, &inv_cc);
+}
+
 static void tgen_setcond(TCGContext *s, TCGType type, TCGCond cond,
                          TCGReg dest, TCGReg c1, TCGArg c2, int c2const)
 {
@@ -1455,20 +1465,19 @@ static void tgen_movcond(TCGContext *s, TCGType type, TCGCond c, TCGReg dest,
                          TCGReg c1, TCGArg c2, int c2const,
                          TCGArg v3, int v3const)
 {
-    int cc;
+    int cc, inv_cc;
+
+    cc = tgen_cmp2(s, type, c, c1, c2, c2const, false, &inv_cc);
+
     if (HAVE_FACILITY(LOAD_ON_COND)) {
-        cc = tgen_cmp(s, type, c, c1, c2, c2const, false);
         if (v3const) {
             tcg_out_insn(s, RIEg, LOCGHI, dest, v3, cc);
         } else {
             tcg_out_insn(s, RRFc, LOCGR, dest, v3, cc);
         }
     } else {
-        c = tcg_invert_cond(c);
-        cc = tgen_cmp(s, type, c, c1, c2, c2const, false);
-
         /* Emit: if (cc) goto over; dest = r3; over:  */
-        tcg_out_insn(s, RI, BRC, cc, (4 + 4) >> 1);
+        tcg_out_insn(s, RI, BRC, inv_cc, (4 + 4) >> 1);
         tcg_out_insn(s, RRE, LGR, dest, v3);
     }
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 10/13] tcg/s390x: Generalize movcond implementation
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (8 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 09/13] tcg/s390x: Create tgen_cmp2 to simplify movcond Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 20:39   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 11/13] tcg/s390x: Support SELGR instruction in movcond Richard Henderson
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

Generalize movcond to support pre-computed conditions, and the same
set of arguments at all times.  This will be assumed by a following
patch, which needs to reuse tgen_movcond_int.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |  3 +-
 tcg/s390x/tcg-target.c.inc     | 78 ++++++++++++++++++++++++++--------
 2 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index dc271a6d11..86cdc248f2 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -29,8 +29,7 @@ C_O1_I2(r, rZ, r)
 C_O1_I2(v, v, r)
 C_O1_I2(v, v, v)
 C_O1_I3(v, v, v, v)
-C_O1_I4(r, r, ri, r, 0)
-C_O1_I4(r, r, ri, rI, 0)
+C_O1_I4(r, r, ri, rI, r)
 C_O2_I2(o, m, 0, r)
 C_O2_I2(o, m, r, r)
 C_O2_I3(o, m, 0, 1, r)
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index fad86f01f3..b2adbbe7de 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -1461,25 +1461,69 @@ static void tgen_setcond(TCGContext *s, TCGType type, TCGCond cond,
     }
 }
 
+static void tgen_movcond_int(TCGContext *s, TCGType type, TCGReg dest,
+                             TCGArg v3, int v3const, TCGReg v4,
+                             int cc, int inv_cc)
+{
+    TCGReg src;
+
+    /* If dest != v4, LGR+LOCGHI is larger than LGHI+{LOCGR,SELGR}. */
+    if (HAVE_FACILITY(LOAD_ON_COND2) && v3const && dest == v4) {
+        /* Emit: if (cc) dest = v3. */
+        tcg_out_insn(s, RIEg, LOCGHI, dest, v3, cc);
+        return;
+    }
+
+    if (HAVE_FACILITY(LOAD_ON_COND)) {
+        if (dest == v4) {
+            if (v3const) {
+                tcg_out_insn(s, RI, LGHI, TCG_TMP0, v3);
+                src = TCG_TMP0;
+            } else {
+                src = v3;
+            }
+        } else {
+            if (v3const) {
+                tcg_out_insn(s, RI, LGHI, dest, v3);
+            } else {
+                tcg_out_mov(s, type, dest, v3);
+            }
+            cc = inv_cc;
+            src = v4;
+        }
+        /* Emit: if (cc) dest = src. */
+        tcg_out_insn(s, RRFc, LOCGR, dest, src, cc);
+        return;
+    }
+
+    if (v3const) {
+        tcg_out_mov(s, type, dest, v4);
+        /* Emit: if (!cc) goto over; dest = r3; over:  */
+        tcg_out_insn(s, RI, BRC, inv_cc, (4 + 4) >> 1);
+        tcg_out_insn(s, RI, LGHI, dest, v3);
+        return;
+    }
+
+    if (dest == v4) {
+        src = v3;
+    } else {
+        tcg_out_mov(s, type, dest, v3);
+        inv_cc = cc;
+        src = v4;
+    }
+    /* Emit: if (!cc) goto over; dest = src; over:  */
+    tcg_out_insn(s, RI, BRC, inv_cc, (4 + 4) >> 1);
+    tcg_out_insn(s, RRE, LGR, dest, src);
+}
+
 static void tgen_movcond(TCGContext *s, TCGType type, TCGCond c, TCGReg dest,
                          TCGReg c1, TCGArg c2, int c2const,
-                         TCGArg v3, int v3const)
+                         TCGArg v3, int v3const, TCGReg v4)
 {
     int cc, inv_cc;
 
     cc = tgen_cmp2(s, type, c, c1, c2, c2const, false, &inv_cc);
-
-    if (HAVE_FACILITY(LOAD_ON_COND)) {
-        if (v3const) {
-            tcg_out_insn(s, RIEg, LOCGHI, dest, v3, cc);
-        } else {
-            tcg_out_insn(s, RRFc, LOCGR, dest, v3, cc);
-        }
-    } else {
-        /* Emit: if (cc) goto over; dest = r3; over:  */
-        tcg_out_insn(s, RI, BRC, inv_cc, (4 + 4) >> 1);
-        tcg_out_insn(s, RRE, LGR, dest, v3);
-    }
+    tgen_movcond_int(s, type, dest, v3, v3const, v4, cc, inv_cc);
 }
 
 static void tgen_clz(TCGContext *s, TCGReg dest, TCGReg a1,
@@ -2352,7 +2396,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
     case INDEX_op_movcond_i32:
         tgen_movcond(s, TCG_TYPE_I32, args[5], args[0], args[1],
-                     args[2], const_args[2], args[3], const_args[3]);
+                     args[2], const_args[2], args[3], const_args[3], args[4]);
         break;
 
     case INDEX_op_qemu_ld_i32:
@@ -2644,7 +2688,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
     case INDEX_op_movcond_i64:
         tgen_movcond(s, TCG_TYPE_I64, args[5], args[0], args[1],
-                     args[2], const_args[2], args[3], const_args[3]);
+                     args[2], const_args[2], args[3], const_args[3], args[4]);
         break;
 
     OP_32_64(deposit):
@@ -3256,9 +3300,7 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
 
     case INDEX_op_movcond_i32:
     case INDEX_op_movcond_i64:
-        return (HAVE_FACILITY(LOAD_ON_COND2)
-                ? C_O1_I4(r, r, ri, rI, 0)
-                : C_O1_I4(r, r, ri, r, 0));
+        return C_O1_I4(r, r, ri, rI, r);
 
     case INDEX_op_div2_i32:
     case INDEX_op_div2_i64:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 11/13] tcg/s390x: Support SELGR instruction in movcond
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (9 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 10/13] tcg/s390x: Generalize movcond implementation Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 20:41   ` Ilya Leoshkevich
  2022-12-02  6:51 ` [PATCH v3 12/13] tcg/s390x: Use tgen_movcond_int in tgen_clz Richard Henderson
  2022-12-02  6:52 ` [PATCH v3 13/13] tcg/s390x: Implement ctpop operation Richard Henderson
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

The new select instruction provides two separate register inputs,
whereas the old load-on-condition instruction overlaps one of the
register inputs with the destination.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index b2adbbe7de..1e4947b598 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -204,6 +204,8 @@ typedef enum S390Opcode {
     RRFa_XRK    = 0xb9f7,
     RRFa_XGRK   = 0xb9e7,
 
+    RRFam_SELGR = 0xb9e3,
+
     RRFc_LOCR   = 0xb9f2,
     RRFc_LOCGR  = 0xb9e2,
 
@@ -560,12 +562,20 @@ static void tcg_out_insn_RRE(TCGContext *s, S390Opcode op,
     tcg_out32(s, (op << 16) | (r1 << 4) | r2);
 }
 
+/* RRF-a without the m4 field */
 static void tcg_out_insn_RRFa(TCGContext *s, S390Opcode op,
                               TCGReg r1, TCGReg r2, TCGReg r3)
 {
     tcg_out32(s, (op << 16) | (r3 << 12) | (r1 << 4) | r2);
 }
 
+/* RRF-a with the m4 field */
+static void tcg_out_insn_RRFam(TCGContext *s, S390Opcode op,
+                               TCGReg r1, TCGReg r2, TCGReg r3, int m4)
+{
+    tcg_out32(s, (op << 16) | (r3 << 12) | (m4 << 8) | (r1 << 4) | r2);
+}
+
 static void tcg_out_insn_RRFc(TCGContext *s, S390Opcode op,
                               TCGReg r1, TCGReg r2, int m3)
 {
@@ -1474,6 +1484,17 @@ static void tgen_movcond_int(TCGContext *s, TCGType type, TCGReg dest,
         return;
     }
 
+    /* Note that while MIE3 implies LOC, it does not imply LOC2. */
+    if (HAVE_FACILITY(MISC_INSN_EXT3)) {
+        if (v3const) {
+            tcg_out_insn(s, RI, LGHI, TCG_TMP0, v3);
+            v3 = TCG_TMP0;
+        }
+        /* Emit: dest = cc ? v3 : v4. */
+        tcg_out_insn(s, RRFam, SELGR, dest, v3, v4, cc);
+        return;
+    }
+
     if (HAVE_FACILITY(LOAD_ON_COND)) {
         if (dest == v4) {
             if (v3const) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 12/13] tcg/s390x: Use tgen_movcond_int in tgen_clz
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (10 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 11/13] tcg/s390x: Support SELGR instruction in movcond Richard Henderson
@ 2022-12-02  6:51 ` Richard Henderson
  2022-12-06 20:49   ` Ilya Leoshkevich
  2022-12-02  6:52 ` [PATCH v3 13/13] tcg/s390x: Implement ctpop operation Richard Henderson
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

Reuse code from movcond to conditionally copy a2 to dest,
based on the condition codes produced by FLOGR.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |  1 +
 tcg/s390x/tcg-target.c.inc     | 26 +++++++++++---------------
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index 86cdc248f2..cfc0d405b3 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -24,6 +24,7 @@ C_O1_I2(r, 0, rI)
 C_O1_I2(r, 0, rJ)
 C_O1_I2(r, r, r)
 C_O1_I2(r, r, ri)
+C_O1_I2(r, r, rI)
 C_O1_I2(r, r, rJ)
 C_O1_I2(r, rZ, r)
 C_O1_I2(v, v, r)
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index 1e4947b598..23cbb10168 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -1557,21 +1557,15 @@ static void tgen_clz(TCGContext *s, TCGReg dest, TCGReg a1,
 
     if (a2const && a2 == 64) {
         tcg_out_mov(s, TCG_TYPE_I64, dest, TCG_REG_R0);
-    } else {
-        if (a2const) {
-            tcg_out_movi(s, TCG_TYPE_I64, dest, a2);
-        } else {
-            tcg_out_mov(s, TCG_TYPE_I64, dest, a2);
-        }
-        if (HAVE_FACILITY(LOAD_ON_COND)) {
-            /* Emit: if (one bit found) dest = r0.  */
-            tcg_out_insn(s, RRFc, LOCGR, dest, TCG_REG_R0, 2);
-        } else {
-            /* Emit: if (no one bit found) goto over; dest = r0; over:  */
-            tcg_out_insn(s, RI, BRC, 8, (4 + 4) >> 1);
-            tcg_out_insn(s, RRE, LGR, dest, TCG_REG_R0);
-        }
+        return;
     }
+
+    /*
+     * Conditions from FLOGR are:
+     *   2 -> one bit found
+     *   8 -> no one bit found
+     */
+    tgen_movcond_int(s, TCG_TYPE_I64, dest, a2, a2const, TCG_REG_R0, 8, 2);
 }
 
 static void tgen_deposit(TCGContext *s, TCGReg dest, TCGReg src,
@@ -3224,11 +3218,13 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_rotl_i64:
     case INDEX_op_rotr_i32:
     case INDEX_op_rotr_i64:
-    case INDEX_op_clz_i64:
     case INDEX_op_setcond_i32:
     case INDEX_op_setcond_i64:
         return C_O1_I2(r, r, ri);
 
+    case INDEX_op_clz_i64:
+        return C_O1_I2(r, r, rI);
+
     case INDEX_op_sub_i32:
     case INDEX_op_sub_i64:
     case INDEX_op_and_i32:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 13/13] tcg/s390x: Implement ctpop operation
  2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
                   ` (11 preceding siblings ...)
  2022-12-02  6:51 ` [PATCH v3 12/13] tcg/s390x: Use tgen_movcond_int in tgen_clz Richard Henderson
@ 2022-12-02  6:52 ` Richard Henderson
  2022-12-06 21:10   ` Ilya Leoshkevich
  12 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-02  6:52 UTC (permalink / raw)
  To: qemu-devel; +Cc: thuth, iii

There is an older form that produces per-byte results,
and a newer form that produces per-register results,
and a vector form that produces per-element results.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.h     |  5 ++--
 tcg/s390x/tcg-target.c.inc | 51 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index 191c6a073e..5d184d8e14 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -62,6 +62,7 @@ typedef enum TCGReg {
 #define FACILITY_LOAD_ON_COND         45
 #define FACILITY_FAST_BCR_SER         FACILITY_LOAD_ON_COND
 #define FACILITY_DISTINCT_OPS         FACILITY_LOAD_ON_COND
+#define FACILITY_POPCOUNT             FACILITY_LOAD_ON_COND
 #define FACILITY_LOAD_ON_COND2        53
 #define FACILITY_MISC_INSN_EXT2       58
 #define FACILITY_MISC_INSN_EXT3       61
@@ -91,7 +92,7 @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_nor_i32        HAVE_FACILITY(MISC_INSN_EXT3)
 #define TCG_TARGET_HAS_clz_i32        0
 #define TCG_TARGET_HAS_ctz_i32        0
-#define TCG_TARGET_HAS_ctpop_i32      0
+#define TCG_TARGET_HAS_ctpop_i32      HAVE_FACILITY(POPCOUNT)
 #define TCG_TARGET_HAS_deposit_i32    HAVE_FACILITY(GEN_INST_EXT)
 #define TCG_TARGET_HAS_extract_i32    HAVE_FACILITY(GEN_INST_EXT)
 #define TCG_TARGET_HAS_sextract_i32   0
@@ -128,7 +129,7 @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_nor_i64        HAVE_FACILITY(MISC_INSN_EXT3)
 #define TCG_TARGET_HAS_clz_i64        HAVE_FACILITY(EXT_IMM)
 #define TCG_TARGET_HAS_ctz_i64        0
-#define TCG_TARGET_HAS_ctpop_i64      0
+#define TCG_TARGET_HAS_ctpop_i64      HAVE_FACILITY(POPCOUNT)
 #define TCG_TARGET_HAS_deposit_i64    HAVE_FACILITY(GEN_INST_EXT)
 #define TCG_TARGET_HAS_extract_i64    HAVE_FACILITY(GEN_INST_EXT)
 #define TCG_TARGET_HAS_sextract_i64   0
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index 23cbb10168..7744c6ad54 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -208,6 +208,7 @@ typedef enum S390Opcode {
 
     RRFc_LOCR   = 0xb9f2,
     RRFc_LOCGR  = 0xb9e2,
+    RRFc_POPCNT = 0xb9e1,
 
     RR_AR       = 0x1a,
     RR_ALR      = 0x1e,
@@ -259,6 +260,7 @@ typedef enum S390Opcode {
     RXY_LRVG    = 0xe30f,
     RXY_LRVH    = 0xe31f,
     RXY_LY      = 0xe358,
+    RXY_MSG     = 0xe30c,
     RXY_NG      = 0xe380,
     RXY_OG      = 0xe381,
     RXY_STCY    = 0xe372,
@@ -276,6 +278,7 @@ typedef enum S390Opcode {
     RX_L        = 0x58,
     RX_LA       = 0x41,
     RX_LH       = 0x48,
+    RX_MS       = 0x71,
     RX_ST       = 0x50,
     RX_STC      = 0x42,
     RX_STH      = 0x40,
@@ -1568,6 +1571,45 @@ static void tgen_clz(TCGContext *s, TCGReg dest, TCGReg a1,
     tgen_movcond_int(s, TCG_TYPE_I64, dest, a2, a2const, TCG_REG_R0, 8, 2);
 }
 
+static void tgen_ctpop(TCGContext *s, TCGType type, TCGReg dest, TCGReg src)
+{
+    /* With MIE3, and bit 0 of m4 set, we get the complete result. */
+    if (HAVE_FACILITY(MISC_INSN_EXT3)) {
+        if (type == TCG_TYPE_I32) {
+            tgen_ext32u(s, dest, src);
+            src = dest;
+        }
+        tcg_out_insn(s, RRFc, POPCNT, dest, src, 8);
+        return;
+    }
+
+    /* Without MIE3, each byte gets the count of bits for the byte. */
+    tcg_out_insn(s, RRFc, POPCNT, dest, src, 0);
+
+    /* Multiply to sum each byte at the top of the word. */
+    if (type == TCG_TYPE_I32 && HAVE_FACILITY(GEN_INST_EXT)) {
+        tcg_out_insn(s, RIL, MSFI, dest, 0x01010101);
+    } else {
+        /* No space to save: share the constant between TCG_TYPE_I32/I64. */
+        tcg_out_insn(s, RIL, LARL, TCG_TMP0, 0);
+        new_pool_label(s, 0x0101010101010101ull,
+                       R_390_PC32DBL, s->code_ptr - 2, 2);
+
+        if (type == TCG_TYPE_I32) {
+            tcg_out_insn(s, RX, MS, dest, TCG_TMP0, TCG_REG_NONE, 0);
+        } else {
+            tcg_out_insn(s, RXY, MSG, dest, TCG_TMP0, TCG_REG_NONE, 0);
+        }
+    }
+
+    /* Shift result down from the top byte. */
+    if (type == TCG_TYPE_I32) {
+        tcg_out_sh32(s, RS_SRL, dest, TCG_REG_NONE, 24);
+    } else {
+        tcg_out_sh64(s, RSY_SRLG, dest, dest, TCG_REG_NONE, 56);
+    }
+}
+
 static void tgen_deposit(TCGContext *s, TCGReg dest, TCGReg src,
                          int ofs, int len, int z)
 {
@@ -2733,6 +2775,13 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         tgen_clz(s, args[0], args[1], args[2], const_args[2]);
         break;
 
+    case INDEX_op_ctpop_i32:
+        tgen_ctpop(s, TCG_TYPE_I32, args[0], args[1]);
+        break;
+    case INDEX_op_ctpop_i64:
+        tgen_ctpop(s, TCG_TYPE_I64, args[0], args[1]);
+        break;
+
     case INDEX_op_mb:
         /* The host memory model is quite strong, we simply need to
            serialize the instruction stream.  */
@@ -3302,6 +3351,8 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_extu_i32_i64:
     case INDEX_op_extract_i32:
     case INDEX_op_extract_i64:
+    case INDEX_op_ctpop_i32:
+    case INDEX_op_ctpop_i64:
         return C_O1_I1(r, r);
 
     case INDEX_op_qemu_ld_i32:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 01/13] tcg/s390x: Use register pair allocation for div and mulu2
  2022-12-02  6:51 ` [PATCH v3 01/13] tcg/s390x: Use register pair allocation for div and mulu2 Richard Henderson
@ 2022-12-06 15:49   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 15:49 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:48PM -0800, Richard Henderson wrote:
> Previously we hard-coded R2 and R3.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target-con-set.h |  4 ++--
>  tcg/s390x/tcg-target-con-str.h |  8 +------
>  tcg/s390x/tcg-target.c.inc     | 43 +++++++++++++++++++++++++---------
>  3 files changed, 35 insertions(+), 20 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-02  6:51 ` [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB Richard Henderson
@ 2022-12-06 19:29   ` Ilya Leoshkevich
  2022-12-06 22:22     ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 19:29 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:49PM -0800, Richard Henderson wrote:
> This reverts 829e1376d940 ("tcg/s390: Introduce TCG_REG_TB"), and
> several follow-up patches.  The primary motivation is to reduce the
> less-tested code paths, pre-z10.  Secondarily, this allows the
> unconditional use of TCG_TARGET_HAS_direct_jump, which might be more
> important for performance than any slight increase in code size.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target.h     |   2 +-
>  tcg/s390x/tcg-target.c.inc | 176 +++++--------------------------------
>  2 files changed, 23 insertions(+), 155 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>

I have a few questions/ideas for the future below.
 
> diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
> index 22d70d431b..645f522058 100644
> --- a/tcg/s390x/tcg-target.h
> +++ b/tcg/s390x/tcg-target.h
> @@ -103,7 +103,7 @@ extern uint64_t s390_facilities[3];
>  #define TCG_TARGET_HAS_mulsh_i32      0
>  #define TCG_TARGET_HAS_extrl_i64_i32  0
>  #define TCG_TARGET_HAS_extrh_i64_i32  0
> -#define TCG_TARGET_HAS_direct_jump    HAVE_FACILITY(GEN_INST_EXT)
> +#define TCG_TARGET_HAS_direct_jump    1

This change doesn't seem to affect that, but what is the minimum
supported s390x qemu host? z900?

>  #define TCG_TARGET_HAS_qemu_st8_i32   0
>  
>  #define TCG_TARGET_HAS_div2_i64       1
> diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
> index cb00bb6999..8a4bec0a28 100644
> --- a/tcg/s390x/tcg-target.c.inc
> +++ b/tcg/s390x/tcg-target.c.inc
> @@ -65,12 +65,6 @@
>  /* A scratch register that may be be used throughout the backend.  */
>  #define TCG_TMP0        TCG_REG_R1
>  
> -/* A scratch register that holds a pointer to the beginning of the TB.
> -   We don't need this when we have pc-relative loads with the general
> -   instructions extension facility.  */
> -#define TCG_REG_TB      TCG_REG_R12
> -#define USE_REG_TB      (!HAVE_FACILITY(GEN_INST_EXT))
> -
>  #ifndef CONFIG_SOFTMMU
>  #define TCG_GUEST_BASE_REG TCG_REG_R13
>  #endif
> @@ -813,8 +807,8 @@ static bool maybe_out_small_movi(TCGContext *s, TCGType type,
>  }
>  
>  /* load a register with an immediate value */
> -static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
> -                             tcg_target_long sval, bool in_prologue)
> +static void tcg_out_movi(TCGContext *s, TCGType type,
> +                         TCGReg ret, tcg_target_long sval)
>  {
>      tcg_target_ulong uval;
>  
> @@ -853,14 +847,6 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
>              tcg_out_insn(s, RIL, LARL, ret, off);
>              return;
>          }
> -    } else if (USE_REG_TB && !in_prologue) {
> -        ptrdiff_t off = tcg_tbrel_diff(s, (void *)sval);
> -        if (off == sextract64(off, 0, 20)) {
> -            /* This is certain to be an address within TB, and therefore
> -               OFF will be negative; don't try RX_LA.  */
> -            tcg_out_insn(s, RXY, LAY, ret, TCG_REG_TB, TCG_REG_NONE, off);
> -            return;
> -        }
>      }
>  
>      /* A 32-bit unsigned value can be loaded in 2 insns.  And given
> @@ -876,10 +862,6 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
>      if (HAVE_FACILITY(GEN_INST_EXT)) {
>          tcg_out_insn(s, RIL, LGRL, ret, 0);
>          new_pool_label(s, sval, R_390_PC32DBL, s->code_ptr - 2, 2);
> -    } else if (USE_REG_TB && !in_prologue) {
> -        tcg_out_insn(s, RXY, LG, ret, TCG_REG_TB, TCG_REG_NONE, 0);
> -        new_pool_label(s, sval, R_390_20, s->code_ptr - 2,
> -                       tcg_tbrel_diff(s, NULL));
>      } else {
>          TCGReg base = ret ? ret : TCG_TMP0;
>          tcg_out_insn(s, RIL, LARL, base, 0);
> @@ -888,12 +870,6 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
>      }
>  }

I did some benchmarking of various ways to load constants in context of
GCC in the past, and it turned out that LLIHF+OILF is more efficient
than literal pool [1].

> -static void tcg_out_movi(TCGContext *s, TCGType type,
> -                         TCGReg ret, tcg_target_long sval)
> -{
> -    tcg_out_movi_int(s, type, ret, sval, false);
> -}
> -
>  /* Emit a load/store type instruction.  Inputs are:
>     DATA:     The register to be loaded or stored.
>     BASE+OFS: The effective address.
> @@ -1020,35 +996,6 @@ static inline bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
>      return false;
>  }
>  
> -/* load data from an absolute host address */
> -static void tcg_out_ld_abs(TCGContext *s, TCGType type,
> -                           TCGReg dest, const void *abs)
> -{
> -    intptr_t addr = (intptr_t)abs;
> -
> -    if (HAVE_FACILITY(GEN_INST_EXT) && !(addr & 1)) {
> -        ptrdiff_t disp = tcg_pcrel_diff(s, abs) >> 1;
> -        if (disp == (int32_t)disp) {
> -            if (type == TCG_TYPE_I32) {
> -                tcg_out_insn(s, RIL, LRL, dest, disp);
> -            } else {
> -                tcg_out_insn(s, RIL, LGRL, dest, disp);
> -            }
> -            return;
> -        }
> -    }
> -    if (USE_REG_TB) {
> -        ptrdiff_t disp = tcg_tbrel_diff(s, abs);
> -        if (disp == sextract64(disp, 0, 20)) {
> -            tcg_out_ld(s, type, dest, TCG_REG_TB, disp);
> -            return;
> -        }
> -    }
> -
> -    tcg_out_movi(s, TCG_TYPE_PTR, dest, addr & ~0xffff);
> -    tcg_out_ld(s, type, dest, dest, addr & 0xffff);
> -}
> -
>  static inline void tcg_out_risbg(TCGContext *s, TCGReg dest, TCGReg src,
>                                   int msb, int lsb, int ofs, int z)
>  {
> @@ -1243,17 +1190,7 @@ static void tgen_andi(TCGContext *s, TCGType type, TCGReg dest, uint64_t val)
>          return;
>      }
>  
> -    /* Use the constant pool if USE_REG_TB, but not for small constants.  */
> -    if (USE_REG_TB) {
> -        if (!maybe_out_small_movi(s, type, TCG_TMP0, val)) {
> -            tcg_out_insn(s, RXY, NG, dest, TCG_REG_TB, TCG_REG_NONE, 0);
> -            new_pool_label(s, val & valid, R_390_20, s->code_ptr - 2,
> -                           tcg_tbrel_diff(s, NULL));
> -            return;
> -        }
> -    } else {
> -        tcg_out_movi(s, type, TCG_TMP0, val);
> -    }
> +    tcg_out_movi(s, type, TCG_TMP0, val);
>      if (type == TCG_TYPE_I32) {
>          tcg_out_insn(s, RR, NR, dest, TCG_TMP0);
>      } else {
> @@ -1297,24 +1234,11 @@ static void tgen_ori(TCGContext *s, TCGType type, TCGReg dest, uint64_t val)
>          }
>      }
>  
> -    /* Use the constant pool if USE_REG_TB, but not for small constants.  */
> -    if (maybe_out_small_movi(s, type, TCG_TMP0, val)) {
> -        if (type == TCG_TYPE_I32) {
> -            tcg_out_insn(s, RR, OR, dest, TCG_TMP0);
> -        } else {
> -            tcg_out_insn(s, RRE, OGR, dest, TCG_TMP0);
> -        }
> -    } else if (USE_REG_TB) {
> -        tcg_out_insn(s, RXY, OG, dest, TCG_REG_TB, TCG_REG_NONE, 0);
> -        new_pool_label(s, val, R_390_20, s->code_ptr - 2,
> -                       tcg_tbrel_diff(s, NULL));
> +    tcg_out_movi(s, type, TCG_TMP0, val);
> +    if (type == TCG_TYPE_I32) {
> +        tcg_out_insn(s, RR, OR, dest, TCG_TMP0);
>      } else {
> -        /* Perform the OR via sequential modifications to the high and
> -           low parts.  Do this via recursion to handle 16-bit vs 32-bit
> -           masks in each half.  */
> -        tcg_debug_assert(HAVE_FACILITY(EXT_IMM));
> -        tgen_ori(s, type, dest, val & 0x00000000ffffffffull);
> -        tgen_ori(s, type, dest, val & 0xffffffff00000000ull);
> +        tcg_out_insn(s, RRE, OGR, dest, TCG_TMP0);
>      }
>  }
>  
> @@ -1332,26 +1256,11 @@ static void tgen_xori(TCGContext *s, TCGType type, TCGReg dest, uint64_t val)
>          }
>      }
>  
> -    /* Use the constant pool if USE_REG_TB, but not for small constants.  */
> -    if (maybe_out_small_movi(s, type, TCG_TMP0, val)) {
> -        if (type == TCG_TYPE_I32) {
> -            tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
> -        } else {
> -            tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
> -        }
> -    } else if (USE_REG_TB) {
> -        tcg_out_insn(s, RXY, XG, dest, TCG_REG_TB, TCG_REG_NONE, 0);
> -        new_pool_label(s, val, R_390_20, s->code_ptr - 2,
> -                       tcg_tbrel_diff(s, NULL));
> +    tcg_out_movi(s, type, TCG_TMP0, val);
> +    if (type == TCG_TYPE_I32) {
> +        tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
>      } else {
> -        /* Perform the xor by parts.  */
> -        tcg_debug_assert(HAVE_FACILITY(EXT_IMM));
> -        if (val & 0xffffffff) {
> -            tcg_out_insn(s, RIL, XILF, dest, val);
> -        }
> -        if (val > 0xffffffff) {
> -            tcg_out_insn(s, RIL, XIHF, dest, val >> 32);
> -        }
> +        tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
>      }
>  }

Wouldn't it be worth keeping XILF/XIFH here?
I don't have any numbers right now, but it looks more compact/efficient
than a load + XGR.
Same for OGR above; I even wonder if both implementations could be
unified.

> @@ -1395,19 +1304,6 @@ static int tgen_cmp(TCGContext *s, TCGType type, TCGCond c, TCGReg r1,
>          if (maybe_out_small_movi(s, type, TCG_TMP0, c2)) {
>              c2 = TCG_TMP0;
>              /* fall through to reg-reg */
> -        } else if (USE_REG_TB) {
> -            if (type == TCG_TYPE_I32) {
> -                op = (is_unsigned ? RXY_CLY : RXY_CY);
> -                tcg_out_insn_RXY(s, op, r1, TCG_REG_TB, TCG_REG_NONE, 0);
> -                new_pool_label(s, (uint32_t)c2, R_390_20, s->code_ptr - 2,
> -                               4 - tcg_tbrel_diff(s, NULL));
> -            } else {
> -                op = (is_unsigned ? RXY_CLG : RXY_CG);
> -                tcg_out_insn_RXY(s, op, r1, TCG_REG_TB, TCG_REG_NONE, 0);
> -                new_pool_label(s, c2, R_390_20, s->code_ptr - 2,
> -                               tcg_tbrel_diff(s, NULL));
> -            }
> -            goto exit;
>          } else {
>              if (type == TCG_TYPE_I32) {
>                  op = (is_unsigned ? RIL_CLRL : RIL_CRL);
> @@ -2101,43 +1997,22 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>  
>      case INDEX_op_goto_tb:
>          a0 = args[0];
> -        if (s->tb_jmp_insn_offset) {
> -            /*
> -             * branch displacement must be aligned for atomic patching;
> -             * see if we need to add extra nop before branch
> -             */
> -            if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) {
> -                tcg_out16(s, NOP);
> -            }
> -            tcg_debug_assert(!USE_REG_TB);
> -            tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4));
> -            s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
> -            s->code_ptr += 2;
> -        } else {
> -            /* load address stored at s->tb_jmp_target_addr + a0 */
> -            tcg_out_ld_abs(s, TCG_TYPE_PTR, TCG_REG_TB,
> -                           tcg_splitwx_to_rx(s->tb_jmp_target_addr + a0));
> -            /* and go there */
> -            tcg_out_insn(s, RR, BCR, S390_CC_ALWAYS, TCG_REG_TB);
> +        tcg_debug_assert(s->tb_jmp_insn_offset);
> +        /*
> +         * branch displacement must be aligned for atomic patching;
> +         * see if we need to add extra nop before branch
> +         */
> +        if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) {
> +            tcg_out16(s, NOP);
>          }
> +        tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4));
> +        s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
> +        tcg_out32(s, 0);
>          set_jmp_reset_offset(s, a0);

This seems to work in practice, but I don't think patching branch
offsets is allowed by PoP (in a multi-threaded environment). For
example, I had to do [2] in order to work around this limitation in
ftrace.

> -
> -        /* For the unlinked path of goto_tb, we need to reset
> -           TCG_REG_TB to the beginning of this TB.  */
> -        if (USE_REG_TB) {
> -            int ofs = -tcg_current_code_size(s);
> -            /* All TB are restricted to 64KiB by unwind info. */
> -            tcg_debug_assert(ofs == sextract64(ofs, 0, 20));
> -            tcg_out_insn(s, RXY, LAY, TCG_REG_TB,
> -                         TCG_REG_TB, TCG_REG_NONE, ofs);
> -        }
>          break;
>  
>      case INDEX_op_goto_ptr:
>          a0 = args[0];
> -        if (USE_REG_TB) {
> -            tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_TB, a0);
> -        }
>          tcg_out_insn(s, RR, BCR, S390_CC_ALWAYS, a0);
>          break;
>  
> @@ -3405,9 +3280,6 @@ static void tcg_target_init(TCGContext *s)
>      /* XXX many insns can't be used with R0, so we better avoid it for now */
>      tcg_regset_set_reg(s->reserved_regs, TCG_REG_R0);
>      tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
> -    if (USE_REG_TB) {
> -        tcg_regset_set_reg(s->reserved_regs, TCG_REG_TB);
> -    }
>  }

A third benefit seems to be that we now have one more register to
allocate.

>  
>  #define FRAME_SIZE  ((int)(TCG_TARGET_CALL_STACK_OFFSET          \
> @@ -3428,16 +3300,12 @@ static void tcg_target_qemu_prologue(TCGContext *s)
>  
>  #ifndef CONFIG_SOFTMMU
>      if (guest_base >= 0x80000) {
> -        tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_GUEST_BASE_REG, guest_base, true);
> +        tcg_out_movi(s, TCG_TYPE_PTR, TCG_GUEST_BASE_REG, guest_base);
>          tcg_regset_set_reg(s->reserved_regs, TCG_GUEST_BASE_REG);
>      }
>  #endif
>  
>      tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
> -    if (USE_REG_TB) {
> -        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_TB,
> -                    tcg_target_call_iarg_regs[1]);
> -    }
>  
>      /* br %r3 (go to TB) */
>      tcg_out_insn(s, RR, BCR, S390_CC_ALWAYS, tcg_target_call_iarg_regs[1]);
> -- 
> 2.34.1

[1] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=9776b4653bc4f8b568ea49fea4a7d7460e58b68a
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de5012b41e5c900a8a3875c7a825394c5f624c05


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 03/13] tcg/s390x: Use LARL+AGHI for odd addresses
  2022-12-02  6:51 ` [PATCH v3 03/13] tcg/s390x: Use LARL+AGHI for odd addresses Richard Henderson
@ 2022-12-06 19:42   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 19:42 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:50PM -0800, Richard Henderson wrote:
> Add one instead of dropping odd addresses to the constant pool.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target.c.inc | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 04/13] tcg/s390x: Distinguish RRF-a and RRF-c formats
  2022-12-02  6:51 ` [PATCH v3 04/13] tcg/s390x: Distinguish RRF-a and RRF-c formats Richard Henderson
@ 2022-12-06 19:45   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 19:45 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:51PM -0800, Richard Henderson wrote:
> One has 3 register arguments; the other has 2 plus an m3 field.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target.c.inc | 57 +++++++++++++++++++++-----------------
>  1 file changed, 32 insertions(+), 25 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 05/13] tcg/s390x: Distinguish RIE formats
  2022-12-02  6:51 ` [PATCH v3 05/13] tcg/s390x: Distinguish RIE formats Richard Henderson
@ 2022-12-06 19:47   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 19:47 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:52PM -0800, Richard Henderson wrote:
> There are multiple variations, with different fields.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target.c.inc | 47 +++++++++++++++++++++-----------------
>  1 file changed, 26 insertions(+), 21 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 06/13] tcg/s390x: Support MIE2 multiply single instructions
  2022-12-02  6:51 ` [PATCH v3 06/13] tcg/s390x: Support MIE2 multiply single instructions Richard Henderson
@ 2022-12-06 20:02   ` Ilya Leoshkevich
  2022-12-06 20:20     ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 20:02 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel

On Thu, Dec 01, 2022 at 10:51:53PM -0800, Richard Henderson wrote:
> The MIE2 facility adds 3-operand versions of multiply.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target-con-set.h |  1 +
>  tcg/s390x/tcg-target.h         |  1 +
>  tcg/s390x/tcg-target.c.inc     | 34 ++++++++++++++++++++++++----------
>  3 files changed, 26 insertions(+), 10 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>

I have one small suggestion, see below.

> diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
> index 00ba727b70..33a82e3286 100644
> --- a/tcg/s390x/tcg-target-con-set.h
> +++ b/tcg/s390x/tcg-target-con-set.h
> @@ -23,6 +23,7 @@ C_O1_I2(r, 0, ri)
>  C_O1_I2(r, 0, rI)
>  C_O1_I2(r, 0, rJ)
>  C_O1_I2(r, r, ri)
> +C_O1_I2(r, r, rJ)
>  C_O1_I2(r, rZ, r)
>  C_O1_I2(v, v, r)
>  C_O1_I2(v, v, v)
> diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
> index 645f522058..bfd623a639 100644
> --- a/tcg/s390x/tcg-target.h
> +++ b/tcg/s390x/tcg-target.h
> @@ -63,6 +63,7 @@ typedef enum TCGReg {
>  #define FACILITY_FAST_BCR_SER         FACILITY_LOAD_ON_COND
>  #define FACILITY_DISTINCT_OPS         FACILITY_LOAD_ON_COND
>  #define FACILITY_LOAD_ON_COND2        53
> +#define FACILITY_MISC_INSN_EXT2       58
>  #define FACILITY_VECTOR               129
>  #define FACILITY_VECTOR_ENH1          135
>  
> diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
> index d02b433271..cd39b2a208 100644
> --- a/tcg/s390x/tcg-target.c.inc
> +++ b/tcg/s390x/tcg-target.c.inc
> @@ -180,6 +180,8 @@ typedef enum S390Opcode {
>      RRE_SLBGR   = 0xb989,
>      RRE_XGR     = 0xb982,
>  
> +    RRFa_MSRKC  = 0xb9fd,
> +    RRFa_MSGRKC = 0xb9ed,
>      RRFa_NRK    = 0xb9f4,
>      RRFa_NGRK   = 0xb9e4,
>      RRFa_ORK    = 0xb9f6,
> @@ -2140,14 +2142,18 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>          break;
>  
>      case INDEX_op_mul_i32:
> +        a0 = args[0], a1 = args[1], a2 = (int32_t)args[2];
>          if (const_args[2]) {
> -            if ((int32_t)args[2] == (int16_t)args[2]) {
> -                tcg_out_insn(s, RI, MHI, args[0], args[2]);
> +            tcg_out_mov(s, TCG_TYPE_I32, a0, a1);

Should we consider a0 == a1 case here as well, in order to get rid of
this extra move when possible?

> +            if (a2 == (int16_t)a2) {
> +                tcg_out_insn(s, RI, MHI, a0, a2);
>              } else {
> -                tcg_out_insn(s, RIL, MSFI, args[0], args[2]);
> +                tcg_out_insn(s, RIL, MSFI, a0, a2);
>              }
> +        } else if (a0 == a1) {
> +            tcg_out_insn(s, RRE, MSR, a0, a2);
>          } else {
> -            tcg_out_insn(s, RRE, MSR, args[0], args[2]);
> +            tcg_out_insn(s, RRFa, MSRKC, a0, a1, a2);
>          }
>          break;
>  
> @@ -2405,14 +2411,18 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>          break;
>  
>      case INDEX_op_mul_i64:
> +        a0 = args[0], a1 = args[1], a2 = args[2];
>          if (const_args[2]) {
> -            if (args[2] == (int16_t)args[2]) {
> -                tcg_out_insn(s, RI, MGHI, args[0], args[2]);
> +            tcg_out_mov(s, TCG_TYPE_I64, a0, a1);

Same here.

> +            if (a2 == (int16_t)a2) {
> +                tcg_out_insn(s, RI, MGHI, a0, a2);
>              } else {
> -                tcg_out_insn(s, RIL, MSGFI, args[0], args[2]);
> +                tcg_out_insn(s, RIL, MSGFI, a0, a2);
>              }
> +        } else if (a0 == a1) {
> +            tcg_out_insn(s, RRE, MSGR, a0, a2);
>          } else {
> -            tcg_out_insn(s, RRE, MSGR, args[0], args[2]);
> +            tcg_out_insn(s, RRFa, MSGRKC, a0, a1, a2);
>          }
>          break;
>  
> @@ -3072,12 +3082,16 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
>             MULTIPLY SINGLE IMMEDIATE with a signed 32-bit, otherwise we
>             have only MULTIPLY HALFWORD IMMEDIATE, with a signed 16-bit.  */
>          return (HAVE_FACILITY(GEN_INST_EXT)
> -                ? C_O1_I2(r, 0, ri)
> +                ? (HAVE_FACILITY(MISC_INSN_EXT2)
> +                   ? C_O1_I2(r, r, ri)
> +                   : C_O1_I2(r, 0, ri))
>                  : C_O1_I2(r, 0, rI));
>  
>      case INDEX_op_mul_i64:
>          return (HAVE_FACILITY(GEN_INST_EXT)
> -                ? C_O1_I2(r, 0, rJ)
> +                ? (HAVE_FACILITY(MISC_INSN_EXT2)
> +                   ? C_O1_I2(r, r, rJ)
> +                   : C_O1_I2(r, 0, rJ))
>                  : C_O1_I2(r, 0, rI));
>  
>      case INDEX_op_shl_i32:
> -- 
> 2.34.1
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 07/13] tcg/s390x: Support MIE2 MGRK instruction
  2022-12-02  6:51 ` [PATCH v3 07/13] tcg/s390x: Support MIE2 MGRK instruction Richard Henderson
@ 2022-12-06 20:02   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 20:02 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:54PM -0800, Richard Henderson wrote:
> The MIE2 facility adds a 3-operand signed 64x64->128 multiply.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target-con-set.h | 1 +
>  tcg/s390x/tcg-target.h         | 2 +-
>  tcg/s390x/tcg-target.c.inc     | 8 ++++++++
>  3 files changed, 10 insertions(+), 1 deletion(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 08/13] tcg/s390x: Support MIE3 logical operations
  2022-12-02  6:51 ` [PATCH v3 08/13] tcg/s390x: Support MIE3 logical operations Richard Henderson
@ 2022-12-06 20:08   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 20:08 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:55PM -0800, Richard Henderson wrote:
> This is andc, orc, nand, nor, eqv.
> We can use nor for implementing not.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target-con-set.h |   1 +
>  tcg/s390x/tcg-target.h         |  25 +++++----
>  tcg/s390x/tcg-target.c.inc     | 100 +++++++++++++++++++++++++++++++++
>  3 files changed, 114 insertions(+), 12 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 09/13] tcg/s390x: Create tgen_cmp2 to simplify movcond
  2022-12-02  6:51 ` [PATCH v3 09/13] tcg/s390x: Create tgen_cmp2 to simplify movcond Richard Henderson
@ 2022-12-06 20:14   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 20:14 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:56PM -0800, Richard Henderson wrote:
> Return both regular and inverted condition codes from tgen_cmp2.
> This lets us choose after the fact which comparision we want.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target.c.inc | 25 +++++++++++++++++--------
>  1 file changed, 17 insertions(+), 8 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 06/13] tcg/s390x: Support MIE2 multiply single instructions
  2022-12-06 20:02   ` Ilya Leoshkevich
@ 2022-12-06 20:20     ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2022-12-06 20:20 UTC (permalink / raw)
  To: Ilya Leoshkevich; +Cc: qemu-devel@nongnu.org Developers

[-- Attachment #1: Type: text/plain, Size: 2566 bytes --]

On Tue, 6 Dec 2022, 14:02 Ilya Leoshkevich, <iii@linux.ibm.com> wrote:

> On Thu, Dec 01, 2022 at 10:51:53PM -0800, Richard Henderson wrote:
> > The MIE2 facility adds 3-operand versions of multiply.
> >
> > Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> > ---
> >  tcg/s390x/tcg-target-con-set.h |  1 +
> >  tcg/s390x/tcg-target.h         |  1 +
> >  tcg/s390x/tcg-target.c.inc     | 34 ++++++++++++++++++++++++----------
> >  3 files changed, 26 insertions(+), 10 deletions(-)
>
> Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
>
> I have one small suggestion, see below.
>
> > diff --git a/tcg/s390x/tcg-target-con-set.h
> b/tcg/s390x/tcg-target-con-set.h
> > index 00ba727b70..33a82e3286 100644
> > --- a/tcg/s390x/tcg-target-con-set.h
> > +++ b/tcg/s390x/tcg-target-con-set.h
> > @@ -23,6 +23,7 @@ C_O1_I2(r, 0, ri)
> >  C_O1_I2(r, 0, rI)
> >  C_O1_I2(r, 0, rJ)
> >  C_O1_I2(r, r, ri)
> > +C_O1_I2(r, r, rJ)
> >  C_O1_I2(r, rZ, r)
> >  C_O1_I2(v, v, r)
> >  C_O1_I2(v, v, v)
> > diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
> > index 645f522058..bfd623a639 100644
> > --- a/tcg/s390x/tcg-target.h
> > +++ b/tcg/s390x/tcg-target.h
> > @@ -63,6 +63,7 @@ typedef enum TCGReg {
> >  #define FACILITY_FAST_BCR_SER         FACILITY_LOAD_ON_COND
> >  #define FACILITY_DISTINCT_OPS         FACILITY_LOAD_ON_COND
> >  #define FACILITY_LOAD_ON_COND2        53
> > +#define FACILITY_MISC_INSN_EXT2       58
> >  #define FACILITY_VECTOR               129
> >  #define FACILITY_VECTOR_ENH1          135
> >
> > diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
> > index d02b433271..cd39b2a208 100644
> > --- a/tcg/s390x/tcg-target.c.inc
> > +++ b/tcg/s390x/tcg-target.c.inc
> > @@ -180,6 +180,8 @@ typedef enum S390Opcode {
> >      RRE_SLBGR   = 0xb989,
> >      RRE_XGR     = 0xb982,
> >
> > +    RRFa_MSRKC  = 0xb9fd,
> > +    RRFa_MSGRKC = 0xb9ed,
> >      RRFa_NRK    = 0xb9f4,
> >      RRFa_NGRK   = 0xb9e4,
> >      RRFa_ORK    = 0xb9f6,
> > @@ -2140,14 +2142,18 @@ static inline void tcg_out_op(TCGContext *s,
> TCGOpcode opc,
> >          break;
> >
> >      case INDEX_op_mul_i32:
> > +        a0 = args[0], a1 = args[1], a2 = (int32_t)args[2];
> >          if (const_args[2]) {
> > -            if ((int32_t)args[2] == (int16_t)args[2]) {
> > -                tcg_out_insn(s, RI, MHI, args[0], args[2]);
> > +            tcg_out_mov(s, TCG_TYPE_I32, a0, a1);
>
> Should we consider a0 == a1 case here as well, in order to get rid of
> this extra move when possible?
>

tcg_out_mov already does that.


r~

[-- Attachment #2: Type: text/html, Size: 3599 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 10/13] tcg/s390x: Generalize movcond implementation
  2022-12-02  6:51 ` [PATCH v3 10/13] tcg/s390x: Generalize movcond implementation Richard Henderson
@ 2022-12-06 20:39   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 20:39 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:57PM -0800, Richard Henderson wrote:
> Generalize movcond to support pre-computed conditions, and the same
> set of arguments at all times.  This will be assumed by a following
> patch, which needs to reuse tgen_movcond_int.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target-con-set.h |  3 +-
>  tcg/s390x/tcg-target.c.inc     | 78 ++++++++++++++++++++++++++--------
>  2 files changed, 61 insertions(+), 20 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 11/13] tcg/s390x: Support SELGR instruction in movcond
  2022-12-02  6:51 ` [PATCH v3 11/13] tcg/s390x: Support SELGR instruction in movcond Richard Henderson
@ 2022-12-06 20:41   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 20:41 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:58PM -0800, Richard Henderson wrote:
> The new select instruction provides two separate register inputs,
> whereas the old load-on-condition instruction overlaps one of the
> register inputs with the destination.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target.c.inc | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 12/13] tcg/s390x: Use tgen_movcond_int in tgen_clz
  2022-12-02  6:51 ` [PATCH v3 12/13] tcg/s390x: Use tgen_movcond_int in tgen_clz Richard Henderson
@ 2022-12-06 20:49   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 20:49 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:51:59PM -0800, Richard Henderson wrote:
> Reuse code from movcond to conditionally copy a2 to dest,
> based on the condition codes produced by FLOGR.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target-con-set.h |  1 +
>  tcg/s390x/tcg-target.c.inc     | 26 +++++++++++---------------
>  2 files changed, 12 insertions(+), 15 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 13/13] tcg/s390x: Implement ctpop operation
  2022-12-02  6:52 ` [PATCH v3 13/13] tcg/s390x: Implement ctpop operation Richard Henderson
@ 2022-12-06 21:10   ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-06 21:10 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Thu, Dec 01, 2022 at 10:52:00PM -0800, Richard Henderson wrote:
> There is an older form that produces per-byte results,
> and a newer form that produces per-register results,
> and a vector form that produces per-element results.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target.h     |  5 ++--
>  tcg/s390x/tcg-target.c.inc | 51 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+), 2 deletions(-)

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-06 19:29   ` Ilya Leoshkevich
@ 2022-12-06 22:22     ` Richard Henderson
  2022-12-07  0:42       ` Richard Henderson
                         ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Richard Henderson @ 2022-12-06 22:22 UTC (permalink / raw)
  To: Ilya Leoshkevich, qemu-devel; +Cc: thuth

On 12/6/22 13:29, Ilya Leoshkevich wrote:
> On Thu, Dec 01, 2022 at 10:51:49PM -0800, Richard Henderson wrote:
>> This reverts 829e1376d940 ("tcg/s390: Introduce TCG_REG_TB"), and
>> several follow-up patches.  The primary motivation is to reduce the
>> less-tested code paths, pre-z10.  Secondarily, this allows the
>> unconditional use of TCG_TARGET_HAS_direct_jump, which might be more
>> important for performance than any slight increase in code size.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   tcg/s390x/tcg-target.h     |   2 +-
>>   tcg/s390x/tcg-target.c.inc | 176 +++++--------------------------------
>>   2 files changed, 23 insertions(+), 155 deletions(-)
> 
> Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
> 
> I have a few questions/ideas for the future below.
>   
>> diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
>> index 22d70d431b..645f522058 100644
>> --- a/tcg/s390x/tcg-target.h
>> +++ b/tcg/s390x/tcg-target.h
>> @@ -103,7 +103,7 @@ extern uint64_t s390_facilities[3];
>>   #define TCG_TARGET_HAS_mulsh_i32      0
>>   #define TCG_TARGET_HAS_extrl_i64_i32  0
>>   #define TCG_TARGET_HAS_extrh_i64_i32  0
>> -#define TCG_TARGET_HAS_direct_jump    HAVE_FACILITY(GEN_INST_EXT)
>> +#define TCG_TARGET_HAS_direct_jump    1
> 
> This change doesn't seem to affect that, but what is the minimum
> supported s390x qemu host? z900?

Possibly z990, if I'm reading the gcc processor_flags_table[] correctly; 
long-displacement-facility is definitely a minimum.

We probably should revisit what the minimum for TCG should be, assert those features at 
startup, and drop the corresponding runtime tests.

> I did some benchmarking of various ways to load constants in context of
> GCC in the past, and it turned out that LLIHF+OILF is more efficient
> than literal pool [1].

Interesting.  If we include extended-immediate-facility (base_GEN9_GA1, z9-109?) in the 
bare minimum that would definitely simplify a few things.

>> -    /* Use the constant pool if USE_REG_TB, but not for small constants.  */
>> -    if (maybe_out_small_movi(s, type, TCG_TMP0, val)) {
>> -        if (type == TCG_TYPE_I32) {
>> -            tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
>> -        } else {
>> -            tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
>> -        }
>> -    } else if (USE_REG_TB) {
>> -        tcg_out_insn(s, RXY, XG, dest, TCG_REG_TB, TCG_REG_NONE, 0);
>> -        new_pool_label(s, val, R_390_20, s->code_ptr - 2,
>> -                       tcg_tbrel_diff(s, NULL));
>> +    tcg_out_movi(s, type, TCG_TMP0, val);
>> +    if (type == TCG_TYPE_I32) {
>> +        tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
>>       } else {
>> -        /* Perform the xor by parts.  */
>> -        tcg_debug_assert(HAVE_FACILITY(EXT_IMM));
>> -        if (val & 0xffffffff) {
>> -            tcg_out_insn(s, RIL, XILF, dest, val);
>> -        }
>> -        if (val > 0xffffffff) {
>> -            tcg_out_insn(s, RIL, XIHF, dest, val >> 32);
>> -        }
>> +        tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
>>       }
>>   }
> 
> Wouldn't it be worth keeping XILF/XIFH here?

I don't know.  It's difficult for me to guess whether a dependency chain like

     val -> xor -> xor

(3 insns with serial dependencies) is better than

     val   --> xor
     load  -/

(3 insns, but only one serial dependency) is better.  But there may also be instruction 
fusion going on at the micro-architectural level, so that there's really only one xor.

If you have suggestions, I'm all ears.

> I don't have any numbers right now, but it looks more compact/efficient
> than a load + XGR.

If we assume general-instruction-extension-facility (z10?), LGRL + XGR is smaller than 
XILF + XIFH, ignoring the constant pool entry which might be shared, and modulo the µarch 
questions above.


> Same for OGR above; I even wonder if both implementations could be
> unified.

Sadly not, because of OILL et al.  There are no 16-bit xor immediate insns.

>> +        /*
>> +         * branch displacement must be aligned for atomic patching;
>> +         * see if we need to add extra nop before branch
>> +         */
>> +        if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) {
>> +            tcg_out16(s, NOP);
>>           }
>> +        tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4));
>> +        s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
>> +        tcg_out32(s, 0);
>>           set_jmp_reset_offset(s, a0);
> 
> This seems to work in practice, but I don't think patching branch
> offsets is allowed by PoP (in a multi-threaded environment). For
> example, I had to do [2] in order to work around this limitation in
> ftrace.

Really?  How does the processor distinguish between overwriting opcode/condition vs 
overwriting immediate operand when invalidating cached instructions?

If overwriting operand truly isn't correct, then I think we have to use indirect branch 
always for goto_tb.

> A third benefit seems to be that we now have one more register to
> allocate.

Yes.  It's call clobbered, so it isn't live so often, but sometimes.


r~


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-06 22:22     ` Richard Henderson
@ 2022-12-07  0:42       ` Richard Henderson
  2022-12-08 14:04         ` Ilya Leoshkevich
  2022-12-07  7:45       ` Thomas Huth
  2022-12-07 22:09       ` Ilya Leoshkevich
  2 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-07  0:42 UTC (permalink / raw)
  To: Ilya Leoshkevich, qemu-devel; +Cc: thuth

On 12/6/22 16:22, Richard Henderson wrote:
>> Wouldn't it be worth keeping XILF/XIFH here?
> 
> I don't know.  It's difficult for me to guess whether a dependency chain like
> 
>      val -> xor -> xor
> 
> (3 insns with serial dependencies) is better than
> 
>      val   --> xor
>      load  -/
> 
> (3 insns, but only one serial dependency) is better.  But there may also be instruction 
> fusion going on at the micro-architectural level, so that there's really only one xor.
> 
> If you have suggestions, I'm all ears.

Related microarchitectural question:

If a 32-bit insn and a 64-bit insn have a similar size encoding (and perhaps even if they 
don't), is it better to produce a 64-bit output so that the hw doesn't have a false 
dependency on the upper 32-bits of the register?

Just wondering whether most of the distinction between 32-bit and 64-bit opcodes ought to 
be discarded, simplifying code generation.  The only items that seem most likely to have 
real execution time differences are multiply and divide.


r~


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-06 22:22     ` Richard Henderson
  2022-12-07  0:42       ` Richard Henderson
@ 2022-12-07  7:45       ` Thomas Huth
  2022-12-07 14:55         ` Richard Henderson
  2022-12-07 22:09       ` Ilya Leoshkevich
  2 siblings, 1 reply; 36+ messages in thread
From: Thomas Huth @ 2022-12-07  7:45 UTC (permalink / raw)
  To: Richard Henderson, Ilya Leoshkevich, qemu-devel

On 06/12/2022 23.22, Richard Henderson wrote:
> On 12/6/22 13:29, Ilya Leoshkevich wrote:
>> On Thu, Dec 01, 2022 at 10:51:49PM -0800, Richard Henderson wrote:
>>> This reverts 829e1376d940 ("tcg/s390: Introduce TCG_REG_TB"), and
>>> several follow-up patches.  The primary motivation is to reduce the
>>> less-tested code paths, pre-z10.  Secondarily, this allows the
>>> unconditional use of TCG_TARGET_HAS_direct_jump, which might be more
>>> important for performance than any slight increase in code size.
>>>
>>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>>> ---
>>>   tcg/s390x/tcg-target.h     |   2 +-
>>>   tcg/s390x/tcg-target.c.inc | 176 +++++--------------------------------
>>>   2 files changed, 23 insertions(+), 155 deletions(-)
>>
>> Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
>>
>> I have a few questions/ideas for the future below.
>>> diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
>>> index 22d70d431b..645f522058 100644
>>> --- a/tcg/s390x/tcg-target.h
>>> +++ b/tcg/s390x/tcg-target.h
>>> @@ -103,7 +103,7 @@ extern uint64_t s390_facilities[3];
>>>   #define TCG_TARGET_HAS_mulsh_i32      0
>>>   #define TCG_TARGET_HAS_extrl_i64_i32  0
>>>   #define TCG_TARGET_HAS_extrh_i64_i32  0
>>> -#define TCG_TARGET_HAS_direct_jump    HAVE_FACILITY(GEN_INST_EXT)
>>> +#define TCG_TARGET_HAS_direct_jump    1
>>
>> This change doesn't seem to affect that, but what is the minimum
>> supported s390x qemu host? z900?
> 
> Possibly z990, if I'm reading the gcc processor_flags_table[] correctly; 
> long-displacement-facility is definitely a minimum.
> 
> We probably should revisit what the minimum for TCG should be, assert those 
> features at startup, and drop the corresponding runtime tests.

If we consider the official IBM support statement:

https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Mainframe%20Life%20Cycle%20History%20V2.10%20-%20Sept%2013%202022_1.pdf

... that would mean that the z10 and all older machines are not supported 
anymore.

  Thomas



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-07  7:45       ` Thomas Huth
@ 2022-12-07 14:55         ` Richard Henderson
  2022-12-07 20:40           ` Ilya Leoshkevich
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2022-12-07 14:55 UTC (permalink / raw)
  To: Thomas Huth, Ilya Leoshkevich, qemu-devel

On 12/7/22 01:45, Thomas Huth wrote:
> On 06/12/2022 23.22, Richard Henderson wrote:
>> On 12/6/22 13:29, Ilya Leoshkevich wrote:
>>> This change doesn't seem to affect that, but what is the minimum
>>> supported s390x qemu host? z900?
>>
>> Possibly z990, if I'm reading the gcc processor_flags_table[] correctly; 
>> long-displacement-facility is definitely a minimum.
>>
>> We probably should revisit what the minimum for TCG should be, assert those features at 
>> startup, and drop the corresponding runtime tests.
> 
> If we consider the official IBM support statement:
> 
> https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Mainframe%20Life%20Cycle%20History%20V2.10%20-%20Sept%2013%202022_1.pdf
> 
> ... that would mean that the z10 and all older machines are not supported anymore.

Thanks for the pointer.  It would appear that z114 exits support at the end of this month, 
which would leave z12 as minimum supported cpu.

Even assuming z196 gets us extended-immediate, general-insn-extension, load-on-condition, 
and distinct-operands, which are all quite important to TCG, and constitute almost all of 
the current runtime checks.

The other metric would be matching the set of supported cpus from the set of supported os 
distributions, but I would be ready to believe z196 is below the minimum there too.


r~


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-07 14:55         ` Richard Henderson
@ 2022-12-07 20:40           ` Ilya Leoshkevich
  2022-12-07 21:20             ` Christian Borntraeger
  0 siblings, 1 reply; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-07 20:40 UTC (permalink / raw)
  To: Richard Henderson, Thomas Huth, qemu-devel, Christian Borntraeger

On Wed, 2022-12-07 at 08:55 -0600, Richard Henderson wrote:
> On 12/7/22 01:45, Thomas Huth wrote:
> > On 06/12/2022 23.22, Richard Henderson wrote:
> > > On 12/6/22 13:29, Ilya Leoshkevich wrote:
> > > > This change doesn't seem to affect that, but what is the
> > > > minimum
> > > > supported s390x qemu host? z900?
> > > 
> > > Possibly z990, if I'm reading the gcc processor_flags_table[]
> > > correctly; 
> > > long-displacement-facility is definitely a minimum.
> > > 
> > > We probably should revisit what the minimum for TCG should be,
> > > assert those features at 
> > > startup, and drop the corresponding runtime tests.
> > 
> > If we consider the official IBM support statement:
> > 
> > https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Mainframe%20Life%20Cycle%20History%20V2.10%20-%20Sept%2013%202022_1.pdf
> > 
> > ... that would mean that the z10 and all older machines are not
> > supported anymore.
> 
> Thanks for the pointer.  It would appear that z114 exits support at
> the end of this month, 
> which would leave z12 as minimum supported cpu.
> 
> Even assuming z196 gets us extended-immediate, general-insn-
> extension, load-on-condition, 
> and distinct-operands, which are all quite important to TCG, and
> constitute almost all of 
> the current runtime checks.
> 
> The other metric would be matching the set of supported cpus from the
> set of supported os 
> distributions, but I would be ready to believe z196 is below the
> minimum there too.
> 
> 
> r~

I think it should be safe to raise the minimum required hardware for
TCG to z196:

* The oldest supported RHEL is v7, it requires z196:

https://access.redhat.com/product-life-cycles/
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/installation_guide/chap-installation-planning-s390

* The oldest supported SLES is v12, it requires z196:

https://www.suse.com/de-de/lifecycle/
https://documentation.suse.com/sles/12-SP5/html/SLES-all/cha-zseries.html

* The oldest supported Ubuntu is v16.04, it requires zEC12+:
https://wiki.ubuntu.com/S390X

Best regards,
Ilya


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-07 20:40           ` Ilya Leoshkevich
@ 2022-12-07 21:20             ` Christian Borntraeger
  0 siblings, 0 replies; 36+ messages in thread
From: Christian Borntraeger @ 2022-12-07 21:20 UTC (permalink / raw)
  To: Ilya Leoshkevich, Richard Henderson, Thomas Huth, qemu-devel



Am 07.12.22 um 21:40 schrieb Ilya Leoshkevich:
> On Wed, 2022-12-07 at 08:55 -0600, Richard Henderson wrote:
>> On 12/7/22 01:45, Thomas Huth wrote:
>>> On 06/12/2022 23.22, Richard Henderson wrote:
>>>> On 12/6/22 13:29, Ilya Leoshkevich wrote:
>>>>> This change doesn't seem to affect that, but what is the
>>>>> minimum
>>>>> supported s390x qemu host? z900?
>>>>
>>>> Possibly z990, if I'm reading the gcc processor_flags_table[]
>>>> correctly;
>>>> long-displacement-facility is definitely a minimum.
>>>>
>>>> We probably should revisit what the minimum for TCG should be,
>>>> assert those features at
>>>> startup, and drop the corresponding runtime tests.
>>>
>>> If we consider the official IBM support statement:
>>>
>>> https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Mainframe%20Life%20Cycle%20History%20V2.10%20-%20Sept%2013%202022_1.pdf
>>>
>>> ... that would mean that the z10 and all older machines are not
>>> supported anymore.
>>
>> Thanks for the pointer.  It would appear that z114 exits support at
>> the end of this month,
>> which would leave z12 as minimum supported cpu.
>>
>> Even assuming z196 gets us extended-immediate, general-insn-
>> extension, load-on-condition,
>> and distinct-operands, which are all quite important to TCG, and
>> constitute almost all of
>> the current runtime checks.
>>
>> The other metric would be matching the set of supported cpus from the
>> set of supported os
>> distributions, but I would be ready to believe z196 is below the
>> minimum there too.
>>
>>
>> r~
> 
> I think it should be safe to raise the minimum required hardware for
> TCG to z196:

We recently raised the minimum hardware for the Linux kernel to be z10, so that would be super-safe, but z196 is certainly a sane minimum.
> 
> * The oldest supported RHEL is v7, it requires z196:
> 
> https://access.redhat.com/product-life-cycles/
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/installation_guide/chap-installation-planning-s390
> 
> * The oldest supported SLES is v12, it requires z196:
> 
> https://www.suse.com/de-de/lifecycle/
> https://documentation.suse.com/sles/12-SP5/html/SLES-all/cha-zseries.html
> 
> * The oldest supported Ubuntu is v16.04, it requires zEC12+:
> https://wiki.ubuntu.com/S390X
> 
> Best regards,
> Ilya


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-06 22:22     ` Richard Henderson
  2022-12-07  0:42       ` Richard Henderson
  2022-12-07  7:45       ` Thomas Huth
@ 2022-12-07 22:09       ` Ilya Leoshkevich
  2 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-07 22:09 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Tue, 2022-12-06 at 16:22 -0600, Richard Henderson wrote:
> On 12/6/22 13:29, Ilya Leoshkevich wrote:
> > On Thu, Dec 01, 2022 at 10:51:49PM -0800, Richard Henderson wrote:
> > > This reverts 829e1376d940 ("tcg/s390: Introduce TCG_REG_TB"), and
> > > several follow-up patches.  The primary motivation is to reduce
> > > the
> > > less-tested code paths, pre-z10.  Secondarily, this allows the
> > > unconditional use of TCG_TARGET_HAS_direct_jump, which might be
> > > more
> > > important for performance than any slight increase in code size.
> > > 
> > > Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> > > ---
> > >   tcg/s390x/tcg-target.h     |   2 +-
> > >   tcg/s390x/tcg-target.c.inc | 176 +++++-------------------------
> > > -------
> > >   2 files changed, 23 insertions(+), 155 deletions(-)
> > 
> > Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
> > 
> > I have a few questions/ideas for the future below.  
> > > 
...

> > 
> > > -    /* Use the constant pool if USE_REG_TB, but not for small
> > > constants.  */
> > > -    if (maybe_out_small_movi(s, type, TCG_TMP0, val)) {
> > > -        if (type == TCG_TYPE_I32) {
> > > -            tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
> > > -        } else {
> > > -            tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
> > > -        }
> > > -    } else if (USE_REG_TB) {
> > > -        tcg_out_insn(s, RXY, XG, dest, TCG_REG_TB, TCG_REG_NONE,
> > > 0);
> > > -        new_pool_label(s, val, R_390_20, s->code_ptr - 2,
> > > -                       tcg_tbrel_diff(s, NULL));
> > > +    tcg_out_movi(s, type, TCG_TMP0, val);
> > > +    if (type == TCG_TYPE_I32) {
> > > +        tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
> > >       } else {
> > > -        /* Perform the xor by parts.  */
> > > -        tcg_debug_assert(HAVE_FACILITY(EXT_IMM));
> > > -        if (val & 0xffffffff) {
> > > -            tcg_out_insn(s, RIL, XILF, dest, val);
> > > -        }
> > > -        if (val > 0xffffffff) {
> > > -            tcg_out_insn(s, RIL, XIHF, dest, val >> 32);
> > > -        }
> > > +        tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
> > >       }
> > >   }
> > 
> > Wouldn't it be worth keeping XILF/XIFH here?
> 
> I don't know.  It's difficult for me to guess whether a dependency
> chain like
> 
>      val -> xor -> xor
> 
> (3 insns with serial dependencies) is better than
> 
>      val   --> xor
>      load  -/
> 
> (3 insns, but only one serial dependency) is better.  But there may
> also be instruction 
> fusion going on at the micro-architectural level, so that there's
> really only one xor.
> 
> If you have suggestions, I'm all ears.

I ran some experiments, and it turned out you were right: xilf+xihf is
slower exactly because of dependency chains.
So no need to change anything here and sorry for the noise.

> > I don't have any numbers right now, but it looks more
> > compact/efficient
> > than a load + XGR.
> 
> If we assume general-instruction-extension-facility (z10?), LGRL +
> XGR is smaller than 
> XILF + XIFH, ignoring the constant pool entry which might be shared,
> and modulo the µarch 
> questions above.
> 
> 
> > Same for OGR above; I even wonder if both implementations could be
> > unified.
> 
> Sadly not, because of OILL et al.  There are no 16-bit xor immediate
> insns.
> 
> > > +        /*
> > > +         * branch displacement must be aligned for atomic
> > > patching;
> > > +         * see if we need to add extra nop before branch
> > > +         */
> > > +        if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) {
> > > +            tcg_out16(s, NOP);
> > >           }
> > > +        tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4));
> > > +        s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
> > > +        tcg_out32(s, 0);
> > >           set_jmp_reset_offset(s, a0);
> > 
> > This seems to work in practice, but I don't think patching branch
> > offsets is allowed by PoP (in a multi-threaded environment). For
> > example, I had to do [2] in order to work around this limitation in
> > ftrace.
> 
> Really?  How does the processor distinguish between overwriting
> opcode/condition vs 
> overwriting immediate operand when invalidating cached instructions?

The problem is that nothing in PoP prevents a CPU from fetching
instruction bytes one by one, in random order and more than one time.
It's not that this is necessarily going to happen, rather, it's just
that this has never been verified for all instructions and formally
stated. The observation that I use in the ftrace patch is not
formalized either, but it's specific to a single instruction and should
hold in practice for the existing hardware to the best of my knowledge.

> If overwriting operand truly isn't correct, then I think we have to
> use indirect branch 
> always for goto_tb.
> 
> > A third benefit seems to be that we now have one more register to
> > allocate.
> 
> Yes.  It's call clobbered, so it isn't live so often, but sometimes.
> 
> 
> r~



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
  2022-12-07  0:42       ` Richard Henderson
@ 2022-12-08 14:04         ` Ilya Leoshkevich
  0 siblings, 0 replies; 36+ messages in thread
From: Ilya Leoshkevich @ 2022-12-08 14:04 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: thuth

On Tue, 2022-12-06 at 18:42 -0600, Richard Henderson wrote:
> On 12/6/22 16:22, Richard Henderson wrote:
> > > Wouldn't it be worth keeping XILF/XIFH here?
> > 
> > I don't know.  It's difficult for me to guess whether a dependency
> > chain like
> > 
> >      val -> xor -> xor
> > 
> > (3 insns with serial dependencies) is better than
> > 
> >      val   --> xor
> >      load  -/
> > 
> > (3 insns, but only one serial dependency) is better.  But there may
> > also be instruction 
> > fusion going on at the micro-architectural level, so that there's
> > really only one xor.
> > 
> > If you have suggestions, I'm all ears.
> 
> Related microarchitectural question:
> 
> If a 32-bit insn and a 64-bit insn have a similar size encoding (and
> perhaps even if they 
> don't), is it better to produce a 64-bit output so that the hw
> doesn't have a false 
> dependency on the upper 32-bits of the register?
> 
> Just wondering whether most of the distinction between 32-bit and 64-
> bit opcodes ought to 
> be discarded, simplifying code generation.  The only items that seem
> most likely to have 
> real execution time differences are multiply and divide.

I think this will definitely lead to false dependencies if one produces
32 bits in one place, and then consumes 64 in the other. But if this
idea is applied consistently, then there should be hopefully not so
many such instances?

Two thing come to mind here: memory and CC generation.

The first is probably not very important: we can implement 32-bit loads
with lgf, which sign-extends to 64 bits.

CC generation can be tricker: for conditional jumps it's still going to
be okay, since the code would produce 64-bit values and consume 32-bit
ones, but if the back-end needs a CC from a 32-bit addition, then we
would probably need to sign-extend the result in order to get rid of a
false dependency later on. However, based on a quick inspection I could
not find any such instances.

So using 64-bit operations instead of 32-bit ones would be an
interesting experiment.

Best regards,
Ilya


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-12-08 14:05 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-02  6:51 [PATCH v3 00/13] tcg/s390x: misc patches Richard Henderson
2022-12-02  6:51 ` [PATCH v3 01/13] tcg/s390x: Use register pair allocation for div and mulu2 Richard Henderson
2022-12-06 15:49   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB Richard Henderson
2022-12-06 19:29   ` Ilya Leoshkevich
2022-12-06 22:22     ` Richard Henderson
2022-12-07  0:42       ` Richard Henderson
2022-12-08 14:04         ` Ilya Leoshkevich
2022-12-07  7:45       ` Thomas Huth
2022-12-07 14:55         ` Richard Henderson
2022-12-07 20:40           ` Ilya Leoshkevich
2022-12-07 21:20             ` Christian Borntraeger
2022-12-07 22:09       ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 03/13] tcg/s390x: Use LARL+AGHI for odd addresses Richard Henderson
2022-12-06 19:42   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 04/13] tcg/s390x: Distinguish RRF-a and RRF-c formats Richard Henderson
2022-12-06 19:45   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 05/13] tcg/s390x: Distinguish RIE formats Richard Henderson
2022-12-06 19:47   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 06/13] tcg/s390x: Support MIE2 multiply single instructions Richard Henderson
2022-12-06 20:02   ` Ilya Leoshkevich
2022-12-06 20:20     ` Richard Henderson
2022-12-02  6:51 ` [PATCH v3 07/13] tcg/s390x: Support MIE2 MGRK instruction Richard Henderson
2022-12-06 20:02   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 08/13] tcg/s390x: Support MIE3 logical operations Richard Henderson
2022-12-06 20:08   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 09/13] tcg/s390x: Create tgen_cmp2 to simplify movcond Richard Henderson
2022-12-06 20:14   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 10/13] tcg/s390x: Generalize movcond implementation Richard Henderson
2022-12-06 20:39   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 11/13] tcg/s390x: Support SELGR instruction in movcond Richard Henderson
2022-12-06 20:41   ` Ilya Leoshkevich
2022-12-02  6:51 ` [PATCH v3 12/13] tcg/s390x: Use tgen_movcond_int in tgen_clz Richard Henderson
2022-12-06 20:49   ` Ilya Leoshkevich
2022-12-02  6:52 ` [PATCH v3 13/13] tcg/s390x: Implement ctpop operation Richard Henderson
2022-12-06 21:10   ` Ilya Leoshkevich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.