All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/5] tcg/sparc: Unaligned access for user-only
@ 2022-02-04  7:00 Richard Henderson
  2022-02-04  7:00 ` [PATCH v4 1/5] tcg/sparc: Add scratch argument to tcg_out_movi_int Richard Henderson
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Richard Henderson @ 2022-02-04  7:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.bennee

Changes from v3:
  * Rebase on master, two patches merged.


r~


Richard Henderson (5):
  tcg/sparc: Add scratch argument to tcg_out_movi_int
  tcg/sparc: Improve code gen for shifted 32-bit constants
  tcg/sparc: Use the constant pool for 64-bit constants
  tcg/sparc: Add tcg_out_jmpl_const for better tail calls
  tcg/sparc: Support unaligned access for user-only

 tcg/sparc/tcg-target.c.inc | 390 +++++++++++++++++++++++++++++++++----
 1 file changed, 354 insertions(+), 36 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 1/5] tcg/sparc: Add scratch argument to tcg_out_movi_int
  2022-02-04  7:00 [PATCH v4 0/5] tcg/sparc: Unaligned access for user-only Richard Henderson
@ 2022-02-04  7:00 ` Richard Henderson
  2022-02-04 17:35   ` Peter Maydell
  2022-02-04  7:00 ` [PATCH v4 2/5] tcg/sparc: Improve code gen for shifted 32-bit constants Richard Henderson
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Richard Henderson @ 2022-02-04  7:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.bennee

This will allow us to control exactly what scratch register is
used for loading the constant.  Also, fix a theoretical problem
in recursing through tcg_out_movi, which may provide a different
value for in_prologue.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc/tcg-target.c.inc | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 0c062c60eb..7e3758b798 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -414,7 +414,8 @@ static void tcg_out_movi_imm13(TCGContext *s, TCGReg ret, int32_t arg)
 }
 
 static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
-                             tcg_target_long arg, bool in_prologue)
+                             tcg_target_long arg, bool in_prologue,
+                             TCGReg scratch)
 {
     tcg_target_long hi, lo = (int32_t)arg;
     tcg_target_long test, lsb;
@@ -471,22 +472,24 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
     /* A 64-bit constant decomposed into 2 32-bit pieces.  */
     if (check_fit_i32(lo, 13)) {
         hi = (arg - lo) >> 32;
-        tcg_out_movi(s, TCG_TYPE_I32, ret, hi);
+        tcg_out_movi_int(s, TCG_TYPE_I32, ret, hi, in_prologue, scratch);
         tcg_out_arithi(s, ret, ret, 32, SHIFT_SLLX);
         tcg_out_arithi(s, ret, ret, lo, ARITH_ADD);
     } else {
+        tcg_debug_assert(scratch != TCG_REG_G0);
         hi = arg >> 32;
-        tcg_out_movi(s, TCG_TYPE_I32, ret, hi);
-        tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_T2, lo);
+        tcg_out_movi_int(s, TCG_TYPE_I32, ret, hi, in_prologue, scratch);
+        tcg_out_movi_int(s, TCG_TYPE_I32, scratch, lo, in_prologue, TCG_REG_G0);
         tcg_out_arithi(s, ret, ret, 32, SHIFT_SLLX);
-        tcg_out_arith(s, ret, ret, TCG_REG_T2, ARITH_OR);
+        tcg_out_arith(s, ret, ret, scratch, ARITH_OR);
     }
 }
 
 static void tcg_out_movi(TCGContext *s, TCGType type,
                          TCGReg ret, tcg_target_long arg)
 {
-    tcg_out_movi_int(s, type, ret, arg, false);
+    tcg_debug_assert(ret != TCG_REG_T2);
+    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T2);
 }
 
 static void tcg_out_ldst_rr(TCGContext *s, TCGReg data, TCGReg a1,
@@ -837,7 +840,7 @@ static void tcg_out_call_nodelay(TCGContext *s, const tcg_insn_unit *dest,
     } else {
         uintptr_t desti = (uintptr_t)dest;
         tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_REG_T1,
-                         desti & ~0xfff, in_prologue);
+                         desti & ~0xfff, in_prologue, TCG_REG_O7);
         tcg_out_arithi(s, TCG_REG_O7, TCG_REG_T1, desti & 0xfff, JMPL);
     }
 }
@@ -1013,7 +1016,8 @@ static void tcg_target_qemu_prologue(TCGContext *s)
 
 #ifndef CONFIG_SOFTMMU
     if (guest_base != 0) {
-        tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_GUEST_BASE_REG, guest_base, true);
+        tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_GUEST_BASE_REG, guest_base,
+                         true, TCG_REG_T1);
         tcg_regset_set_reg(s->reserved_regs, TCG_GUEST_BASE_REG);
     }
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 2/5] tcg/sparc: Improve code gen for shifted 32-bit constants
  2022-02-04  7:00 [PATCH v4 0/5] tcg/sparc: Unaligned access for user-only Richard Henderson
  2022-02-04  7:00 ` [PATCH v4 1/5] tcg/sparc: Add scratch argument to tcg_out_movi_int Richard Henderson
@ 2022-02-04  7:00 ` Richard Henderson
  2022-02-04  7:56   ` Philippe Mathieu-Daudé via
  2022-02-04 17:39   ` Peter Maydell
  2022-02-04  7:00 ` [PATCH v4 3/5] tcg/sparc: Use the constant pool for 64-bit constants Richard Henderson
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Richard Henderson @ 2022-02-04  7:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.bennee

We had code for checking for 13 and 21-bit shifted constants,
but we can do better and allow 32-bit shifted constants.
This is still 2 insns shorter than the full 64-bit sequence.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc/tcg-target.c.inc | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 7e3758b798..6349f750cc 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -456,17 +456,17 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
         return;
     }
 
-    /* A 21-bit constant, shifted.  */
+    /* A 32-bit constant, shifted.  */
     lsb = ctz64(arg);
     test = (tcg_target_long)arg >> lsb;
-    if (check_fit_tl(test, 13)) {
-        tcg_out_movi_imm13(s, ret, test);
-        tcg_out_arithi(s, ret, ret, lsb, SHIFT_SLLX);
-        return;
-    } else if (lsb > 10 && test == extract64(test, 0, 21)) {
+    if (lsb > 10 && test == extract64(test, 0, 21)) {
         tcg_out_sethi(s, ret, test << 10);
         tcg_out_arithi(s, ret, ret, lsb - 10, SHIFT_SLLX);
         return;
+    } else if (test == (uint32_t)test || test == (int32_t)test) {
+        tcg_out_movi_int(s, TCG_TYPE_I64, ret, test, in_prologue, scratch);
+        tcg_out_arithi(s, ret, ret, lsb, SHIFT_SLLX);
+        return;
     }
 
     /* A 64-bit constant decomposed into 2 32-bit pieces.  */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 3/5] tcg/sparc: Use the constant pool for 64-bit constants
  2022-02-04  7:00 [PATCH v4 0/5] tcg/sparc: Unaligned access for user-only Richard Henderson
  2022-02-04  7:00 ` [PATCH v4 1/5] tcg/sparc: Add scratch argument to tcg_out_movi_int Richard Henderson
  2022-02-04  7:00 ` [PATCH v4 2/5] tcg/sparc: Improve code gen for shifted 32-bit constants Richard Henderson
@ 2022-02-04  7:00 ` Richard Henderson
  2022-02-04 18:18   ` Peter Maydell
  2022-02-04  7:00 ` [PATCH v4 4/5] tcg/sparc: Add tcg_out_jmpl_const for better tail calls Richard Henderson
  2022-02-04  7:00 ` [PATCH v4 5/5] tcg/sparc: Support unaligned access for user-only Richard Henderson
  4 siblings, 1 reply; 17+ messages in thread
From: Richard Henderson @ 2022-02-04  7:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.bennee

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc/tcg-target.c.inc | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 6349f750cc..47bdf314a0 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -332,6 +332,13 @@ static bool patch_reloc(tcg_insn_unit *src_rw, int type,
         insn &= ~INSN_OFF19(-1);
         insn |= INSN_OFF19(pcrel);
         break;
+    case R_SPARC_13:
+        if (!check_fit_ptr(value, 13)) {
+            return false;
+        }
+        insn &= ~INSN_IMM13(-1);
+        insn |= INSN_IMM13(value);
+        break;
     default:
         g_assert_not_reached();
     }
@@ -469,6 +476,14 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
         return;
     }
 
+    /* Use the constant pool, if possible. */
+    if (!in_prologue && USE_REG_TB) {
+        new_pool_label(s, arg, R_SPARC_13, s->code_ptr,
+                       tcg_tbrel_diff(s, NULL));
+        tcg_out32(s, LDX | INSN_RD(ret) | INSN_RS1(TCG_REG_TB));
+        return;
+    }
+
     /* A 64-bit constant decomposed into 2 32-bit pieces.  */
     if (check_fit_i32(lo, 13)) {
         hi = (arg - lo) >> 32;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 4/5] tcg/sparc: Add tcg_out_jmpl_const for better tail calls
  2022-02-04  7:00 [PATCH v4 0/5] tcg/sparc: Unaligned access for user-only Richard Henderson
                   ` (2 preceding siblings ...)
  2022-02-04  7:00 ` [PATCH v4 3/5] tcg/sparc: Use the constant pool for 64-bit constants Richard Henderson
@ 2022-02-04  7:00 ` Richard Henderson
  2022-02-04 18:34   ` Peter Maydell
  2022-02-04  7:00 ` [PATCH v4 5/5] tcg/sparc: Support unaligned access for user-only Richard Henderson
  4 siblings, 1 reply; 17+ messages in thread
From: Richard Henderson @ 2022-02-04  7:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.bennee

Due to mapping changes, we now rarely place the code_gen_buffer
near the main executable.  Which means that direct calls will
now rarely be in range.

So, always use indirect calls for tail calls, which allows us to
avoid clobbering %o7, and therefore we need not save and restore it.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc/tcg-target.c.inc | 37 +++++++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 47bdf314a0..82a9d88816 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -845,6 +845,19 @@ static void tcg_out_addsub2_i64(TCGContext *s, TCGReg rl, TCGReg rh,
     tcg_out_mov(s, TCG_TYPE_I64, rl, tmp);
 }
 
+static void tcg_out_jmpl_const(TCGContext *s, const tcg_insn_unit *dest,
+                               bool in_prologue, bool tail_call)
+{
+    uintptr_t desti = (uintptr_t)dest;
+
+    /* Be careful not to clobber %o7 for a tail call. */
+    tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_REG_T1,
+                     desti & ~0xfff, in_prologue,
+                     tail_call ? TCG_REG_G2 : TCG_REG_O7);
+    tcg_out_arithi(s, tail_call ? TCG_REG_G0 : TCG_REG_O7,
+                   TCG_REG_T1, desti & 0xfff, JMPL);
+}
+
 static void tcg_out_call_nodelay(TCGContext *s, const tcg_insn_unit *dest,
                                  bool in_prologue)
 {
@@ -853,10 +866,7 @@ static void tcg_out_call_nodelay(TCGContext *s, const tcg_insn_unit *dest,
     if (disp == (int32_t)disp) {
         tcg_out32(s, CALL | (uint32_t)disp >> 2);
     } else {
-        uintptr_t desti = (uintptr_t)dest;
-        tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_REG_T1,
-                         desti & ~0xfff, in_prologue, TCG_REG_O7);
-        tcg_out_arithi(s, TCG_REG_O7, TCG_REG_T1, desti & 0xfff, JMPL);
+        tcg_out_jmpl_const(s, dest, in_prologue, false);
     }
 }
 
@@ -947,11 +957,10 @@ static void build_trampolines(TCGContext *s)
 
         /* Set the retaddr operand.  */
         tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
-        /* Set the env operand.  */
-        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O0, TCG_AREG0);
         /* Tail call.  */
-        tcg_out_call_nodelay(s, qemu_ld_helpers[i], true);
-        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O7, ra);
+        tcg_out_jmpl_const(s, qemu_ld_helpers[i], true, true);
+        /* delay slot -- set the env argument */
+        tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
     }
 
     for (i = 0; i < ARRAY_SIZE(qemu_st_helpers); ++i) {
@@ -993,14 +1002,14 @@ static void build_trampolines(TCGContext *s)
         if (ra >= TCG_REG_O6) {
             tcg_out_st(s, TCG_TYPE_PTR, TCG_REG_O7, TCG_REG_CALL_STACK,
                        TCG_TARGET_CALL_STACK_OFFSET);
-            ra = TCG_REG_G1;
+        } else {
+            tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
         }
-        tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
-        /* Set the env operand.  */
-        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O0, TCG_AREG0);
+
         /* Tail call.  */
-        tcg_out_call_nodelay(s, qemu_st_helpers[i], true);
-        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O7, ra);
+        tcg_out_jmpl_const(s, qemu_st_helpers[i], true, true);
+        /* delay slot -- set the env argument */
+        tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
     }
 }
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 5/5] tcg/sparc: Support unaligned access for user-only
  2022-02-04  7:00 [PATCH v4 0/5] tcg/sparc: Unaligned access for user-only Richard Henderson
                   ` (3 preceding siblings ...)
  2022-02-04  7:00 ` [PATCH v4 4/5] tcg/sparc: Add tcg_out_jmpl_const for better tail calls Richard Henderson
@ 2022-02-04  7:00 ` Richard Henderson
  2022-02-04 19:07   ` Peter Maydell
  4 siblings, 1 reply; 17+ messages in thread
From: Richard Henderson @ 2022-02-04  7:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.bennee

This is kinda sorta the opposite of the other tcg hosts, where
we get (normal) alignment checks for free with host SIGBUS and
need to add code to support unaligned accesses.

This inline code expansion is somewhat large, but it takes quite
a few instructions to make a function call to a helper anyway.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc/tcg-target.c.inc | 308 +++++++++++++++++++++++++++++++++++--
 1 file changed, 299 insertions(+), 9 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 82a9d88816..a0d206380c 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -211,6 +211,7 @@ static const int tcg_target_call_oarg_regs[] = {
 #define ARITH_ADD  (INSN_OP(2) | INSN_OP3(0x00))
 #define ARITH_ADDCC (INSN_OP(2) | INSN_OP3(0x10))
 #define ARITH_AND  (INSN_OP(2) | INSN_OP3(0x01))
+#define ARITH_ANDCC (INSN_OP(2) | INSN_OP3(0x11))
 #define ARITH_ANDN (INSN_OP(2) | INSN_OP3(0x05))
 #define ARITH_OR   (INSN_OP(2) | INSN_OP3(0x02))
 #define ARITH_ORCC (INSN_OP(2) | INSN_OP3(0x12))
@@ -997,7 +998,7 @@ static void build_trampolines(TCGContext *s)
             /* Skip the oi argument.  */
             ra += 1;
         }
-                
+
         /* Set the retaddr operand.  */
         if (ra >= TCG_REG_O6) {
             tcg_out_st(s, TCG_TYPE_PTR, TCG_REG_O7, TCG_REG_CALL_STACK,
@@ -1012,6 +1013,38 @@ static void build_trampolines(TCGContext *s)
         tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
     }
 }
+#else
+static const tcg_insn_unit *qemu_unalign_ld_trampoline;
+static const tcg_insn_unit *qemu_unalign_st_trampoline;
+
+static void build_trampolines(TCGContext *s)
+{
+    for (int ld = 0; ld < 2; ++ld) {
+        void *helper;
+
+        while ((uintptr_t)s->code_ptr & 15) {
+            tcg_out_nop(s);
+        }
+
+        if (ld) {
+            helper = helper_unaligned_ld;
+            qemu_unalign_ld_trampoline = tcg_splitwx_to_rx(s->code_ptr);
+        } else {
+            helper = helper_unaligned_st;
+            qemu_unalign_st_trampoline = tcg_splitwx_to_rx(s->code_ptr);
+        }
+
+        if (!SPARC64 && TARGET_LONG_BITS == 64) {
+            /* Install the high part of the address.  */
+            tcg_out_arithi(s, TCG_REG_O1, TCG_REG_O2, 32, SHIFT_SRLX);
+        }
+
+        /* Tail call.  */
+        tcg_out_jmpl_const(s, helper, true, true);
+        /* delay slot -- set the env argument */
+        tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
+    }
+}
 #endif
 
 /* Generate global QEMU prologue and epilogue code */
@@ -1062,9 +1095,7 @@ static void tcg_target_qemu_prologue(TCGContext *s)
     /* delay slot */
     tcg_out_movi_imm13(s, TCG_REG_O0, 0);
 
-#ifdef CONFIG_SOFTMMU
     build_trampolines(s);
-#endif
 }
 
 static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
@@ -1149,18 +1180,22 @@ static TCGReg tcg_out_tlb_load(TCGContext *s, TCGReg addr, int mem_index,
 static const int qemu_ld_opc[(MO_SSIZE | MO_BSWAP) + 1] = {
     [MO_UB]   = LDUB,
     [MO_SB]   = LDSB,
+    [MO_UB | MO_LE] = LDUB,
+    [MO_SB | MO_LE] = LDSB,
 
     [MO_BEUW] = LDUH,
     [MO_BESW] = LDSH,
     [MO_BEUL] = LDUW,
     [MO_BESL] = LDSW,
     [MO_BEUQ] = LDX,
+    [MO_BESQ] = LDX,
 
     [MO_LEUW] = LDUH_LE,
     [MO_LESW] = LDSH_LE,
     [MO_LEUL] = LDUW_LE,
     [MO_LESL] = LDSW_LE,
     [MO_LEUQ] = LDX_LE,
+    [MO_LESQ] = LDX_LE,
 };
 
 static const int qemu_st_opc[(MO_SIZE | MO_BSWAP) + 1] = {
@@ -1179,11 +1214,12 @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
                             MemOpIdx oi, bool is_64)
 {
     MemOp memop = get_memop(oi);
+    tcg_insn_unit *label_ptr;
+
 #ifdef CONFIG_SOFTMMU
     unsigned memi = get_mmuidx(oi);
     TCGReg addrz, param;
     const tcg_insn_unit *func;
-    tcg_insn_unit *label_ptr;
 
     addrz = tcg_out_tlb_load(s, addr, memi, memop,
                              offsetof(CPUTLBEntry, addr_read));
@@ -1247,13 +1283,190 @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
 
     *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
 #else
+    TCGReg index = (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0);
+    unsigned a_bits = get_alignment_bits(memop);
+    unsigned s_bits = memop & MO_SIZE;
+    unsigned t_bits;
+    TCGReg orig_addr = addr;
+
     if (SPARC64 && TARGET_LONG_BITS == 32) {
         tcg_out_arithi(s, TCG_REG_T1, addr, 0, SHIFT_SRL);
         addr = TCG_REG_T1;
     }
-    tcg_out_ldst_rr(s, data, addr,
-                    (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0),
+
+    /*
+     * Normal case: alignment equal to access size.
+     */
+    if (a_bits == s_bits) {
+        tcg_out_ldst_rr(s, data, addr, index,
+                        qemu_ld_opc[memop & (MO_BSWAP | MO_SSIZE)]);
+        return;
+    }
+
+    /*
+     * Test for at least natural alignment, and assume most accesses
+     * will be aligned -- perform a straight load in the delay slot.
+     * This is required to preserve atomicity for aligned accesses.
+     */
+    t_bits = MAX(a_bits, s_bits);
+    tcg_debug_assert(t_bits < 13);
+    tcg_out_arithi(s, TCG_REG_G0, addr, (1u << t_bits) - 1, ARITH_ANDCC);
+
+    /* beq,a,pt %icc, label */
+    label_ptr = s->code_ptr;
+    tcg_out_bpcc0(s, COND_E, BPCC_A | BPCC_PT | BPCC_ICC, 0);
+    /* delay slot */
+    tcg_out_ldst_rr(s, data, addr, index,
                     qemu_ld_opc[memop & (MO_BSWAP | MO_SSIZE)]);
+
+    /*
+     * Overalignment: When we're asking for really large alignment,
+     * the actual access is always done above and all we need to do
+     * here is invoke the handler for SIGBUS.
+     */
+    if (a_bits >= s_bits) {
+        TCGReg arg_low = TCG_REG_O1 + (!SPARC64 && TARGET_LONG_BITS == 64);
+        tcg_out_call_nodelay(s, qemu_unalign_ld_trampoline, false);
+        /* delay slot -- move to low part of argument reg */
+        tcg_out_mov_delay(s, arg_low, addr);
+        goto done;
+    }
+
+    /*
+     * Underalignment: use multiple loads to perform the operation.
+     *
+     * Force full address into T1 early; avoids problems with
+     * overlap between @addr and @data.
+     *
+     * Note that the little-endian instructions use the immediate
+     * asi form, and must use tcg_out_ldst_rr.
+     */
+    tcg_out_arith(s, TCG_REG_T1, addr, index, ARITH_ADD);
+
+    switch ((unsigned)memop) {
+    case MO_BEUW | MO_UNALN:
+    case MO_BESW | MO_UNALN:
+    case MO_BEUL | MO_ALIGN_2:
+    case MO_BESL | MO_ALIGN_2:
+    case MO_BEUQ | MO_ALIGN_4:
+        /* Two loads: shift and combine. */
+        tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, 0,
+                        qemu_ld_opc[a_bits | MO_BE | (memop & MO_SIGN)]);
+        tcg_out_ldst(s, data, TCG_REG_T1, 1 << a_bits,
+                        qemu_ld_opc[a_bits | MO_BE]);
+        tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, 8 << a_bits, SHIFT_SLLX);
+        tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
+        break;
+
+    case MO_LEUW | MO_UNALN:
+    case MO_LESW | MO_UNALN:
+    case MO_LEUL | MO_ALIGN_2:
+    case MO_LESL | MO_ALIGN_2:
+    case MO_LEUQ | MO_ALIGN_4:
+        /* Similarly, with shifts adjusted for little-endian. */
+        tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0,
+                        qemu_ld_opc[a_bits | MO_LE]);
+        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 1 << a_bits, ARITH_ADD);
+        tcg_out_ldst_rr(s, data, TCG_REG_T1, TCG_REG_G0,
+                        qemu_ld_opc[a_bits | MO_LE | (memop & MO_SIGN)]);
+        tcg_out_arithi(s, data, data, 8 << a_bits, SHIFT_SLLX);
+        tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
+        break;
+
+    case MO_BEUL | MO_UNALN:
+    case MO_BESL | MO_UNALN:
+        /*
+         * Naively, this would require 4 loads, 3 shifts, 3 ors.
+         * Use two 32-bit aligned loads, combine, and extract.
+         */
+        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 3, ARITH_ANDN);
+        tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, 0, LDUW);
+        tcg_out_ldst(s, TCG_REG_T1, TCG_REG_T1, 4, LDUW);
+        tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, 32, SHIFT_SLLX);
+        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_T2, ARITH_OR);
+        tcg_out_arithi(s, TCG_REG_T2, orig_addr, 3, ARITH_AND);
+        tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, 3, SHIFT_SLL);
+        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_T2, SHIFT_SLLX);
+        tcg_out_arithi(s, data, TCG_REG_T1, 32,
+                       memop & MO_SIGN ? SHIFT_SRAX : SHIFT_SRLX);
+        break;
+
+    case MO_LEUL | MO_UNALN:
+    case MO_LESL | MO_UNALN:
+        /* Similarly, with shifts adjusted for little-endian. */
+        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 3, ARITH_ANDN);
+        tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, LDUW_LE);
+        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 4, ARITH_ADD);
+        tcg_out_ldst_rr(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_G0, LDUW_LE);
+        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 32, SHIFT_SLLX);
+        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_T2, ARITH_OR);
+        tcg_out_arithi(s, TCG_REG_T2, orig_addr, 3, ARITH_AND);
+        tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, 3, SHIFT_SLL);
+        tcg_out_arith(s, data, TCG_REG_T1, TCG_REG_T2, SHIFT_SRLX);
+        if (is_64) {
+            tcg_out_arithi(s, data, data, 0,
+                           memop & MO_SIGN ? SHIFT_SRA : SHIFT_SRL);
+        }
+        break;
+
+    case MO_BEUQ | MO_UNALN:
+        /* Similarly for 64-bit. */
+        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 7, ARITH_ANDN);
+        tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, 0, LDX);
+        tcg_out_ldst(s, TCG_REG_T1, TCG_REG_T1, 8, LDX);
+        tcg_out_arithi(s, data, orig_addr, 7, ARITH_AND);
+        tcg_out_arithi(s, data, data, 3, SHIFT_SLL);
+        tcg_out_arith(s, TCG_REG_T2, TCG_REG_T2, data, SHIFT_SLLX);
+        tcg_out_arithi(s, data, data, 64, ARITH_SUB);
+        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, data, SHIFT_SRLX);
+        tcg_out_arith(s, data, TCG_REG_T1, TCG_REG_T2, ARITH_OR);
+        break;
+
+    case MO_LEUQ | MO_UNALN:
+        /* Similarly for little-endian. */
+        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 7, ARITH_ANDN);
+        tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, LDX_LE);
+        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 8, ARITH_ADD);
+        tcg_out_ldst_rr(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_G0, LDX_LE);
+        tcg_out_arithi(s, data, orig_addr, 7, ARITH_AND);
+        tcg_out_arithi(s, data, data, 3, SHIFT_SLL);
+        tcg_out_arith(s, TCG_REG_T2, TCG_REG_T2, data, SHIFT_SRLX);
+        tcg_out_arithi(s, data, data, 64, ARITH_SUB);
+        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, data, SHIFT_SLLX);
+        tcg_out_arith(s, data, TCG_REG_T1, TCG_REG_T2, ARITH_OR);
+        break;
+
+    case MO_BEUQ | MO_ALIGN_2:
+        /*
+         * An extra test to verify alignment 2 is 5 insns, which
+         * is more than we would save by using the slightly smaller
+         * unaligned sequence above.
+         */
+        tcg_out_ldst(s, data, TCG_REG_T1, 0, LDUH);
+        for (int i = 2; i < 8; i += 2) {
+            tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, i, LDUW);
+            tcg_out_arithi(s, data, data, 16, SHIFT_SLLX);
+            tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
+        }
+        break;
+
+    case MO_LEUQ | MO_ALIGN_2:
+        /* Similarly for little-endian. */
+        tcg_out_ldst_rr(s, data, TCG_REG_T1, TCG_REG_G0, LDUH_LE);
+        for (int i = 2; i < 8; i += 2) {
+            tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 2, ARITH_ADD);
+            tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, LDUW_LE);
+            tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, i * 8, SHIFT_SLLX);
+            tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
+        }
+        break;
+
+    default:
+        g_assert_not_reached();
+    }
+
+ done:
+    *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
 #endif /* CONFIG_SOFTMMU */
 }
 
@@ -1261,11 +1474,12 @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
                             MemOpIdx oi)
 {
     MemOp memop = get_memop(oi);
+    tcg_insn_unit *label_ptr;
+
 #ifdef CONFIG_SOFTMMU
     unsigned memi = get_mmuidx(oi);
     TCGReg addrz, param;
     const tcg_insn_unit *func;
-    tcg_insn_unit *label_ptr;
 
     addrz = tcg_out_tlb_load(s, addr, memi, memop,
                              offsetof(CPUTLBEntry, addr_write));
@@ -1302,13 +1516,89 @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
 
     *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
 #else
+    TCGReg index = (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0);
+    unsigned a_bits = get_alignment_bits(memop);
+    unsigned s_bits = memop & MO_SIZE;
+    unsigned t_bits;
+
     if (SPARC64 && TARGET_LONG_BITS == 32) {
         tcg_out_arithi(s, TCG_REG_T1, addr, 0, SHIFT_SRL);
         addr = TCG_REG_T1;
     }
-    tcg_out_ldst_rr(s, data, addr,
-                    (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0),
+
+    /*
+     * Normal case: alignment equal to access size.
+     */
+    if (a_bits == s_bits) {
+        tcg_out_ldst_rr(s, data, addr, index,
+                        qemu_st_opc[memop & (MO_BSWAP | MO_SIZE)]);
+        return;
+    }
+
+    /*
+     * Test for at least natural alignment, and assume most accesses
+     * will be aligned -- perform a straight store in the delay slot.
+     * This is required to preserve atomicity for aligned accesses.
+     */
+    t_bits = MAX(a_bits, s_bits);
+    tcg_debug_assert(t_bits < 13);
+    tcg_out_arithi(s, TCG_REG_G0, addr, (1u << t_bits) - 1, ARITH_ANDCC);
+
+    /* beq,a,pt %icc, label */
+    label_ptr = s->code_ptr;
+    tcg_out_bpcc0(s, COND_E, BPCC_A | BPCC_PT | BPCC_ICC, 0);
+    /* delay slot */
+    tcg_out_ldst_rr(s, data, addr, index,
                     qemu_st_opc[memop & (MO_BSWAP | MO_SIZE)]);
+
+    if (a_bits >= s_bits) {
+        TCGReg arg_low = TCG_REG_O1 + (!SPARC64 && TARGET_LONG_BITS == 64);
+        /* Overalignment: only need to call helper for SIGBUS. */
+        tcg_out_call_nodelay(s, qemu_unalign_st_trampoline, false);
+        /* delay slot -- move to low part of argument reg */
+        tcg_out_mov_delay(s, arg_low, addr);
+    } else {
+        /* Underalignment: store by pieces of minimum alignment. */
+        int st_opc, a_size, s_size, i;
+
+        /*
+         * Force full address into T1 early; avoids problems with
+         * overlap between @addr and @data.
+         */
+        tcg_out_arith(s, TCG_REG_T1, addr, index, ARITH_ADD);
+
+        a_size = 1 << a_bits;
+        s_size = 1 << (memop & MO_SIZE);
+        if ((memop & MO_BSWAP) == MO_BE) {
+            st_opc = qemu_st_opc[a_bits + MO_BE];
+            for (i = 0; i < s_size; i += a_size) {
+                TCGReg d = data;
+                int shift = (s_size - a_size - i) * 8;
+                if (shift) {
+                    d = TCG_REG_T2;
+                    tcg_out_arithi(s, d, data, shift, SHIFT_SRLX);
+                }
+                tcg_out_ldst(s, d, TCG_REG_T1, i, st_opc);
+            }
+        } else if (a_bits == 0) {
+            tcg_out_ldst(s, data, TCG_REG_T1, 0, STB);
+            for (i = 1; i < s_size; i++) {
+                tcg_out_arithi(s, TCG_REG_T2, data, i * 8, SHIFT_SRLX);
+                tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, i, STB);
+            }
+        } else {
+            /* Note that ST*A with immediate asi must use indexed address. */
+            st_opc = qemu_st_opc[a_bits + MO_LE];
+            tcg_out_ldst_rr(s, data, TCG_REG_T1, TCG_REG_G0, st_opc);
+            for (i = a_size; i < s_size; i += a_size) {
+                tcg_out_arithi(s, TCG_REG_T2, data, i * 8, SHIFT_SRLX);
+                tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, a_size, ARITH_ADD);
+                tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, st_opc);
+            }
+        }
+    }
+
+    *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
 #endif /* CONFIG_SOFTMMU */
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 2/5] tcg/sparc: Improve code gen for shifted 32-bit constants
  2022-02-04  7:00 ` [PATCH v4 2/5] tcg/sparc: Improve code gen for shifted 32-bit constants Richard Henderson
@ 2022-02-04  7:56   ` Philippe Mathieu-Daudé via
  2022-02-04 17:39   ` Peter Maydell
  1 sibling, 0 replies; 17+ messages in thread
From: Philippe Mathieu-Daudé via @ 2022-02-04  7:56 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: alex.bennee

On 4/2/22 08:00, Richard Henderson wrote:
> We had code for checking for 13 and 21-bit shifted constants,
> but we can do better and allow 32-bit shifted constants.
> This is still 2 insns shorter than the full 64-bit sequence.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   tcg/sparc/tcg-target.c.inc | 12 ++++++------
>   1 file changed, 6 insertions(+), 6 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/5] tcg/sparc: Add scratch argument to tcg_out_movi_int
  2022-02-04  7:00 ` [PATCH v4 1/5] tcg/sparc: Add scratch argument to tcg_out_movi_int Richard Henderson
@ 2022-02-04 17:35   ` Peter Maydell
  2022-02-04 20:42     ` Richard Henderson
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Maydell @ 2022-02-04 17:35 UTC (permalink / raw)
  To: Richard Henderson; +Cc: alex.bennee, qemu-devel

On Fri, 4 Feb 2022 at 07:53, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> This will allow us to control exactly what scratch register is
> used for loading the constant.  Also, fix a theoretical problem
> in recursing through tcg_out_movi, which may provide a different
> value for in_prologue.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/sparc/tcg-target.c.inc | 20 ++++++++++++--------
>  1 file changed, 12 insertions(+), 8 deletions(-)

>  static void tcg_out_movi(TCGContext *s, TCGType type,
>                           TCGReg ret, tcg_target_long arg)
>  {
> -    tcg_out_movi_int(s, type, ret, arg, false);
> +    tcg_debug_assert(ret != TCG_REG_T2);
> +    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T2);

Here we assert that 'ret' isn't TCG_REG_T2, but in
tcg_out_addsub2_i64() we do:

           tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_T2, bh);
and
           tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_T2, bh + (is_sub ? -1 : 1));

Otherwise looks OK.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 2/5] tcg/sparc: Improve code gen for shifted 32-bit constants
  2022-02-04  7:00 ` [PATCH v4 2/5] tcg/sparc: Improve code gen for shifted 32-bit constants Richard Henderson
  2022-02-04  7:56   ` Philippe Mathieu-Daudé via
@ 2022-02-04 17:39   ` Peter Maydell
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Maydell @ 2022-02-04 17:39 UTC (permalink / raw)
  To: Richard Henderson; +Cc: alex.bennee, qemu-devel

On Fri, 4 Feb 2022 at 07:27, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> We had code for checking for 13 and 21-bit shifted constants,
> but we can do better and allow 32-bit shifted constants.
> This is still 2 insns shorter than the full 64-bit sequence.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/5] tcg/sparc: Use the constant pool for 64-bit constants
  2022-02-04  7:00 ` [PATCH v4 3/5] tcg/sparc: Use the constant pool for 64-bit constants Richard Henderson
@ 2022-02-04 18:18   ` Peter Maydell
  2022-02-04 20:41     ` Richard Henderson
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Maydell @ 2022-02-04 18:18 UTC (permalink / raw)
  To: Richard Henderson; +Cc: alex.bennee, qemu-devel

On Fri, 4 Feb 2022 at 07:53, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/sparc/tcg-target.c.inc | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
> index 6349f750cc..47bdf314a0 100644
> --- a/tcg/sparc/tcg-target.c.inc
> +++ b/tcg/sparc/tcg-target.c.inc
> @@ -332,6 +332,13 @@ static bool patch_reloc(tcg_insn_unit *src_rw, int type,
>          insn &= ~INSN_OFF19(-1);
>          insn |= INSN_OFF19(pcrel);
>          break;
> +    case R_SPARC_13:
> +        if (!check_fit_ptr(value, 13)) {
> +            return false;
> +        }

This code seems to contemplate that the offset might not fit
into the 13-bit immediate field (unlike the other two reloc
cases in this function, which just assert() that it fits)...

> +        insn &= ~INSN_IMM13(-1);
> +        insn |= INSN_IMM13(value);
> +        break;
>      default:
>          g_assert_not_reached();
>      }
> @@ -469,6 +476,14 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
>          return;
>      }
>
> +    /* Use the constant pool, if possible. */
> +    if (!in_prologue && USE_REG_TB) {
> +        new_pool_label(s, arg, R_SPARC_13, s->code_ptr,
> +                       tcg_tbrel_diff(s, NULL));
> +        tcg_out32(s, LDX | INSN_RD(ret) | INSN_RS1(TCG_REG_TB));
> +        return;
> +    }

...but this code doesn't seem to have any mechanism for
falling back to something else if it won't fit.

> +
>      /* A 64-bit constant decomposed into 2 32-bit pieces.  */
>      if (check_fit_i32(lo, 13)) {
>          hi = (arg - lo) >> 32;
> --
> 2.25.1

thanks
-- PMM


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 4/5] tcg/sparc: Add tcg_out_jmpl_const for better tail calls
  2022-02-04  7:00 ` [PATCH v4 4/5] tcg/sparc: Add tcg_out_jmpl_const for better tail calls Richard Henderson
@ 2022-02-04 18:34   ` Peter Maydell
  2022-02-04 20:44     ` Richard Henderson
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Maydell @ 2022-02-04 18:34 UTC (permalink / raw)
  To: Richard Henderson; +Cc: alex.bennee, qemu-devel

On Fri, 4 Feb 2022 at 07:54, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Due to mapping changes, we now rarely place the code_gen_buffer
> near the main executable.  Which means that direct calls will
> now rarely be in range.
>
> So, always use indirect calls for tail calls, which allows us to
> avoid clobbering %o7, and therefore we need not save and restore it.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

As an aside: I found it a bit confusing that tcg_out_mov()
takes a 'type' argument but ignores it, whereas tcg_out_mov_delay()
doesn't take a 'type' argument at all.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] tcg/sparc: Support unaligned access for user-only
  2022-02-04  7:00 ` [PATCH v4 5/5] tcg/sparc: Support unaligned access for user-only Richard Henderson
@ 2022-02-04 19:07   ` Peter Maydell
  2022-02-04 20:55     ` Richard Henderson
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Maydell @ 2022-02-04 19:07 UTC (permalink / raw)
  To: Richard Henderson; +Cc: alex.bennee, qemu-devel

On Fri, 4 Feb 2022 at 08:16, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> This is kinda sorta the opposite of the other tcg hosts, where
> we get (normal) alignment checks for free with host SIGBUS and
> need to add code to support unaligned accesses.
>
> This inline code expansion is somewhat large, but it takes quite
> a few instructions to make a function call to a helper anyway.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/sparc/tcg-target.c.inc | 308 +++++++++++++++++++++++++++++++++++--
>  1 file changed, 299 insertions(+), 9 deletions(-)
>
> diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
> index 82a9d88816..a0d206380c 100644
> --- a/tcg/sparc/tcg-target.c.inc
> +++ b/tcg/sparc/tcg-target.c.inc
> @@ -211,6 +211,7 @@ static const int tcg_target_call_oarg_regs[] = {
>  #define ARITH_ADD  (INSN_OP(2) | INSN_OP3(0x00))
>  #define ARITH_ADDCC (INSN_OP(2) | INSN_OP3(0x10))
>  #define ARITH_AND  (INSN_OP(2) | INSN_OP3(0x01))
> +#define ARITH_ANDCC (INSN_OP(2) | INSN_OP3(0x11))
>  #define ARITH_ANDN (INSN_OP(2) | INSN_OP3(0x05))
>  #define ARITH_OR   (INSN_OP(2) | INSN_OP3(0x02))
>  #define ARITH_ORCC (INSN_OP(2) | INSN_OP3(0x12))
> @@ -997,7 +998,7 @@ static void build_trampolines(TCGContext *s)
>              /* Skip the oi argument.  */
>              ra += 1;
>          }
> -
> +
>          /* Set the retaddr operand.  */
>          if (ra >= TCG_REG_O6) {
>              tcg_out_st(s, TCG_TYPE_PTR, TCG_REG_O7, TCG_REG_CALL_STACK,

Stray whitespace change.

> @@ -1012,6 +1013,38 @@ static void build_trampolines(TCGContext *s)
>          tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
>      }
>  }
> +#else
> +static const tcg_insn_unit *qemu_unalign_ld_trampoline;
> +static const tcg_insn_unit *qemu_unalign_st_trampoline;
> +
> +static void build_trampolines(TCGContext *s)
> +{
> +    for (int ld = 0; ld < 2; ++ld) {
> +        void *helper;
> +
> +        while ((uintptr_t)s->code_ptr & 15) {
> +            tcg_out_nop(s);
> +        }
> +
> +        if (ld) {
> +            helper = helper_unaligned_ld;
> +            qemu_unalign_ld_trampoline = tcg_splitwx_to_rx(s->code_ptr);
> +        } else {
> +            helper = helper_unaligned_st;
> +            qemu_unalign_st_trampoline = tcg_splitwx_to_rx(s->code_ptr);
> +        }
> +
> +        if (!SPARC64 && TARGET_LONG_BITS == 64) {
> +            /* Install the high part of the address.  */
> +            tcg_out_arithi(s, TCG_REG_O1, TCG_REG_O2, 32, SHIFT_SRLX);
> +        }
> +
> +        /* Tail call.  */
> +        tcg_out_jmpl_const(s, helper, true, true);
> +        /* delay slot -- set the env argument */
> +        tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
> +    }
> +}
>  #endif
>
>  /* Generate global QEMU prologue and epilogue code */
> @@ -1062,9 +1095,7 @@ static void tcg_target_qemu_prologue(TCGContext *s)
>      /* delay slot */
>      tcg_out_movi_imm13(s, TCG_REG_O0, 0);
>
> -#ifdef CONFIG_SOFTMMU
>      build_trampolines(s);
> -#endif
>  }
>
>  static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
> @@ -1149,18 +1180,22 @@ static TCGReg tcg_out_tlb_load(TCGContext *s, TCGReg addr, int mem_index,
>  static const int qemu_ld_opc[(MO_SSIZE | MO_BSWAP) + 1] = {
>      [MO_UB]   = LDUB,
>      [MO_SB]   = LDSB,
> +    [MO_UB | MO_LE] = LDUB,
> +    [MO_SB | MO_LE] = LDSB,
>
>      [MO_BEUW] = LDUH,
>      [MO_BESW] = LDSH,
>      [MO_BEUL] = LDUW,
>      [MO_BESL] = LDSW,
>      [MO_BEUQ] = LDX,
> +    [MO_BESQ] = LDX,
>
>      [MO_LEUW] = LDUH_LE,
>      [MO_LESW] = LDSH_LE,
>      [MO_LEUL] = LDUW_LE,
>      [MO_LESL] = LDSW_LE,
>      [MO_LEUQ] = LDX_LE,
> +    [MO_LESQ] = LDX_LE,
>  };
>
>  static const int qemu_st_opc[(MO_SIZE | MO_BSWAP) + 1] = {
> @@ -1179,11 +1214,12 @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
>                              MemOpIdx oi, bool is_64)
>  {
>      MemOp memop = get_memop(oi);
> +    tcg_insn_unit *label_ptr;
> +
>  #ifdef CONFIG_SOFTMMU
>      unsigned memi = get_mmuidx(oi);
>      TCGReg addrz, param;
>      const tcg_insn_unit *func;
> -    tcg_insn_unit *label_ptr;
>
>      addrz = tcg_out_tlb_load(s, addr, memi, memop,
>                               offsetof(CPUTLBEntry, addr_read));
> @@ -1247,13 +1283,190 @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
>
>      *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
>  #else
> +    TCGReg index = (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0);
> +    unsigned a_bits = get_alignment_bits(memop);
> +    unsigned s_bits = memop & MO_SIZE;
> +    unsigned t_bits;
> +    TCGReg orig_addr = addr;
> +
>      if (SPARC64 && TARGET_LONG_BITS == 32) {
>          tcg_out_arithi(s, TCG_REG_T1, addr, 0, SHIFT_SRL);
>          addr = TCG_REG_T1;
>      }
> -    tcg_out_ldst_rr(s, data, addr,
> -                    (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0),
> +
> +    /*
> +     * Normal case: alignment equal to access size.
> +     */
> +    if (a_bits == s_bits) {
> +        tcg_out_ldst_rr(s, data, addr, index,
> +                        qemu_ld_opc[memop & (MO_BSWAP | MO_SSIZE)]);
> +        return;
> +    }
> +
> +    /*
> +     * Test for at least natural alignment, and assume most accesses
> +     * will be aligned -- perform a straight load in the delay slot.
> +     * This is required to preserve atomicity for aligned accesses.
> +     */
> +    t_bits = MAX(a_bits, s_bits);
> +    tcg_debug_assert(t_bits < 13);
> +    tcg_out_arithi(s, TCG_REG_G0, addr, (1u << t_bits) - 1, ARITH_ANDCC);
> +
> +    /* beq,a,pt %icc, label */
> +    label_ptr = s->code_ptr;
> +    tcg_out_bpcc0(s, COND_E, BPCC_A | BPCC_PT | BPCC_ICC, 0);
> +    /* delay slot */
> +    tcg_out_ldst_rr(s, data, addr, index,
>                      qemu_ld_opc[memop & (MO_BSWAP | MO_SSIZE)]);
> +
> +    /*
> +     * Overalignment: When we're asking for really large alignment,
> +     * the actual access is always done above and all we need to do
> +     * here is invoke the handler for SIGBUS.
> +     */

I thought the access was in an annulled delay slot and so won't
be "done above" ?

> +    if (a_bits >= s_bits) {
> +        TCGReg arg_low = TCG_REG_O1 + (!SPARC64 && TARGET_LONG_BITS == 64);
> +        tcg_out_call_nodelay(s, qemu_unalign_ld_trampoline, false);
> +        /* delay slot -- move to low part of argument reg */
> +        tcg_out_mov_delay(s, arg_low, addr);
> +        goto done;
> +    }
> +
> +    /*
> +     * Underalignment: use multiple loads to perform the operation.
> +     *
> +     * Force full address into T1 early; avoids problems with
> +     * overlap between @addr and @data.
> +     *
> +     * Note that the little-endian instructions use the immediate
> +     * asi form, and must use tcg_out_ldst_rr.
> +     */
> +    tcg_out_arith(s, TCG_REG_T1, addr, index, ARITH_ADD);
> +
> +    switch ((unsigned)memop) {
> +    case MO_BEUW | MO_UNALN:
> +    case MO_BESW | MO_UNALN:
> +    case MO_BEUL | MO_ALIGN_2:
> +    case MO_BESL | MO_ALIGN_2:
> +    case MO_BEUQ | MO_ALIGN_4:
> +        /* Two loads: shift and combine. */
> +        tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, 0,
> +                        qemu_ld_opc[a_bits | MO_BE | (memop & MO_SIGN)]);
> +        tcg_out_ldst(s, data, TCG_REG_T1, 1 << a_bits,
> +                        qemu_ld_opc[a_bits | MO_BE]);
> +        tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, 8 << a_bits, SHIFT_SLLX);

Why are we calculating the offset in memory of the second half of
the data and the amount to shift it by using the alignment-bits
rather than the size-bits ? Because of the cases we know that
here a_bits == s_bits - 1, but I think it would be clearer to
work in terms of the size.

> +        tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
> +        break;
> +
> +    case MO_LEUW | MO_UNALN:
> +    case MO_LESW | MO_UNALN:
> +    case MO_LEUL | MO_ALIGN_2:
> +    case MO_LESL | MO_ALIGN_2:
> +    case MO_LEUQ | MO_ALIGN_4:
> +        /* Similarly, with shifts adjusted for little-endian. */
> +        tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0,
> +                        qemu_ld_opc[a_bits | MO_LE]);
> +        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 1 << a_bits, ARITH_ADD);
> +        tcg_out_ldst_rr(s, data, TCG_REG_T1, TCG_REG_G0,
> +                        qemu_ld_opc[a_bits | MO_LE | (memop & MO_SIGN)]);
> +        tcg_out_arithi(s, data, data, 8 << a_bits, SHIFT_SLLX);
> +        tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
> +        break;
> +
> +    case MO_BEUL | MO_UNALN:
> +    case MO_BESL | MO_UNALN:
> +        /*
> +         * Naively, this would require 4 loads, 3 shifts, 3 ors.
> +         * Use two 32-bit aligned loads, combine, and extract.
> +         */

Accessing off the end of a buffer just because you know it can't
segfault because it's not going to cross a page boundary will
make Valgrind unhappy :-)

At least, we should mention in the comment that this loads data
not from where the guest asked but that we know it won't access
any data in a page that the guest didn't ask to touch.

> +        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 3, ARITH_ANDN);
> +        tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, 0, LDUW);

Doesn't this give the wrong fault-address value to the guest
(ie not the address it used for the load, but a rounded-down one)
if we take a SIGSEGV? Or do we fix that up elsewhere?

> +        tcg_out_ldst(s, TCG_REG_T1, TCG_REG_T1, 4, LDUW);
> +        tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, 32, SHIFT_SLLX);
> +        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_T2, ARITH_OR);
> +        tcg_out_arithi(s, TCG_REG_T2, orig_addr, 3, ARITH_AND);
> +        tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, 3, SHIFT_SLL);
> +        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_T2, SHIFT_SLLX);
> +        tcg_out_arithi(s, data, TCG_REG_T1, 32,
> +                       memop & MO_SIGN ? SHIFT_SRAX : SHIFT_SRLX);
> +        break;
> +
> +    case MO_LEUL | MO_UNALN:
> +    case MO_LESL | MO_UNALN:
> +        /* Similarly, with shifts adjusted for little-endian. */
> +        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 3, ARITH_ANDN);
> +        tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, LDUW_LE);
> +        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 4, ARITH_ADD);
> +        tcg_out_ldst_rr(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_G0, LDUW_LE);
> +        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 32, SHIFT_SLLX);
> +        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_T2, ARITH_OR);
> +        tcg_out_arithi(s, TCG_REG_T2, orig_addr, 3, ARITH_AND);
> +        tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, 3, SHIFT_SLL);
> +        tcg_out_arith(s, data, TCG_REG_T1, TCG_REG_T2, SHIFT_SRLX);
> +        if (is_64) {
> +            tcg_out_arithi(s, data, data, 0,
> +                           memop & MO_SIGN ? SHIFT_SRA : SHIFT_SRL);
> +        }
> +        break;
> +
> +    case MO_BEUQ | MO_UNALN:
> +        /* Similarly for 64-bit. */
> +        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 7, ARITH_ANDN);
> +        tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, 0, LDX);
> +        tcg_out_ldst(s, TCG_REG_T1, TCG_REG_T1, 8, LDX);
> +        tcg_out_arithi(s, data, orig_addr, 7, ARITH_AND);
> +        tcg_out_arithi(s, data, data, 3, SHIFT_SLL);
> +        tcg_out_arith(s, TCG_REG_T2, TCG_REG_T2, data, SHIFT_SLLX);
> +        tcg_out_arithi(s, data, data, 64, ARITH_SUB);
> +        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, data, SHIFT_SRLX);
> +        tcg_out_arith(s, data, TCG_REG_T1, TCG_REG_T2, ARITH_OR);
> +        break;
> +
> +    case MO_LEUQ | MO_UNALN:
> +        /* Similarly for little-endian. */
> +        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 7, ARITH_ANDN);
> +        tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, LDX_LE);
> +        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 8, ARITH_ADD);
> +        tcg_out_ldst_rr(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_G0, LDX_LE);
> +        tcg_out_arithi(s, data, orig_addr, 7, ARITH_AND);
> +        tcg_out_arithi(s, data, data, 3, SHIFT_SLL);
> +        tcg_out_arith(s, TCG_REG_T2, TCG_REG_T2, data, SHIFT_SRLX);
> +        tcg_out_arithi(s, data, data, 64, ARITH_SUB);
> +        tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, data, SHIFT_SLLX);
> +        tcg_out_arith(s, data, TCG_REG_T1, TCG_REG_T2, ARITH_OR);
> +        break;
> +
> +    case MO_BEUQ | MO_ALIGN_2:
> +        /*
> +         * An extra test to verify alignment 2 is 5 insns, which
> +         * is more than we would save by using the slightly smaller
> +         * unaligned sequence above.
> +         */
> +        tcg_out_ldst(s, data, TCG_REG_T1, 0, LDUH);
> +        for (int i = 2; i < 8; i += 2) {
> +            tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, i, LDUW);

Isn't this loading 2 + 3 * 4 == 14 bytes?

> +            tcg_out_arithi(s, data, data, 16, SHIFT_SLLX);
> +            tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
> +        }
> +        break;
> +
> +    case MO_LEUQ | MO_ALIGN_2:
> +        /* Similarly for little-endian. */
> +        tcg_out_ldst_rr(s, data, TCG_REG_T1, TCG_REG_G0, LDUH_LE);
> +        for (int i = 2; i < 8; i += 2) {
> +            tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 2, ARITH_ADD);
> +            tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, LDUW_LE);

Ditto.

> +            tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, i * 8, SHIFT_SLLX);
> +            tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
> +        }
> +        break;
> +
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> + done:
> +    *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
>  #endif /* CONFIG_SOFTMMU */
>  }
>
> @@ -1261,11 +1474,12 @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
>                              MemOpIdx oi)
>  {
>      MemOp memop = get_memop(oi);
> +    tcg_insn_unit *label_ptr;
> +
>  #ifdef CONFIG_SOFTMMU
>      unsigned memi = get_mmuidx(oi);
>      TCGReg addrz, param;
>      const tcg_insn_unit *func;
> -    tcg_insn_unit *label_ptr;
>
>      addrz = tcg_out_tlb_load(s, addr, memi, memop,
>                               offsetof(CPUTLBEntry, addr_write));
> @@ -1302,13 +1516,89 @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
>
>      *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
>  #else
> +    TCGReg index = (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0);
> +    unsigned a_bits = get_alignment_bits(memop);
> +    unsigned s_bits = memop & MO_SIZE;
> +    unsigned t_bits;
> +
>      if (SPARC64 && TARGET_LONG_BITS == 32) {
>          tcg_out_arithi(s, TCG_REG_T1, addr, 0, SHIFT_SRL);
>          addr = TCG_REG_T1;
>      }
> -    tcg_out_ldst_rr(s, data, addr,
> -                    (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0),
> +
> +    /*
> +     * Normal case: alignment equal to access size.
> +     */
> +    if (a_bits == s_bits) {
> +        tcg_out_ldst_rr(s, data, addr, index,
> +                        qemu_st_opc[memop & (MO_BSWAP | MO_SIZE)]);
> +        return;
> +    }
> +
> +    /*
> +     * Test for at least natural alignment, and assume most accesses
> +     * will be aligned -- perform a straight store in the delay slot.
> +     * This is required to preserve atomicity for aligned accesses.
> +     */
> +    t_bits = MAX(a_bits, s_bits);
> +    tcg_debug_assert(t_bits < 13);
> +    tcg_out_arithi(s, TCG_REG_G0, addr, (1u << t_bits) - 1, ARITH_ANDCC);
> +
> +    /* beq,a,pt %icc, label */
> +    label_ptr = s->code_ptr;
> +    tcg_out_bpcc0(s, COND_E, BPCC_A | BPCC_PT | BPCC_ICC, 0);
> +    /* delay slot */
> +    tcg_out_ldst_rr(s, data, addr, index,
>                      qemu_st_opc[memop & (MO_BSWAP | MO_SIZE)]);
> +
> +    if (a_bits >= s_bits) {
> +        TCGReg arg_low = TCG_REG_O1 + (!SPARC64 && TARGET_LONG_BITS == 64);
> +        /* Overalignment: only need to call helper for SIGBUS. */
> +        tcg_out_call_nodelay(s, qemu_unalign_st_trampoline, false);
> +        /* delay slot -- move to low part of argument reg */
> +        tcg_out_mov_delay(s, arg_low, addr);
> +    } else {
> +        /* Underalignment: store by pieces of minimum alignment. */
> +        int st_opc, a_size, s_size, i;
> +
> +        /*
> +         * Force full address into T1 early; avoids problems with
> +         * overlap between @addr and @data.
> +         */
> +        tcg_out_arith(s, TCG_REG_T1, addr, index, ARITH_ADD);
> +
> +        a_size = 1 << a_bits;
> +        s_size = 1 << (memop & MO_SIZE);

Isn't this "1 << s_bits" ?

> +        if ((memop & MO_BSWAP) == MO_BE) {
> +            st_opc = qemu_st_opc[a_bits + MO_BE];
> +            for (i = 0; i < s_size; i += a_size) {
> +                TCGReg d = data;
> +                int shift = (s_size - a_size - i) * 8;
> +                if (shift) {
> +                    d = TCG_REG_T2;
> +                    tcg_out_arithi(s, d, data, shift, SHIFT_SRLX);
> +                }
> +                tcg_out_ldst(s, d, TCG_REG_T1, i, st_opc);
> +            }
> +        } else if (a_bits == 0) {
> +            tcg_out_ldst(s, data, TCG_REG_T1, 0, STB);
> +            for (i = 1; i < s_size; i++) {
> +                tcg_out_arithi(s, TCG_REG_T2, data, i * 8, SHIFT_SRLX);
> +                tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, i, STB);
> +            }
> +        } else {
> +            /* Note that ST*A with immediate asi must use indexed address. */
> +            st_opc = qemu_st_opc[a_bits + MO_LE];
> +            tcg_out_ldst_rr(s, data, TCG_REG_T1, TCG_REG_G0, st_opc);
> +            for (i = a_size; i < s_size; i += a_size) {
> +                tcg_out_arithi(s, TCG_REG_T2, data, i * 8, SHIFT_SRLX);
> +                tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, a_size, ARITH_ADD);
> +                tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, st_opc);
> +            }
> +        }
> +    }
> +
> +    *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
>  #endif /* CONFIG_SOFTMMU */

thanks
-- PMM


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/5] tcg/sparc: Use the constant pool for 64-bit constants
  2022-02-04 18:18   ` Peter Maydell
@ 2022-02-04 20:41     ` Richard Henderson
  2022-02-04 22:20       ` Peter Maydell
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Henderson @ 2022-02-04 20:41 UTC (permalink / raw)
  To: Peter Maydell; +Cc: alex.bennee, qemu-devel

On 2/5/22 05:18, Peter Maydell wrote:
> On Fri, 4 Feb 2022 at 07:53, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   tcg/sparc/tcg-target.c.inc | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
>> index 6349f750cc..47bdf314a0 100644
>> --- a/tcg/sparc/tcg-target.c.inc
>> +++ b/tcg/sparc/tcg-target.c.inc
>> @@ -332,6 +332,13 @@ static bool patch_reloc(tcg_insn_unit *src_rw, int type,
>>           insn &= ~INSN_OFF19(-1);
>>           insn |= INSN_OFF19(pcrel);
>>           break;
>> +    case R_SPARC_13:
>> +        if (!check_fit_ptr(value, 13)) {
>> +            return false;
>> +        }
> 
> This code seems to contemplate that the offset might not fit
> into the 13-bit immediate field (unlike the other two reloc
> cases in this function, which just assert() that it fits)...

Ooo, thanks for noticing.  The other entries have not been updated for changes to tcg 
relocations.  They should be returning false instead of asserting.

Returning false here tells generic code the relocation didn't fit, and to restart the TB 
with half of the number of guest instructions.

>> +    /* Use the constant pool, if possible. */
>> +    if (!in_prologue && USE_REG_TB) {
>> +        new_pool_label(s, arg, R_SPARC_13, s->code_ptr,
>> +                       tcg_tbrel_diff(s, NULL));
>> +        tcg_out32(s, LDX | INSN_RD(ret) | INSN_RS1(TCG_REG_TB));
>> +        return;
>> +    }
> 
> ...but this code doesn't seem to have any mechanism for
> falling back to something else if it won't fit.

It will fit, because we'll keep trying with smaller TBs until it does.  If for some reason 
a target generates more than 8k for a single guest insn... it will go boom.


r~


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/5] tcg/sparc: Add scratch argument to tcg_out_movi_int
  2022-02-04 17:35   ` Peter Maydell
@ 2022-02-04 20:42     ` Richard Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2022-02-04 20:42 UTC (permalink / raw)
  To: Peter Maydell; +Cc: alex.bennee, qemu-devel

On 2/5/22 04:35, Peter Maydell wrote:
> On Fri, 4 Feb 2022 at 07:53, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> This will allow us to control exactly what scratch register is
>> used for loading the constant.  Also, fix a theoretical problem
>> in recursing through tcg_out_movi, which may provide a different
>> value for in_prologue.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   tcg/sparc/tcg-target.c.inc | 20 ++++++++++++--------
>>   1 file changed, 12 insertions(+), 8 deletions(-)
> 
>>   static void tcg_out_movi(TCGContext *s, TCGType type,
>>                            TCGReg ret, tcg_target_long arg)
>>   {
>> -    tcg_out_movi_int(s, type, ret, arg, false);
>> +    tcg_debug_assert(ret != TCG_REG_T2);
>> +    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T2);
> 
> Here we assert that 'ret' isn't TCG_REG_T2, but in
> tcg_out_addsub2_i64() we do:
> 
>             tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_T2, bh);
> and
>             tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_T2, bh + (is_sub ? -1 : 1));
> 
> Otherwise looks OK.

Oops.  Good catch.  I shouldn't have moved the assert.


r~


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 4/5] tcg/sparc: Add tcg_out_jmpl_const for better tail calls
  2022-02-04 18:34   ` Peter Maydell
@ 2022-02-04 20:44     ` Richard Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2022-02-04 20:44 UTC (permalink / raw)
  To: Peter Maydell; +Cc: alex.bennee, qemu-devel

On 2/5/22 05:34, Peter Maydell wrote:
> On Fri, 4 Feb 2022 at 07:54, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Due to mapping changes, we now rarely place the code_gen_buffer
>> near the main executable.  Which means that direct calls will
>> now rarely be in range.
>>
>> So, always use indirect calls for tail calls, which allows us to
>> avoid clobbering %o7, and therefore we need not save and restore it.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> 
> Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
> 
> As an aside: I found it a bit confusing that tcg_out_mov()
> takes a 'type' argument but ignores it, whereas tcg_out_mov_delay()
> doesn't take a 'type' argument at all.

Mm.  tcg_out_mov() is common to all tcg backends, and some hosts do use the type argument. 
  I didn't add one to tcg_out_mov_delay() because sparc didn't need one.


r~


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] tcg/sparc: Support unaligned access for user-only
  2022-02-04 19:07   ` Peter Maydell
@ 2022-02-04 20:55     ` Richard Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2022-02-04 20:55 UTC (permalink / raw)
  To: Peter Maydell; +Cc: alex.bennee, qemu-devel

On 2/5/22 06:07, Peter Maydell wrote:
>> +    /*
>> +     * Overalignment: When we're asking for really large alignment,
>> +     * the actual access is always done above and all we need to do
>> +     * here is invoke the handler for SIGBUS.
>> +     */
> 
> I thought the access was in an annulled delay slot and so won't
> be "done above" ?

Bad wording, I suppose.  If the alignment check succeeds, then access is done above.  If 
the alignment check fails, there is no access to be performed, only to report failure.

>> +    switch ((unsigned)memop) {
>> +    case MO_BEUW | MO_UNALN:
>> +    case MO_BESW | MO_UNALN:
>> +    case MO_BEUL | MO_ALIGN_2:
>> +    case MO_BESL | MO_ALIGN_2:
>> +    case MO_BEUQ | MO_ALIGN_4:
>> +        /* Two loads: shift and combine. */
>> +        tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, 0,
>> +                        qemu_ld_opc[a_bits | MO_BE | (memop & MO_SIGN)]);
>> +        tcg_out_ldst(s, data, TCG_REG_T1, 1 << a_bits,
>> +                        qemu_ld_opc[a_bits | MO_BE]);
>> +        tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, 8 << a_bits, SHIFT_SLLX);
> 
> Why are we calculating the offset in memory of the second half of
> the data and the amount to shift it by using the alignment-bits
> rather than the size-bits ? Because of the cases we know that
> here a_bits == s_bits - 1, but I think it would be clearer to
> work in terms of the size.

Ok.

>> +        tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, 3, ARITH_ANDN);
>> +        tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, 0, LDUW);
> 
> Doesn't this give the wrong fault-address value to the guest
> (ie not the address it used for the load, but a rounded-down one)
> if we take a SIGSEGV? Or do we fix that up elsewhere?

Oops, no.  Perhaps a single byte load to the zero register would fix that, without having 
to go to full load-by-parts.

>> +    case MO_BEUQ | MO_ALIGN_2:
>> +        /*
>> +         * An extra test to verify alignment 2 is 5 insns, which
>> +         * is more than we would save by using the slightly smaller
>> +         * unaligned sequence above.
>> +         */
>> +        tcg_out_ldst(s, data, TCG_REG_T1, 0, LDUH);
>> +        for (int i = 2; i < 8; i += 2) {
>> +            tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, i, LDUW);
> 
> Isn't this loading 2 + 3 * 4 == 14 bytes?

Oops.  Got confused with the qemu vs sparc "word" there for a moment.


r~


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/5] tcg/sparc: Use the constant pool for 64-bit constants
  2022-02-04 20:41     ` Richard Henderson
@ 2022-02-04 22:20       ` Peter Maydell
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Maydell @ 2022-02-04 22:20 UTC (permalink / raw)
  To: Richard Henderson; +Cc: alex.bennee, qemu-devel

On Fri, 4 Feb 2022 at 20:41, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 2/5/22 05:18, Peter Maydell wrote:
> > On Fri, 4 Feb 2022 at 07:53, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >>
> >> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> >> ---
> >>   tcg/sparc/tcg-target.c.inc | 15 +++++++++++++++
> >>   1 file changed, 15 insertions(+)
> >>
> >> diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
> >> index 6349f750cc..47bdf314a0 100644
> >> --- a/tcg/sparc/tcg-target.c.inc
> >> +++ b/tcg/sparc/tcg-target.c.inc
> >> @@ -332,6 +332,13 @@ static bool patch_reloc(tcg_insn_unit *src_rw, int type,
> >>           insn &= ~INSN_OFF19(-1);
> >>           insn |= INSN_OFF19(pcrel);
> >>           break;
> >> +    case R_SPARC_13:
> >> +        if (!check_fit_ptr(value, 13)) {
> >> +            return false;
> >> +        }
> >
> > This code seems to contemplate that the offset might not fit
> > into the 13-bit immediate field (unlike the other two reloc
> > cases in this function, which just assert() that it fits)...
>
> Ooo, thanks for noticing.  The other entries have not been updated for changes to tcg
> relocations.  They should be returning false instead of asserting.
>
> Returning false here tells generic code the relocation didn't fit, and to restart the TB
> with half of the number of guest instructions.
>
> >> +    /* Use the constant pool, if possible. */
> >> +    if (!in_prologue && USE_REG_TB) {
> >> +        new_pool_label(s, arg, R_SPARC_13, s->code_ptr,
> >> +                       tcg_tbrel_diff(s, NULL));
> >> +        tcg_out32(s, LDX | INSN_RD(ret) | INSN_RS1(TCG_REG_TB));
> >> +        return;
> >> +    }
> >
> > ...but this code doesn't seem to have any mechanism for
> > falling back to something else if it won't fit.
>
> It will fit, because we'll keep trying with smaller TBs until it does.  If for some reason
> a target generates more than 8k for a single guest insn... it will go boom.

Ah, I see. Then for this patch
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-02-04 22:22 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-04  7:00 [PATCH v4 0/5] tcg/sparc: Unaligned access for user-only Richard Henderson
2022-02-04  7:00 ` [PATCH v4 1/5] tcg/sparc: Add scratch argument to tcg_out_movi_int Richard Henderson
2022-02-04 17:35   ` Peter Maydell
2022-02-04 20:42     ` Richard Henderson
2022-02-04  7:00 ` [PATCH v4 2/5] tcg/sparc: Improve code gen for shifted 32-bit constants Richard Henderson
2022-02-04  7:56   ` Philippe Mathieu-Daudé via
2022-02-04 17:39   ` Peter Maydell
2022-02-04  7:00 ` [PATCH v4 3/5] tcg/sparc: Use the constant pool for 64-bit constants Richard Henderson
2022-02-04 18:18   ` Peter Maydell
2022-02-04 20:41     ` Richard Henderson
2022-02-04 22:20       ` Peter Maydell
2022-02-04  7:00 ` [PATCH v4 4/5] tcg/sparc: Add tcg_out_jmpl_const for better tail calls Richard Henderson
2022-02-04 18:34   ` Peter Maydell
2022-02-04 20:44     ` Richard Henderson
2022-02-04  7:00 ` [PATCH v4 5/5] tcg/sparc: Support unaligned access for user-only Richard Henderson
2022-02-04 19:07   ` Peter Maydell
2022-02-04 20:55     ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.